Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update the about page to the new documentation format

The SURF Data Archive allows the user users to safely archive up to petabytes of valuable research data . The Data Archive uses tape library technology to store data sets for to ensure the long term and allows access at any time.Data ingested to the Data Archive of SURF is kept in two different tape libraries located at two different locations in Amsterdam, the Netherlandsaccessibility and reproducibility of their work. The Data Archive uses tape technology to provide affordable, safe, and secure data storage. The Data Archive is also connected with our to SURF’s compute infrastructure, via a fast network connection allowing fast staging of archived data. Users are given a login, which enables immediate, 24/7 access to the service.

Why? - preserve your data!

Modern scientific research tends to generate more and more data. It would be possible to keep this data always directly accessible online in the HOME directories, but that would be too costly. Experience learns that very large datasets are not always needed online, so they can be transported to tape, a much cheaper solution than disk. Luckily, with the advent of tape robots and appropriate software, the storage on tape is almost transparent to the end-user.

The Data Archive supports:

  • Long-term and safe preservation of data;
  • High-level support concerning the optimal use of the service

Data access is facilitated by several data transfer protocols that can be employed in a Linux or Windows environment:

  • Via the internet, using SSH, (HPN)SCP, SFTP, rsync, GridFTP;
  • Via iRODS federations that allow implementation and execution of user-defined data policies. Currently, the EUDAT-B2SAFE policies are available: http://www.eudat.eu/services/b2safe

For more details and user stories please refer to the website from SURF.

What? - The tape back-end and the Data Migration Facility (DMF)

Disk space capacity in the archive is handled by selecting which file systems the Data Migration Facility (DMF) will manage and by specifying the volume of free space that will be maintained on each files system. Space management begins with a list of user files that are ranked according to file size and file age.

Image Removed

Users can access the archive infrastructure via the computing infrastructure of SURF (shown in red) or directly via Services Noces, iRODS, and the repository (shown in blue). High-performance transfer protocols are possible using dCache (shown in green). 

Once the data is in the DMF File system it can be copied to two tape libraries where it is safely stored offline. Data migration occurs in two stages. First, a file is migrated to an offline medium (tape). Once the offline copy is secure, the file is eligible to have its data blocks released (this usually occurs after a minimum space threshold is reached). A file with all offline copies completed is called fully backed up. A file that is fully backed up but whose data blocks have not been released is called a dual-state file; its data exists both online and offline, simultaneously. After a file's data blocks have been released, the file is called an offline file.

Migrated files remain catalogued in their original directories and are accessed as if they were still on disk. The only difference users might notice is a delay in access time.

How? - Guidelines for storing data on the archive

The Data Archive is designed to store large amounts of research data for long-term archiving.

This means that there are a few guidelines we ask our users to follow to make optimal use of the system and keep performance to a maximum:

  • Try to store files of significant size (> 1 GB) as much as possible. Smaller files will always be accepted but will lower the performance of restoring your files from tape.
  • If you have many small files, make sure to pack them using a file archiving tool like tar or dmftar.
  • Try to pack your files before uploading them to the archive, possibly by using dmftar which allows remote tarring.
  • Organise your files in such a way that in case the files are needed again only parts of the data set need to be restored from tape.
  • Avoid storing unpacked software packages, these usually contain a lot of small files. Instead, pack these as well, or refer to a specific software repository.

System login and general usage

Data Archive: Data transfers to and from the service

What can I store on the Data Archive?

The Data Archive is a storage facility for archiving research data and research data only. In principle, all data must be packed into larger files (see above).

In general the following is allowed:

, allowing for the seamless depositing and retrieval of data.

Who is it for?

The Data Archive is available to all SURF members. Access can be granted through several streams depending on the user’s affiliation and institutional contracts.  

Affiliate of SURF Member Institution – Using SURF Computing Services

If you are a researcher at a SURF member institution, you can submit an individual request to SURF for Data Archive access as part of the application process for the computing services (ex.  Snellius and Research Cloud). For more details see the access to the computing services page.

If you are a researcher at a SURF member institution with a dedicated contract for Data Archive use across the whole institution, an individual request does not need to be submitted and capacity will be available on demand. This type of contract is negotiated between SURF and the institution’s IT service. To find out if your institution has this type of contract and the stipulations of use, please contact your local IT service department.

Affiliate of SURF Member Institution – Require a Dedicated Contract for Research Group/Project

If you are a researcher at a SURF member institution who requires dedicated space on the Data Archive for a project or research group and do not require any other SURF computing services, you can request a quote based on the amount of space and duration length required via servicedesk@surf.nl or our service portal.

Affiliate of Educational/Research Facility or Private Business Enterprise Not Associated with SURF

If you are a member of an organisation that is not a SURF member but would like to take advantage of our Data Archive services, please email us at servicedesk@surf.nl. Our advisors will determine if you are eligible to use the Data Archive and can provide additional information about the applicable rates.

Why should I archive my data?

There are 2 main reasons to archive data: scientific integrity and data reuse.

Scientific Integrity

The Netherlands Code of Conduct for Research Integrity has been adopted by the NWO, KNAW, NFU, VSNU and TO2 Federation. It is based on a set of principles: honesty, scrupulousness, transparency, independence, and responsibility. By retaining all the data required to recreate the research done, researchers ensure that their work is transparent and honest.

Data Reuse     

Equally important to transparency and honesty in scientific research is maximizing the value of research. Researchers can store datasets that may prove useful in future research endeavours (either through novel analyses or aggregation with other datasets). This helps researchers to FAIRify their data and reduces the investment cost of future research (as long-term cold storage is usually cheaper then repeating the experiment).

What data can I store?

The Data Archive is aimed specifically at the secure storage of large volumes of data that are not actively in use. For instance, researchers can opt to 'freeze' data from an article or store raw data which may be reused in some yet unknown future research. The Data Archive is not meant for keeping an incremental copy of data/directories in case the primary copy is lost (aka backups) or as a cheaper option for hosting large datasets that are undergoing regular access or processing on one of our computing services. For more details about effective data storage and the effects on the system see Effective Archive File Management.

Guidelines

  • Files should be between 1GB and 200GB in size and average file size should not fall below 1GB
    • Details on how to pack directories of small files into larger files or divide extra large files into appropriately sized chunks can be found here
  • Files should be packed before uploading them to the archive
    • We recommend using dmftar as detailed here
  • Plan out how data is spread across the file and folder structure such that specific data will be easy to find and restore without recalling unneeded data
    • Documenting the contents of files and folders with a text file is also strongly recommended – even a couple words per file can save a lot of frustration down the road

Do Not Store

  • Unpacked software packages
    • Packed software packages are acceptable
  • Backups of working directories (e.g. Snellius project space)
  • Computer/phone/peripherals backups
  • Personal data not related to your research output (e.g. your music library, photo collections, tax returns, etc)
  • Unpacked research datasets

 Do Store

  • Data from completed research projects
    • Raw data like unprocessed audio files, field notes or readings straight from machine equipment
    • Processed data like digitized drawings, transcribed interviews, validated datasets and anonymized survey results
    • Analyzed data like models, graphs, tables and texts that convey useful information, decisions or conclusions
  • Key temporal snapshots of datasets processed on one of the compute services
  • Data from your research project after it has completed
  • Temporal storage of your research data during usage of computational facilities at SURF
  • Identity files for authentication and authorization
  • Configuration files of for applications and tools used on the archive

What is not allowed:

  • Unpacked research data sets (with many smaller files)
  • Backups of your computer, phone or other peripherals
  • Personal data not belonging to your research output
  • Your music or photo collections

Furthermore we kindly ask you to not unpack packed data files on the archive itself, but transfer them first to your computation or processing facility and unpack them there.

Obtain an account

We are pleased to help you with gaining access to the Data Archive, answer your questions or assist you to specific requests you may have about the service and SURF. Please just contact us by email via helpdesk@surfsara.nl or the portal on this pages.

Specific details on obtaining accounts by affiliates of one of the Dutch Universities or Grand Technology Institutes can be found on our main website. Detailed information on the Data Archive is provided in the Service Level Specification.

Data can survive your project

Normally, when a project is ended, the login and accompanying files are removed after a retention period of 6th month.
You can arrange, however, that files in the archive are retained. Write an email to helpdesk@surfsara.nl for the details or via the portal on this pages..

EUDAT - B2SAFE

B2SAFE is a robust, secure and accessible data management service. It allows common repositories to reliably implement data management policies, even in multiple administrative domains. Moreover, research communities and large-scale projects can use B2SAFE to replicate their data robustly and securely. B2SAFE provides tools to apply data policies at various locations so that identical collections of data are managed based on the same policies.

Data policies are sets of rules and processes for regulating data management. B2SAFE is in line with the policies developed within EUDAT and can be expanded to include the policies of a research community, department or institute.

Related Pages


If you have any questions about whether your use case is appropriate for the Data Archive or need advice on how to format it for optimal storage, please contact our advisors at servicedesk@surf.nl or our service desk portal. An Acceptable Use Policy is also available.

Where is my data stored?

The Data Archive maintains two tape libraries for security and redundancy in two physically separate locations in the Amsterdam and Haarlemmermeer municipalities.  When data is uploaded to the Data Archive using SSH, (HPN)SCP, SFTP, rsync, GridFTP, iRODS, etc. it ends up on an online disk space managed by the Data Migration Facility (DMF). The DMF will then manage the careful migration of files from the disk space to two tape libraries until your data is available on both tape libraries. Once your data is safely stored in the two tape libraries it may be removed from the disk space (aka offline). Offline data can be interacted with in the same manner as online data though users may notice a delay in access time.

How can I start using it?

To set up an account you can contact us via servicedesk@surf.nl or the service portal. Any questions or special requests can also be directed there. Page TreerootData Archive

Table of contents

Table of Contents
excludeTable of contents