The SURF Data Archive allows the user to safely archive up to petabytes of valuable research data. The Data Archive uses tape library technology to store data sets for the long term and allows access at any time.
Data ingested to the Data Archive of SURF is kept in two different tape libraries located at two different locations in Amsterdam, the Netherlands. The Data Archive is connected with our compute infrastructure via a fast network connection allowing fast staging of archived data. Users are given a login, which enables immediate, 24/7 access to the service.
Why? - preserve your data!
Modern scientific research tends to generate more and more data. It would be possible to keep this data always directly accessible online in the HOME directories, but that would be too costly. Experience learns that very large datasets are not always needed online, so they can be transported to tape, a much cheaper solution than disk. Luckily, with the advent of tape robots and appropriate software, the storage on tape is almost transparent to the end-user.
The Data Archive supports:
- Long-term and safe preservation of data;
- High-level support concerning the optimal use of the service
Data access is facilitated by several data transfer protocols that can be employed in a Linux or Windows environment:
- Via the internet, using SSH, (HPN)SCP, SFTP, rsync, GridFTP;
- Via iRODS federations that allow implementation and execution of user-defined data policies. Currently, the EUDAT-B2SAFE policies are available: http://www.eudat.eu/services/b2safe
For more details and user stories please refer to the website from SURF.
What? - The tape back-end and the Data Migration Facility (DMF)
Disk space capacity in the archive is handled by selecting which file systems the Data Migration Facility (DMF) will manage and by specifying the volume of free space that will be maintained on each files system. Space management begins with a list of user files that are ranked according to file size and file age.
Users can access the archive infrastructure via the computing infrastructure of SURF (shown in red) or directly via Services Noces, iRODS, and the repository (shown in blue). High-performance transfer protocols are possible using dCache (shown in green).
Once the data is in the DMF File system it can be copied to two tape libraries where it is safely stored offline. Data migration occurs in two stages. First, a file is migrated to an offline medium (tape). Once the offline copy is secure, the file is eligible to have its data blocks released (this usually occurs after a minimum space threshold is reached). A file with all offline copies completed is called fully backed up. A file that is fully backed up but whose data blocks have not been released is called a dual-state file; its data exists both online and offline, simultaneously. After a file's data blocks have been released, the file is called an offline file.
Migrated files remain catalogued in their original directories and are accessed as if they were still on disk. The only difference users might notice is a delay in access time.
How? - Guidelines for storing data on the archive
The Data Archive is designed to store large amounts of research data for long-term archiving.
This means that there are a few guidelines we ask our users to follow to make optimal use of the system and keep performance to a maximum:
- Try to store files of significant size (> 1 GB) as much as possible. Smaller files will always be accepted but will lower the performance of restoring your files from tape.
- If you have many small files, make sure to pack them using a file archiving tool like tar or dmftar.
- Try to pack your files before uploading them to the archive, possibly by using dmftar which allows remote tarring.
- Organise your files in such a way that in case the files are needed again only parts of the data set need to be restored from tape.
- Avoid storing unpacked software packages, these usually contain a lot of small files. Instead, pack these as well, or refer to a specific software repository.
What can I store on the Data Archive?
The Data Archive is a storage facility for archiving research data and research data only. In principle, all data must be packed into larger files (see above).
In general the following is allowed:
- Data from your research project after it has completed
- Temporal storage of your research data during usage of computational facilities at SURF
- Identity files for authentication and authorization
- Configuration files of applications and tools used on the archive
What is not allowed:
- Unpacked research data sets (with many smaller files)
- Backups of your computer, phone or other peripherals
- Personal data not belonging to your research output
- Your music or photo collections
Furthermore we kindly ask you to not unpack packed data files on the archive itself, but transfer them first to your computation or processing facility and unpack them there.
Obtain an account
We are pleased to help you with gaining access to the Data Archive, answer your questions or assist you to specific requests you may have about the service and SURF. Please just contact us by email via firstname.lastname@example.org or the portal on this pages.
Specific details on obtaining accounts by affiliates of one of the Dutch Universities or Grand Technology Institutes can be found on our main website. Detailed information on the Data Archive is provided in the Service Level Specification.
Data can survive your project
Normally, when a project is ended, the login and accompanying files are removed after a retention period of 6th month.
You can arrange, however, that files in the archive are retained. Write an email to email@example.com for the details or via the portal on this pages..
EUDAT - B2SAFE
B2SAFE is a robust, secure and accessible data management service. It allows common repositories to reliably implement data management policies, even in multiple administrative domains. Moreover, research communities and large-scale projects can use B2SAFE to replicate their data robustly and securely. B2SAFE provides tools to apply data policies at various locations so that identical collections of data are managed based on the same policies.
Data policies are sets of rules and processes for regulating data management. B2SAFE is in line with the policies developed within EUDAT and can be expanded to include the policies of a research community, department or institute.