This document defines the Preservation Policy for the Data Repository service.
The following definitions are important and used throughout the rest of this document, unless explicitly stated otherwise:
- 'Repository' or 'service': the Data Repository service as provided by SURF
- 'Digital Object': an entity containing metadata and possibly a bundle of files served by the Repository service
- 'Visitor': any visitor to the site, whether registered, logged in or solely visiting anonymously
- 'User': any registered user that is logged in
- 'Producer': any user creating new records and uploading data to the service
- 'Consumer': any user or visitor downloading files of a data set
This document contains the preservation policy plan of the SURF Data Repository. It describes the policies for preservation and curation of your data during its publication period.
This policy document describes the principles of data preservation and curation as applied by the SURF Data Repository (hereafter “Repository”) service. The document aims to conform to the OAIS Reference Model.
The SURF Data Repository is a service provided by SURF, The Netherlands. SURF is a subsidiary of the SURF Cooperation.
The mission and vision of the SURF Cooperation can be found here.
One part of the mission of SURF is the provision of services to researchers in the areas of computing, data storage, visualization, networking, cloud and e-Science. Regarding data storage, this includes the long-term storage and preservation of research data. SURF enables researchers to store their (active) research data on several different data-oriented services:
- SURFdrive: for the sharing of documents and small data sets to encourage easy collaboration between researchers and institute employees
- Data Archive: long-term storage and preservation of static end-of-project research data
- Data Repository: long-term storage and preservation of large-scale research data sets that need to be publicly available, discoverable and citeable
Only the SURF Data Repository service as provided by SURF is applicable to this policy document. This includes all data sets, files, workflows and views contained in the service, but not any other documents like web pages, infrastructure and documentation. All other data stored on the Data Archive service is exempt from this policy.
The objective of the SURF Data Repository is to enable researchers to publish all their research data sets as used for publications, regardless of the data volume and number of files, as well as to the structure in which publication is done. The Repository allows researchers to manage their objects using the web interface. Users cannot remove any data they previously uploaded and are contained in an object they own, but they are able to inactivate the object.
Every object published within the Data Repository is subject to the preservation policy described in this document. This includes the files and metadata that belong to these objects, as well as the established links between objects and the data provenance resulting from any changes made. The purpose of preservation is to make sure that all objects will be readable, findable, authentic and understandable now and in the future.
Data Archive service
The Data Archive service serves as the storage back-end of the Data Repository service. This service only provides bit-wise data preservation, which is ensured by a fully automatic and periodical process. All other preservation procedures are built on top of this infrastructure and are part of the Data Repository service.
Each data set contains of at least a minimum default metadata set and at least one file. There is no limit on the amount of data stored or the number of files.
For deposits made in a specific collection or community, additional metadata sets might need to be filled in. These are stored separately from the default metadata set. Any later changes to the metadata are stored in new versions of that specific metadata set (see ‘Versioning’ below).
Preferred and accepted file formats
The SURF Data Repository accepts specific file formats based on file extension and file mime-type. The list of file formats is maintained in a separate file that is integrated with the Repository system and can be referenced publicly on the website.
The list discriminates between preferred and accepted file formats. In principle only file formats that are considered to be usable when they are an open format, available up to the far future and used by the scientific communities. File formats that have all these qualities will be preferred and possibly curated in the future when necessary (see ‘Data curation’ below). Other file formats that are not frequently used or possibly closed format will be accepted, but not actively curated.
You can find the list of accepted and preferred formats here.
Secure high-performance storage
All files uploaded and published in the Repository are by default stored using the SURF Data Archive as storage back-end. This infrastructure uses tape media to archive large data files and has a large disk cache to enable fast up- and downloads.
Checksums and fixity checking
For all files uploaded or published through the Data Archive checksums will be calculated using the well-established MD5 hashing algorithm. While this method does not suffice for protection against service breaches, it is still very suitable for data fixity checks.
The service will automatically periodically do fixity checks by comparing the stored checksums with newly calculated ones for every file published. This will ensure that any change in stored data is notified automatically. If such an unlikely event happens, the original data will be restored using backups and replicated data.
All data is stored twice on different geographical locations. If by any chance one of the tape shows any errors regarding the data, the data will be automatically restored from the other copy. This ensures the data integrity at all times.
Storage medium migration
All data in the Data Repository is managed using the Data Archive which is using tape infrastructure to store all data. The Data Archive uses proprietary tape storage technology which is regularly updated as soon as technology updates are established and widely used. By using well-tested methodology, data stored on old tape media will be migrated to new tape media technology. During this process, the data integrity is automatically checked using the stored checksums.
When the tape storage facility is updated using a different technology, the data will only be transferred when it is absolutely certain the new technology provides the reliability and performance required to deliver the service needed by the Data Repository.
No actual backup of the file data is done, except for the secondary (meta)data of the Data Repository itself. At any time, the file systems of the Data Archive can be restored by using an inode backup made daily. Full data restore has been tested to ensure usability of these backups.
By default, the Data Repository provides no means to perform data curation on the ingested data. This means that either before the publication is made, the data needs to be stored in a supported file format. If this is not possible or deemed unnecessary at the time being, it is possible to store the data as provided.
If after publication of the data it is determined by either the service provider or the data owner that the data publication needs to be converted to another file format or restructured to improve the usability of the publication, it is possible to carry out a specific data curation project in collaboration with the data advisors of SURF. This is done based on a mutual understanding between SURF and the data owner and recorded in a tailor-made contract.
All metadata is automatically versioned by the Repository system upon any change made by the data owner or the system.
In case a new version is created from an existing data object, the new version will contain a copy of the original data files and metadata. These files will remain in juxtaposition so old versions can always be retrieved.
You will find further information related to Repository service in the Documentation pages.