Introduction - Only large files, please
In many cases, raw or processed data needs to be archived for long-term storage or future use. At SURF, the Data Archive provides an efficient archiving of large data sets by storing files on tape.
It is important that you don't put many small files, but few large files on the archive file system. By small we mean less than 100 MB, by large we mean larger than 1 GB. The reason is, that if you archive many files, chances are that they will get allocated on hundreds of different tapes. Getting these files back will then take much time because every tape-mount takes a significant amount of time.
Packing - the old fashioned way
If you have many small files, it is best to combine them in a large file, for example using the
tar command. Assuming you have some files small files in a folder called output. You can combine them into one so-called tape archive (tar) file.
Here the option 'c' indicates to create a new file
output.tar in the workdir, which contains of files in the output directory. The option '
f' specifies the location of the archive and '
v' means verbose output to the user. The reverse of the above action would be:
Optimal packed file size
When using a tool like tar or dmftar always take the total size of the files you are about to pack into consideration. Although the Data Archive system prefers larger files, files that are too large (> 1 TB) lead to other difficulties.
If your data set is several terabytes in size, consider packing several subfolders separately. This will ease retrieving those parts when only specific parts are needed upon processing of the data.
In short, given your situation:
- The easiest route is to use dmftar (see next section)
- We strongly recommend using this tool as it will manage all aspects automatically, from staging files to packing your data set into a multi-volume archive of multiple files of by default 100 GB
- When using tar is the only option, and you have a large volume data set you would like to pack, please use the right options with the tool:
This is the same command as in the previous section, but now the -M option indicates to create multiple files instead of one larger, and the -L option with value '10GB' indicates to create files no larger than 10 GB. If your data set is larger than that, more files will be generated.
When unpacking these files, you can just point to the first file of the series without any additional options.
dmftar - the one tool for your archive data
In order to optimally use this service, data files are best stored in archive files with sizes usually much larger than their individual size. Because this can be difficult to achieve using standard tools like
dmftar tool has been developed at SURF and is available on your HPC system to largely automate the archive and restore processes of your data files.
dmftar is a wrapper for the Linux tool
gnutar and automatically creates archive files of any size (default 100 GB) and can transfer them to the archive file system if necessary.
This page will briefly explain how to use the
dmftar tool on the Data Archive, Lisa cluster or Snellius supercomputer.
Things to note
dmftaris currently only available on Snellius, Lisa and the Data Archive services
- All commands are issued on the HPC system after logging in (see SSH Usage)
- For remote archiving functionality (from system to system), key-based authentication needs to be set up
- Please refer to the Data Archive Usage page for general guidelines concerning the data archiving on this service
- When using
dmftarplease do not invoke too many processes at the same time, since that possibly stresses the storage facility. If your data can be slowly archived, use staging jobs on the Cartesius and Lisa systems.
Please type the following command to get insight on how to use the
The first few lines will briefly explain how the tool can be used:
As is visible, the tool supports various operational modes and can be used to create or extract archive files locally or remotely. Add a pattern during extraction to extract only the files you need from an archive.
The operational modes are:
Not all modes are discussed here, but you are free to use them on any system.
In many cases, the tarring of a folder is sufficient to efficiently archive any data on the archive file system. For example, to archive a folder called 'data' for the long term, issue the following command:
This will create an additional folder called
data.dmftar which contains the tarred equivalent of the original folder in files (also called volumes) of 100 GB in size. In case tarring is successful, the original folder can be removed and the folder
data.dmftar can be transferred to the archive file system using standard copy commands like
To retrieve all the original files again, use
dmftar to extract the data from the archive folder:
This will recreate the original folder
data again (as it is stored in the archive files as well) with all the original data.
One can also list the contents of a dmftar folder:
A user can invoke direct transfer of a newly created dmftar folder to another system (usually the archive service) by adding user and system information:
This will create the archive files locally and will directly transfer them to the home folder of the user on the Data Archive. In order to make this work fully automatically, key-based authentication and key-forwarding need to be enabled.
The same thing can be done on remotely stored dmftar folders in order to restore the archived files locally:
This will locally create a folder called
data (as implied by the dmftar folder) containing the original data files.
Extracting specific files and folders
When only specific files or folders instead of the whole dmftar folder contents are needed, an extra pattern argument can be added in order to do so. Please note that in all cases the original root folder of the dmftar folder needs to be given in the pattern to make sure the files are found in the archive.
In the following examples, a dmftar folder
data.dmftar has been prepared with a root directory
data and subdirectories
data[1-9], each having 9 data files named
data[1-9]-[1-9].dat. To check the contents of the dmftar-folder:
To extract a single file, say
data1-1.dat, the file name and path are added as the extraction pattern:
To extract the files in subdirectories
data2 of dmftar folder
data.dmftar, this will work:
If you want to use wildcards, to for example extract only the first file of every subfolder in the dmftar folder, the option
-o with value
--wildcards needs to be added to make sure the underlying tar accepts patterns with wildcards:
Now there are 9 files extracted, each in a subfolder in the folder
dmftar tool provides several options which alter the working and outcome of the tool. Enter the
dmftar command without any options to see a full list of possibilities.
Please choose your options with care. By default the most useful options are set.
In all cases,
dmftar will calculate checksums for every generated tar file within a dmftar folder. Checksums are calculated to be able to assess in a later stage whether data has remained the same over time. If data has changed, the checksum will be different from the one calculated before. The default checksum algorithm used by
blake2b, but others are supported as well:
In general, the higher the number in the name of the algorithm, the more computationally intensive the calculation of the checksum will be, but the more reliable the check will be when comparing a freshly calculated checksum with the one stored in the dmftar folder. For archiving purposes, the
blake2b algorithm is considered sufficient and reliable and is faster than the
md5 that was default on previous versions of
dmftar. You can learn more about checksums here.
xx64 checksum algorithms are available on systems that fulfill at least one of the following:
xxhashPython module installed
- Have the
xxhashsoftware package installed
To recalculate the checksums in a dmftar folder and compare them with the original, use the
After you have created the dmftar archive files, you can verify whether all files present in the resulting files are also present in the original directory structure and vice versa using the
Removing dmftar files
The dmftar tool creates read-only files within a dmftar folder as a security precaution. First make sure that the dmftar archive files are copied to the archive. Then, if you want to delete the original dmftar files from your system, you can use the
--delete-archive option of the dmftar command: