Introduction - Only large files, please

In many cases, raw or processed data needs to be archived for long-term storage or future use. At SURF, the Data Archive provides an efficient archiving of large data sets by storing files on tape.

It is important that you don't put many small files, but few large files on the archive file system. By small we mean less than 100 MB, by large we mean larger than 1 GB. The reason is, that if you archive many files, chances are that they will get allocated on hundreds of different tapes. Getting these files back will then take much time because every tape-mount takes a significant amount of time.

Packing - the old fashioned way

If you have many small files, it is best to combine them in a large file, for example using the tar command. Assuming you have some files small files in a folder called output. You can combine them into one so-called tape archive (tar) file.

tar cvf /archive/<username>/<workdir>/<output>.tar output*

Here the option 'c' indicates to create a new file output.tar in the workdir, which contains of files in the output directory. The option 'f' specifies the location of the archive and 'v' means verbose output to the user. The reverse of the above action would be:

cd $HOME/workdir
dmget -a output*
tar xvf /archive/<username>/workdir/output.tar

Optimal packed file size

When using a tool like tar or dmftar always take the total size of the files you are about to pack into consideration. Although the Data Archive system prefers larger files, files that are too large (> 1 TB) lead to other difficulties.

If your data set is several terabytes in size, consider packing several subfolders separately. This will ease retrieving those parts when only specific parts are needed upon processing of the data.

In short, given your situation:

  • The easiest route is to use dmftar (see next section)
    • We strongly recommend using this tool as it will manage all aspects automatically, from staging files to packing your data set into a multi-volume archive of multiple files of by default 100 GB
  • When using tar is the only option, and you have a large volume data set you would like to pack, please use the right options with the tool:
tar -M -L 10GB -cvf /archive/<username>/<workdir>/<output>.tar output*

This is the same command as in the previous section, but now the -M option indicates to create multiple files instead of one larger, and the -L option with value '10GB' indicates to create files no larger than 10 GB. If your data set is larger than that, more files will be generated.

When unpacking these files, you can just point to the first file of the series without any additional options.

dmftar - the one tool for your archive data

In order to optimally use this service, data files are best stored in archive files with sizes usually much larger than their individual size. Because this can be difficult to achieve using standard tools like tar, the dmftar tool has been developed at SURF and is available on your HPC system to largely automate the archive and restore processes of your data files.

In short, dmftar is a wrapper for the Linux tool gnutar and automatically creates archive files of any size (default 100 GB) and can transfer them to the archive file system if necessary.

This page will briefly explain how to use the dmftar tool on the Data Archive, Lisa cluster or Snellius supercomputer.

Things to note

  • dmftar is currently only available on Snellius, Lisa and the Data Archive services
  • All commands are issued on the HPC system after logging in (see SSH Usage)
  • For remote archiving functionality (from system to system), key-based authentication needs to be set up
  • Please refer to the Data Archive Usage page for general guidelines concerning the data archiving on this service
  • When using dmftar please do not invoke too many processes at the same time, since that possibly stresses the storage facility. If your data can be slowly archived, use staging jobs on the Cartesius and Lisa systems.

Getting help

Please type the following command to get insight on how to use the dmftar tool:

dmftar --help

The first few lines will briefly explain how the tool can be used:

usage: dmftar -c       [options] -f [[USER@]REMOTE:]DEST/ [FILE|DIR ..]
or:    dmftar -x       [options] -f [[USER@]REMOTE:]DIR/ [PATTERN ..]
or:    dmftar -t|-d|-V [options] -f [[USER@]REMOTE:]DIR/
or:    dmftar -i                 -f DIR/

As is visible, the tool supports various operational modes and can be used to create or extract archive files locally or remotely. Add a pattern during extraction to extract only the files you need from an archive.

The operational modes are:

  -c, --create             Create a new archive
  -t, --list               List archive contents
  -x, --extract            Extract archive
  -d, --download           Download remote archive
  -V, --verify             Verify volume checksums
      --verify-content     Compare archive index with directory entries
  -i, --interactive        Interactively browse and extract local archive
      --regen-index        Re-create the index files
      --regen-checksum     Re-create volume checksums
      --delete-archive     Delete archive (*no* questions asked)
  -f, --archive=DIR        Set local or remote archive directory

Not all modes are discussed here, but you are free to use them on any system.

Default usage

In many cases, the tarring of a folder is sufficient to efficiently archive any data on the archive file system. For example, to archive a folder called 'data' for the long term, issue the following command:

dmftar -c -f data.dmftar data/

This will create an additional folder called data.dmftar which contains the tarred equivalent of the original folder in files (also called volumes) of 100 GB in size. In case tarring is successful, the original folder can be removed and the folder data.dmftar can be transferred to the archive file system using standard copy commands like cp or mv.

To retrieve all the original files again, use dmftar to extract the data from the archive folder:

dmftar -x -f data.dmftar

This will recreate the original folder data again (as it is stored in the archive files as well) with all the original data.

One can also list the contents of a dmftar folder:

dmftar -t -f data.dmftar

Remote usage

A user can invoke direct transfer of a newly created dmftar folder to another system (usually the archive service) by adding user and system information:

dmftar -c -f <user>@archive.surfsara.nl:data.dmftar data/

This will create the archive files locally and will directly transfer them to the home folder of the user on the Data Archive. In order to make this work fully automatically, key-based authentication and key-forwarding need to be enabled.

The same thing can be done on remotely stored dmftar folders in order to restore the archived files locally:

dmftar -x -f <user>@archive.surfsara.nl:data.dmftar

This will locally create a folder called data (as implied by the dmftar folder) containing the original data files.

Extracting specific files and folders

When only specific files or folders instead of the whole dmftar folder contents are needed, an extra pattern argument can be added in order to do so. Please note that in all cases the original root folder of the dmftar folder needs to be given in the pattern to make sure the files are found in the archive.

In the following examples, a dmftar folder data.dmftar has been prepared with a root directory data and subdirectories data[1-9], each having 9 data files named data[1-9]-[1-9].dat. To check the contents of the dmftar-folder:

dmftar -t -f data.dmftar

To extract a single file, say data1-1.dat, the file name and path are added as the extraction pattern:

dmftar -x -f data.dmftar/ data/data1/data1-1.dat

To extract the files in subdirectories data1 and data2 of dmftar folder data.dmftar, this will work:

dmftar -x -f data.dmftar/ data/data{1..2}
verifying checksum for volume #1
extracting data.dmftar/

If you want to use wildcards, to for example extract only the first file of every subfolder in the dmftar folder, the option -o with value --wildcards needs to be added to make sure the underlying tar accepts patterns with wildcards:

dmftar -x -o=--wildcards -f data.dmftar/ data/data*/data1-1.dat

Now there are 9 files extracted, each in a subfolder in the folder data.

Advanced options

The dmftar tool provides several options which alter the working and outcome of the tool. Enter the dmftar command without any options to see a full list of possibilities.

      --force-local        Archive dir is local even if it has a colon
  -L, --tape-length=SIZE   Set volume size (default: '100.0 GB')
  -o, --options=OPTIONS    Pass extra options to 'tar'
      --tar=BINARY         Set path to GNU tar binary (default: /bin/tar)
      --checksum=CHECKSUM  Set checksum algorithm (default: blake2b)
      --no-checksum        Skip all checksumming
      --keep-cache         Do not delete any downloaded volumes
      --conf=CONFIGFILE    Change config file (default: /etc/dmftar.conf)
      --files-from=FILE    Extract members listed in file
      --files-from0=FILE   Extract null-terminated members listed in file
  -v, --verbose            Show verbose output
  -q, --quiet              Suppress informational messages
      --debug              Show debug messages
      --version            Display version number and exit

Please choose your options with care. By default the most useful options are set.

Checksums

In all cases, dmftar will calculate checksums for every generated tar file within a dmftar folder. Checksums are calculated to be able to assess in a later stage whether data has remained the same over time. If data has changed, the checksum will be different from the one calculated before. The default checksum algorithm used by dmftar is blake2b, but others are supported as well:

blake2b blake2s md5 sha1 sha224 sha256 sha384 sha3_224 sha3_256 sha3_384 sha3_512 sha512 xxh128 xxh32 xxh64

In general, the higher the number in the name of the algorithm, the more computationally intensive the calculation of the checksum will be, but the more reliable the check will be when comparing a freshly calculated checksum with the one stored in the dmftar folder. For archiving purposes, the blake2b algorithm is considered sufficient and reliable and is faster than the md5 that was default on previous versions of dmftar. You can learn more about checksums here.

The xx128, xx32, and xx64 checksum algorithms are available on systems that fulfill at least one of the following:

  • Have the xxhash Python module installed

  • Have the xxhash software package installed

To recalculate the checksums in a dmftar folder and compare them with the original, use the -V option:

dmftar -V -f archive.dmftar

Verify contents

After you have created the dmftar archive files, you can verify whether all files present in the resulting files are also present in the original directory structure and vice versa using the --verify-content option:

dmftar --verify-content -f archive.dmftar

Removing dmftar files

The dmftar tool creates read-only files within a dmftar folder as a security precaution. First make sure that the dmftar archive files are copied to the archive. Then, if you want to delete the original dmftar files from your system, you can use the --delete-archive option of the dmftar command:

dmftar --delete-archive -f <file_name>.dmftar/