Data Archive: Effective archive file management

Tape storage is a cost-effective and long-term data retention technology. Data can be safely and robustly archived at a fraction of the cost of other technologies. However, the tape library storage-based infrastructure is optimized to support the storage of the large files typically selected for archive and will provide drastically worse performance when used to store large volumes of small files.

Size Definitions

Small files:	<100 MB
Large files:	>1 GB
Too large files:	>1 TB
Ideal size range:	1 GB - 200 GB

How the Data Archive Stores Files

All data on the Data Archive is managed by the Data Migration Facility (DMF). When you upload your data, it is stored on a DMF managed file system. This file system is an “online” disk pool that is easy and fast for users to access and retrieve their data from. However, the data will not stay there. The DMF server will automatically trigger a migration procedure where the files are individually scheduled for migration to the offline tape medium based on their size and age. Once all the data has been safely stored on two separate tapes, the data on the DMF file system may be removed to create space on the disk pool for incoming or outgoing data.

When storing data on the Data Archive it is important to store your data using large files only. Depending on when your data is migrated, it will be spread across multiple physical tapes in the library, even if all files appear to be in the same folder. The consequence of this is when you want to retrieve five files from within the same folder, the robot in the tape library will have to retrieve, mount, and locate your data on up to five separate tapes. If this data were stored together in one large file, instead of five smaller files, it would require only one trip and occupy only one tape reader. This will cause substantial delays in data access and retrieval as the number of files increases.

Conversely, when files are larger than 1 TB we also observe performance issues. Files over 1 TB are difficult to migrate as the chances of time outs and failed transfers increases as the size of the file increases. Also, files over 1 TB are so large that they take a long time to retrieve in full – this becomes especially problematic if you only need to retrieve part of your data.

Data Archive File Size Agreement

All users of the Data Archive agree to bundle up files as much as possible to store files within the ideal range of the Data Archive (1 GB - 200 GB) and maintain an average file size above 1 GB. Full details can be found in our acceptable use policy. In the section below you can find some information on packing your files. If you require any additional help with managing your data or want to discuss how you can organise your data, you can request assistance from our advisors by contacting us via the service desk.

Packing as a Solution to Multiple Small Files Storage

The easiest way to store numerous small files while not compromising the performance of the Data Archive is to pack them into larger archive files. SURF developed a utility called dmftar that is optimized for the Data Archive and makes this process as simple as possible. Below you can find instructions on how to use dmftar to solve a few common situations. Further details on how to use dmftar, or tar if dmftar is not available, can be found within the linked pages.

I have many small files on the Data Archive. How do I pack them and remove the small files?

Take a moment to sort your small files into a logical set of folders and document their contents in a text file for future reference.
Pack each folder with many small files into an archive file (or multivolume archive files if the contents is larger then 100GB):
```
dmftar -c -f <data>.dmftar <data>/
```
After packing, you can verify whether all the files in the archive file match those in the original directory structure:
```
dmftar --verify-content -f <data>.dmftar
```
Once you have double checked that all your files were successfully packed, you can remove the original folders of small files:
```
rm -R <data>/
```

I have many small files on Snellius/Lisa. How do I pack and transfer them to the Data Archive?

Take a moment to sort your small files into a logical set of folders and document their contents in a text file for future reference.
Using dmftar you can directly transfer newly created archive files to your home folder on the Data Archive by adding user and system information to your packing command (provided key-based authentication and key-forwarding have been enabled):
```
dmftar -c -f <login>@archive.surfsara.nl:<data>.dmftar <data>/
```
After packing, you can verify whether all the files in the archive file match those in the original directory structure:
```
dmftar --verify-content -f <data>.dmftar
```
Once you have double checked that the archived files were successfully transferred and contain all your small files, local archive files can be removed with the following command:
```
dmftar --delete-archive -f <data>.dmftar/
```

How do I download a file to Snellius/Lisa from a packed archive file on the Data Archive?

dmftar can directly unpack and restore files remotely from the Data Archive (provided key-based authenticationand key-forwarding have been enabled).

To preview the contents of archive file:

dmftar -t -f <login>@archive.surfsara.nl:<data>.dmftar

To restore a whole archive file use:

dmftar -x -f <login>@archive.surfsara.nl:<data>.dmftar

To restore a single file from the archive file remotely use:

dmftar -x -f <login>@archive.surfsara.nl:<data>.dmftar/ <data>/<data1>/data1-1.dat

I have a file >1TB in size. How do I pack it into appropriately sized archive file bundles?

Dmftar handles the bundling of large files automatically. Simply pack a large file like you would a folder of small files:

dmftar -c -f <data>.dmftar <bigfile>.dat

This will create a series of archive files with a default size of 100GB.

If you wish to unpack these files in the future, simply point to the first archive file in the series:

dmftar -x -f <bigfile1>.dmftar

Space shortcuts

Page tree