Introduction

SURF provides the Data Archive service to store large files or datasets for long-term preservation or as a temporary storage scale-out for compute infrastructure users. Files are stored on tapes that are managed and accessed by a tape library. A conventional disk-based file system is available for receiving files from the tapes and from the file systems of remote systems, for example, the computer systems (e.g. Lisa and Snellius).

As a user you can focus on structuring your files and folders and make sure data is staged or not. You don't have to bother about where your data is stored on tape or which tape you need, this is handled automatically by the system.

Tools

Depending on the operating system you are using tools are available to access the Data Archive service.

If you only need to list your files or want to add files to the archive, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP for Windows or Cyberduck for MacOS). To retrieve data using these applications, you first need to stage the data before you can transfer it to another system.

MacOS/Linux

Use the built-in terminal application.

Windows

Use PuTTY or MobaXTerm.

If you need to list your files or want to add files, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP).

System login and general usage

The Data Archive system is Unix-based and requires you to login to the login node with your username and password. The system also allows key-based authentication. A more in-depth introduction and explanation can be found on the SSH usage page.

Does your login have access to the archive? 


To find out if this is the case, do the following:

  1. In your browser, go to https://portal.surfsara.nl/ and login with the username that was communicated with you
  2. Go to the menu on the left hand side and select "Your profile" (see image on the right)
  3. On the right hand side you will see all systems you have access to. If the "Data Archive" is not listed there you have no access.  
  4. If you think you should have access to the system please open a ticket at the https://servicedesk.surfsara.nl/ and provide us your grant number or your login.

Users of Lisa, Snellius

Users of the compute infrastructure, that (additionally) have access to the Data Archive service, have their own directory on the archive file system visible from the compute infrastructure. A user can access his/her archive directory via:

cd /archive/<username>

Archive-only users

If a user has only access to the archive but no other parts of the compute infrastructure, he/she can login directory to the archive. Login can be done using a terminal or tool supporting SFTP or SSH connections. Logging in directly on the Data Archive can be done via:

ssh <username>@archive.surfsara.nl

Possibly the SSH application notifies you about the authenticity of a newly connected host:

The authenticity of host 'archive.surfsara.nl (<IP>)' can't be established.
ECDSA key fingerprint is SHA256:<hash>.
Are you sure you want to continue connecting (yes/no)?

You can safely continue by entering 'yes', as long as the host is 'archive.surfsara.nl' and the IP is in the range 145.100.xx.xx.

Basic commands

The usage of the archive file system is essentially transparent, the standard Unix commands (cp, ls, mv, ...) can be used to handle the files in the archive, but there are special commands to do things more efficient. Those are the so-called DMF commands and will be explained later.

User home folder

If logged in directly to the archive the user will be in the so-called home folder:

pwd
/archive/<username>

By default, you can only store data in this folder or any subfolder within it.

Shared folders

Possibly a user has access to one or more shared folders on the archive system. This means data can be accessed and/or modified on a different location than the user's home folder, for example a folder of another user, a group or an institution.

In this case you can simply change path to that shared folder, e.g.:

cd /archive/<shared folder>

If you are copying data using scp or another tool to the shared folder from a remote system, make sure to use the absolute path in the target while using your own user name as authentication:

scp file <username>@archive.surfsara.nl:/archive/<shared folder>/.

Create a new folder

A new subfolder, here named “workdir” can be created with the following command:

mkdir /archive/<username>/workdir

Delete a file or folder

You can not undo a delete operation, so be careful what you remove from the system!

To delete a file or folder, do the following:

rm -rf workdir

Removed files and folders are subject to the retention period policy of the archive.

Moving data within the archive or from a compute infrastructure

To copy data files to the folder created earlier, one can run the following command when working in e.g. Snellius:

cp $HOME/workdir/output* /archive/<username>/workdir

What happens here, is that files from your Snellius home folder are copied to the stage area of the archive file system. Later on, these files will be automatically replicated to tape.

Now the other way around:

dmget -a /archive/barbara/workdir/output*
cp /archive/barbara/workdir/output* $HOME/workdir

Here, files are copied from tape to the stage area (if not already there), and subsequently copied to your home folder. See below to read more about staging files.

Retrieving files from tape to the stage area can take some time.

Transferring data to the archive from your local system

To transfer data to the archive different transfer protocols can be used such as SSH and GridFTP.

Default transfer tools

Use your terminal application to invoke transfer commands. The commands for tools such as scp, rsync and sftp are all the same.

[scp, rsync, sftp] <datafile> <user>@archive.surfsara.nl:<destination folder>

Example:

scp transferdata.tar.gz <user>@archive.surfsara.nl:/home/<user>/workdir

For uploading using rsync, use option -W or --whole-file. This prevents recalls from tape when files are being updated, which improves performance a lot.

rsync -a -W <files> <user>@archive.surfsara.nl:/archive/<user>/

Transferring data from the archive to your local system

To transfer data from the archive to your own system, the same protocols can be used.

Default transfer tools

Use your terminal application to invoke transfer commands. The commands for tools such as scp, rsync and sftp are all the same.

[scp, rsync, sftp] <user>@archive.surfsara.nl:<folder>/<file> <local destination>

Example:

scp <user>@archive.surfsara.nl:file.tar.gz transferdata.tar.gz  

Note: as is visible there is no specific source folder used here. By default when omitted, the source folder is your own home folder on the archive.

Other transfer options

See the Data Archive: Data transfers to and from the service for more data transfer options.

Parallel transfers using GridFTP

GridFTP allows for fast parallel transfers to and from the archive. You need special configuration, knowledge and tools to use this protocol. Please contact us via the service desk if you want to use it.

Special usage

The Data Archive service makes use of the Data Migration Facility (DMF) package which is a hierarchical storage management system to ease the use of tape-based storage. Any user can make use of several so-called DMF commands that stage or migrate data files from and to the tape storage environment.

Basic principles

The Data Archive consists of a staging area and the tape storage environment. Files that are online exist in both areas, while files that are offline only exist on tape and therefore cannot be directly used. These latter files need to be staged first to the staging area. A user therefore needs to determine the actual status of the files before they are used (in another program or copied to a different system), and thus if one or more of these files are offline make sure they are staged.

DMF commands

Users of Snellius, Lisa and the archive can use the following commands:

  • dmput: Explicitly migrate files from the staging area to tape
  • dmget: Explicitly recall files or parts of files (also called staging) from tape to the staging area
  • dmcopy: Copy all or part of the data from a migrated file to an online file
  • dmls: List files and determine whether the status of these files
  • dmattr: Test in shell scripts whether a file is online or offline
  • dmfind: Search for files

See the online manual pages for details.

DMF file status

A file status can have several values in DMF:

  • REG: Regular files are user files residing only on disk
  • MIG: Migrating files are files which are being copied from disk to tape
  • UNM: Unmigrating files are files which are being copied from tape to disk
  • Migrated files can be either of the following:
    • DUL: Dual-state files whose data resides both online and offline
    • OFL: Offline files whose data is no longer on disk

Although directories themselves have statuses as well when listed, they will always remain regular (REG) as they are not stored on tape. They are only defined in the file system itself.

Listing file statuses

To determine the status (online or offline) of one or more files, use the "dmls" command:

$ dmls -l
-rw-------    1 hthta    staff     632792 Jul 26  1999 (OFL) file1
-rw-------    1 hthta    staff     632792 Jul 27  1999 (OFL) file2
-rw-------    1 hthta    staff      15884 Jul 27  1999 (REG) file3
-rw-------    1 hthta    staff     632792 Aug  2  1999 (DUL) file4
-rw-------    1 hthta    staff     632792 Jun 19 23:20 (MIG) file5

Here two files are offline (status OFL) and not readily available for usage, one file is not stored on tape at all (status REG), one file is on tape and in the staging area (status DUL) and one file is being migrated to tape (status MIG).

Staging files

To stage a file from tape to the staging area, use the "dmget" command in a specific directory on the archive system:

$ dmget <file name>

To stage multiple files at once, use a wildcard in a specific directory:

$ cd <folder>
$ dmget *

Once this command completes, the file listing above looks like:

$ dmls -l
-rw-------    1 hthta    staff     632792 Jul 26  1999 (DUL) file1
-rw-------    1 hthta    staff     632792 Jul 27  1999 (DUL) file2
-rw-------    1 hthta    staff      15884 Jul 27  1999 (DUL) file3
-rw-------    1 hthta    staff     632792 Aug  2  1999 (DUL) file4
-rw-------    1 hthta    staff     632792 Jun 19 23:20 (DUL) file5 

All the files with (OFL) changed their status to (DUL). Any files that are already in the staging area will not be affected.

Storing files

It is important that you don't put many small files, but preferably only few larger files on the archive file system. By small we mean less than 100 MB, by large we mean larger than 1 GB. If you have many small files, it is best to combine them in a large file, for example using the tar command:

cd $HOME/workdir
tar cvf /archive/barbara/workdir/output.tar output*

The reason is that if you archive many files, chances are that they will get allocated on hundreds different tapes. Getting these files back will then take much time, because every tape-mount takes a significant amount of time.

The reverse of the above action would be:

cd $HOME/workdir
dmget -a output*
tar xvf /archive/barbara/workdir/output.tar

SURF provides additional tools to partially automate the checks and staging commands when storing data, see Data Archive: Optimal Archiving using dmftar guide for more information.

Archive and batch

In general, do not handle archived files in a batch job. Before a job starts, the files should be available on the normal file system, using the examples above.

Archive and the Lisa system

On Lisa, the archive file system is not available on the batch nodes, so all handling of the archive file system has to be done interactively.

Archive and the Snellius system

On Snellius, the archive file system is available on the login and service nodes. A possibility is to create a dependent job, where the handling of archive data is done in a cheap service node. Please see the Batch - Howto.