Introduction
SURF provides the Data Archive service to store large files or datasets for long-term preservation or as a scale-out/archiving final data for compute infrastructure users. Files are stored on tapes that are managed and accessed by a tape library. A conventional disk-based file system is available for receiving files from the tapes and from the file systems of remote systems, for example, the computer systems (e.g. Lisa and Snellius).
As a user you can focus on structuring your files and folders and make sure data is staged or not. You don't have to bother about where your data is stored on tape or which tape you need, this is handled automatically by the system.
Tools
Depending on the operating system you are using tools are available to access the Data Archive service.
If you only need to list your files or want to add files to the archive, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP for Windows or Cyberduck for MacOS). To retrieve data using these applications, you first need to stage the data before you can transfer it to another system.
MacOS/Linux
Use the built-in terminal application.
Windows
If you need to list your files or want to add files, you can also use a file transfer application that supports the SSH and/or SFTP protocols (like WinSCP).
System login and general usage
The Data Archive system is Unix-based and requires you to login to the login node with your username and password. The system also allows key-based authentication. A more in-depth introduction and explanation can be found on the SSH usage page.
Does your login have access to the archive?
To find out if this is the case, do the following:
- In your browser, go to https://portal.surfsara.nl/ and login with the username that was communicated with you
- Go to the menu on the left hand side and select "Your profile" (see image on the right)
- On the right hand side you will see all systems you have access to. If the "Data Archive" is not listed there you have no access.
- If you think you should have access to the system please open a ticket at the https://servicedesk.surfsara.nl/ and provide us your grant number or your login.
Users of Lisa, Snellius
Users of the compute infrastructure, that (additionally) have access to the Data Archive service, have their own directory on the archive file system visible from the compute infrastructure. A user can access his/her archive directory via:
cd /archive/<username>
Archive-only users
If a user has only access to the archive but no other parts of the compute infrastructure, he/she can login directory to the archive. Login can be done using a terminal or tool supporting SFTP or SSH connections. Logging in directly on the Data Archive can be done via:
ssh <username>@archive.surfsara.nl
Possibly the SSH application notifies you about the authenticity of a newly connected host:
The authenticity of host 'archive.surfsara.nl (<IP>)' can't be established. ECDSA key fingerprint is SHA256:<hash>. Are you sure you want to continue connecting (yes/no)?
You can safely continue by entering 'yes', as long as the host is 'archive.surfsara.nl' and the IP is in the range 145.100.xx.xx
.
Basic commands
The usage of the archive file system is essentially transparent, the standard Unix commands (cp
, ls
, mv
, ...) can be used to handle the files in the archive, but there are special commands to do things more efficient. Those are the so-called DMF commands and will be explained later.
User home folder
If logged in directly to the archive the user will be in the so-called home folder:
pwd /archive/<username>
By default, you can only store data in this folder or any subfolder within it.
Shared folders
Possibly a user has access to one or more shared folders on the archive system. This means data can be accessed and/or modified on a different location than the user's home folder, for example a folder of another user, a group or an institution.
In this case you can simply change path to that shared folder, e.g.:
cd /archive/<shared folder>
If you are copying data using scp or another tool to the shared folder from a remote system, make sure to use the absolute path in the target while using your own user name as authentication:
scp file <username>@archive.surfsara.nl:/archive/<shared folder>/.
Create a new folder
A new subfolder, here named “workdir” can be created with the following command:
mkdir /archive/<username>/workdir
Delete a file or folder
You can not undo a delete operation, so be careful what you remove from the system!
To delete a file or folder, do the following:
rm -rf workdir
Removed files and folders are subject to the retention period policy of the archive.
Moving data within the archive or from a compute infrastructure
To copy data files to the folder created earlier, one can run the following command when working in e.g. Snellius:
cp $HOME/workdir/output* /archive/<username>/workdir
What happens here, is that files from your Snellius home folder are copied to the stage area of the archive file system. Later on, these files will be automatically replicated to tape.
Now the other way around:
dmget -a /archive/barbara/workdir/output* cp /archive/barbara/workdir/output* $HOME/workdir
Here, files are copied from tape to the stage area (if not already there), and subsequently copied to your home folder. See below to read more about staging files.
Retrieving files from tape to the stage area can take some time.
Transferring data to the archive from your local system
To transfer data to the archive different transfer protocols can be used such as SSH and GridFTP.
Default transfer tools
Use your terminal application to invoke transfer commands. The commands for tools such as scp, rsync and sftp are all the same.
[scp, rsync, sftp] <datafile> <user>@archive.surfsara.nl:<destination folder>
Example:
scp transferdata.tar.gz <user>@archive.surfsara.nl:/home/<user>/workdir
For uploading using rsync, use option -W or --whole-file. This prevents recalls from tape when files are being updated, which improves performance a lot.
rsync -a -W <files> <user>@archive.surfsara.nl:/archive/<user>/
Transferring data from the archive to your local system
To transfer data from the archive to your own system, the same protocols can be used.
Default transfer tools
Use your terminal application to invoke transfer commands. The commands for tools such as scp, rsync and sftp are all the same.
[scp, rsync, sftp] <user>@archive.surfsara.nl:<folder>/<file> <local destination>
Example:
scp <user>@archive.surfsara.nl:file.tar.gz transferdata.tar.gz
Note: as is visible there is no specific source folder used here. By default when omitted, the source folder is your own home folder on the archive.
Other transfer options
See the Data Archive: Data transfers to and from the service for more data transfer options.
Parallel transfers using GridFTP
GridFTP allows for fast parallel transfers to and from the archive. You need special configuration, knowledge and tools to use this protocol. Please contact us via the service desk if you want to use it.
Special usage
The Data Archive service makes use of the Data Migration Facility (DMF) package which is a hierarchical storage management system to ease the use of tape-based storage. Any user can make use of several so-called DMF commands that stage or migrate data files from and to the tape storage environment.
Basic principles
The Data Archive consists of a staging area and the tape storage environment. Files that are online exist in both areas, while files that are offline only exist on tape and therefore cannot be directly used. These latter files need to be staged first to the staging area. A user therefore needs to determine the actual status of the files before they are used (in another program or copied to a different system), and thus if one or more of these files are offline make sure they are staged.
DMF commands
Users of Snellius, Lisa and the archive can use the following commands:
dmput
: Explicitly migrate files from the staging area to tapedmget
: Explicitly recall files or parts of files (also called staging) from tape to the staging areadmcopy
: Copy all or part of the data from a migrated file to an online filedmls
: List files and determine whether the status of these filesdmattr
: Test in shell scripts whether a file is online or offlinedmfind
: Search for files
See the online manual pages for details.
DMF file status
A file status can have several values in DMF:
REG
: Regular files are user files residing only on diskMIG
: Migrating files are files which are being copied from disk to tapeUNM
: Unmigrating files are files which are being copied from tape to disk- Migrated files can be either of the following:
DUL
: Dual-state files whose data resides both online and offlineOFL
: Offline files whose data is no longer on disk
Although directories themselves have statuses as well when listed, they will always remain regular (REG) as they are not stored on tape. They are only defined in the file system itself.
Listing file statuses
To determine the status (online or offline) of one or more files, use the "dmls
" command:
$ dmls -l -rw------- 1 hthta staff 632792 Jul 26 1999 (OFL) file1 -rw------- 1 hthta staff 632792 Jul 27 1999 (OFL) file2 -rw------- 1 hthta staff 15884 Jul 27 1999 (REG) file3 -rw------- 1 hthta staff 632792 Aug 2 1999 (DUL) file4 -rw------- 1 hthta staff 632792 Jun 19 23:20 (MIG) file5
Here two files are offline (status OFL) and not readily available for usage, one file is not stored on tape at all (status REG), one file is on tape and in the staging area (status DUL) and one file is being migrated to tape (status MIG).
Staging files
To stage a file from tape to the staging area, use the "dmget
" command in a specific directory on the archive system:
$ dmget <file name>
To stage multiple files at once, use a wildcard in a specific directory:
$ cd <folder> $ dmget *
Once this command completes, the file listing above looks like:
$ dmls -l -rw------- 1 hthta staff 632792 Jul 26 1999 (DUL) file1 -rw------- 1 hthta staff 632792 Jul 27 1999 (DUL) file2 -rw------- 1 hthta staff 15884 Jul 27 1999 (DUL) file3 -rw------- 1 hthta staff 632792 Aug 2 1999 (DUL) file4 -rw------- 1 hthta staff 632792 Jun 19 23:20 (DUL) file5
All the files with (OFL) changed their status to (DUL). Any files that are already in the staging area will not be affected.
Storing files
It is important that you don't put many small files, but preferably only few larger files on the archive file system. By small we mean less than 100 MB, by large we mean larger than 1 GB. If you have many small files, it is best to combine them in a large file, for example using the tar command:
cd $HOME/workdir tar cvf /archive/barbara/workdir/output.tar output*
The reason is that if you archive many files, chances are that they will get allocated on hundreds different tapes. Getting these files back will then take much time, because every tape-mount takes a significant amount of time.
The reverse of the above action would be:
cd $HOME/workdir dmget -a output* tar xvf /archive/barbara/workdir/output.tar
SURF provides additional tools to partially automate the checks and staging commands when storing data, see Data Archive: Optimal Archiving using dmftar guide for more information.
Archive and batch
In general, do not handle archived files in a batch job. Before a job starts, the files should be available on the normal file system, using the examples above.
Archive and the Lisa system
On Lisa, the archive file system is not available on the batch nodes, so all handling of the archive file system has to be done interactively.
Archive and the Snellius system
On Snellius, the archive file system is available on the login and service nodes. A possibility is to create a dependent job, where the handling of archive data is done in a cheap service node. Please see the Batch - Howto.