Purpose

Sometimes, you might encounter issues where your jobs won't run, or you can't compile programs due to a quota exceeded error. Even after checking /scratch-local or /scratch-shared and finding no excess quota, the problem persists. This document can help you resolve such issues.

If you're looking to clean up your home directory or project space, you may find this tutorial on efficiently cleaning up your home directory or project space helpful.

Explanation: Why can't I find the data on scratch?

Background

Most nodes have a local scratch filesystem. "Local" is used in a very literal way, as it refers to the SSD drives physically installed on the node.  The variable $TMPDIR, thus, points to that specific piece of hardware on the node that you are currently logged in. You can visualize this concept by following the steps below after logging in to Snellius:

cd $TMPDIR            
hostname ; pwd                    
touch MYFILE          
  • The first command moves you to scratch, usually /scratch-local
  • The command `pwd` shows you the current node and path
  • The third command creates an empty file called `MYFILE`.

Now change to a different node, for this example, let's say int5, check the hostname and path for good measure and get a directory listing.

ssh int5
hostname ; pwd
ls -l *

You will see that the file you have just created is not there. The reason is that you are on the scratch drive (SSD) disk of node int5.  The same is valid for all the nodes, the interactive nodes int4-5 and gcn1, the batch nodes (all the nodes that execute the jobs) and the stagings and cbuild nodes.

Quota errors are often caused by the number of inodes used (files and directories), rather than the amount of storage. This can be confusing because while it's challenging to fill up 8 TB of space, it's quite easy to exceed 1 million inodes when working on a machine learning project that generates many small files. To learn how to prevent this, check out this document: Best Practices for Data Formats in Deep Learning

Where is your data?

The particular problem of the orphan scratch data is caused by jobs running on the batch nodes. Thus, doing their usual computation work and writing the results to the scratch directory of the node/s where your job is running. Once the job is done, the data stays in that scratch where the job started. Sometimes you are lucky and can recall where the jobs ran. Sometimes it can be recalled checking the node list using the `sacct` command. But most of the time this is not obvious, or a job ran on several nodes, and it's not easy to know which one is causing the error.

How to resolve the issue

A note first:

The process involves using the command `find` and going through a series of nodes on a specific GPFS filesystem and deleting the content on the scratches that it finds. It takes time, and we may kill any process on the int nodes that take more than a certain amount of time or runs out of bounds. It may also be an annoyance to keep it running in the background or keeping a terminal open just for that. The int nodes are also way less powerful than the staging or cbuild nodes, which are meant specifically for tasks like these. The procedure that I am going to explain will thus involve creating a job script and sending it to the cbuild partition for execution. The costs in SBU for that are between 1 and 32 SBU / hour. As this process will only use a single CPU and takes some 45 minutes on a CBUILD node, it should cost you 1 SBU in total. 

The procedure:

It all boils down to this single line of code: 

find -O3 /gpfs/scratch1/nodespecific/ -maxdepth 2 -type d -user YOURLOGIN -exec rm -rf {} \;


But we are going to wrap it into a job script and send it to cbuild. Here is the script: 

#! /bin/bash

#SBATCH -p cbuild
#SBATCH -t 01:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=put_your@mail.here

srun find -O3 /gpfs/scratch1/nodespecific/ -maxdepth 2 -type d -user YOURLOGIN -exec rm -rf {} \;


Substitute YOURLOGIN with your login (obviously) and add your mail to the --mail-user argument. This will email you when the job ends or fails. 
I set the walltime to 1 hour, this is just to be sure it has time enough. But it shouldn't be more than 1 hour. You will only be accounted for the real time that you use, not the walltime. But you are welcome to set it to whatever you want.

The final step is to save the script (let's say we call it "scratchcleanup.sh" and send it to the queue as a regular SLURM job 

sbatch scratchcleanup.sh



A last consideration:

There is another trivial way of resolving this issue: All the scratch directories keep the data for only a limited period of time. The local scratches for 6 days and /scratch-shared for 14. If you can wait 6 days without doing any work, the problem would resolve itself spontaneously.





  • No labels