Synopsis

A job script generally consists of the following parts.

  1. Specification of the job requirements for the batch system (number of nodes, expected runtime, etc)
  2. Loading of modules needed to run your application
  3. Preparing input data (e.g. copying input data to from your home to scratch, preprocessing)
  4. Running your application
  5. Aggregating output data (e.g. post-processing, copying data from scratch to your home)

In the following paragraphs we will explain each of these parts. Note: in some cases steps 2, 3 and 5 are not always required.

SLURM batch system

Synopsis

Snellius uses the workload manager SLURM. As a cluster workload manager, Slurm has three key functions.

First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.

Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.

Finally, it arbitrates contention for resources by managing a queue of pending work.

Defining the requirements of a job

For the batch system to know which nodes to allocate to you, and for how long, you need to start your jobscript by defining options for the SLURM batch system. The SLURM system will read all lines from the start of your job script until the first non-comment or empty line that does not start #SBATCH - hence all your SLURM options should be at the start of your job script, before anything else.

Shell

You can define the interpreter for your job script with a shebang, just like in any other script. For example, if you want to use Bash, you should start your script with

#!/bin/bash

Job name

You can specify a name for the job allocation using

#SBATCH -J myfirstjob

 When querying SLURM jobs on the system, it will show the job id number as well as the specified job name . The default job name is the name of the batch script. 

 For job names in Snellius, valid characters are lowercase a to z, uppercase A to Z, numbers 0 to 9, period (.), underscore (_), forward slash (/) and hyphen (-).


 Wall clock time

You can set the duration for which the nodes remain allocated to you (known as the wall clock time), using

#SBATCH -t 1:30:00

The duration can be specified in minutes, or in the MM:SS, or HH:MM:SS format (as in the example), of on the D-HH:MM:SS. In general, the maximum walltime for CPU jobs is 120 hours (5 days); for jobs submitted to the GPU queue, the maximum walltime is 120 hours. Some queues have a different maximum walltime.

Choose your wall clock time carefully: if your job is still running after the wall clock time has exceeded, it will be cancelled and you will lose your results!

In general, we recommend timing a minimal version of your computational task in order to estimate how long the total task will take. Then, choose your wall clock time liberally based on your expected runtime (e.g. 1.5-2x higher). Once you gain more experience in running similar-sized jobs, you could consider setting walltimes more accurately, as shorter jobs are slightly easier for the scheduler to fit in.

Partition

On our systems, different types of node are grouped into partitions each having different limit for wall-clock time, job size, access limitation, etc. Partitions can overlap, i.e. compute nodes can be contained in several partitions (e.g.: "normal" and "short" partitions share a large number of nodes but they have different maximum wall time limit).

Users can therefore request specific a node type by submitting to the partition which contain it (from the one available on each system). 

To set the partition you want to submit to, use the -p argument. For example to run on the thin partition (available on Snellius) you would need to specify:

#SBATCH -p thin

If you don't specify a partition, your job will be scheduled on the thin partition by default. You can check all available partitions, including their maximum walltime and number of nodes, using the sinfo command.

The available compute partitions that batch jobs can be submitted to can be listed by issuing the following command

sinfo

You can get a summary of the available partitions by using the -s flag.

sinfo -s

The available partitions of Snellius are described here


Number of nodes

To define the number of nodes you need, use the -N argument. For example, if your job requires 1 node, you can use

#SBATCH -N 1

Alternatively, if your application is e.g. an MPI program, you may want to specify the number of (MPI-)tasks you want to to run using the -n argument. For example, to request an allocation for 16 tasks:

#SBATCH -n 16

SLURM is aware of how many cores each node contains, and typically requesting 16 tasks will allocate a node with 16 cores for you (unless you specify a different number of --tasks-per-node, see below).

Number tasks per node

For practically all jobs, you will want to set the number of nodes and wall clock time. For particular jobs, you may also need to set the number of tasks per node. For example, MPI programs will use this number to determine how many processes to run per node. You can set it (to 16) using

#SBATCH --tasks-per-node 16

Sometimes, you may want to run fewer than 16 processes. For example, if your applications uses too much memory to run more instances in parallel, or if you have an OpenMP/MPI hybrid application, in which each MPI task is multithreaded. Finally, some applications run more efficiently if you leave one core free (i.e. specify 15 tasks for a 16-core node); the only way to know this, is to test your applications using both settings.

This by default will allocate 1 cpu per task. If you would like to advise the Slurm controller that job steps will require a different number of processors per task, you can request this using the following options:

#SBATCH --ntasks 8
#SBATCH --cpus-per-task 2

Number of GPUs per node

For GPU equipped nodes, you may want to request a specific number of GPU devices. This can be done using the GPU scheduling options provided in SLURM. 

To request a two GPU for the job you can use the option:

#SBATCH --gpus=2

Similarly to the number of tasks per node option shown above, you can also request a fixed number of GPUs per node by adding to your job script the following option:

#SBATCH --gpus-per-node=4

Check the sbatch command help output, for more options.


Memory requirements

You can explicitly request the memory needed on the node by using (please consult each HPC system's node specifications) e.g.

#SBATCH --mem=60G


Additional requirements

You can request a node with a specific features, using the --constraint option, e.g.:

#SBATCH --constraint=scratch-node

This will request a node with a local disk.

Alternatively, resources such as GPU number (within a gpu partition) can be requested using the Generic RESources (GRES) option available in SLURM.

#SBATCH --gres=gpu:1

This will select one TitanRTX GPU within the 4 available in each node.

Constraints and GRES are system and partition specific, please check the Snellius specifications to find out which features are available on each system's nodes. 

Automatic job restart after system failure

If a running job ends because of a system failure, the batch system will re-submit the job. Mostly, this is what you want, but in some cases it may not be desirable that your partially runned jobs are started again. To prevent such an automatic restart, specify

#SBATCH --requeue

Running X11 programs

Running programs that use X11 is not recommended in a batch system, as there is no possibility to interact with the program anyway. However, for programs that insist on opening an X11 window, you can start an X-server in the job itself. At the start of the job script, put

export DISPLAY=:1
Xvfb $DISPLAY -auth /dev/null &

Note that it is not possible to get output from or provide input to the X11 window. 

Mail from a job

SLURM can send e-mail at various stages in your job. With

#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=jan.janssen@uni.edu

you can specify the stages and destination e-mail address.

Multiple requirements

Note that you can specify as many requirements as you want, e.g.

#SBATCH --constraint=avx2 -N 2 --tasks-per-node=8 -t 120:00:00

However, specifying many requirements limits the number of nodes that can run your job and your job may be in the queue for longer. Also, if you ask for a combination of requirements that no node can satisfy, the job will be rejected. E.g. all avx2 nodes have only 96 GB memory. So, if you specify

#SBATCH --constraint=avx2 --mem=1T

your job will be rejected.

Default options

SLURM will use default values for options that you don't explicitely set. E.g. the default number of nodes is 1, and the default walltime is 10 minutes. However, for clarity, we always recommend to set these two options explicitely.

The default number of processes per node (--task-per-node), and the number of threads per task (--cpus-per-task) is 1. This affects how many processes are launched by mpiexec when running an MPI / OpenMP / hybrid program. Thus, in order to use all cores in a node, make sure to set the number of tasks, and the number of cpus per task options explicitly for MPI and OpenMP programs.

Overview of SLURM options

Common options that can be passed to the SLURM batch system, either on the

sbatch

command-line, or as an

#SBATCH

attribute:

  • -N [value]
  • -n [value]
  • -c [value]
  • --tasks-per-node=[value]
  • --cpus-per-task=[value]
  • --mem=[value]
  • --constraint=[value
  • -t [HH:MM:SS] or [MM:SS] or [minutes]
  • -p gpuGPU
  • --requeue

Additional options

The options presented above are just a part of all available commands that can be used in your job script. You can see all the available SBATCH flags with the command:

sbatch --help



Loading modules

There are many software applications on our systems for which multiple versions have been installed. This is generally done because some users need an older version for compatibility, while others may want to use features only present in a newer version. The module system allows users to easily select which specific version of a software package they want to use. Modules also enable you to use specific compilers or link-specific library versions when installing your own software applications.

A more detailed introduction to using modules and the module environment can be found on this tutorial page.

The environment module system on Snellius  Lmod

To see which modules are available, use

module avail

or to find all possible modules and extensions 

module spider

To load a module, use

module load [module_name]
module load 2022

For example, you can load Python by using the following command

module load 2022
module load Python/3.10.4-GCCcore-11.3.0

Default modules are disabled on Snellius. You need to specify the full version of the module you want to load (e.g: Python/3.10.4-GCCcore-11.3.0)

In order to minimize bugs in job scripts and have more consistent behaviour, users should load the exact version of the software via the module. 

To check which versions of python are available, you can use

module avail Python

which will show something like

Python/2.7.18-GCCcore-10.3.0-bare  Python/3.9.5-GCCcore-10.3.0  Python/3.9.5-GCCcore-10.3.0-bare

The module avail  command will match all parts of the complete module+version string. So, for example:

module avail GCC

will also match

pkgconfig/1.5.5-GCCcore-11.3.0-python



Copy input data to scratch

The scratch disk provides temporary storage that is much faster than the home file system. This is particularly important if, for example, you launch 16 processes on a single node, that each need to read in files (or worse, you launch an MPI program over 10 nodes, with 16 processes per node). Your input files will be read 16 (or, in the MPI case: 160) times. In case you have 1 or 2 tiny input files, it may be acceptable to read these directly from the home file system. If you have many files and/or very large files, however, this will put too high a load on the home file system - and slow down your job considerably.

The solution is to copy your input data to the local scratch disk of each node before starting your application. Then, each of the 16 processes on that node can read the input data from the local scratch disk. For the single-node example, you reduce the number of file reads from the home file system from 16 to 1 (i.e. only the copy operation).

Copying data to scratch for a single node job

The $TMPDIR environment variable points to a temporary folder on the local scratch disk and can be used to write files to scratch. For a single-node job, copying your input can be done simply by the cp command. For example, to copy single big input file from your home to the local scratch disk

cp $HOME/big_input_file "$TMPDIR"

Or, to copy a whole directory with input files

cp -r $HOME/input_dir "$TMPDIR"

Copying data to scratch for a multi-node job

For the MPI example involving 10 nodes, copying the data to each of the local scratch disks would still result in the input files being read 10 times. To avoid that, we have developed the mpicopy tool. Mpicopy reads the file from the home file system only on the first node, and from there, broadcasts it to all nodes that are assigned to you. To use the mpicopy tool you need to load the mpicopy and openmpi modules first. You can specify a target directory using the -o argument, but by default mpicopy copies to the $TMPDIR directory. For example

module load 2020
module load mpicopy
module load OpenMPI/4.0.3-GCC-9.3.0 
mpicopy $HOME/big_input_file "$TMPDIR"/

Note that mpicopy also copies directories recursively by default, you don't need to specify the -r option.

mpicopy $HOME/input_dir

An MPI program can then be started with the corresponding input file

mpiexec my_program "$TMPDIR"/big_input_file

Executing your program

The way you want to run your program may be very specific to your problem: potentially you need to do preprocessing, run an actual simulation, do postprocessing etc. Moreover, it depends on if - and how - your program is parallelized.

In this section, we limit ourselves to the simplest scenario: running a single instance of a serial (i.e. non-parallel) program, taking a single input file as an argument. For that, you add a line like

my_program "$TMPDIR"/big_input_file

to your job script.

This is the simplest example, but it is not the way you should generally use a HPC system! Running only a single instance of a serial program, you will only use one core in a single node, leaving the other cores idle. This, of course, is a waste of computational power (and a waste of your budget). In practice, you will want to use parallelization to use all cores in a node, or even multiple nodes. 

Note: the need for parallelization depends on how 'heavy' your program is. If you have some simple pre- and post-processing steps, it is of course fine to run these as serial programs.


Temporary files

Some program may generate temporary files. If your program does, and if you can set the location where they store the temporary files (e.g. as an argument or through a configuration file), make your program use the (fast) scratch space (accessible using the "$TMPDIR" environment variable). Other programs will use the current directory (i.e. the directory your shell was in when you launched the program) to store temporary files. If that is the case, change directory to the "$TMPDIR" directory, before launching your program. E.g.

cd "$TMPDIR"
$HOME/my_program

Do not use the /tmp directory to store temporary files, it may cause the node to crash! On the nodes, /tmp  has a limited size and should only be used by system processes.


Copy output data from scratch

To copy output data from scratch back to your home folder after running a single-node job, use
cp "$TMPDIR"/output_file $HOME

or for a directory containing output

cp -r "$TMPDIR"/output_dir $HOME

WARNING: after your job finishes, the "$TMPDIR" on the scratch disk is removed by the batch system. If you forget to copy your results back to your home directory, they will be lost!

Final job script

Incorporating the instructions above, a complete job script could, for example, look like this:
#!/bin/bash
#Set job requirements
#SBATCH -n 16
#SBATCH -t 5:00

#Loading modules
module load 2022
module load Python/3.10.4-GCCcore-11.3.0

#Copy input file to scratch
cp $HOME/big_input_file "$TMPDIR"

#Create output directory on scratch
mkdir "$TMPDIR"/output_dir

#Execute a Python program located in $HOME, that takes an input file and output directory as arguments.
python $HOME/my_program.py "$TMPDIR"/big_input_file "$TMPDIR"/output_dir

#Copy output directory from scratch to home
cp -r "$TMPDIR"/output_dir $HOME

While this script illustrates the use of the most important elements of a job script (SBATCH arguments, modules, managing input and output), it is probably not efficient: if the Python program only runs on a single core, many cores are left idle, wasting resources.

Environment variables within a job

You can inspect all environment variables available in a job by submitting the following jobscript
#!/bin/bash
#SBATCH --nodes=1 --time=1:00
echo "-----"
echo "Non-SLURM environment"
echo "-----"
env | grep -v SLURM
echo "-----"
echo "SLURM environment"
echo "-----"
env | grep SLURM

and inspecting the output.

SLURM sets automatically several environment variables. A non-exhaustive list of the most useful variables is shown here.

Variable nameDescription
USERYour username.
HOSTNAMEName of the computer currently running the script.
HOMEYour home directory.
PWDCurrent directory the script is running in. Changes when you do cd in a script.
SLURM_SUBMIT_DIRDirectory where the sbatch command was executed.
SLURM_JOBIDID number assigned to the job upon submission. The same number is seen in showq.
SLURM_JOB_NAMEName of the job script.
SLURM_NODELISTA collapsed list of nodes assigned to this job (see comments below)
SLURM_ARRAY_TASK_IDArray ID numbers for jobs submitted with the -t flag. For #SBATCH -a 1-8, this will be an integer between 1 and 8.
SLURM_NTASKSNumber of tasks for the job.
SLURM_NTASKS_PER_NODENumber of tasks per node.

The list of nodes assigned to a job is stored in the variable $SLURM_NODELIST in a compact form (e.g: "tcn[1-10]" means nodes tcn1, tcn2, ..., tcn10 are assigned to the job). To obtain the full list of compute nodes available for the job you can use the following command:

scontrol show hostnames 

which will print all the nodes available as ordered list, e.g:

tcn1
tcn2
tcn3
tcn4
tcn5
tcn6
tcn7
tcn8
tcn9
tcn10

On Snellius we are not exporting  environment variables which are set by the user within the bash session.

This apply both to variables set by the user (interactively and in .bashrc or similar configuration files)  or set by module files. This means that every variable you would like to be set at runtime, needs to be explicitly exported in the job script. Similar to this, any module you would like to have loaded at runtime, needs to be explicitly loaded within the job script. 

  • No labels