SLURM batch system

Synopsis

Snellius uses the workload manager SLURM. As a cluster workload manager, Slurm has three key functions.

First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.

Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.

Finally, it arbitrates contention for resources by managing a queue of pending work.

Defining the requirements of a job

For the batch system to know which nodes to allocate to you, and for how long, you need to start your jobscript by defining options for the SLURM batch system. The SLURM system will read all lines from the start of your job script until the first non-comment or empty line that does not start #SBATCH - hence all your SLURM options should be at the start of your job script, before anything else.

Shell

You can define the interpreter for your job script with a shebang, just like in any other script. For example, if you want to use Bash, you should start your script with

#!/bin/bash

Job name

You can specify a name for the job allocation using

#SBATCH -J myfirstjob

When querying SLURM jobs on the system, it will show the job id number as well as the specified job name . The default job name is the name of the batch script.

For job names in Snellius, valid characters are lowercase a to z, uppercase A to Z, numbers 0 to 9, period (.), underscore (_), forward slash (/) and hyphen (-).

Wall clock time

You can set the duration for which the nodes remain allocated to you (known as the wall clock time), using

#SBATCH -t 1:30:00

The duration can be specified in minutes, or in the MM:SS, or HH:MM:SS format (as in the example), of on the D-HH:MM:SS. In general, the maximum walltime for CPU jobs is 120 hours (5 days); for jobs submitted to the GPU queue, the maximum walltime is 120 hours. Some queues have a different maximum walltime.

Choose your wall clock time carefully: if your job is still running after the wall clock time has exceeded, it will be cancelled and you will lose your results!

In general, we recommend timing a minimal version of your computational task in order to estimate how long the total task will take. Then, choose your wall clock time liberally based on your expected runtime (e.g. 1.5-2x higher). Once you gain more experience in running similar-sized jobs, you could consider setting walltimes more accurately, as shorter jobs are slightly easier for the scheduler to fit in.

Partition

On our systems, different types of node are grouped into partitions each having different limit for wall-clock time, job size, access limitation, etc. Partitions can overlap, i.e. compute nodes can be contained in several partitions (e.g.: "normal" and "short" partitions share a large number of nodes but they have different maximum wall time limit).

Users can therefore request specific a node type by submitting to the partition which contain it (from the one available on each system).

To set the partition you want to submit to, use the -p argument. For example to run on the thin partition (available on Snellius) you would need to specify:

#SBATCH -p thin

If you don't specify a partition, your job will be scheduled on the thin partition by default. You can check all available partitions, including their maximum walltime and number of nodes, using the sinfo command.

The available compute partitions that batch jobs can be submitted to can be listed by issuing the following command

sinfo

You can get a summary of the available partitions by using the -s flag.

sinfo -s

The available partitions of Snellius are described here.

Number of nodes

To define the number of nodes you need, use the -N argument. For example, if your job requires 1 node, you can use

#SBATCH -N 1

Alternatively, if your application is e.g. an MPI program, you may want to specify the number of (MPI-)tasks you want to to run using the -n argument. For example, to request an allocation for 16 tasks:

#SBATCH -n 16

SLURM is aware of how many cores each node contains, and typically requesting 16 tasks will allocate a node with 16 cores for you (unless you specify a different number of --tasks-per-node, see below).

Number tasks per node

For practically all jobs, you will want to set the number of nodes and wall clock time. For particular jobs, you may also need to set the number of tasks per node. For example, MPI programs will use this number to determine how many processes to run per node. You can set it (to 16) using

#SBATCH --tasks-per-node 16

Sometimes, you may want to run fewer than 16 processes. For example, if your applications uses too much memory to run more instances in parallel, or if you have an OpenMP/MPI hybrid application, in which each MPI task is multithreaded. Finally, some applications run more efficiently if you leave one core free (i.e. specify 15 tasks for a 16-core node); the only way to know this, is to test your applications using both settings.

This by default will allocate 1 cpu per task. If you would like to advise the Slurm controller that job steps will require a different number of processors per task, you can request this using the following options:

#SBATCH --ntasks 8
#SBATCH --cpus-per-task 2

Number of GPUs per node

For GPU equipped nodes, you may want to request a specific number of GPU devices. This can be done using the GPU scheduling options provided in SLURM.

To request a two GPU for the job you can use the option:

#SBATCH --gpus=2

Similarly to the number of tasks per node option shown above, you can also request a fixed number of GPUs per node by adding to your job script the following option:

#SBATCH --gpus-per-node=4

Check the sbatch command help output, for more options.

Memory requirements

You can explicitly request the memory needed on the node by using (please consult each HPC system's node specifications) e.g.

#SBATCH --mem=60G

Additional requirements

You can request a node with a specific features, using the --constraint option, e.g.:

#SBATCH --constraint=scratch-node

This will request a node with a local disk.

Alternatively, resources such as GPU number (within a gpu partition) can be requested using the Generic RESources (GRES) option available in SLURM.

#SBATCH --gres=gpu:1

This will select one TitanRTX GPU within the 4 available in each node.

Constraints and GRES are system and partition specific, please check the Snellius specifications to find out which features are available on each system's nodes.

Automatic job restart after system failure

If a running job ends because of a system failure, the batch system will re-submit the job. Mostly, this is what you want, but in some cases it may not be desirable that your partially runned jobs are started again. To prevent such an automatic restart, specify

#SBATCH --requeue

Running X11 programs

Running programs that use X11 is not recommended in a batch system, as there is no possibility to interact with the program anyway. However, for programs that insist on opening an X11 window, you can start an X-server in the job itself. At the start of the job script, put

export DISPLAY=:1
Xvfb $DISPLAY -auth /dev/null &

Note that it is not possible to get output from or provide input to the X11 window.

Mail from a job

SLURM can send e-mail at various stages in your job. With

#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=jan.janssen@uni.edu

you can specify the stages and destination e-mail address.

Multiple requirements

Note that you can specify as many requirements as you want, e.g.

#SBATCH --constraint=avx2 -N 2 --tasks-per-node=8 -t 120:00:00

However, specifying many requirements limits the number of nodes that can run your job and your job may be in the queue for longer. Also, if you ask for a combination of requirements that no node can satisfy, the job will be rejected. E.g. all avx2 nodes have only 96 GB memory. So, if you specify

#SBATCH --constraint=avx2 --mem=1T

your job will be rejected.

Default options

SLURM will use default values for options that you don't explicitely set. E.g. the default number of nodes is 1, and the default walltime is 10 minutes. However, for clarity, we always recommend to set these two options explicitely.

The default number of processes per node (--task-per-node), and the number of threads per task (--cpus-per-task) is 1. This affects how many processes are launched by mpiexec when running an MPI / OpenMP / hybrid program. Thus, in order to use all cores in a node, make sure to set the number of tasks, and the number of cpus per task options explicitly for MPI and OpenMP programs.

Overview of SLURM options

Common options that can be passed to the SLURM batch system, either on the

sbatch

command-line, or as an

#SBATCH

attribute:

-N [value]
-n [value]
-c [value]
--tasks-per-node=[value]
--cpus-per-task=[value]
--mem=[value]
--constraint=[value
-t [HH:MM:SS] or [MM:SS] or [minutes]
-p gpuGPU
--requeue

Additional options

The options presented above are just a part of all available commands that can be used in your job script. You can see all the available SBATCH flags with the command:

sbatch --help

Space shortcuts

Page tree