Defining the requirements of a job
For the batch system to know which nodes to allocate to you, and for how long, you need to start your jobscript by defining options for the SLURM batch system. The SLURM system will read all lines from the start of your job script until the first non-comment or empty line that does not start #SBATCH - hence all your SLURM options should be at the start of your job script, before anything else.
Shell
You can define the interpreter for your job script with a shebang, just like in any other script. For example, if you want to use Bash, you should start your script with
#!/bin/bash
Wall clock time
You can set the duration for which the nodes remain allocated to you (known as the wall clock time), using
#SBATCH -t 1:30:00
The duration can be specified in minutes, or in the MM:SS, or HH:MM:SS format (as in the example), of on the D-HH:MM:SS. In general, the maximum walltime for CPU jobs is 120 hours (5 days); for jobs submitted to the GPU queue, the maximum walltime is 120 hours. Some queues have a different maximum walltime.
Choose your wall clock time carefully: if your job is still running after the wall clock time has exceeded, it will be cancelled and you will lose your results!
In general, we recommend timing a minimal version of your computational task in order to estimate how long the total task will take. Then, choose your wall clock time liberally based on your expected runtime (e.g. 1.5-2x higher). Once you gain more experience in running similar-sized jobs, you could consider setting walltimes more accurately, as shorter jobs are slightly easier for the scheduler to fit in.
Partition
On our systems, different types of node are grouped into partitions each having different limit for wall-clock time, job size, access limitation, etc. Partitions can overlap, i.e. compute nodes can be contained in several partitions (e.g.: "normal" and "short" partitions share a large number of nodes but they have different maximum wall time limit).
Users can therefore request specific a node type by submitting to the partition which contain it (from the one available on each system).
To set the partition you want to submit to, use the -p argument. For example to run on the thin partition (available on Snellius) you would need to specify:
#SBATCH -p thin
If you don't specify a partition, your job will be scheduled on the thin
partition by default. You can check all available partitions, including their maximum walltime and number of nodes, using the sinfo
command.
The available compute partitions that batch jobs can be submitted to can be listed by issuing the following command
sinfo
You can get a summary of the available partitions by using the -s
flag.
sinfo -s
The available partitions of Snellius are described here.
Number of nodes
To define the number of nodes you need, use the -N argument. For example, if your job requires 1 node, you can use
#SBATCH -N 1
Alternatively, if your application is e.g. an MPI program, you may want to specify the number of (MPI-)tasks you want to to run using the -n argument. For example, to request an allocation for 16 tasks:
#SBATCH -n 16
SLURM is aware of how many cores each node contains, and typically requesting 16 tasks will allocate a node with 16 cores for you (unless you specify a different number of --tasks-per-node, see below).
Number tasks per node
For practically all jobs, you will want to set the number of nodes and wall clock time. For particular jobs, you may also need to set the number of tasks per node. For example, MPI programs will use this number to determine how many processes to run per node. You can set it (to 16) using
#SBATCH --tasks-per-node 16
Sometimes, you may want to run fewer than 16 processes. For example, if your applications uses too much memory to run more instances in parallel, or if you have an OpenMP/MPI hybrid application, in which each MPI task is multithreaded. Finally, some applications run more efficiently if you leave one core free (i.e. specify 15 tasks for a 16-core node); the only way to know this, is to test your applications using both settings.
This by default will allocate 1 cpu per task. If you would like to advise the Slurm controller that job steps will require a different number of processors per task, you can request this using the following options:
#SBATCH --ntasks 8 #SBATCH --cpus-per-task 2
Number of GPUs per node
For GPU equipped nodes, you may want to request a specific number of GPU devices. This can be done using the GPU scheduling options provided in SLURM.
To request a two GPU for the job you can use the option:
#SBATCH --gpus=2
Similarly to the number of tasks per node option shown above, you can also request a fixed number of GPUs per node by adding to your job script the following option:
#SBATCH --gpus-per-node=4
Check the sbatch command help output, for more options.
Memory requirements
You can explicitly request the memory needed on the node by using (please consult each HPC system's node specifications) e.g.
#SBATCH --mem=60G
Additional requirements
You can request a node with a specific features, using the --constraint option, e.g.:
#SBATCH --constraint=avx2
This will request a CPU with the AVX2 instruction set.
Alternatively, resources such as GPU number (within a gpu partition) can be requested using the Generic RESources (GRES) option available in SLURM.
#SBATCH --gres=gpu:1
This will select one TitanRTX GPU within the 4 available in each node.
Constraints and GRES are system and partition specific, please check the Snellius specifications to find out which features are available on each system's nodes.
Automatic job restart after system failure
If a running job ends because of a system failure, the batch system will re-submit the job. Mostly, this is what you want, but in some cases it may not be desirable that your partially runned jobs are started again. To prevent such an automatic restart, specify
#SBATCH --requeue
Running X11 programs
Running programs that use X11 is not recommended in a batch system, as there is no possibility to interact with the program anyway. However, for programs that insist on opening an X11 window, you can start an X-server in the job itself. At the start of the job script, put
export DISPLAY=:1 Xvfb $DISPLAY -auth /dev/null &
Note that it is not possible to get output from or provide input to the X11 window.
Mail from a job
SLURM can send e-mail at various stages in your job. With
#SBATCH --mail-type=BEGIN,END #SBATCH --mail-user=jan.janssen@uni.edu
you can specify the stages and destination e-mail address.
Multiple requirements
Note that you can specify as many requirements as you want, e.g.
#SBATCH --constraint=avx2 -N 2 --tasks-per-node=8 -t 120:00:00
However, specifying many requirements limits the number of nodes that can run your job and your job may be in the queue for longer. Also, if you ask for a combination of requirements that no node can satisfy, the job will be rejected. E.g. all avx2 nodes have only 96 GB memory. So, if you specify
#SBATCH --constraint=avx2 --mem=1T
your job will be rejected.
Default options
SLURM will use default values for options that you don't explicitely set. E.g. the default number of nodes is 1, and the default walltime is 10 minutes. However, for clarity, we always recommend to set these two options explicitely.
The default number of processes per node (--task-per-node
), and the number of threads per task (--cpus-per-task
) is 1. This affects how many processes are launched by mpiexec when running an MPI / OpenMP / hybrid program. Thus, in order to use all cores in a node, make sure to set the number of tasks, and the number of cpus per task options explicitly for MPI and OpenMP programs.
Overview of SLURM options
Common options that can be passed to the SLURM batch system, either on the
sbatch
command-line, or as an
#SBATCH
attribute:
- -N [value]
- -n [value]
- -c [value]
- --tasks-per-node=[value]
- --cpus-per-task=[value]
- --mem=[value]
- --constraint=[value
- -t [HH:MM:SS] or [MM:SS] or [minutes]
- -p gpuGPU
- --requeue
Additional options
The options presented above are just a part of all available commands that can be used in your job script. You can see all the available SBATCH
flags with the command:
sbatch --help