Steps to follow when writing a job script
- SLURM batch system
- Loading modules
- Copy input data to scratch
- Executing your program
- Copy output data from scratch
- Final job script
- Environment variables within a job
You can find some example jobs scripts here.
SLURM batch system
- Defining the requirements of a job
- Shell
- Job name
- Partition
- Number of nodes
- Number tasks per node
- Number of GPUs per node
- Memory requirements
- Additional requirements
- Automatic job restart after system failure
- Running X11 programs
- Mail from a job
- Multiple requirements
- Default options
- Additional options
Defining the requirements of a job
For the batch system to know which nodes to allocate to you, and for how long, you need to start your jobscript by defining options for the SLURM batch system. The SLURM system will read all lines from the start of your job script until the first non-comment or empty line that does not start #SBATCH - hence all your SLURM options should be at the start of your job script, before anything else.
Shell
You can define the interpreter for your job script with a shebang, just like in any other script. For example, if you want to use Bash, you should start your script with
#!/bin/bash
Job name
You can specify a name for the job allocation using
#SBATCH -J myfirstjob
When querying SLURM jobs on the system, it will show the job id number as well as the specified job name . The default job name is the name of the batch script.
For job names in Snellius, valid characters are lowercase a to z, uppercase A to Z, numbers 0 to 9, period (.), underscore (_), forward slash (/) and hyphen (-).
Wall clock time
You can set the duration for which the nodes remain allocated to you (known as the wall clock time), using
#SBATCH -t 1:30:00
The duration can be specified in minutes, or in the MM:SS, or HH:MM:SS format (as in the example), of on the D-HH:MM:SS. In general, the maximum walltime for CPU jobs is 120 hours (5 days); for jobs submitted to the GPU queue, the maximum walltime is 120 hours. Some queues have a different maximum walltime.
Choose your wall clock time carefully: if your job is still running after the wall clock time has exceeded, it will be cancelled and you will lose your results!
In general, we recommend timing a minimal version of your computational task in order to estimate how long the total task will take. Then, choose your wall clock time liberally based on your expected runtime (e.g. 1.5-2x higher). Once you gain more experience in running similar-sized jobs, you could consider setting walltimes more accurately, as shorter jobs are slightly easier for the scheduler to fit in.
Partition
On our systems, different types of node are grouped into partitions each having different limit for wall-clock time, job size, access limitation, etc. Partitions can overlap, i.e. compute nodes can be contained in several partitions (e.g.: "normal" and "short" partitions share a large number of nodes but they have different maximum wall time limit).
Users can therefore request specific a node type by submitting to the partition which contain it (from the one available on each system).
To set the partition you want to submit to, use the -p argument. For example to run on the thin partition (available on Snellius) you would need to specify:
#SBATCH -p thin
If you don't specify a partition, your job will be scheduled on the thin
partition by default. You can check all available partitions, including their maximum walltime and number of nodes, using the sinfo
command.
The available compute partitions that batch jobs can be submitted to can be listed by issuing the following command
sinfo
You can get a summary of the available partitions by using the -s
flag.
sinfo -s
The available partitions of Snellius are described here.
Number of nodes
To define the number of nodes you need, use the -N argument. For example, if your job requires 1 node, you can use
#SBATCH -N 1
Alternatively, if your application is e.g. an MPI program, you may want to specify the number of (MPI-)tasks you want to to run using the -n argument. For example, to request an allocation for 16 tasks:
#SBATCH -n 16
SLURM is aware of how many cores each node contains, and typically requesting 16 tasks will allocate a node with 16 cores for you (unless you specify a different number of --tasks-per-node, see below).
Number tasks per node
For practically all jobs, you will want to set the number of nodes and wall clock time. For particular jobs, you may also need to set the number of tasks per node. For example, MPI programs will use this number to determine how many processes to run per node. You can set it (to 16) using
#SBATCH --tasks-per-node 16
Sometimes, you may want to run fewer than 16 processes. For example, if your applications uses too much memory to run more instances in parallel, or if you have an OpenMP/MPI hybrid application, in which each MPI task is multithreaded. Finally, some applications run more efficiently if you leave one core free (i.e. specify 15 tasks for a 16-core node); the only way to know this, is to test your applications using both settings.
This by default will allocate 1 cpu per task. If you would like to advise the Slurm controller that job steps will require a different number of processors per task, you can request this using the following options:
#SBATCH --ntasks 8 #SBATCH --cpus-per-task 2
Number of GPUs per node
For GPU equipped nodes, you may want to request a specific number of GPU devices. This can be done using the GPU scheduling options provided in SLURM.
To request a two GPU for the job you can use the option:
#SBATCH --gpus=2
Similarly to the number of tasks per node option shown above, you can also request a fixed number of GPUs per node by adding to your job script the following option:
#SBATCH --gpus-per-node=4
Check the sbatch command help output, for more options.
Memory requirements
You can explicitly request the memory needed on the node by using (please consult each HPC system's node specifications) e.g.
#SBATCH --mem=60G
Additional requirements
You can request a node with a specific features, using the --constraint option, e.g.:
#SBATCH --constraint=scratch-node
This will request a node with a local disk.
Alternatively, resources such as GPU number (within a gpu partition) can be requested using the Generic RESources (GRES) option available in SLURM.
#SBATCH --gres=gpu:1
This will select one TitanRTX GPU within the 4 available in each node.
Constraints and GRES are system and partition specific, please check the Snellius specifications to find out which features are available on each system's nodes.
Automatic job restart after system failure
If a running job ends because of a system failure, the batch system will re-submit the job. Mostly, this is what you want, but in some cases it may not be desirable that your partially runned jobs are started again. To prevent such an automatic restart, specify
#SBATCH --requeue
Running X11 programs
Running programs that use X11 is not recommended in a batch system, as there is no possibility to interact with the program anyway. However, for programs that insist on opening an X11 window, you can start an X-server in the job itself. At the start of the job script, put
export DISPLAY=:1 Xvfb $DISPLAY -auth /dev/null &
Note that it is not possible to get output from or provide input to the X11 window.
Mail from a job
SLURM can send e-mail at various stages in your job. With
#SBATCH --mail-type=BEGIN,END #SBATCH --mail-user=jan.janssen@uni.edu
you can specify the stages and destination e-mail address.
Multiple requirements
Note that you can specify as many requirements as you want, e.g.
#SBATCH --constraint=avx2 -N 2 --tasks-per-node=8 -t 120:00:00
However, specifying many requirements limits the number of nodes that can run your job and your job may be in the queue for longer. Also, if you ask for a combination of requirements that no node can satisfy, the job will be rejected. E.g. all avx2 nodes have only 96 GB memory. So, if you specify
#SBATCH --constraint=avx2 --mem=1T
your job will be rejected.
Default options
SLURM will use default values for options that you don't explicitely set. E.g. the default number of nodes is 1, and the default walltime is 10 minutes. However, for clarity, we always recommend to set these two options explicitely.
The default number of processes per node (--task-per-node
), and the number of threads per task (--cpus-per-task
) is 1. This affects how many processes are launched by mpiexec when running an MPI / OpenMP / hybrid program. Thus, in order to use all cores in a node, make sure to set the number of tasks, and the number of cpus per task options explicitly for MPI and OpenMP programs.
Overview of SLURM options
Common options that can be passed to the SLURM batch system, either on the
sbatch
command-line, or as an
#SBATCH
attribute:
- -N [value]
- -n [value]
- -c [value]
- --tasks-per-node=[value]
- --cpus-per-task=[value]
- --mem=[value]
- --constraint=[value
- -t [HH:MM:SS] or [MM:SS] or [minutes]
- -p gpuGPU
- --requeue
Additional options
The options presented above are just a part of all available commands that can be used in your job script. You can see all the available SBATCH
flags with the command:
sbatch --help
Loading modules
A more detailed introduction to using modules and the module environment can be found on this tutorial page.
The environment module system on Snellius Lmod
To see which modules are available, use
module avail
or to find all possible modules and extensions
module spider
To load a module, use
module load [module_name]
module load 2022
For example, you can load Python by using the following command
module load 2022 module load Python/3.10.4-GCCcore-11.3.0
Default modules are disabled on Snellius. You need to specify the full version of the module you want to load (e.g: Python/3.10.4-GCCcore-11.3.0)
In order to minimize bugs in job scripts and have more consistent behaviour, users should load the exact version of the software via the module.
To check which versions of python are available, you can use
module avail Python
which will show something like
Python/2.7.18-GCCcore-10.3.0-bare Python/3.9.5-GCCcore-10.3.0 Python/3.9.5-GCCcore-10.3.0-bare
The module avail
command will match all parts of the complete module+version string. So, for example:
module avail GCC
will also match
pkgconfig/1.5.5-GCCcore-11.3.0-python
Copy input data to scratch
The solution is to copy your input data to the local scratch disk of each node before starting your application. Then, each of the 16 processes on that node can read the input data from the local scratch disk. For the single-node example, you reduce the number of file reads from the home file system from 16 to 1 (i.e. only the copy operation).
Copying data to scratch for a single node job
The $TMPDIR environment variable points to a temporary folder on the local scratch disk and can be used to write files to scratch. For a single-node job, copying your input can be done simply by the cp
command. For example, to copy single big input file from your home to the local scratch disk
cp $HOME/big_input_file "$TMPDIR"
Or, to copy a whole directory with input files
cp -r $HOME/input_dir "$TMPDIR"
Copying data to scratch for a multi-node job
For the MPI example involving 10 nodes, copying the data to each of the local scratch disks would still result in the input files being read 10 times. To avoid that, we have developed the mpicopy tool. Mpicopy reads the file from the home file system only on the first node, and from there, broadcasts it to all nodes that are assigned to you. To use the mpicopy tool you need to load the mpicopy and openmpi modules first. You can specify a target directory using the -o
argument, but by default mpicopy copies to the $TMPDIR directory. For example
module load 2020 module load mpicopy module load OpenMPI/4.0.3-GCC-9.3.0 mpicopy $HOME/big_input_file "$TMPDIR"/
Note that mpicopy also copies directories recursively by default, you don't need to specify the -r
option.
mpicopy $HOME/input_dir
An MPI program can then be started with the corresponding input file
mpiexec my_program "$TMPDIR"/big_input_file
Executing your program
In this section, we limit ourselves to the simplest scenario: running a single instance of a serial (i.e. non-parallel) program, taking a single input file as an argument. For that, you add a line like
my_program "$TMPDIR"/big_input_file
to your job script.
This is the simplest example, but it is not the way you should generally use a HPC system! Running only a single instance of a serial program, you will only use one core in a single node, leaving the other cores idle. This, of course, is a waste of computational power (and a waste of your budget). In practice, you will want to use parallelization to use all cores in a node, or even multiple nodes.
Note: the need for parallelization depends on how 'heavy' your program is. If you have some simple pre- and post-processing steps, it is of course fine to run these as serial programs.
Temporary files
Some program may generate temporary files. If your program does, and if you can set the location where they store the temporary files (e.g. as an argument or through a configuration file), make your program use the (fast) scratch space (accessible using the "$TMPDIR" environment variable). Other programs will use the current directory (i.e. the directory your shell was in when you launched the program) to store temporary files. If that is the case, change directory to the "$TMPDIR" directory, before launching your program. E.g.
cd "$TMPDIR" $HOME/my_program
Do not use the /tmp
directory to store temporary files, it may cause the node to crash! On the nodes, /tmp
has a limited size and should only be used by system processes.
Copy output data from scratch
cp "$TMPDIR"/output_file $HOME
or for a directory containing output
cp -r "$TMPDIR"/output_dir $HOME
WARNING: after your job finishes, the "$TMPDIR" on the scratch disk is removed by the batch system. If you forget to copy your results back to your home directory, they will be lost!
Final job script
#!/bin/bash #Set job requirements #SBATCH -n 16 #SBATCH -t 5:00 #Loading modules module load 2022 module load Python/3.10.4-GCCcore-11.3.0 #Copy input file to scratch cp $HOME/big_input_file "$TMPDIR" #Create output directory on scratch mkdir "$TMPDIR"/output_dir #Execute a Python program located in $HOME, that takes an input file and output directory as arguments. python $HOME/my_program.py "$TMPDIR"/big_input_file "$TMPDIR"/output_dir #Copy output directory from scratch to home cp -r "$TMPDIR"/output_dir $HOME
While this script illustrates the use of the most important elements of a job script (SBATCH arguments, modules, managing input and output), it is probably not efficient: if the Python program only runs on a single core, many cores are left idle, wasting resources.
Environment variables within a job
You can inspect all environment variables available in a job by submitting the following jobscript
#!/bin/bash
#SBATCH --nodes=1 --time=1:00
echo "-----"
echo "Non-SLURM environment"
echo "-----"
env | grep -v SLURM
echo "-----"
echo "SLURM environment"
echo "-----"
env | grep SLURM
and inspecting the output.
SLURM sets automatically several environment variables. A non-exhaustive list of the most useful variables is shown here.
Variable name Description USER Your username. HOSTNAME Name of the computer currently running the script. HOME Your home directory. PWD Current directory the script is running in. Changes when you do cd
in a script. SLURM_SUBMIT_DIR Directory where the sbatch command was executed. SLURM_JOBID ID number assigned to the job upon submission. The same number is seen in showq. SLURM_JOB_NAME Name of the job script. SLURM_NODELIST A collapsed list of nodes assigned to this job (see comments below) SLURM_ARRAY_TASK_ID Array ID numbers for jobs submitted with the -t flag. For #SBATCH -a 1-8, this will be an integer between 1 and 8. SLURM_NTASKS Number of tasks for the job. SLURM_NTASKS_PER_NODE Number of tasks per node.
The list of nodes assigned to a job is stored in the variable $SLURM_NODELIST in a compact form (e.g: "tcn[1-10]" means nodes tcn1, tcn2, ..., tcn10 are assigned to the job). To obtain the full list of compute nodes available for the job you can use the following command:
scontrol show hostnames
which will print all the nodes available as ordered list, e.g:
tcn1
tcn2
tcn3
tcn4
tcn5
tcn6
tcn7
tcn8
tcn9
tcn10
On Snellius we are not exporting environment variables which are set by the user within the bash session.
This apply both to variables set by the user (interactively and in .bashrc or similar configuration files) or set by module files. This means that every variable you would like to be set at runtime, needs to be explicitly exported in the job script. Similar to this, any module you would like to have loaded at runtime, needs to be explicitly loaded within the job script.
#!/bin/bash #SBATCH --nodes=1 --time=1:00 echo "-----" echo "Non-SLURM environment" echo "-----" env | grep -v SLURM echo "-----" echo "SLURM environment" echo "-----" env | grep SLURM
and inspecting the output.
SLURM sets automatically several environment variables. A non-exhaustive list of the most useful variables is shown here.
Variable name | Description |
---|---|
USER | Your username. |
HOSTNAME | Name of the computer currently running the script. |
HOME | Your home directory. |
PWD | Current directory the script is running in. Changes when you do cd in a script. |
SLURM_SUBMIT_DIR | Directory where the sbatch command was executed. |
SLURM_JOBID | ID number assigned to the job upon submission. The same number is seen in showq. |
SLURM_JOB_NAME | Name of the job script. |
SLURM_NODELIST | A collapsed list of nodes assigned to this job (see comments below) |
SLURM_ARRAY_TASK_ID | Array ID numbers for jobs submitted with the -t flag. For #SBATCH -a 1-8, this will be an integer between 1 and 8. |
SLURM_NTASKS | Number of tasks for the job. |
SLURM_NTASKS_PER_NODE | Number of tasks per node. |
The list of nodes assigned to a job is stored in the variable $SLURM_NODELIST in a compact form (e.g: "tcn[1-10]" means nodes tcn1, tcn2, ..., tcn10 are assigned to the job). To obtain the full list of compute nodes available for the job you can use the following command:
scontrol show hostnames
which will print all the nodes available as ordered list, e.g:
tcn1 tcn2 tcn3 tcn4 tcn5 tcn6 tcn7 tcn8 tcn9 tcn10
On Snellius we are not exporting environment variables which are set by the user within the bash session.
This apply both to variables set by the user (interactively and in .bashrc or similar configuration files) or set by module files. This means that every variable you would like to be set at runtime, needs to be explicitly exported in the job script. Similar to this, any module you would like to have loaded at runtime, needs to be explicitly loaded within the job script.