Methods of parallelization

The way to use the system efficiently - i.e. use all the cores in a node, and all the nodes reserved for you - depends on the type of program you want to run. Here, we distinguish three types of programs.

Serial programs. These programs have no parallelism programmed into them. They show up as a single process in top and only use a single core (max 100% CPU in top). The way to use hardware efficiently with serial programs is simply to run multiple serial programs at the same time.
Multithreaded programs. These programs have parallelism programmed into them, e.g. using frameworks like OpenMP. These programs show up as a single process in top and can use multiple cores (in a 16-core machine, max 1600% CPU in top).
MPI programs. These programs have parallelism programed into them using the MPI framework. This framework is designed to spawn many processes, both within a node and across different nodes. MPI is aware of the batch system, and thus knows how many nodes are allocated to you, and how many processes per node you want to launch. This makes it realatively easy to use all nodes efficiently. MPI processes are often single core processes, but may also be multithreaded processes.

In this section, we focus on running existing programs in an efficient way. Of course, if you have the source code, you can parallelize serial programs and change them into multithreaded and/or MPI programs, but that is not our focus here.

Parallel execution of serial programs

Run multiple serial programs concurrently on a single node

To run multiple serial programs concurrently on a single node, you can start programs in the background:

$HOME/my_serial_program_1 &
$HOME/my_serial_program_2 &
#etc...
wait

Normally, a script would wait until my_serial_program_1 is finished, before starting my_serial_program_2. However, the &-sign starts programs in the background. Thus, immediately after launching my_serial_program_1, the script continues and also launches my_serial_program_2, and the two programs run concurrently. The 'wait' statement makes sure the script does not continue before all background processes are finished. In a job script, this is important: without the wait, the script would finish nearly immediately and the SLURM system would automatically kill all remaining background processes.

In the specific case your programs are numbered in a regular way, you can also use a bash for-loop. E.g. to start my_serial_program_1 to my_serial_program_4:

for i in `seq 1 4`; do
  $HOME/my_serial_program_$i &
done
wait

The same works of course if you want to run the same program repeatedly.

#Determining the number of processors in the system
NPROC=`nproc --all`

#Execute program
for i in `seq 1 $NPROC`; do
  $HOME/my_serial_program &
done
wait

Here, the function nproc --all is used to obtain the number of cores in the current system, so that we can easily launch one instance of my_serial_program per core.

Often, you'll need to pass different inputs to your serial program, for example because you want to perform the same analysis on different datasets, or because you want to perform a parameter sweep (i.e. vary a numerical input parameter over a certain range). If your input files are numbered in a logical way, e.g. input_file_1, input_file_2, etc, you can use the iterator of the for loop to start multiple instances of your program, each with its own input file (and output file, if you want):

NPROC=`nproc --all`
for i in `seq 1 $NPROC`; do
  $HOME/my_serial_program "$TMPDIR"/input_file_$i "$TMPDIR"/output_file_$i &
done
wait

In a similar way, you could use the for-loop iterator to do a parameter sweep. E.g. if my_serial_program takes one (numerical) input parameter, and we want to do a sweep from 1, 2, ... 10:

#Execute program located in $HOME
for i in `seq 1 10`; do
  $HOME/my_serial_program $i &
done
wait

If you're doing parameter sweeps, the QCG-PilotJob application may be helpful to you. The software allows one to schedule multiple tasks within one Slurm allocation, in particular tasks with different parameters.

Run multiple serial programs concurrently on multiple nodes

Although it is possible to create a single job script that starts multiple serial programs on multiple nodes, this is generally not needed. Most often, you can simply submit the same (single-node) job script multiple times, (if needed) with slight alterations to use e.g. different input files or parameters.

If you really need to start multiple serial programs on multiple nodes with a single job script, please see the Example job scripts section for an example of how to use srun or ssh for that purpose.

Multithreaded programs

Multithreaded program on a single node

If your program is a multithreaded program and can use up to 16 cores, you can simply run

my_multithreaded_program "$TMPDIR"/input_file

on a single node and it will likely be able to use all cores in the node.

Some OpenMP programs will use the OPM_NUM_THREADS environment variable to determine the maximum number of threads they are allowed to launch. You can set this equal to the number of physical cores by adding

export OMP_NUM_THREADS=`nproc --all`

to the job script, before you start your multithreaded program.

The operating system may decide to move threads from one core to another during execution. Often however, performance of multithreaded programs is better when each thread remains on the same core throughout execution of the program - this is known as core affinity. You can enable core affinity by adding the following environment variables

export OMP_PLACES cores
export OMP_PROC_BIND true

to the job script, again before you start your multithreaded program. It is no guarantee however that this improves performance, so experiment with it! You can read more about these variable in the official OpenMP API specifications document: here and here

Multithreaded program on multiple nodes

Threads always have to reside on a machine with shared memory. That means you can start a multithreaded program on one node, but you cannot have a single multithreaded program running on multiple nodes.

Like with serial programs however, you can of course submit multiple jobs that each start a multithreaded program on a single node - and in that way, distribute a workload.

MPI parallel execution

If your program is an MPI program, you can launch it using

srun my_program "$TMPDIR"/input_file

From the SLURM environment, srun knows how many nodes are available to you, and how many processes per node you requested. Thus, srun will launch the corresponding number (nodes x processes per node) of MPI processes. You can also launch an MPI program with an explicit number of tasks, which may be smaller than your allocation. For example, to launch ‘my_program’ with 4 MPI processes:

srun -n 4 my_program "$TMPDIR"/input_file

In general, you'd want to spawn as many processes per node as there are cores, i.e. for the nodes with 16 cores, you'd have a line like this in your job script:

#SBATCH -N 10
#SBATCH --tasks-per-node 16

The exception is when your program is a hybrid MPI/openMP program, as we'll see in the next section.

Hybrid MPI/openMP

Some programs use a combination of MPI and openMP for parallelization. For example, they use openMP for parallelization within a node, and MPI for parallelization across nodes (this is somewhat rare). Another example would be programs running on GPUs, where one MPI tasks per GPU is started, and (openMP) multithreading is used to use multiple cores to ‘feed’ that GPU (this is somewhat more common, e.g. in deep learning applications). In that case, you need to choose a matching number of threads and MPI processes per node, so that this doesn't exceed the total available cores in the node. E.g. for a 16 core node,

#SBATCH -N 10
#SBATCH --tasks-per-node 1
export OMP_NUM_THREADS=16

would be ok, as it would spawn a single MPI process that is allowed to use 16 threads. Other options would be ppn=2 and OMP_NUM_THREADS=8, ppn=4 and OMP_NUM_THREADS=4, etc. Experiment to find out which gives the best performance.

Space shortcuts

Page tree

Parallel execution of serial programs

Multithreaded programs

MPI parallel execution

Hybrid MPI/openMP