Common reasons for inefficient CPU jobs

The three most common reasons for a job to run inefficiently are:

  1. The file systems are not used efficiently.
  2. A number of nodes are reserved for a job, but the job is only running on one node.
  3. Only a single core is being used on each node.

In either of these cases, one or more CPU cores are idling, either because they are waiting for data to be read from or written to the disk (1), or simply because you didn't start your application in such a way that all available cores are used (2 and 3). This is a waste of resources, but more importantly, it is a waste of your CPU budget, because you are paying for all the cores reserved by your job script (see  accounting for Snellius).

The solutions are:

  • Case 1 - If you use many/large input files, copy them from the home file system to scratch before starting your program. You may consider compressing them first (using tar). Write intermediate results only to scratch. If you have many/large output files, write to scratch and then copy them to the home file system. Again, consider compressing them first.
  • Cases 2 and 3 - To use all nodes and cores, use appropriate parallelization. An essential step is also to verify that your job runs on all nodes and cores the way you intended: it is easy to make a mistake and the difference in running on all cores of a node or just one may be just a single character in your job script.

How to verify if a job runs efficiently

To verify that your job is using all nodes and cores, you first need the ID of your job. The job ID is shown when you submit a job using sbatch, or you may find it in the queue using

squeue -u [username]

Then, you can use the ssh command to login to any of the nodes where your job is running

ssh [node_hostname]

Once on the node, you can use the "top" unix command to show the processes running on the node and their cpu utilisation

The output you expect depends on the type of parallelisation you have used. In case you are running a multithreaded program on a 16 core node, you may expect to see a single process that is using >> 100% CPU (ideally close to 1600%, but in practice this is usually less). For example, you may see something like

top - 18:01:24 up 22 days,  8:00,  0 users,  load average: 6.89, 7.34, 4.94
Tasks: 212 total,   2 running, 210 sleeping,   0 stopped,   0 zombie
%Cpu(s): 43.1 us,  0.2 sy,  0.0 ni, 56.6 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  66062740 total,  1275784 used, 64786956 free,      988 buffers
KiB Swap:  3999996 total,        0 used,  3999996 free,   631744 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
10118 casparl   20   0 43784  668  564 R  1525  0.0   0:48.29 MyMultithreadedProgram
 9902 casparl   20   0 21908 3144 2848 S     0  0.0   0:00.01 bash
10110 casparl   20   0 21916 2556 2244 S     0  0.0   0:00.00 bash
10145 casparl   20   0 95696 3896 2824 S     0  0.0   0:00.00 sshd
10146 casparl   20   0 23636 2528 2120 R     0  0.0   0:00.00 top

This way, we have verified that our program indeed runs multithreaded, on all cores.

In case you run multiple instances of a serial program in parallel, or a single threaded MPI program, you should see e.g. 16 processes (in case of a 16-core node), each ideally using ~100% CPU. For example, inspecting the node we obtain:

top - 13:45:28 up 32 days, 28 min,  2 users,  load average: 6.30, 1.74, 0.59
Tasks: 309 total,   1 running, 308 sleeping,   0 stopped,   0 zombie
%Cpu(s): 94.4 us,  4.5 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 32866356 total, 24265008 free,  5624304 used,  2977044 buff/cache
KiB Swap:  3999996 total,  3999996 free,        0 used. 26567764 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
43682 casparl   20   0 2976144 358208 173084 S 100.0  1.1   0:09.38 MyMPIprogram
43689 casparl   20   0 2976820 363664 174172 S  99.7  1.1   0:09.25 MyMPIprogram
43691 casparl   20   0 2973572 354756 169236 S  99.3  1.1   0:09.46 MyMPIprogram
43695 casparl   20   0 2975944 363824 173528 S  99.0  1.1   0:09.23 MyMPIprogram
43685 casparl   20   0 2973568 358776 169416 S  98.3  1.1   0:09.31 MyMPIprogram
43681 casparl   20   0 2972544 359216 169528 S  98.0  1.1   0:09.29 MyMPIprogram
43680 casparl   20   0 2972544 357444 169112 S  97.4  1.1   0:09.40 MyMPIprogram
43683 casparl   20   0 2975192 378908 171928 S  97.4  1.2   0:09.27 MyMPIprogram
43693 casparl   20   0 2978540 361980 174192 S  97.4  1.1   0:09.22 MyMPIprogram
43687 casparl   20   0 2975300 361368 172636 S  97.0  1.1   0:09.16 MyMPIprogram
43690 casparl   20   0 2972544 377844 169588 S  97.0  1.1   0:09.31 MyMPIprogram
43694 casparl   20   0 2972544 357516 169508 S  97.0  1.1   0:09.29 MyMPIprogram
43684 casparl   20   0 2975660 360200 171068 S  96.4  1.1   0:09.35 MyMPIprogram
43686 casparl   20   0 2974208 357668 170708 S  96.4  1.1   0:09.35 MyMPIprogram
43688 casparl   20   0 2972544 356644 168752 S  95.0  1.1   0:09.29 MyMPIprogram
43692 casparl   20   0 2977632 360816 174820 S  94.4  1.1   0:09.09 MyMPIprogram

For more information on how to interpret the output, see

man top
  • No labels