Common reasons for inefficient CPU jobs
The three most common reasons for a job to run inefficiently are:
- The file systems are not used efficiently.
- A number of nodes are reserved for a job, but the job is only running on one node.
- Only a single core is being used on each node.
In either of these cases, one or more CPU cores are idling, either because they are waiting for data to be read from or written to the disk (1), or simply because you didn't start your application in such a way that all available cores are used (2 and 3). This is a waste of resources, but more importantly, it is a waste of your CPU budget, because you are paying for all the cores reserved by your job script (see accounting for Snellius).
The solutions are:
- Case 1 - If you use many/large input files, copy them from the home file system to scratch before starting your program. You may consider compressing them first (using tar). Write intermediate results only to scratch. If you have many/large output files, write to scratch and then copy them to the home file system. Again, consider compressing them first.
- Cases 2 and 3 - To use all nodes and cores, use appropriate parallelization. An essential step is also to verify that your job runs on all nodes and cores the way you intended: it is easy to make a mistake and the difference in running on all cores of a node or just one may be just a single character in your job script.
How to verify if a job runs efficiently
To verify that your job is using all nodes and cores, you first need the ID of your job. The job ID is shown when you submit a job using sbatch, or you may find it in the queue using
squeue -u [username]
Then, you can use the ssh command to login to any of the nodes where your job is running
ssh [node_hostname]
Once on the node, you can use the "top" unix command to show the processes running on the node and their cpu utilisation
The output you expect depends on the type of parallelisation you have used. In case you are running a multithreaded program on a 16 core node, you may expect to see a single process that is using >> 100% CPU (ideally close to 1600%, but in practice this is usually less). For example, you may see something like
top - 18:01:24 up 22 days, 8:00, 0 users, load average: 6.89, 7.34, 4.94 Tasks: 212 total, 2 running, 210 sleeping, 0 stopped, 0 zombie %Cpu(s): 43.1 us, 0.2 sy, 0.0 ni, 56.6 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 66062740 total, 1275784 used, 64786956 free, 988 buffers KiB Swap: 3999996 total, 0 used, 3999996 free, 631744 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10118 casparl 20 0 43784 668 564 R 1525 0.0 0:48.29 MyMultithreadedProgram 9902 casparl 20 0 21908 3144 2848 S 0 0.0 0:00.01 bash 10110 casparl 20 0 21916 2556 2244 S 0 0.0 0:00.00 bash 10145 casparl 20 0 95696 3896 2824 S 0 0.0 0:00.00 sshd 10146 casparl 20 0 23636 2528 2120 R 0 0.0 0:00.00 top
This way, we have verified that our program indeed runs multithreaded, on all cores.
In case you run multiple instances of a serial program in parallel, or a single threaded MPI program, you should see e.g. 16 processes (in case of a 16-core node), each ideally using ~100% CPU. For example, inspecting the node we obtain:
top - 13:45:28 up 32 days, 28 min, 2 users, load average: 6.30, 1.74, 0.59 Tasks: 309 total, 1 running, 308 sleeping, 0 stopped, 0 zombie %Cpu(s): 94.4 us, 4.5 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 32866356 total, 24265008 free, 5624304 used, 2977044 buff/cache KiB Swap: 3999996 total, 3999996 free, 0 used. 26567764 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 43682 casparl 20 0 2976144 358208 173084 S 100.0 1.1 0:09.38 MyMPIprogram 43689 casparl 20 0 2976820 363664 174172 S 99.7 1.1 0:09.25 MyMPIprogram 43691 casparl 20 0 2973572 354756 169236 S 99.3 1.1 0:09.46 MyMPIprogram 43695 casparl 20 0 2975944 363824 173528 S 99.0 1.1 0:09.23 MyMPIprogram 43685 casparl 20 0 2973568 358776 169416 S 98.3 1.1 0:09.31 MyMPIprogram 43681 casparl 20 0 2972544 359216 169528 S 98.0 1.1 0:09.29 MyMPIprogram 43680 casparl 20 0 2972544 357444 169112 S 97.4 1.1 0:09.40 MyMPIprogram 43683 casparl 20 0 2975192 378908 171928 S 97.4 1.2 0:09.27 MyMPIprogram 43693 casparl 20 0 2978540 361980 174192 S 97.4 1.1 0:09.22 MyMPIprogram 43687 casparl 20 0 2975300 361368 172636 S 97.0 1.1 0:09.16 MyMPIprogram 43690 casparl 20 0 2972544 377844 169588 S 97.0 1.1 0:09.31 MyMPIprogram 43694 casparl 20 0 2972544 357516 169508 S 97.0 1.1 0:09.29 MyMPIprogram 43684 casparl 20 0 2975660 360200 171068 S 96.4 1.1 0:09.35 MyMPIprogram 43686 casparl 20 0 2974208 357668 170708 S 96.4 1.1 0:09.35 MyMPIprogram 43688 casparl 20 0 2972544 356644 168752 S 95.0 1.1 0:09.29 MyMPIprogram 43692 casparl 20 0 2977632 360816 174820 S 94.4 1.1 0:09.09 MyMPIprogram
For more information on how to interpret the output, see
man top