To verify if your job uses available GPUs, login to the node running your job. To do this, you can directly ssh into the node with ssh <node-id>
.
Once on the node, run the command
nvidia-smi
to check the current usage ('volatility') of the GPUs in the node. There are a couple of potential outcomes, which we will discuss one by one.
All GPUs show a (near) 0% volatility, and no processes are listed
Several things could be going on here. First of all, many codes only run certain parts on the GPU, hence they will not necessarily be used all the time. Rerun nvidia-smi
every now and again to check if the process shows up. If the process doesn't show up, it may not be running on the GPU at all. Potentially, the code doesn't have GPU support, or it was not compiled with GPU support. If you are using third party code, check if you need to set certain compiler flags during the compilation to enable GPU support.
One or more processes are listed, but all GPUs show a (near) 0% volatility
The process may be waiting (a lot) for the CPU to deliver data to the GPU (i.e. the bandwidth between CPU and GPU is the bottleneck), or it is a code that only runs certain parts on the GPU. Rerun nvidia-smi
every now and again to see if volatility is ever higher. If it is, the code probably runs partly on CPU, partly on GPU. If your process is listed but volatility stays at 0%, your code probably has initialized the GPU, but is not sending any working there.
One or more processes are listed, but only a single GPUs show non-zero volatility
Your code is either not suitable for running on multiple GPUs, or you may have to specify certain arguments when running your code. Many codes that run on multiple GPUs will automatically detect the number of GPUs in the system, but this is not always the case.
All GPUs show non-zero volatility, but (far) below 100%
Your GPUs may be waiting for data from the CPU for part of the time (i.e. the bandwidth between CPU and GPU is the bottleneck). Also, again it may be code that runs partly on the CPU, partly on the GPU, in which case the GPU is waiting part of its time for CPU calculations to finish.
All GPUs show (almost) 100% volatility
You're most likely using the GPUs efficiently. It indicates your calculations saturate either the floating point performance of the GPU, or the memory bandwidth from the GPU memory to the GPU cores. In general, this means your calculations are being performed efficiently. In case you designed your own CUDA kernels, and performance is below what you expect, you may be limited by bandwidth. In that case, more efficiently designed CUDA kernels may help you achieve the maximum floating point capacity of the GPU.
Volatility lists 'N/A'
This is the case when you work on a MIG instance, instead of a full A100. MIG, or multi-instance GPU, is a technique by which a single, physical GPU, can be split into several smaller, virtual GPUs. It is useful for problems where you can not use a full A100 efficiently, e.g. because your problem size is too small. To know if you are using a MIG instance efficiently, you have to use dcgmi dmon
which is explained below.