How to use dcgmi dmon to monitor GPU utilization
The dcgmi dmon
monitoring tool is very powerful, and can show a lot of metrics regarding the usage of the GPU. A word of warning though: monitoring in such detail may slightly slow down your code. Thus, it is not advised to use such monitoring all the time, but only when you want to investigate the runtime behaviour and efficiency of your program. First, you'll want to figure out which GPU has been assigned to you, so that you can query the correct one.To do that, we run:
$ nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a)
This example would be a typical result for a job to which one GPU has been allocated, out of the four available in a GPU node.
Now, we run
$ dcgmi discovery -l 4 GPUs found. +--------+----------------------------------------------------------------------+ | GPU ID | Device Information | +--------+----------------------------------------------------------------------+ | 0 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:31:00.0 | | | Device UUID: GPU-ce4cef33-0cee-c533-c9ca-f2c11c07e3bf | +--------+----------------------------------------------------------------------+ | 1 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:32:00.0 | | | Device UUID: GPU-c30a2b41-7fc7-c244-2896-1a69df3bccc8 | +--------+----------------------------------------------------------------------+ | 2 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:CA:00.0 | | | Device UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a | +--------+----------------------------------------------------------------------+ | 3 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:E3:00.0 | | | Device UUID: GPU-357ed8c2-a0ea-a164-c218-0ca3862bd7bf | +--------+----------------------------------------------------------------------+
As you can see, dcgmi always shows all GPUs in the node, even the ones that have not been assigned to you. By comparing the UUID output from both outputs, we see that the GPU that was assigned to us has index '2' in dcgmi
. Now, we can run the monitoring, e.g.
dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i 2
Here, the number following the -i
is our GPU ID. The numbers following the -e
indicate the metrics we want to query. The most relevant metrics are probably the profiling metrics (see the official dcgmi documentation). Some example output:
#Entity ID GPUTL MCUTL GRACT SMACT SMOCC TENSO DRAMA PCITX PCIRX NVLTX NVLRX POWER W GPU 0 82 39 0.779 0.562 0.233 0.140 0.296 31122275 148592109 0 0 204.109 GPU 0 82 39 0.797 0.575 0.238 0.143 0.303 24381091 150507324 0 0 204.109 GPU 0 6 3 0.409 0.302 0.128 0.077 0.154 23075223 112425741 0 0 173.644 GPU 0 24 11 0.247 0.191 0.084 0.052 0.091 21383893 114272271 0 0 207.260
For the exact meaning of each of these, read the official documentation here and here. In short, they are:
- GPUTL: GPU utilization (similar to the volatility printed by nvidia-smi). Indicates the percentage of time that any kernel is active on the GPU. Does not indicate how much of the GPU that kernel is using.
- MCUTL: memory utilization.
- GRACT: Ratio of the the graphics or compute engines where active
- SMACT: The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors
- SMOCC: The fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor
- TENSO: Utilization of the tensor cores
- DRAMA. Fraction of cycles where data was sent to or received from device memory
- PCITX/RX: PCI Bandwidth transmitted (TX) or received (RX)
- NVLTX/RX: NVLink Bandwidth transmitted (TX) or received (RX)
- Power: total power consumption of the GPU.
Note that the official documentation gives some reasonable guidelines of what 'good' values are for these metrics. It is quite hard to interpret, as the maximum achieveable depends also on the nature and size of the computational problem. Still, GPUTL under 75%, SMACT less than 0.5 or (in the case of Deep Learning) no Tensor Core usage (TENSO = 0.000) are indications that you're underutilizing the GPU. If your problem size is too small for a full A100, consider using one of our MIG-partitioned A100s.
There are many more metrics that you can query. You can list them using
dcgmi dmon --list
How to use dcgmi dmon to monitor GPU utilization when using MIG instances
The approach is similar as above, with two differences:
- We need to determine which MIG instance we are using, rather than which GPU
- Some metrics cannot be measured for MIG instances.
First, we run
$ nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-82f16363-e6df-5bed-3cef-c1a484831465) MIG 3g.20gb Device 0: (UUID: MIG-5e639bae-ed68-53c8-b15c-0d3e2518ecc5)
to determine the UUID of the GPU we are running on. Then, we run:
$ nvidia-smi Thu Oct 26 14:32:22 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-40GB On | 00000000:31:00.0 Off | Off | | N/A 31C P0 86W / 400W | 1609MiB / 40960MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 1 0 0 | 1572MiB / 19968MiB | 42 N/A | 3 0 2 0 0 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 1 0 882588 C ...on/3.10.4-GCCcore-11.3.0/bin/python 1526MiB | +---------------------------------------------------------------------------------------+
Here we extract the MIG GI ID (bottom left, under 'MIG devices'), which in this case is '1'.
Finally, we run:
$ dcgmi discovery -c +-------------------+--------------------------------------------------------------------+ | Instance Hierarchy | +===================+====================================================================+ | GPU 0 | GPU GPU-82f16363-e6df-5bed-3cef-c1a484831465 (EntityID: 0) | | -> I 0/1 | GPU Instance (EntityID: 0) | | -> CI 0/1/0 | Compute Instance (EntityID: 0) | | -> I 0/2 | GPU Instance (EntityID: 1) | | -> CI 0/2/0 | Compute Instance (EntityID: 1) | +-------------------+--------------------------------------------------------------------+ | GPU 1 | GPU GPU-7f81592f-2b32-bd81-2307-81243a539e04 (EntityID: 1) | | -> I 1/1 | GPU Instance (EntityID: 7) | | -> CI 1/1/0 | Compute Instance (EntityID: 7) | | -> I 1/2 | GPU Instance (EntityID: 8) | | -> CI 1/2/0 | Compute Instance (EntityID: 8) | +-------------------+--------------------------------------------------------------------+ | GPU 2 | GPU GPU-7eb5ea70-e6ed-88db-c0b7-08d594288d6b (EntityID: 2) | | -> I 2/1 | GPU Instance (EntityID: 14) | | -> CI 2/1/0 | Compute Instance (EntityID: 14) | | -> I 2/2 | GPU Instance (EntityID: 15) | | -> CI 2/2/0 | Compute Instance (EntityID: 15) | +-------------------+--------------------------------------------------------------------+ | GPU 3 | GPU GPU-f69f57d5-0468-5793-f018-89852bd0ae5d (EntityID: 3) | | -> I 3/1 | GPU Instance (EntityID: 21) | | -> CI 3/1/0 | Compute Instance (EntityID: 21) | | -> I 3/2 | GPU Instance (EntityID: 22) | | -> CI 3/2/0 | Compute Instance (EntityID: 22) | +-------------------+--------------------------------------------------------------------+
Now, we see that we are running on GPU 0, GPU instance ID 1; the former by searching for the GPU UUID (GPU-82f16363-e6df-5bed-3cef-c1a484831465), the latter by searching for the MIG GI ID (1) on that GPU on the left-hand side.
We can query this particular MIG instance by adapting the -i
argument to not be the GPU ID, but to be i:EntityID
, where EntityID is that of the MIG instance printed by dcgmi discovery -c
(WARNING: please note that the EntityIDs are not consecutive: the first instance on GPU 1 has IdentityID 7):
dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i i:2 #Entity ID GPUTL MCUTL GRACT SMACT SMOCC TENSO DRAMA PCITX PCIRX NVLTX NVLRX POWER W GPU-I 0 N/A N/A 0.978 0.734 0.327 0.130 0.371 N/A N/A N/A N/A 138.425 GPU-I 0 N/A N/A 0.978 0.735 0.328 0.130 0.372 N/A N/A N/A N/A 82.189 GPU-I 0 N/A N/A 0.527 0.395 0.175 0.071 0.199 N/A N/A N/A N/A 145.837 GPU-I 0 N/A N/A 0.947 0.711 0.317 0.126 0.360 N/A N/A N/A N/A 144.452
This is actually running the same use case as before, but on the a 3g.20gb MIG instance (3/7th of an A100 in terms of compute). As you notice, some metrics report N/A as they can not be measured for MIG instances. But what we can clearly see is that running this code on a MIG instance instead of a full A100 results in much higher SMACT, SMOCC and TENSO usage. We've also timed this code, and it ran at 30% slower (on 50% less hardware). Thus, if efficiency to solution is more important than time to solution, this example is better off running on a MIG instance.