How to use dcgmi dmon to monitor GPU utilization

The dcgmi dmon monitoring tool is very powerful, and can show a lot of metrics regarding the usage of the GPU. A word of warning though: monitoring in such detail may slightly slow down your code. Thus, it is not advised to use such monitoring all the time, but only when you want to investigate the runtime behaviour and efficiency of your program. First, you'll want to figure out which GPU has been assigned to you, so that you can query the correct one.To do that, we run:

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a)

This example would be a typical result for a job to which one GPU has been allocated, out of the four available in a GPU node.

Now, we run 

$ dcgmi discovery -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:31:00.0                                         |
|        | Device UUID: GPU-ce4cef33-0cee-c533-c9ca-f2c11c07e3bf                |
+--------+----------------------------------------------------------------------+
| 1      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:32:00.0                                         |
|        | Device UUID: GPU-c30a2b41-7fc7-c244-2896-1a69df3bccc8                |
+--------+----------------------------------------------------------------------+
| 2      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:CA:00.0                                         |
|        | Device UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a                |
+--------+----------------------------------------------------------------------+
| 3      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:E3:00.0                                         |
|        | Device UUID: GPU-357ed8c2-a0ea-a164-c218-0ca3862bd7bf                |
+--------+----------------------------------------------------------------------+

As you can see, dcgmi always shows all GPUs in the node, even the ones that have not been assigned to you. By comparing the UUID output from both outputs, we see that the GPU that was assigned to us has index '2' in dcgmi. Now, we can run the monitoring, e.g.

dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i 2

Here, the number following the -i is our GPU ID. The numbers following the -e indicate the metrics we want to query. The most relevant metrics are probably the profiling metrics (see the official dcgmi documentation). Some example output:

#Entity ID  GPUTL      MCUTL      GRACT        SMACT        SMOCC        TENSO        DRAMA        PCITX       PCIRX          NVLTX      NVLRX     POWER W                                                                                                                                                                                                                                                                                                                                                                                              
GPU 0       82         39         0.779        0.562        0.233        0.140        0.296        31122275    148592109      0          0         204.109
GPU 0       82         39         0.797        0.575        0.238        0.143        0.303        24381091    150507324      0          0         204.109
GPU 0       6          3          0.409        0.302        0.128        0.077        0.154        23075223    112425741      0          0         173.644
GPU 0       24         11         0.247        0.191        0.084        0.052        0.091        21383893    114272271      0          0         207.260

For the exact meaning of each of these, read the official documentation here and here. In short, they are:

  • GPUTL: GPU utilization (similar to the volatility printed by nvidia-smi). Indicates the percentage of time that any kernel is active on the GPU. Does not indicate how much of the GPU that kernel is using.
  • MCUTL: memory utilization.
  • GRACT: Ratio of the the graphics or compute engines where active
  • SMACT: The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors
  • SMOCC: The fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor
  • TENSO: Utilization of the tensor cores
  • DRAMA. Fraction of cycles where data was sent to or received from device memory
  • PCITX/RX: PCI Bandwidth transmitted (TX) or received (RX)
  • NVLTX/RX: NVLink Bandwidth transmitted (TX) or received (RX)
  • Power: total power consumption of the GPU.

Note that the official documentation gives some reasonable guidelines of what 'good' values are for these metrics. It is quite hard to interpret, as the maximum achieveable depends also on the nature and size of the computational problem. Still, GPUTL under 75%, SMACT less than 0.5 or (in the case of Deep Learning) no Tensor Core usage (TENSO = 0.000) are indications that you're underutilizing the GPU. If your problem size is too small for a full A100, consider using one of our MIG-partitioned A100s.

There are many more metrics that you can query. You can list them using

dcgmi dmon --list

How to use dcgmi dmon to monitor GPU utilization when using MIG instances

The approach is similar as above, with two differences:

  • We need to determine which MIG  instance we are using, rather than which GPU
  • Some metrics cannot be measured for MIG instances.

First, we run

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-82f16363-e6df-5bed-3cef-c1a484831465)
  MIG 3g.20gb     Device  0: (UUID: MIG-5e639bae-ed68-53c8-b15c-0d3e2518ecc5)

to determine the UUID of the GPU we are running on. Then, we run:

$ nvidia-smi
Thu Oct 26 14:32:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:31:00.0 Off |                  Off |
| N/A   31C    P0              86W / 400W |   1609MiB / 40960MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |            1572MiB / 19968MiB  | 42    N/A |  3   0    2    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    1    0     882588      C   ...on/3.10.4-GCCcore-11.3.0/bin/python     1526MiB |
+---------------------------------------------------------------------------------------+

Here we extract the MIG GI ID (bottom left, under 'MIG devices'), which in this case is '1'.

Finally, we run:

$ dcgmi discovery -c
+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy                                                                     |
+===================+====================================================================+
| GPU 0             | GPU GPU-82f16363-e6df-5bed-3cef-c1a484831465 (EntityID: 0)         |
| -> I 0/1          | GPU Instance (EntityID: 0)                                         |
|    -> CI 0/1/0    | Compute Instance (EntityID: 0)                                     |
| -> I 0/2          | GPU Instance (EntityID: 1)                                         |
|    -> CI 0/2/0    | Compute Instance (EntityID: 1)                                     |
+-------------------+--------------------------------------------------------------------+
| GPU 1             | GPU GPU-7f81592f-2b32-bd81-2307-81243a539e04 (EntityID: 1)         |
| -> I 1/1          | GPU Instance (EntityID: 7)                                         |
|    -> CI 1/1/0    | Compute Instance (EntityID: 7)                                     |
| -> I 1/2          | GPU Instance (EntityID: 8)                                         |
|    -> CI 1/2/0    | Compute Instance (EntityID: 8)                                     |
+-------------------+--------------------------------------------------------------------+
| GPU 2             | GPU GPU-7eb5ea70-e6ed-88db-c0b7-08d594288d6b (EntityID: 2)         |
| -> I 2/1          | GPU Instance (EntityID: 14)                                        |
|    -> CI 2/1/0    | Compute Instance (EntityID: 14)                                    |
| -> I 2/2          | GPU Instance (EntityID: 15)                                        |
|    -> CI 2/2/0    | Compute Instance (EntityID: 15)                                    |
+-------------------+--------------------------------------------------------------------+
| GPU 3             | GPU GPU-f69f57d5-0468-5793-f018-89852bd0ae5d (EntityID: 3)         |
| -> I 3/1          | GPU Instance (EntityID: 21)                                        |
|    -> CI 3/1/0    | Compute Instance (EntityID: 21)                                    |
| -> I 3/2          | GPU Instance (EntityID: 22)                                        |
|    -> CI 3/2/0    | Compute Instance (EntityID: 22)                                    |
+-------------------+--------------------------------------------------------------------+

Now, we see that we are running on GPU 0, GPU instance ID 1; the former by searching for the GPU UUID (GPU-82f16363-e6df-5bed-3cef-c1a484831465), the latter by searching for the MIG GI ID (1) on that GPU on the left-hand side.

We can query this particular MIG instance by adapting the -i argument to not be the GPU ID, but to be i:EntityID , where EntityID is that of the MIG instance printed by dcgmi discovery -c (WARNING: please note that the EntityIDs are not  consecutive: the first instance on GPU 1 has IdentityID 7): 

dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i i:2

#Entity ID  GPUTL      MCUTL      GRACT        SMACT        SMOCC        TENSO        DRAMA        PCITX       PCIRX          NVLTX      NVLRX     POWER W
GPU-I 0     N/A        N/A        0.978        0.734        0.327        0.130        0.371        N/A         N/A            N/A        N/A       138.425
GPU-I 0     N/A        N/A        0.978        0.735        0.328        0.130        0.372        N/A         N/A            N/A        N/A       82.189
GPU-I 0     N/A        N/A        0.527        0.395        0.175        0.071        0.199        N/A         N/A            N/A        N/A       145.837
GPU-I 0     N/A        N/A        0.947        0.711        0.317        0.126        0.360        N/A         N/A            N/A        N/A       144.452

This is actually running the same use case as before, but on the a 3g.20gb MIG instance (3/7th of an A100 in terms of compute). As you notice, some metrics report N/A as they can not be measured for MIG instances. But what we can clearly see is that running this code on a MIG instance instead of a full A100 results in much higher SMACT, SMOCC and TENSO usage. We've also timed this code, and it ran at 30% slower (on 50% less hardware). Thus, if efficiency to solution is more important than time to solution, this example is better off running on a MIG instance.


  • No labels