How to use dcgmi dmon to monitor GPU utilization

The dcgmi dmon monitoring tool is very powerful, and can show a lot of metrics regarding the usage of the GPU. A word of warning though: monitoring in such detail may slightly slow down your code. Thus, it is not advised to use such monitoring all the time, but only when you want to investigate the runtime behaviour and efficiency of your program. First, you'll want to figure out which GPU has been assigned to you, so that you can query the correct one.To do that, we run:

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a)

This example would be a typical result for a job to which one GPU has been allocated, out of the four available in a GPU node.

Now, we run 

$ dcgmi discovery -l
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:31:00.0                                         |
|        | Device UUID: GPU-ce4cef33-0cee-c533-c9ca-f2c11c07e3bf                |
+--------+----------------------------------------------------------------------+
| 1      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:32:00.0                                         |
|        | Device UUID: GPU-c30a2b41-7fc7-c244-2896-1a69df3bccc8                |
+--------+----------------------------------------------------------------------+
| 2      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:CA:00.0                                         |
|        | Device UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a                |
+--------+----------------------------------------------------------------------+
| 3      | Name: NVIDIA A100-SXM4-40GB                                          |
|        | PCI Bus ID: 00000000:E3:00.0                                         |
|        | Device UUID: GPU-357ed8c2-a0ea-a164-c218-0ca3862bd7bf                |
+--------+----------------------------------------------------------------------+

As you can see, dcgmi always shows all GPUs in the node, even the ones that have not been assigned to you. By comparing the UUID output from both outputs, we see that the GPU that was assigned to us has index '2' in dcgmi. Now, we can run the monitoring, e.g.

dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i 2

Here, the number following the -i is our GPU ID. The numbers following the -e indicate the metrics we want to query. The most relevant metrics are probably the profiling metrics (see the official dcgmi documentation). Some example output:

#Entity ID  GPUTL      MCUTL      GRACT        SMACT        SMOCC        TENSO        DRAMA        PCITX       PCIRX          NVLTX      NVLRX     POWER W                                                                                                                                                                                                                                                                                                                                                                                              
GPU 0       82         39         0.779        0.562        0.233        0.140        0.296        31122275    148592109      0          0         204.109
GPU 0       82         39         0.797        0.575        0.238        0.143        0.303        24381091    150507324      0          0         204.109
GPU 0       6          3          0.409        0.302        0.128        0.077        0.154        23075223    112425741      0          0         173.644
GPU 0       24         11         0.247        0.191        0.084        0.052        0.091        21383893    114272271      0          0         207.260

For the exact meaning of each of these, read the official documentation here and here. In short, they are:

  • GPUTL: GPU utilization (similar to the volatility printed by nvidia-smi). Indicates the percentage of time that any kernel is active on the GPU. Does not indicate how much of the GPU that kernel is using.
  • MCUTL: memory utilization.
  • GRACT: Ratio of the the graphics or compute engines where active
  • SMACT: The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors
  • SMOCC: The fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor
  • TENSO: Utilization of the tensor cores
  • DRAMA. Fraction of cycles where data was sent to or received from device memory
  • PCITX/RX: PCI Bandwidth transmitted (TX) or received (RX)
  • NVLTX/RX: NVLink Bandwidth transmitted (TX) or received (RX)
  • Power: total power consumption of the GPU.

Note that the official documentation gives some reasonable guidelines of what 'good' values are for these metrics. It is quite hard to interpret, as the maximum achieveable depends also on the nature and size of the computational problem. Still, GPUTL under 75%, SMACT less than 0.5 or (in the case of Deep Learning) no Tensor Core usage (TENSO = 0.000) are indications that you're underutilizing the GPU. If your problem size is too small for a full A100, consider using one of our MIG-partitioned A100s.

There are many more metrics that you can query. You can list them using

dcgmi dmon --list

How to use dcgmi dmon to monitor GPU utilization when using MIG instances

The approach is similar as above, with two differences:

  • We need to determine which MIG  instance we are using, rather than which GPU
  • Some metrics cannot be measured for MIG instances.

First, we run

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-82f16363-e6df-5bed-3cef-c1a484831465)
  MIG 3g.20gb     Device  0: (UUID: MIG-5e639bae-ed68-53c8-b15c-0d3e2518ecc5)

to determine the UUID of the GPU we are running on. Then, we run:

$ nvidia-smi
Thu Oct 26 14:32:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:31:00.0 Off |                  Off |
| N/A   31C    P0              86W / 400W |   1609MiB / 40960MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |            1572MiB / 19968MiB  | 42    N/A |  3   0    2    0    0 |
|                  |               2MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0    1    0     882588      C   ...on/3.10.4-GCCcore-11.3.0/bin/python     1526MiB |
+---------------------------------------------------------------------------------------+

Here we extract the MIG GI ID (bottom left, under 'MIG devices'), which in this case is '1'.

Finally, we run:

$ dcgmi discovery -c
+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy                                                                     |
+===================+====================================================================+
| GPU 0             | GPU GPU-82f16363-e6df-5bed-3cef-c1a484831465 (EntityID: 0)         |
| -> I 0/1          | GPU Instance (EntityID: 0)                                         |
|    -> CI 0/1/0    | Compute Instance (EntityID: 0)                                     |
| -> I 0/2          | GPU Instance (EntityID: 1)                                         |
|    -> CI 0/2/0    | Compute Instance (EntityID: 1)                                     |
+-------------------+--------------------------------------------------------------------+
| GPU 1             | GPU GPU-7f81592f-2b32-bd81-2307-81243a539e04 (EntityID: 1)         |
| -> I 1/1          | GPU Instance (EntityID: 7)                                         |
|    -> CI 1/1/0    | Compute Instance (EntityID: 7)                                     |
| -> I 1/2          | GPU Instance (EntityID: 8)                                         |
|    -> CI 1/2/0    | Compute Instance (EntityID: 8)                                     |
+-------------------+--------------------------------------------------------------------+
| GPU 2             | GPU GPU-7eb5ea70-e6ed-88db-c0b7-08d594288d6b (EntityID: 2)         |
| -> I 2/1          | GPU Instance (EntityID: 14)                                        |
|    -> CI 2/1/0    | Compute Instance (EntityID: 14)                                    |
| -> I 2/2          | GPU Instance (EntityID: 15)                                        |
|    -> CI 2/2/0    | Compute Instance (EntityID: 15)                                    |
+-------------------+--------------------------------------------------------------------+
| GPU 3             | GPU GPU-f69f57d5-0468-5793-f018-89852bd0ae5d (EntityID: 3)         |
| -> I 3/1          | GPU Instance (EntityID: 21)                                        |
|    -> CI 3/1/0    | Compute Instance (EntityID: 21)                                    |
| -> I 3/2          | GPU Instance (EntityID: 22)                                        |
|    -> CI 3/2/0    | Compute Instance (EntityID: 22)                                    |
+-------------------+--------------------------------------------------------------------+

Now, we see that we are running on GPU 0, GPU instance ID 1; the former by searching for the GPU UUID (GPU-82f16363-e6df-5bed-3cef-c1a484831465), the latter by searching for the MIG GI ID (1) on that GPU on the left-hand side.

We can query this particular MIG instance by adapting the -i argument to not be the GPU ID, but to be i:EntityID , where EntityID is that of the MIG instance printed by dcgmi discovery -c (WARNING: please note that the EntityIDs are not  consecutive: the first instance on GPU 1 has IdentityID 7): 

dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i i:2

#Entity ID  GPUTL      MCUTL      GRACT        SMACT        SMOCC        TENSO        DRAMA        PCITX       PCIRX          NVLTX      NVLRX     POWER W
GPU-I 0     N/A        N/A        0.978        0.734        0.327        0.130        0.371        N/A         N/A            N/A        N/A       138.425
GPU-I 0     N/A        N/A        0.978        0.735        0.328        0.130        0.372        N/A         N/A            N/A        N/A       82.189
GPU-I 0     N/A        N/A        0.527        0.395        0.175        0.071        0.199        N/A         N/A            N/A        N/A       145.837
GPU-I 0     N/A        N/A        0.947        0.711        0.317        0.126        0.360        N/A         N/A            N/A        N/A       144.452

This is actually running the same use case as before, but on the a 3g.20gb MIG instance (3/7th of an A100 in terms of compute). As you notice, some metrics report N/A as they can not be measured for MIG instances. But what we can clearly see is that running this code on a MIG instance instead of a full A100 results in much higher SMACT, SMOCC and TENSO usage. We've also timed this code, and it ran at 30% slower (on 50% less hardware). Thus, if efficiency to solution is more important than time to solution, this example is better off running on a MIG instance.

A full list of codes is available with dcgmi dmon --list

$ dcgmi dmon --list
________________________________________________________________________________
Long Name                                         Short Name          Field ID  
________________________________________________________________________________
driver_version                                     DRVER               1
nvml_version                                       NVVER               2
process_name                                       PRNAM               3
device_count                                       DVCNT               4
cuda_driver_version                                CDVER               5
name                                               DVNAM               50
brand                                              DVBRN               51
nvml_index                                         NVIDX               52
serial_number                                      SRNUM               53
uuid                                               UUID#               54
minor_number                                       MNNUM               55
oem_inforom_version                                OEMVR               56
pci_busid                                          PCBID               57
pci_combined_id                                    PCCID               58
pci_subsys_id                                      PCSID               59
system_topology_pci                                STVCI               60
system_topology_nvlink                             STNVL               61
system_affinity                                    SYSAF               62
cuda_compute_capability                            DVCCC               63
p2p_nvlink_status                                  P2PNS               64
compute_mode                                       CMMOD               65
persistance_mode                                   PMMOD               66
mig_mode                                           MGMOD               67
cuda_visible_devices                               CUVID               68
mig_max_slices                                     MIGMS               69
cpu_affinity_0                                     CAFF0               70
cpu_affinity_1                                     CAFF1               71
cpu_affinity_2                                     CAFF2               72
cpu_affinity_3                                     CAFF3               73
cc_mode                                            CCMOD               74
mig_attributes                                     MIGATT              75
mig_gi_info                                        MIGGIINFO           76
mig_ci_info                                        MIGCIINFO           77
ecc_inforom_version                                EIVER               80
power_inforom_version                              PIVER               81
inforom_image_version                              IIVER               82
inforom_config_checksum                            CCSUM               83
inforom_config_valid                               ICVLD               84
vbios_version                                      VBVER               85
mem_affinity_0                                     MAFF0               86
mem_affinity_1                                     MAFF1               87
mem_affinity_2                                     MAFF2               88
mem_affinity_3                                     MAFF3               89
bar1_total                                         B1TTL               90
sync_boost                                         SYBST               91
bar1_used                                          B1USE               92
bar1_free                                          B1FRE               93
gpm_support                                        GPMSPT              94
sm_clock                                           SMCLK               100
memory_clock                                       MMCLK               101
video_clock                                        VICLK               102
sm_app_clock                                       SACLK               110
mem_app_clock                                      MACLK               111
current_clocks_event_reasons                       DVCCTR              112
sm_max_clock                                       SMMAX               113
memory_max_clock                                   MMMAX               114
video_max_clock                                    VIMAX               115
autoboost                                          ATBST               120
supported_clocks                                   SPCLK               130
memory_temp                                        MMTMP               140
gpu_temp                                           TMPTR               150
gpu_mem_max_op_temp                                GMMOT               151
gpu_max_op_temp                                    GGMOT               152
gpu_temp_tlimit                                    GTLIMIT             153
power_usage                                        POWER               155
total_energy_consumption                           TOTEC               156
power_usage_instant                                POWINST             157
slowdown_temp                                      SDTMP               158
shutdown_temp                                      SHTMP               159
power_management_limit                             PMLMT               160
power_management_limit_min                         PMMIN               161
power_management_limit_max                         PMMAX               162
power_management_limit_default                     PMDEF               163
enforced_power_limit                               EPLMT               164
req_power_prof                                     RPPRM               165
enf_power_prof                                     EPPRM               166
val_power_prof                                     VPPRM               167
fabric_manager_status                              FMSTA               170
fabric_manager_failure_code                        FMFRC               171
fabric_cluster_uuid                                FCUID               172
fabric_clique_id                                   FMCID               173
pstate                                             PSTAT               190
fan_speed                                          FANSP               191
pcie_tx_throughput                                 TXTPT               200
pcie_rx_throughput                                 RXTPT               201
pcie_replay_counter                                RPCTR               202
gpu_utilization                                    GPUTL               203
mem_copy_utilization                               MCUTL               204
accounting_data                                    ACCDT               205
enc_utilization                                    ECUTL               206
dec_utilization                                    DCUTL               207
mem_util_samples                                   MUSAM               210
gpu_util_samples                                   GUSAM               211
graphics_pids                                      GPIDS               220
compute_pids                                       CMPID               221
xid_errors                                         XIDER               230
pcie_max_link_gen                                  PCIMG               235
pcie_max_link_width                                PCIMW               236
pcie_link_gen                                      PCILG               237
pcie_link_width                                    PCILW               238
power_violation                                    PVIOL               240
thermal_violation                                  TVIOL               241
sync_boost_violation                               SBVIO               242
board_limit_violation                              BLVIO               243
low_util_violation                                 LUVIO               244
reliability_violation                              RVIOL               245
app_clock_violation                                TAPCV               246
base_clock_violation                               TAPBC               247
fb_total                                           FBTTL               250
fb_free                                            FBFRE               251
fb_used                                            FBUSD               252
fb_resv                                            FBRSV               253
fb_USDP                                            FBUSP               254
c2c_link_count                                     C2CLC               285
c2c_link_status                                    C2CST               286
c2c_max_bandwidth                                  C2CMAXBW            287
ecc                                                ECCUR               300
ecc_pending                                        ECPEN               301
ecc_sbe_volatile_total                             ESVTL               310
ecc_dbe_volatile_total                             EDVTL               311
ecc_sbe_aggregate_total                            ESATL               312
ecc_dbe_aggregate_total                            EDATL               313
ecc_sbe_volatile_l1                                ESVL1               314
ecc_dbe_volatile_l1                                EDVL1               315
ecc_sbe_volatile_l2                                ESVL2               316
ecc_dbe_volatile_l2                                EDVL2               317
ecc_sbe_volatile_device                            ESVDV               318
ecc_dbe_volatile_device                            EDVDV               319
ecc_sbe_volatile_register                          ESVRG               320
ecc_dbe_volatile_register                          EDVRG               321
ecc_sbe_volatile_texture                           ESVTX               322
ecc_dbe_volatile_texture                           EDVTX               323
ecc_sbe_aggregate_l1                               ESAL1               324
ecc_dbe_aggregate_l1                               EDAL1               325
ecc_sbe_aggregate_l2                               ESAL2               326
ecc_dbe_aggregate_l2                               EDAL2               327
ecc_sbe_aggregate_device                           ESADV               328
ecc_dbe_aggregate_device                           EDADV               329
ecc_sbe_aggregate_register                         ESARG               330
ecc_dbe_aggregate_register                         EDARG               331
ecc_sbe_aggregate_texture                          ESATX               332
ecc_dbe_aggregate_texture                          EDATX               333
ecc_sbe_volatile_shared                            ESVSHM              334
ecc_dbe_volatile_shared                            EDVSHM              335
ecc_sbe_volatile_cbu                               ESVCBU              336
ecc_dbe_volatile_cbu                               EDVCBU              337
ecc_sbe_aggregate_shared                           ESDSHM              338
ecc_dbe_aggregate_shared                           EDDSHM              339
ecc_sbe_aggregate_cbu                              ESDCBU              340
ecc_dbe_aggregate_cbu                              EDDCBU              341
ecc_sbe_volatile_sram                              ESVSRM              342
ecc_dbe_volatile_sram                              EDVSRM              343
ecc_sbe_aggregate_sram                             ESDSRM              344
ecc_dbe_aggregate_sram                             EDDSRM              345
ecc_threshold_sram                                 ECCTHRSRM           346
gpu_memory_test_result                             MEMRES              350
diagnostics_test_result                            DIARES              351
pcie_test_result                                   PCIRES              352
targeted_stress_test_result                        STRRES              353
targeted_power_test_result                         POWRES              354
memory_bandwidth_test_result                       MBWRES              355
memory_test_result                                 MEMRES              356
pulse_test_result                                  PLSRES              357
eud_test_result                                    EUDRES              358
cpu_eud_test_result                                CPUEUDRES           359
software_test_result                               SWRES               360
nvbandwidth_test_result                            NVBRES              361
diag_status                                        DIAGSTATU           362
remap_rows_avail_max                               RRAM                385
remap_rows_avail_high                              RRAH                386
remap_rows_avail_partial                           RRAP                387
remap_rows_avail_low                               RRAL                388
remap_rows_avail_none                              RRAN                389
retired_pages_sbe                                  RPSBE               390
retired_pages_dbe                                  RPDBE               391
retired_pages_pending                              RPPEN               392
uncorrectable_remapped_rows                        URMPS               393
correctable_remapped_rows                          CRMPS               394
row_remap_failure                                  RRF                 395
row_remap_pending                                  RRP                 396
nvlink_flit_crc_error_count_l0                     NFEL0               400
nvlink_flit_crc_error_count_l1                     NFEL1               401
nvlink_flit_crc_error_count_l2                     NFEL2               402
nvlink_flit_crc_error_count_l3                     NFEL3               403
nvlink_flit_crc_error_count_l4                     NFEL4               404
nvlink_flit_crc_error_count_l5                     NFEL5               405
nvlink_flit_crc_error_count_l12                    NFEL12              406
nvlink_flit_crc_error_count_l13                    NFEL13              407
nvlink_flit_crc_error_count_l14                    NFEL14              408
nvlink_flit_crc_error_count_total                  NFELT               409
nvlink_data_crc_error_count_l0                     NDEL0               410
nvlink_data_crc_error_count_l1                     NDEL1               411
nvlink_data_crc_error_count_l2                     NDEL2               412
nvlink_data_crc_error_count_l3                     NDEL3               413
nvlink_data_crc_error_count_l4                     NDEL4               414
nvlink_data_crc_error_count_l5                     NDEL5               415
nvlink_data_crc_error_count_l12                    NDEL12              416
nvlink_data_crc_error_count_l13                    NDEL13              417
nvlink_data_crc_error_count_l14                    NDEL14              418
nvlink_data_crc_error_count_total                  NDELT               419
nvlink_replay_error_count_l0                       NREL0               420
nvlink_replay_error_count_l1                       NREL1               421
nvlink_replay_error_count_l2                       NREL2               422
nvlink_replay_error_count_l3                       NREL3               423
nvlink_replay_error_count_l4                       NREL4               424
nvlink_replay_error_count_l5                       NREL5               425
nvlink_replay_error_count_l12                      NREL12              426
nvlink_replay_error_count_l13                      NREL13              427
nvlink_replay_error_count_l14                      NREL14              428
nvlink_replay_error_count_total                    NRELT               429
nvlink_recovery_error_count_l0                     NRCL0               430
nvlink_recovery_error_count_l1                     NRCL1               431
nvlink_recovery_error_count_l2                     NRCL2               432
nvlink_recovery_error_count_l3                     NRCL3               433
nvlink_recovery_error_count_l4                     NRCL4               434
nvlink_recovery_error_count_l5                     NRCL5               435
nvlink_recovery_error_count_l12                    NRCL12              436
nvlink_recovery_error_count_l13                    NRCL13              437
nvlink_recovery_error_count_l14                    NRCL14              438
nvlink_recovery_error_count_total                  NRCLT               439
nvlink_bandwidth_l0                                NBWL0               440
nvlink_bandwidth_l1                                NBWL1               441
nvlink_bandwidth_l2                                NBWL2               442
nvlink_bandwidth_l3                                NBWL3               443
nvlink_bandwidth_l4                                NBWL4               444
nvlink_bandwidth_l5                                NBWL5               445
nvlink_bandwidth_l12                               NBWL12              446
nvlink_bandwidth_l13                               NBWL13              447
nvlink_bandwidth_l14                               NBWL14              448
nvlink_bandwidth_total                             NBWLT               449
gpu_nvlink_errors                                  GNVERR              450
nvlink_flit_crc_error_count_l6                     NFEL6               451
nvlink_flit_crc_error_count_l7                     NFEL7               452
nvlink_flit_crc_error_count_l8                     NFEL8               453
nvlink_flit_crc_error_count_l9                     NFEL9               454
nvlink_flit_crc_error_count_l10                    NFEL10              455
nvlink_flit_crc_error_count_l11                    NFEL11              456
nvlink_data_crc_error_count_l6                     NDEL6               457
nvlink_data_crc_error_count_l7                     NDEL7               458
nvlink_data_crc_error_count_l8                     NDEL8               459
nvlink_data_crc_error_count_l9                     NDEL9               460
nvlink_data_crc_error_count_l10                    NDEL10              461
nvlink_data_crc_error_count_l11                    NDEL11              462
nvlink_replay_error_count_l6                       NREL6               463
nvlink_replay_error_count_l7                       NREL7               464
nvlink_replay_error_count_l8                       NREL8               465
nvlink_replay_error_count_l9                       NREL9               466
nvlink_replay_error_count_l10                      NREL10              467
nvlink_replay_error_count_l11                      NREL11              468
nvlink_recovery_error_count_l6                     NRCL6               469
nvlink_recovery_error_count_l7                     NRCL7               470
nvlink_recovery_error_count_l8                     NRCL8               471
nvlink_recovery_error_count_l9                     NRCL9               472
nvlink_recovery_error_count_l10                    NRCL10              473
nvlink_recovery_error_count_l11                    NRCL11              474
nvlink_bandwidth_l6                                NBWL6               475
nvlink_bandwidth_l7                                NBWL7               476
nvlink_bandwidth_l8                                NBWL8               477
nvlink_bandwidth_l9                                NBWL9               478
nvlink_bandwidth_l10                               NBWL10              479
nvlink_bandwidth_l11                               NBWL11              480
nvlink_flit_crc_error_count_l15                    NFEL15              481
nvlink_flit_crc_error_count_l16                    NFEL16              482
nvlink_flit_crc_error_count_l17                    NFEL17              483
nvlink_data_crc_error_count_l15                    NDEL15              484
nvlink_data_crc_error_count_l16                    NDEL16              485
nvlink_data_crc_error_count_l17                    NDEL17              486
nvlink_replay_error_count_l15                      NREL15              487
nvlink_replay_error_count_l16                      NREL16              488
nvlink_replay_error_count_l17                      NREL17              489
nvlink_recovery_error_count_l15                    NRCL15              491
nvlink_recovery_error_count_l16                    NRCL16              492
nvlink_recovery_error_count_l17                    NRCL17              493
nvlink_bandwidth_l15                               NBWL15              494
nvlink_bandwidth_l16                               NBWL16              495
nvlink_bandwidth_l17                               NBWL17              496
nvlink_crc_err                                     NLCRC               497
nvlink_recovery_err                                NLREC               498
nvlink_replay_err                                  NLREP               499
virtualization_mode                                VMODE               500
supported_type_info                                SPINF               501
creatable_vgpu_type_ids                            CGPID               502
active_vgpu_instance_ids                           VGIID               503
vgpu_instance_utilizations                         VIUTL               504
vgpu_instance_per_process_utilization              VIPPU               505
enc_stats                                          ENSTA               506
fbc_stats                                          FBCSTA              507
fbc_sessions_info                                  FBCINF              508
vgpu_type_ids                                      VTID                509
vgpu_type_info                                     VTPINF              510
vgpu_type_name                                     VTPNM               511
vgpu_type_class                                    VTPCLS              512
vgpu_type_license                                  VTPLC               513
vgpu_instance_vm_id                                VVMID               520
vgpu_instance_vm_name                              VMNAM               521
vgpu_instance_type                                 VITYP               522
vgpu_instance_uuid                                 VUUID               523
vgpu_instance_driver_version                       VDVER               524
vgpu_instance_memory_usage                         VMUSG               525
vgpu_instance_license_status                       VLCST               526
vgpu_instance_frame_rate_limit                     VFLIM               527
vgpu_instance_enc_stats                            VSTAT               528
vgpu_instance_enc_sessions_info                    VSINF               529
vgpu_instance_fbc_stats                            VFSTAT              530
vgpu_instance_fbc_sessions_info                    VFINF               531
vgpu_instance_license_state                        VLCIST              532
vgpu_instance_pci_id                               VPCIID              533
vgpu_instance_gpu_instance_id                      VGII                534
infiniband_guid                                    IBGUID              571
chassis_serial                                     CSERIAL             572
chassis_slot_number                                CSLOTNUM            573
tray_index                                         CTRAYIDX            574
host_id                                            HOSTID              575
peer_type                                          PEERTYPE            576
module_id                                          MODULEID            577
nvlink_pprm_oper_recovery                          NLPRMOPRE           580
nvlink_ppcnt_recovery_time_since_last              NLRECLAST           581
nvlink_ppcnt_recovery_time_between_last_two        NLRECBTWN           582
nvlink_ppcnt_recovery_total_successful_events      NLRECOVER           583
nvlink_ppcnt_physical_successful_recovery_event    NLPHYREC            584
nvlink_ppcnt_physical_link_down_counter            NLLNKDOWN           585
nvlink_ppcnt_plr_rcv_codes                         NLPLRRXC            586
nvlink_ppcnt_plr_rcv_code_err                      NLPLRRXCE           587
nvlink_ppcnt_plr_rcv_uncorrectable_code            NLPLRRXCU           588
nvlink_ppcnt_plr_xmit_codes                        NLPLRTXC            589
nvlink_ppcnt_plr_xmit_retry_codes                  NLPLRTXRC           590
nvlink_ppcnt_plr_xmit_retry_events                 NLPLRTXRE           591
nvlink_ppcnt_plr_sync_events                       NLPLRSYNC           592
nvswitch_voltage_mvolt                             SWVOLT              701
nvswitch_current_iddq                              SWCUR               702
nvswitch_current_iddq_rev                          SCIDDQ              703
nvswitch_current_iddq_dvdd                         SCDVDD              704
nvswitch_power_vdd                                 SWPOWV              705
nvswitch_power_dvdd                                SWPOWD              706
nvswitch_power_hvdd                                SWPOWH              707
nvlink_bandwidth_tx                                SWLNKTX             780
nvswitch_link_bandwidth_rx                         SWLNKRX             781
nvswitch_link_fatal_errors                         SWLNKFE             782
nvswitch_link_non_fatal_errors                     SWLNKNF             783
nvswitch_link_replay_errors                        SWLNKRP             784
nvswitch_link_recovery_errors                      SWLNKRC             785
nvswitch_link_flit_errors                          SWLNKFL             786
nvswitch_link_crc_errors                           SWLNKCR             787
nvswitch_link_ecc_errors                           SWLNKEC             788
nvswitch_link_latency_low_vc0                      SWVCLL0             789
nvswitch_link_latency_low_vc1                      SWVCLL1             790
nvswitch_link_latency_low_vc2                      SWVCLL2             791
nvswitch_link_latency_low_vc                       SWVCLL3             792
nvswitch_link_latency_medium_vc0                   SWVCLM0             793
nvswitch_link_latency_medium_vc1                   SWVCLM1             794
nvswitch_link_latency_medium_vc2                   SWVCLM2             795
nvswitch_link_latency_medium_vc3                   SWVCLM3             796
nvswitch_link_latency_high_vc0                     SWVCLH0             797
nvswitch_link_latency_high_vc1                     SWVCLH1             798
nvswitch_link_latency_high_vc2                     SWVCLH2             799
nvswitch_link_latency_high_vc3                     SWVCLH3             800
nvswitch_link_latency_panic_vc0                    SWVCLP0             801
nvswitch_link_latency_panic_vc1                    SWVCLP1             802
nvswitch_link_latency_panic_vc2                    SWVCLP2             803
nvswitch_link_latency_panic_vc3                    SWVCLP3             804
nvswitch_link_latency_count_vc0                    SWVCLC0             805
nvswitch_link_latency_count_vc1                    SWVCLC1             806
nvswitch_link_latency_count_vc2                    SWVCLC2             807
nvswitch_link_latency_count_vc3                    SWVCLC3             808
nvswitch_link_crc_errors_lane0                     SWLACR0             809
nvswitch_link_crc_errors_lane1                     SWLACR1             810
nvswitch_link_crc_errors_lane2                     SWLACR2             811
nvswitch_link_crc_errors_lane3                     SWLACR3             812
nvswitch_link_ecc_errors_lane0                     SWLAEC0             813
nvswitch_link_ecc_errors_lane1                     SWLAEC1             814
nvswitch_link_ecc_errors_lane2                     SWLAEC2             815
nvswitch_link_ecc_errors_lane3                     SWLAEC3             816
nvswitch_link_crc_errors_lane4                     SWLACR4             817
nvswitch_link_crc_errors_lane5                     SWLACR5             818
nvswitch_link_crc_errors_lane6                     SWLACR6             819
nvswitch_link_crc_errors_lane7                     SWLACR7             820
nvswitch_link_ecc_errors_lane4                     SWLAEC4             821
nvswitch_link_ecc_errors_lane5                     SWLAEC5             822
nvswitch_link_ecc_errors_lane6                     SWLAEC6             823
nvswitch_link_ecc_errors_lane7                     SWLAEC7             824
nvlink_tx_bandwidth_link0                          NVLTXB0             825
nvlink_tx_bandwidth_link1                          NVLTXB1             826
nvlink_tx_bandwidth_link2                          NVLTXB2             827
nvlink_tx_bandwidth_link3                          NVLTXB3             828
nvlink_tx_bandwidth_link4                          NVLTXB4             829
nvlink_tx_bandwidth_link5                          NVLTXB5             830
nvlink_tx_bandwidth_link6                          NVLTXB6             831
nvlink_tx_bandwidth_link7                          NVLTXB7             832
nvlink_tx_bandwidth_link8                          NVLTXB8             833
nvlink_tx_bandwidth_link9                          NVLTXB9             834
nvlink_tx_bandwidth_link10                         NVLTXB10            835
nvlink_tx_bandwidth_link11                         NVLTXB11            836
nvlink_tx_bandwidth_link12                         NVLTXB12            837
nvlink_tx_bandwidth_link13                         NVLTXB13            838
nvlink_tx_bandwidth_link14                         NVLTXB14            839
nvlink_tx_bandwidth_link15                         NVLTXB15            840
nvlink_tx_bandwidth_link16                         NVLTXB16            841
nvlink_tx_bandwidth_link17                         NVLTXB17            842
nvlink_tx_bandwidth_total                          NTXBWLT             843
nvswitch_fatal_error                               SEN00               856
nvswitch_non_fatal_error                           SEN01               857
nvswitch_current_temperature                       TMP01               858
nvswitch_slowdown_temperature                      TMP02               859
nvswitch_shutdown_temperature                      TMP03               860
nvswitch_bandwidth_tx                              SWTX                861
nvswitch_bandwidth_rx                              SWRX                862
nvswitch_physical_id                               SWPHID              863
nvswitch_reset_required                            SWFRMVER            864
nvlink_id                                          LNKID               865
nvswitch_pcie_dom                                  SWPCIEDOM           866
nvswitch_pcie_bus                                  SWPCIEBUS           867
nvswitch_pcie_dev                                  SWPCIEDEV           868
nvswitch_pcie_fun                                  SWPCIEFUN           869
nvswitch_nvlink_status                             SWNVLNKST           870
nvswitch_nvlink_dev_type                           SWNVLNKDT           871
link_pcie_remote_dom                               LNKDOM              872
link_pcie_remote_bus                               LNKBUS              873
link_pcie_remote_dev                               LNKDEV              874
link_pcie_remote_func                              LNKFNC              875
link_dev_link_id                                   SWNVLNKID           876
link_dev_link_sid                                  SWNVLNSID           877
link_dev_uuid                                      SWNVDVUID           878
nvlink_rx_bandwidth_link0                          NVLRXB0             879
nvlink_rx_bandwidth_link1                          NVLRXB1             880
nvlink_rx_bandwidth_link2                          NVLRXB2             881
nvlink_rx_bandwidth_link3                          NVLRXB3             882
nvlink_rx_bandwidth_link4                          NVLRXB4             883
nvlink_rx_bandwidth_link5                          NVLRXB5             884
nvlink_rx_bandwidth_link6                          NVLRXB6             885
nvlink_rx_bandwidth_link7                          NVLRXB7             886
nvlink_rx_bandwidth_link8                          NVLRXB8             887
nvlink_rx_bandwidth_link9                          NVLRXB9             888
nvlink_rx_bandwidth_link10                         NVLRXB10            889
nvlink_rx_bandwidth_link11                         NVLRXB11            890
nvlink_rx_bandwidth_link12                         NVLRXB12            891
nvlink_rx_bandwidth_link13                         NVLRXB13            892
nvlink_rx_bandwidth_link14                         NVLRXB14            893
nvlink_rx_bandwidth_link15                         NVLRXB15            894
nvlink_rx_bandwidth_link16                         NVLRXB16            895
nvlink_rx_bandwidth_link17                         NVLRXB17            896
nvlink_rx_bandwidth_total                          NRXBWLT             897
gr_engine_active                                   GRACT               1001
sm_active                                          SMACT               1002
sm_occupancy                                       SMOCC               1003
tensor_active                                      TENSO               1004
dram_active                                        DRAMA               1005
fp64_active                                        FP64A               1006
fp32_active                                        FP32A               1007
fp16_active                                        FP16A               1008
pcie_tx_bytes                                      PCITX               1009
pcie_rx_bytes                                      PCIRX               1010
nvlink_tx_bytes                                    NVLTX               1011
nvlink_rx_bytes                                    NVLRX               1012
tensor_imma_active                                 TIMMA               1013
tensor_hmma_active                                 THMMA               1014
tensor_dfma_active                                 TDFMA               1015
integer_active                                     INTAC               1016
nvdec0_active                                      NVDEC0              1017
nvdec1_active                                      NVDEC1              1018
nvdec2_active                                      NVDEC2              1019
nvdec3_active                                      NVDEC3              1020
nvdec4_active                                      NVDEC4              1021
nvdec5_active                                      NVDEC5              1022
nvdec6_active                                      NVDEC6              1023
nvdec7_active                                      NVDEC7              1024
nvjpg0_active                                      NVJPG0              1025
nvjpg1_active                                      NVJPG1              1026
nvjpg2_active                                      NVJPG2              1027
nvjpg3_active                                      NVJPG3              1028
nvjpg4_active                                      NVJPG4              1029
nvjpg5_active                                      NVJPG5              1030
nvjpg6_active                                      NVJPG6              1031
nvjpg7_active                                      NVJPG7              1032
nvofa0_active                                      NVOFA0              1033
nvofa1_active                                      NVOFA1              1034
nvlink_l0_tx_bytes                                 NVL0T               1040
nvlink_l0_rx_bytes                                 NVL0R               1041
nvlink_l1_tx_bytes                                 NVL1T               1042
nvlink_l1_rx_bytes                                 NVL1R               1043
nvlink_l2_tx_bytes                                 NVL2T               1044
nvlink_l2_rx_bytes                                 NVL2R               1045
nvlink_l3_tx_bytes                                 NVL3T               1046
nvlink_l3_rx_bytes                                 NVL3R               1047
nvlink_l4_tx_bytes                                 NVL4T               1048
nvlink_l4_rx_bytes                                 NVL4R               1049
nvlink_l5_tx_bytes                                 NVL5T               1050
nvlink_l5_rx_bytes                                 NVL5R               1051
nvlink_l6_tx_bytes                                 NVL6T               1052
nvlink_l6_rx_bytes                                 NVL6R               1053
nvlink_l7_tx_bytes                                 NVL7T               1054
nvlink_l7_rx_bytes                                 NVL7R               1055
nvlink_l8_tx_bytes                                 NVL8T               1056
nvlink_l8_rx_bytes                                 NVL8R               1057
nvlink_l9_tx_bytes                                 NVL9T               1058
nvlink_l9_rx_bytes                                 NVL9R               1059
nvlink_l10_tx_bytes                                NVL10T              1060
nvlink_l10_rx_bytes                                NVL10R              1061
nvlink_l11_tx_bytes                                NVL11T              1062
nvlink_l11_rx_bytes                                NVL11R              1063
nvlink_l12_tx_bytes                                NVL12T              1064
nvlink_l12_rx_bytes                                NVL12R              1065
nvlink_l13_tx_bytes                                NVL13T              1066
nvlink_l13_rx_bytes                                NVL13R              1067
nvlink_l14_tx_bytes                                NVL14T              1068
nvlink_l14_rx_bytes                                NVL14R              1069
nvlink_l15_tx_bytes                                NVL15T              1070
nvlink_l15_rx_bytes                                NVL15R              1071
nvlink_l16_tx_bytes                                NVL16T              1072
nvlink_l16_rx_bytes                                NVL16R              1073
nvlink_l17_tx_bytes                                NVL17T              1074
nvlink_l17_rx_bytes                                NVL17R              1075
c2c_tx_all_bytes                                   C2CTXAB             1076
c2c_tx_data_bytes                                  C2CTXDB             1077
c2c_rx_all_bytes                                   C2CRXAB             1078
c2c_rx_data_bytes                                  C2CRXDB             1079
hostmem_cache_hit                                  HMCACHEHT           1080
hostmem_cache_miss                                 HMCACHEMS           1081
peermem_cache_hit                                  PMCACHEHT           1082
peermem_cache_miss                                 PMCACHEMS           1083
cpu_utilization_total                              CPUUT               1100
cpu_utilization_user                               CPUUU               1101
cpu_utilization_nice                               CPUUN               1102
cpu_utilization_sys                                CPUUS               1103
cpu_utilization_irq                                CPUUI               1104
cpu_temp                                           CPUTP               1110
cpu_temp_warn                                      CPUTW               1111
cpu_temp_crit                                      CPUTC               1112
cpu_clock                                          CPUCL               1120
cpu_power_utilization                              CPUPU               1130
cpu_power_limit                                    CPUPL               1131
sysio_power_utilization                            SIOPU               1132
module_power_utilization                           MODPU               1133
cpu_vendor_name                                    CPUVN               1140
cpu_model_name                                     CPUMN               1141
nvlink_xmit_packets                                NLXMITP             1200
nvlink_xmit_bytes                                  NLXMITB             1201
nvlink_rcv_packets                                 NLRCVP              1202
nvlink_rcv_bytes                                   NLRCVB              1203
nvlink_malformed_packets_err                       NLMALP              1204
nvlink_buffer_overrun_err                          NLBUFO              1205
nvlink_rcv_err                                     NLRCVE              1206
nvlink_rcv_remote_err                              NLRCVRE             1207
nvlink_rcv_generate_err                            NLRCVGE             1208
nvlink_rcv_local_link_integrity_err                NLLLIE              1209
nvlink_xmit_discards                               NLXMITD             1210
nvlink_link_recovery_successful                    NLLRS               1211
nvlink_link_recovery_failed                        NLLRF               1212
nvlink_link_recovery                               NLLR                1213
nvlink_symbol_err                                  NLSYME              1214
nvlink_symbol_ber                                  NLSYMB              1215
nvlink_symbol_ber_float                            NLSYMBF             1216
nvlink_effective_ber                               NLEFB               1217
nvlink_effective_ber_float                         NLEFBF              1218
nvlink_effective_errors                            NLEFE               1219
cx_health                                          CXHEALTH            1300
cx_active_pcie_link_width                          CXAPLKWD            1301
cx_active_pcie_link_speed                          CXAPLKSP            1302
connectx_expect_pcie_link_width                    CXEPLKWD            1303
cx_expect_pcie_link_speed                          CXEPLKSP            1304
cx_correctable_err_status                          CXCERRST            1305
cx_correctable_err_mask                            CXCERRMK            1306
cx_uncorrectable_err_status                        CXUCERRST           1307
cx_uncorrectable_err_mask                          CXUCERRMK           1308
cx_uncorrectable_err_severity                      CXUCERRSE           1309
cx_dev_temp                                        CXTEMP              1310
c2c_link_error_intr                                C2CLNKINT           1400
c2c_link_error_replay                              C2CLNKERR           1401
c2c_link_error_replay_b2b                          C2CLNKERR           1402
c2c_link_power_state                               C2CLNKPWR           1403
nvlink_fec_history_count0                          NVFECHST0           1404
nvlink_fec_history_count1                          NVFECHST1           1405
nvlink_fec_history_count2                          NVFECHST2           1406
nvlink_fec_history_count3                          NVFECHST3           1407
nvlink_fec_history_count4                          NVFECHST4           1408
nvlink_fec_history_count5                          NVFECHST5           1409
nvlink_fec_history_count6                          NVFECHST6           1410
nvlink_fec_history_count7                          NVFECHST7           1411
nvlink_fec_history_count8                          NVFECHST8           1412
nvlink_fec_history_count9                          NVFECHST9           1413
nvlink_fec_history_count10                         NVFECHST1           1414
nvlink_fec_history_count11                         NVFECHST1           1415
nvlink_fec_history_count12                         NVFECHST1           1416
nvlink_fec_history_count13                         NVFECHST1           1417
nvlink_fec_history_count14                         NVFECHST1           1418
nvlink_fec_history_count15                         NVFECHST1           1419
clocks_event_power_cap_ns                          CLKPCNS             1420
clocks_event_boost_ns                              CLKBSTNS            1421
clocks_event_sw_thermal_slowdown_ns                CLKSWTHNS           1422
clocks_event_hw_thermal_slowdown_ns                CLKHWTHNS           1423
clocks_event_power_brake_slowdown_ns               CLKEPBSNS           1424
power_smoothing_enabled                            PWRSMEN             1425
power_smoothing_priv_level                         PWRSMPRLV           1426
power_smoothing_imm_ramp_down_enabled              PWRSMIRD            1427
power_smoothing_applied_tmp_ceil                   PWRSMCL             1428
power_smoothing_applied_tmp_floor                  PWRSMTFL            1429
power_smoothing_max_percent_tmp_floor_setting      PSMMINFLR           1430
power_smoothing_min_percent_tmp_floor_setting      PSMMINFLR           1431
power_smoothing_hw_circuitry_percent_lifetime_r    PSMHCPLR            1432
power_smoothing_max_num_preset_profiles            PSMMAXNP            1433
power_smoothing_profile_percent_tmp_floor          PSMPTFLR            1434
power_smoothing_profile_ramp_up_rate               PSMRURAT            1435
power_smoothing_profile_ramp_down_rate             PSMRDRAT            1436
power_smoothing_profile_ramp_down_hyst_val         PSMRDHV             1437
power_smoothing_active_preset_profile              PSMAPRPR            1438
power_smoothing_admin_override_percent_tmp_floo    PSMAOFLR            1439
power_smoothing_admin_override_ramp_up_rate        PSMAORUR            1440
power_smoothing_admin_override_ramp_down_rate      PSMAORDR            1441
power_smoothing_admin_override_ramp_down_hyst_v    PSMAORDH            1442


  • No labels