How to use dcgmi dmon to monitor GPU utilization
The dcgmi dmon monitoring tool is very powerful, and can show a lot of metrics regarding the usage of the GPU. A word of warning though: monitoring in such detail may slightly slow down your code. Thus, it is not advised to use such monitoring all the time, but only when you want to investigate the runtime behaviour and efficiency of your program. First, you'll want to figure out which GPU has been assigned to you, so that you can query the correct one.To do that, we run:
$ nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a)
This example would be a typical result for a job to which one GPU has been allocated, out of the four available in a GPU node.
Now, we run
$ dcgmi discovery -l 4 GPUs found. +--------+----------------------------------------------------------------------+ | GPU ID | Device Information | +--------+----------------------------------------------------------------------+ | 0 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:31:00.0 | | | Device UUID: GPU-ce4cef33-0cee-c533-c9ca-f2c11c07e3bf | +--------+----------------------------------------------------------------------+ | 1 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:32:00.0 | | | Device UUID: GPU-c30a2b41-7fc7-c244-2896-1a69df3bccc8 | +--------+----------------------------------------------------------------------+ | 2 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:CA:00.0 | | | Device UUID: GPU-2512619e-298b-bb53-09f3-7867b8d1454a | +--------+----------------------------------------------------------------------+ | 3 | Name: NVIDIA A100-SXM4-40GB | | | PCI Bus ID: 00000000:E3:00.0 | | | Device UUID: GPU-357ed8c2-a0ea-a164-c218-0ca3862bd7bf | +--------+----------------------------------------------------------------------+
As you can see, dcgmi always shows all GPUs in the node, even the ones that have not been assigned to you. By comparing the UUID output from both outputs, we see that the GPU that was assigned to us has index '2' in dcgmi. Now, we can run the monitoring, e.g.
dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i 2
Here, the number following the -i is our GPU ID. The numbers following the -e indicate the metrics we want to query. The most relevant metrics are probably the profiling metrics (see the official dcgmi documentation). Some example output:
#Entity ID GPUTL MCUTL GRACT SMACT SMOCC TENSO DRAMA PCITX PCIRX NVLTX NVLRX POWER W GPU 0 82 39 0.779 0.562 0.233 0.140 0.296 31122275 148592109 0 0 204.109 GPU 0 82 39 0.797 0.575 0.238 0.143 0.303 24381091 150507324 0 0 204.109 GPU 0 6 3 0.409 0.302 0.128 0.077 0.154 23075223 112425741 0 0 173.644 GPU 0 24 11 0.247 0.191 0.084 0.052 0.091 21383893 114272271 0 0 207.260
For the exact meaning of each of these, read the official documentation here and here. In short, they are:
- GPUTL: GPU utilization (similar to the volatility printed by nvidia-smi). Indicates the percentage of time that any kernel is active on the GPU. Does not indicate how much of the GPU that kernel is using.
- MCUTL: memory utilization.
- GRACT: Ratio of the the graphics or compute engines where active
- SMACT: The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors
- SMOCC: The fraction of resident warps on a multiprocessor, relative to the maximum number of concurrent warps supported on a multiprocessor
- TENSO: Utilization of the tensor cores
- DRAMA. Fraction of cycles where data was sent to or received from device memory
- PCITX/RX: PCI Bandwidth transmitted (TX) or received (RX)
- NVLTX/RX: NVLink Bandwidth transmitted (TX) or received (RX)
- Power: total power consumption of the GPU.
Note that the official documentation gives some reasonable guidelines of what 'good' values are for these metrics. It is quite hard to interpret, as the maximum achieveable depends also on the nature and size of the computational problem. Still, GPUTL under 75%, SMACT less than 0.5 or (in the case of Deep Learning) no Tensor Core usage (TENSO = 0.000) are indications that you're underutilizing the GPU. If your problem size is too small for a full A100, consider using one of our MIG-partitioned A100s.
There are many more metrics that you can query. You can list them using
dcgmi dmon --list
How to use dcgmi dmon to monitor GPU utilization when using MIG instances
The approach is similar as above, with two differences:
- We need to determine which MIG instance we are using, rather than which GPU
- Some metrics cannot be measured for MIG instances.
First, we run
$ nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-82f16363-e6df-5bed-3cef-c1a484831465) MIG 3g.20gb Device 0: (UUID: MIG-5e639bae-ed68-53c8-b15c-0d3e2518ecc5)
to determine the UUID of the GPU we are running on. Then, we run:
$ nvidia-smi Thu Oct 26 14:32:22 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-40GB On | 00000000:31:00.0 Off | Off | | N/A 31C P0 86W / 400W | 1609MiB / 40960MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 1 0 0 | 1572MiB / 19968MiB | 42 N/A | 3 0 2 0 0 | | | 2MiB / 32767MiB | | | +------------------+--------------------------------+-----------+-----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 1 0 882588 C ...on/3.10.4-GCCcore-11.3.0/bin/python 1526MiB | +---------------------------------------------------------------------------------------+
Here we extract the MIG GI ID (bottom left, under 'MIG devices'), which in this case is '1'.
Finally, we run:
$ dcgmi discovery -c +-------------------+--------------------------------------------------------------------+ | Instance Hierarchy | +===================+====================================================================+ | GPU 0 | GPU GPU-82f16363-e6df-5bed-3cef-c1a484831465 (EntityID: 0) | | -> I 0/1 | GPU Instance (EntityID: 0) | | -> CI 0/1/0 | Compute Instance (EntityID: 0) | | -> I 0/2 | GPU Instance (EntityID: 1) | | -> CI 0/2/0 | Compute Instance (EntityID: 1) | +-------------------+--------------------------------------------------------------------+ | GPU 1 | GPU GPU-7f81592f-2b32-bd81-2307-81243a539e04 (EntityID: 1) | | -> I 1/1 | GPU Instance (EntityID: 7) | | -> CI 1/1/0 | Compute Instance (EntityID: 7) | | -> I 1/2 | GPU Instance (EntityID: 8) | | -> CI 1/2/0 | Compute Instance (EntityID: 8) | +-------------------+--------------------------------------------------------------------+ | GPU 2 | GPU GPU-7eb5ea70-e6ed-88db-c0b7-08d594288d6b (EntityID: 2) | | -> I 2/1 | GPU Instance (EntityID: 14) | | -> CI 2/1/0 | Compute Instance (EntityID: 14) | | -> I 2/2 | GPU Instance (EntityID: 15) | | -> CI 2/2/0 | Compute Instance (EntityID: 15) | +-------------------+--------------------------------------------------------------------+ | GPU 3 | GPU GPU-f69f57d5-0468-5793-f018-89852bd0ae5d (EntityID: 3) | | -> I 3/1 | GPU Instance (EntityID: 21) | | -> CI 3/1/0 | Compute Instance (EntityID: 21) | | -> I 3/2 | GPU Instance (EntityID: 22) | | -> CI 3/2/0 | Compute Instance (EntityID: 22) | +-------------------+--------------------------------------------------------------------+
Now, we see that we are running on GPU 0, GPU instance ID 1; the former by searching for the GPU UUID (GPU-82f16363-e6df-5bed-3cef-c1a484831465), the latter by searching for the MIG GI ID (1) on that GPU on the left-hand side.
We can query this particular MIG instance by adapting the -i argument to not be the GPU ID, but to be i:EntityID , where EntityID is that of the MIG instance printed by dcgmi discovery -c (WARNING: please note that the EntityIDs are not consecutive: the first instance on GPU 1 has IdentityID 7):
dcgmi dmon -e 203,204,1001,1002,1003,1004,1005,1009,1010,1011,1012,155 -i i:2 #Entity ID GPUTL MCUTL GRACT SMACT SMOCC TENSO DRAMA PCITX PCIRX NVLTX NVLRX POWER W GPU-I 0 N/A N/A 0.978 0.734 0.327 0.130 0.371 N/A N/A N/A N/A 138.425 GPU-I 0 N/A N/A 0.978 0.735 0.328 0.130 0.372 N/A N/A N/A N/A 82.189 GPU-I 0 N/A N/A 0.527 0.395 0.175 0.071 0.199 N/A N/A N/A N/A 145.837 GPU-I 0 N/A N/A 0.947 0.711 0.317 0.126 0.360 N/A N/A N/A N/A 144.452
This is actually running the same use case as before, but on the a 3g.20gb MIG instance (3/7th of an A100 in terms of compute). As you notice, some metrics report N/A as they can not be measured for MIG instances. But what we can clearly see is that running this code on a MIG instance instead of a full A100 results in much higher SMACT, SMOCC and TENSO usage. We've also timed this code, and it ran at 30% slower (on 50% less hardware). Thus, if efficiency to solution is more important than time to solution, this example is better off running on a MIG instance.
A full list of codes is available with dcgmi dmon --list :
$ dcgmi dmon --list ________________________________________________________________________________ Long Name Short Name Field ID ________________________________________________________________________________ driver_version DRVER 1 nvml_version NVVER 2 process_name PRNAM 3 device_count DVCNT 4 cuda_driver_version CDVER 5 name DVNAM 50 brand DVBRN 51 nvml_index NVIDX 52 serial_number SRNUM 53 uuid UUID# 54 minor_number MNNUM 55 oem_inforom_version OEMVR 56 pci_busid PCBID 57 pci_combined_id PCCID 58 pci_subsys_id PCSID 59 system_topology_pci STVCI 60 system_topology_nvlink STNVL 61 system_affinity SYSAF 62 cuda_compute_capability DVCCC 63 p2p_nvlink_status P2PNS 64 compute_mode CMMOD 65 persistance_mode PMMOD 66 mig_mode MGMOD 67 cuda_visible_devices CUVID 68 mig_max_slices MIGMS 69 cpu_affinity_0 CAFF0 70 cpu_affinity_1 CAFF1 71 cpu_affinity_2 CAFF2 72 cpu_affinity_3 CAFF3 73 cc_mode CCMOD 74 mig_attributes MIGATT 75 mig_gi_info MIGGIINFO 76 mig_ci_info MIGCIINFO 77 ecc_inforom_version EIVER 80 power_inforom_version PIVER 81 inforom_image_version IIVER 82 inforom_config_checksum CCSUM 83 inforom_config_valid ICVLD 84 vbios_version VBVER 85 mem_affinity_0 MAFF0 86 mem_affinity_1 MAFF1 87 mem_affinity_2 MAFF2 88 mem_affinity_3 MAFF3 89 bar1_total B1TTL 90 sync_boost SYBST 91 bar1_used B1USE 92 bar1_free B1FRE 93 gpm_support GPMSPT 94 sm_clock SMCLK 100 memory_clock MMCLK 101 video_clock VICLK 102 sm_app_clock SACLK 110 mem_app_clock MACLK 111 current_clocks_event_reasons DVCCTR 112 sm_max_clock SMMAX 113 memory_max_clock MMMAX 114 video_max_clock VIMAX 115 autoboost ATBST 120 supported_clocks SPCLK 130 memory_temp MMTMP 140 gpu_temp TMPTR 150 gpu_mem_max_op_temp GMMOT 151 gpu_max_op_temp GGMOT 152 gpu_temp_tlimit GTLIMIT 153 power_usage POWER 155 total_energy_consumption TOTEC 156 power_usage_instant POWINST 157 slowdown_temp SDTMP 158 shutdown_temp SHTMP 159 power_management_limit PMLMT 160 power_management_limit_min PMMIN 161 power_management_limit_max PMMAX 162 power_management_limit_default PMDEF 163 enforced_power_limit EPLMT 164 req_power_prof RPPRM 165 enf_power_prof EPPRM 166 val_power_prof VPPRM 167 fabric_manager_status FMSTA 170 fabric_manager_failure_code FMFRC 171 fabric_cluster_uuid FCUID 172 fabric_clique_id FMCID 173 pstate PSTAT 190 fan_speed FANSP 191 pcie_tx_throughput TXTPT 200 pcie_rx_throughput RXTPT 201 pcie_replay_counter RPCTR 202 gpu_utilization GPUTL 203 mem_copy_utilization MCUTL 204 accounting_data ACCDT 205 enc_utilization ECUTL 206 dec_utilization DCUTL 207 mem_util_samples MUSAM 210 gpu_util_samples GUSAM 211 graphics_pids GPIDS 220 compute_pids CMPID 221 xid_errors XIDER 230 pcie_max_link_gen PCIMG 235 pcie_max_link_width PCIMW 236 pcie_link_gen PCILG 237 pcie_link_width PCILW 238 power_violation PVIOL 240 thermal_violation TVIOL 241 sync_boost_violation SBVIO 242 board_limit_violation BLVIO 243 low_util_violation LUVIO 244 reliability_violation RVIOL 245 app_clock_violation TAPCV 246 base_clock_violation TAPBC 247 fb_total FBTTL 250 fb_free FBFRE 251 fb_used FBUSD 252 fb_resv FBRSV 253 fb_USDP FBUSP 254 c2c_link_count C2CLC 285 c2c_link_status C2CST 286 c2c_max_bandwidth C2CMAXBW 287 ecc ECCUR 300 ecc_pending ECPEN 301 ecc_sbe_volatile_total ESVTL 310 ecc_dbe_volatile_total EDVTL 311 ecc_sbe_aggregate_total ESATL 312 ecc_dbe_aggregate_total EDATL 313 ecc_sbe_volatile_l1 ESVL1 314 ecc_dbe_volatile_l1 EDVL1 315 ecc_sbe_volatile_l2 ESVL2 316 ecc_dbe_volatile_l2 EDVL2 317 ecc_sbe_volatile_device ESVDV 318 ecc_dbe_volatile_device EDVDV 319 ecc_sbe_volatile_register ESVRG 320 ecc_dbe_volatile_register EDVRG 321 ecc_sbe_volatile_texture ESVTX 322 ecc_dbe_volatile_texture EDVTX 323 ecc_sbe_aggregate_l1 ESAL1 324 ecc_dbe_aggregate_l1 EDAL1 325 ecc_sbe_aggregate_l2 ESAL2 326 ecc_dbe_aggregate_l2 EDAL2 327 ecc_sbe_aggregate_device ESADV 328 ecc_dbe_aggregate_device EDADV 329 ecc_sbe_aggregate_register ESARG 330 ecc_dbe_aggregate_register EDARG 331 ecc_sbe_aggregate_texture ESATX 332 ecc_dbe_aggregate_texture EDATX 333 ecc_sbe_volatile_shared ESVSHM 334 ecc_dbe_volatile_shared EDVSHM 335 ecc_sbe_volatile_cbu ESVCBU 336 ecc_dbe_volatile_cbu EDVCBU 337 ecc_sbe_aggregate_shared ESDSHM 338 ecc_dbe_aggregate_shared EDDSHM 339 ecc_sbe_aggregate_cbu ESDCBU 340 ecc_dbe_aggregate_cbu EDDCBU 341 ecc_sbe_volatile_sram ESVSRM 342 ecc_dbe_volatile_sram EDVSRM 343 ecc_sbe_aggregate_sram ESDSRM 344 ecc_dbe_aggregate_sram EDDSRM 345 ecc_threshold_sram ECCTHRSRM 346 gpu_memory_test_result MEMRES 350 diagnostics_test_result DIARES 351 pcie_test_result PCIRES 352 targeted_stress_test_result STRRES 353 targeted_power_test_result POWRES 354 memory_bandwidth_test_result MBWRES 355 memory_test_result MEMRES 356 pulse_test_result PLSRES 357 eud_test_result EUDRES 358 cpu_eud_test_result CPUEUDRES 359 software_test_result SWRES 360 nvbandwidth_test_result NVBRES 361 diag_status DIAGSTATU 362 remap_rows_avail_max RRAM 385 remap_rows_avail_high RRAH 386 remap_rows_avail_partial RRAP 387 remap_rows_avail_low RRAL 388 remap_rows_avail_none RRAN 389 retired_pages_sbe RPSBE 390 retired_pages_dbe RPDBE 391 retired_pages_pending RPPEN 392 uncorrectable_remapped_rows URMPS 393 correctable_remapped_rows CRMPS 394 row_remap_failure RRF 395 row_remap_pending RRP 396 nvlink_flit_crc_error_count_l0 NFEL0 400 nvlink_flit_crc_error_count_l1 NFEL1 401 nvlink_flit_crc_error_count_l2 NFEL2 402 nvlink_flit_crc_error_count_l3 NFEL3 403 nvlink_flit_crc_error_count_l4 NFEL4 404 nvlink_flit_crc_error_count_l5 NFEL5 405 nvlink_flit_crc_error_count_l12 NFEL12 406 nvlink_flit_crc_error_count_l13 NFEL13 407 nvlink_flit_crc_error_count_l14 NFEL14 408 nvlink_flit_crc_error_count_total NFELT 409 nvlink_data_crc_error_count_l0 NDEL0 410 nvlink_data_crc_error_count_l1 NDEL1 411 nvlink_data_crc_error_count_l2 NDEL2 412 nvlink_data_crc_error_count_l3 NDEL3 413 nvlink_data_crc_error_count_l4 NDEL4 414 nvlink_data_crc_error_count_l5 NDEL5 415 nvlink_data_crc_error_count_l12 NDEL12 416 nvlink_data_crc_error_count_l13 NDEL13 417 nvlink_data_crc_error_count_l14 NDEL14 418 nvlink_data_crc_error_count_total NDELT 419 nvlink_replay_error_count_l0 NREL0 420 nvlink_replay_error_count_l1 NREL1 421 nvlink_replay_error_count_l2 NREL2 422 nvlink_replay_error_count_l3 NREL3 423 nvlink_replay_error_count_l4 NREL4 424 nvlink_replay_error_count_l5 NREL5 425 nvlink_replay_error_count_l12 NREL12 426 nvlink_replay_error_count_l13 NREL13 427 nvlink_replay_error_count_l14 NREL14 428 nvlink_replay_error_count_total NRELT 429 nvlink_recovery_error_count_l0 NRCL0 430 nvlink_recovery_error_count_l1 NRCL1 431 nvlink_recovery_error_count_l2 NRCL2 432 nvlink_recovery_error_count_l3 NRCL3 433 nvlink_recovery_error_count_l4 NRCL4 434 nvlink_recovery_error_count_l5 NRCL5 435 nvlink_recovery_error_count_l12 NRCL12 436 nvlink_recovery_error_count_l13 NRCL13 437 nvlink_recovery_error_count_l14 NRCL14 438 nvlink_recovery_error_count_total NRCLT 439 nvlink_bandwidth_l0 NBWL0 440 nvlink_bandwidth_l1 NBWL1 441 nvlink_bandwidth_l2 NBWL2 442 nvlink_bandwidth_l3 NBWL3 443 nvlink_bandwidth_l4 NBWL4 444 nvlink_bandwidth_l5 NBWL5 445 nvlink_bandwidth_l12 NBWL12 446 nvlink_bandwidth_l13 NBWL13 447 nvlink_bandwidth_l14 NBWL14 448 nvlink_bandwidth_total NBWLT 449 gpu_nvlink_errors GNVERR 450 nvlink_flit_crc_error_count_l6 NFEL6 451 nvlink_flit_crc_error_count_l7 NFEL7 452 nvlink_flit_crc_error_count_l8 NFEL8 453 nvlink_flit_crc_error_count_l9 NFEL9 454 nvlink_flit_crc_error_count_l10 NFEL10 455 nvlink_flit_crc_error_count_l11 NFEL11 456 nvlink_data_crc_error_count_l6 NDEL6 457 nvlink_data_crc_error_count_l7 NDEL7 458 nvlink_data_crc_error_count_l8 NDEL8 459 nvlink_data_crc_error_count_l9 NDEL9 460 nvlink_data_crc_error_count_l10 NDEL10 461 nvlink_data_crc_error_count_l11 NDEL11 462 nvlink_replay_error_count_l6 NREL6 463 nvlink_replay_error_count_l7 NREL7 464 nvlink_replay_error_count_l8 NREL8 465 nvlink_replay_error_count_l9 NREL9 466 nvlink_replay_error_count_l10 NREL10 467 nvlink_replay_error_count_l11 NREL11 468 nvlink_recovery_error_count_l6 NRCL6 469 nvlink_recovery_error_count_l7 NRCL7 470 nvlink_recovery_error_count_l8 NRCL8 471 nvlink_recovery_error_count_l9 NRCL9 472 nvlink_recovery_error_count_l10 NRCL10 473 nvlink_recovery_error_count_l11 NRCL11 474 nvlink_bandwidth_l6 NBWL6 475 nvlink_bandwidth_l7 NBWL7 476 nvlink_bandwidth_l8 NBWL8 477 nvlink_bandwidth_l9 NBWL9 478 nvlink_bandwidth_l10 NBWL10 479 nvlink_bandwidth_l11 NBWL11 480 nvlink_flit_crc_error_count_l15 NFEL15 481 nvlink_flit_crc_error_count_l16 NFEL16 482 nvlink_flit_crc_error_count_l17 NFEL17 483 nvlink_data_crc_error_count_l15 NDEL15 484 nvlink_data_crc_error_count_l16 NDEL16 485 nvlink_data_crc_error_count_l17 NDEL17 486 nvlink_replay_error_count_l15 NREL15 487 nvlink_replay_error_count_l16 NREL16 488 nvlink_replay_error_count_l17 NREL17 489 nvlink_recovery_error_count_l15 NRCL15 491 nvlink_recovery_error_count_l16 NRCL16 492 nvlink_recovery_error_count_l17 NRCL17 493 nvlink_bandwidth_l15 NBWL15 494 nvlink_bandwidth_l16 NBWL16 495 nvlink_bandwidth_l17 NBWL17 496 nvlink_crc_err NLCRC 497 nvlink_recovery_err NLREC 498 nvlink_replay_err NLREP 499 virtualization_mode VMODE 500 supported_type_info SPINF 501 creatable_vgpu_type_ids CGPID 502 active_vgpu_instance_ids VGIID 503 vgpu_instance_utilizations VIUTL 504 vgpu_instance_per_process_utilization VIPPU 505 enc_stats ENSTA 506 fbc_stats FBCSTA 507 fbc_sessions_info FBCINF 508 vgpu_type_ids VTID 509 vgpu_type_info VTPINF 510 vgpu_type_name VTPNM 511 vgpu_type_class VTPCLS 512 vgpu_type_license VTPLC 513 vgpu_instance_vm_id VVMID 520 vgpu_instance_vm_name VMNAM 521 vgpu_instance_type VITYP 522 vgpu_instance_uuid VUUID 523 vgpu_instance_driver_version VDVER 524 vgpu_instance_memory_usage VMUSG 525 vgpu_instance_license_status VLCST 526 vgpu_instance_frame_rate_limit VFLIM 527 vgpu_instance_enc_stats VSTAT 528 vgpu_instance_enc_sessions_info VSINF 529 vgpu_instance_fbc_stats VFSTAT 530 vgpu_instance_fbc_sessions_info VFINF 531 vgpu_instance_license_state VLCIST 532 vgpu_instance_pci_id VPCIID 533 vgpu_instance_gpu_instance_id VGII 534 infiniband_guid IBGUID 571 chassis_serial CSERIAL 572 chassis_slot_number CSLOTNUM 573 tray_index CTRAYIDX 574 host_id HOSTID 575 peer_type PEERTYPE 576 module_id MODULEID 577 nvlink_pprm_oper_recovery NLPRMOPRE 580 nvlink_ppcnt_recovery_time_since_last NLRECLAST 581 nvlink_ppcnt_recovery_time_between_last_two NLRECBTWN 582 nvlink_ppcnt_recovery_total_successful_events NLRECOVER 583 nvlink_ppcnt_physical_successful_recovery_event NLPHYREC 584 nvlink_ppcnt_physical_link_down_counter NLLNKDOWN 585 nvlink_ppcnt_plr_rcv_codes NLPLRRXC 586 nvlink_ppcnt_plr_rcv_code_err NLPLRRXCE 587 nvlink_ppcnt_plr_rcv_uncorrectable_code NLPLRRXCU 588 nvlink_ppcnt_plr_xmit_codes NLPLRTXC 589 nvlink_ppcnt_plr_xmit_retry_codes NLPLRTXRC 590 nvlink_ppcnt_plr_xmit_retry_events NLPLRTXRE 591 nvlink_ppcnt_plr_sync_events NLPLRSYNC 592 nvswitch_voltage_mvolt SWVOLT 701 nvswitch_current_iddq SWCUR 702 nvswitch_current_iddq_rev SCIDDQ 703 nvswitch_current_iddq_dvdd SCDVDD 704 nvswitch_power_vdd SWPOWV 705 nvswitch_power_dvdd SWPOWD 706 nvswitch_power_hvdd SWPOWH 707 nvlink_bandwidth_tx SWLNKTX 780 nvswitch_link_bandwidth_rx SWLNKRX 781 nvswitch_link_fatal_errors SWLNKFE 782 nvswitch_link_non_fatal_errors SWLNKNF 783 nvswitch_link_replay_errors SWLNKRP 784 nvswitch_link_recovery_errors SWLNKRC 785 nvswitch_link_flit_errors SWLNKFL 786 nvswitch_link_crc_errors SWLNKCR 787 nvswitch_link_ecc_errors SWLNKEC 788 nvswitch_link_latency_low_vc0 SWVCLL0 789 nvswitch_link_latency_low_vc1 SWVCLL1 790 nvswitch_link_latency_low_vc2 SWVCLL2 791 nvswitch_link_latency_low_vc SWVCLL3 792 nvswitch_link_latency_medium_vc0 SWVCLM0 793 nvswitch_link_latency_medium_vc1 SWVCLM1 794 nvswitch_link_latency_medium_vc2 SWVCLM2 795 nvswitch_link_latency_medium_vc3 SWVCLM3 796 nvswitch_link_latency_high_vc0 SWVCLH0 797 nvswitch_link_latency_high_vc1 SWVCLH1 798 nvswitch_link_latency_high_vc2 SWVCLH2 799 nvswitch_link_latency_high_vc3 SWVCLH3 800 nvswitch_link_latency_panic_vc0 SWVCLP0 801 nvswitch_link_latency_panic_vc1 SWVCLP1 802 nvswitch_link_latency_panic_vc2 SWVCLP2 803 nvswitch_link_latency_panic_vc3 SWVCLP3 804 nvswitch_link_latency_count_vc0 SWVCLC0 805 nvswitch_link_latency_count_vc1 SWVCLC1 806 nvswitch_link_latency_count_vc2 SWVCLC2 807 nvswitch_link_latency_count_vc3 SWVCLC3 808 nvswitch_link_crc_errors_lane0 SWLACR0 809 nvswitch_link_crc_errors_lane1 SWLACR1 810 nvswitch_link_crc_errors_lane2 SWLACR2 811 nvswitch_link_crc_errors_lane3 SWLACR3 812 nvswitch_link_ecc_errors_lane0 SWLAEC0 813 nvswitch_link_ecc_errors_lane1 SWLAEC1 814 nvswitch_link_ecc_errors_lane2 SWLAEC2 815 nvswitch_link_ecc_errors_lane3 SWLAEC3 816 nvswitch_link_crc_errors_lane4 SWLACR4 817 nvswitch_link_crc_errors_lane5 SWLACR5 818 nvswitch_link_crc_errors_lane6 SWLACR6 819 nvswitch_link_crc_errors_lane7 SWLACR7 820 nvswitch_link_ecc_errors_lane4 SWLAEC4 821 nvswitch_link_ecc_errors_lane5 SWLAEC5 822 nvswitch_link_ecc_errors_lane6 SWLAEC6 823 nvswitch_link_ecc_errors_lane7 SWLAEC7 824 nvlink_tx_bandwidth_link0 NVLTXB0 825 nvlink_tx_bandwidth_link1 NVLTXB1 826 nvlink_tx_bandwidth_link2 NVLTXB2 827 nvlink_tx_bandwidth_link3 NVLTXB3 828 nvlink_tx_bandwidth_link4 NVLTXB4 829 nvlink_tx_bandwidth_link5 NVLTXB5 830 nvlink_tx_bandwidth_link6 NVLTXB6 831 nvlink_tx_bandwidth_link7 NVLTXB7 832 nvlink_tx_bandwidth_link8 NVLTXB8 833 nvlink_tx_bandwidth_link9 NVLTXB9 834 nvlink_tx_bandwidth_link10 NVLTXB10 835 nvlink_tx_bandwidth_link11 NVLTXB11 836 nvlink_tx_bandwidth_link12 NVLTXB12 837 nvlink_tx_bandwidth_link13 NVLTXB13 838 nvlink_tx_bandwidth_link14 NVLTXB14 839 nvlink_tx_bandwidth_link15 NVLTXB15 840 nvlink_tx_bandwidth_link16 NVLTXB16 841 nvlink_tx_bandwidth_link17 NVLTXB17 842 nvlink_tx_bandwidth_total NTXBWLT 843 nvswitch_fatal_error SEN00 856 nvswitch_non_fatal_error SEN01 857 nvswitch_current_temperature TMP01 858 nvswitch_slowdown_temperature TMP02 859 nvswitch_shutdown_temperature TMP03 860 nvswitch_bandwidth_tx SWTX 861 nvswitch_bandwidth_rx SWRX 862 nvswitch_physical_id SWPHID 863 nvswitch_reset_required SWFRMVER 864 nvlink_id LNKID 865 nvswitch_pcie_dom SWPCIEDOM 866 nvswitch_pcie_bus SWPCIEBUS 867 nvswitch_pcie_dev SWPCIEDEV 868 nvswitch_pcie_fun SWPCIEFUN 869 nvswitch_nvlink_status SWNVLNKST 870 nvswitch_nvlink_dev_type SWNVLNKDT 871 link_pcie_remote_dom LNKDOM 872 link_pcie_remote_bus LNKBUS 873 link_pcie_remote_dev LNKDEV 874 link_pcie_remote_func LNKFNC 875 link_dev_link_id SWNVLNKID 876 link_dev_link_sid SWNVLNSID 877 link_dev_uuid SWNVDVUID 878 nvlink_rx_bandwidth_link0 NVLRXB0 879 nvlink_rx_bandwidth_link1 NVLRXB1 880 nvlink_rx_bandwidth_link2 NVLRXB2 881 nvlink_rx_bandwidth_link3 NVLRXB3 882 nvlink_rx_bandwidth_link4 NVLRXB4 883 nvlink_rx_bandwidth_link5 NVLRXB5 884 nvlink_rx_bandwidth_link6 NVLRXB6 885 nvlink_rx_bandwidth_link7 NVLRXB7 886 nvlink_rx_bandwidth_link8 NVLRXB8 887 nvlink_rx_bandwidth_link9 NVLRXB9 888 nvlink_rx_bandwidth_link10 NVLRXB10 889 nvlink_rx_bandwidth_link11 NVLRXB11 890 nvlink_rx_bandwidth_link12 NVLRXB12 891 nvlink_rx_bandwidth_link13 NVLRXB13 892 nvlink_rx_bandwidth_link14 NVLRXB14 893 nvlink_rx_bandwidth_link15 NVLRXB15 894 nvlink_rx_bandwidth_link16 NVLRXB16 895 nvlink_rx_bandwidth_link17 NVLRXB17 896 nvlink_rx_bandwidth_total NRXBWLT 897 gr_engine_active GRACT 1001 sm_active SMACT 1002 sm_occupancy SMOCC 1003 tensor_active TENSO 1004 dram_active DRAMA 1005 fp64_active FP64A 1006 fp32_active FP32A 1007 fp16_active FP16A 1008 pcie_tx_bytes PCITX 1009 pcie_rx_bytes PCIRX 1010 nvlink_tx_bytes NVLTX 1011 nvlink_rx_bytes NVLRX 1012 tensor_imma_active TIMMA 1013 tensor_hmma_active THMMA 1014 tensor_dfma_active TDFMA 1015 integer_active INTAC 1016 nvdec0_active NVDEC0 1017 nvdec1_active NVDEC1 1018 nvdec2_active NVDEC2 1019 nvdec3_active NVDEC3 1020 nvdec4_active NVDEC4 1021 nvdec5_active NVDEC5 1022 nvdec6_active NVDEC6 1023 nvdec7_active NVDEC7 1024 nvjpg0_active NVJPG0 1025 nvjpg1_active NVJPG1 1026 nvjpg2_active NVJPG2 1027 nvjpg3_active NVJPG3 1028 nvjpg4_active NVJPG4 1029 nvjpg5_active NVJPG5 1030 nvjpg6_active NVJPG6 1031 nvjpg7_active NVJPG7 1032 nvofa0_active NVOFA0 1033 nvofa1_active NVOFA1 1034 nvlink_l0_tx_bytes NVL0T 1040 nvlink_l0_rx_bytes NVL0R 1041 nvlink_l1_tx_bytes NVL1T 1042 nvlink_l1_rx_bytes NVL1R 1043 nvlink_l2_tx_bytes NVL2T 1044 nvlink_l2_rx_bytes NVL2R 1045 nvlink_l3_tx_bytes NVL3T 1046 nvlink_l3_rx_bytes NVL3R 1047 nvlink_l4_tx_bytes NVL4T 1048 nvlink_l4_rx_bytes NVL4R 1049 nvlink_l5_tx_bytes NVL5T 1050 nvlink_l5_rx_bytes NVL5R 1051 nvlink_l6_tx_bytes NVL6T 1052 nvlink_l6_rx_bytes NVL6R 1053 nvlink_l7_tx_bytes NVL7T 1054 nvlink_l7_rx_bytes NVL7R 1055 nvlink_l8_tx_bytes NVL8T 1056 nvlink_l8_rx_bytes NVL8R 1057 nvlink_l9_tx_bytes NVL9T 1058 nvlink_l9_rx_bytes NVL9R 1059 nvlink_l10_tx_bytes NVL10T 1060 nvlink_l10_rx_bytes NVL10R 1061 nvlink_l11_tx_bytes NVL11T 1062 nvlink_l11_rx_bytes NVL11R 1063 nvlink_l12_tx_bytes NVL12T 1064 nvlink_l12_rx_bytes NVL12R 1065 nvlink_l13_tx_bytes NVL13T 1066 nvlink_l13_rx_bytes NVL13R 1067 nvlink_l14_tx_bytes NVL14T 1068 nvlink_l14_rx_bytes NVL14R 1069 nvlink_l15_tx_bytes NVL15T 1070 nvlink_l15_rx_bytes NVL15R 1071 nvlink_l16_tx_bytes NVL16T 1072 nvlink_l16_rx_bytes NVL16R 1073 nvlink_l17_tx_bytes NVL17T 1074 nvlink_l17_rx_bytes NVL17R 1075 c2c_tx_all_bytes C2CTXAB 1076 c2c_tx_data_bytes C2CTXDB 1077 c2c_rx_all_bytes C2CRXAB 1078 c2c_rx_data_bytes C2CRXDB 1079 hostmem_cache_hit HMCACHEHT 1080 hostmem_cache_miss HMCACHEMS 1081 peermem_cache_hit PMCACHEHT 1082 peermem_cache_miss PMCACHEMS 1083 cpu_utilization_total CPUUT 1100 cpu_utilization_user CPUUU 1101 cpu_utilization_nice CPUUN 1102 cpu_utilization_sys CPUUS 1103 cpu_utilization_irq CPUUI 1104 cpu_temp CPUTP 1110 cpu_temp_warn CPUTW 1111 cpu_temp_crit CPUTC 1112 cpu_clock CPUCL 1120 cpu_power_utilization CPUPU 1130 cpu_power_limit CPUPL 1131 sysio_power_utilization SIOPU 1132 module_power_utilization MODPU 1133 cpu_vendor_name CPUVN 1140 cpu_model_name CPUMN 1141 nvlink_xmit_packets NLXMITP 1200 nvlink_xmit_bytes NLXMITB 1201 nvlink_rcv_packets NLRCVP 1202 nvlink_rcv_bytes NLRCVB 1203 nvlink_malformed_packets_err NLMALP 1204 nvlink_buffer_overrun_err NLBUFO 1205 nvlink_rcv_err NLRCVE 1206 nvlink_rcv_remote_err NLRCVRE 1207 nvlink_rcv_generate_err NLRCVGE 1208 nvlink_rcv_local_link_integrity_err NLLLIE 1209 nvlink_xmit_discards NLXMITD 1210 nvlink_link_recovery_successful NLLRS 1211 nvlink_link_recovery_failed NLLRF 1212 nvlink_link_recovery NLLR 1213 nvlink_symbol_err NLSYME 1214 nvlink_symbol_ber NLSYMB 1215 nvlink_symbol_ber_float NLSYMBF 1216 nvlink_effective_ber NLEFB 1217 nvlink_effective_ber_float NLEFBF 1218 nvlink_effective_errors NLEFE 1219 cx_health CXHEALTH 1300 cx_active_pcie_link_width CXAPLKWD 1301 cx_active_pcie_link_speed CXAPLKSP 1302 connectx_expect_pcie_link_width CXEPLKWD 1303 cx_expect_pcie_link_speed CXEPLKSP 1304 cx_correctable_err_status CXCERRST 1305 cx_correctable_err_mask CXCERRMK 1306 cx_uncorrectable_err_status CXUCERRST 1307 cx_uncorrectable_err_mask CXUCERRMK 1308 cx_uncorrectable_err_severity CXUCERRSE 1309 cx_dev_temp CXTEMP 1310 c2c_link_error_intr C2CLNKINT 1400 c2c_link_error_replay C2CLNKERR 1401 c2c_link_error_replay_b2b C2CLNKERR 1402 c2c_link_power_state C2CLNKPWR 1403 nvlink_fec_history_count0 NVFECHST0 1404 nvlink_fec_history_count1 NVFECHST1 1405 nvlink_fec_history_count2 NVFECHST2 1406 nvlink_fec_history_count3 NVFECHST3 1407 nvlink_fec_history_count4 NVFECHST4 1408 nvlink_fec_history_count5 NVFECHST5 1409 nvlink_fec_history_count6 NVFECHST6 1410 nvlink_fec_history_count7 NVFECHST7 1411 nvlink_fec_history_count8 NVFECHST8 1412 nvlink_fec_history_count9 NVFECHST9 1413 nvlink_fec_history_count10 NVFECHST1 1414 nvlink_fec_history_count11 NVFECHST1 1415 nvlink_fec_history_count12 NVFECHST1 1416 nvlink_fec_history_count13 NVFECHST1 1417 nvlink_fec_history_count14 NVFECHST1 1418 nvlink_fec_history_count15 NVFECHST1 1419 clocks_event_power_cap_ns CLKPCNS 1420 clocks_event_boost_ns CLKBSTNS 1421 clocks_event_sw_thermal_slowdown_ns CLKSWTHNS 1422 clocks_event_hw_thermal_slowdown_ns CLKHWTHNS 1423 clocks_event_power_brake_slowdown_ns CLKEPBSNS 1424 power_smoothing_enabled PWRSMEN 1425 power_smoothing_priv_level PWRSMPRLV 1426 power_smoothing_imm_ramp_down_enabled PWRSMIRD 1427 power_smoothing_applied_tmp_ceil PWRSMCL 1428 power_smoothing_applied_tmp_floor PWRSMTFL 1429 power_smoothing_max_percent_tmp_floor_setting PSMMINFLR 1430 power_smoothing_min_percent_tmp_floor_setting PSMMINFLR 1431 power_smoothing_hw_circuitry_percent_lifetime_r PSMHCPLR 1432 power_smoothing_max_num_preset_profiles PSMMAXNP 1433 power_smoothing_profile_percent_tmp_floor PSMPTFLR 1434 power_smoothing_profile_ramp_up_rate PSMRURAT 1435 power_smoothing_profile_ramp_down_rate PSMRDRAT 1436 power_smoothing_profile_ramp_down_hyst_val PSMRDHV 1437 power_smoothing_active_preset_profile PSMAPRPR 1438 power_smoothing_admin_override_percent_tmp_floo PSMAOFLR 1439 power_smoothing_admin_override_ramp_up_rate PSMAORUR 1440 power_smoothing_admin_override_ramp_down_rate PSMAORDR 1441 power_smoothing_admin_override_ramp_down_hyst_v PSMAORDH 1442