Nvidia GPU Monitoring - v1 (15.0.0)

Collector Type: Agent

Category: Application Monitors

Application Name: nvidiagpumonitor

G2 Monitor Name: Agent G2 - Nvidia Gpu Monitor

Global Template Name: Agent G2 - Linux - Nvidia GPU Monitoring

Supported DCGM Version: 3.1.7

Supported Agent Version : 15.0.0

Configuration Parameters

NameDescriptionDefault Value
NamespaceNamespace on which dcgm exporter is runninggpu-operator
PortPort on which metrics are exported9400

Collected Metrics

Monitor NameDisplay NameDescription
nvidia_dcgm_power_usageNvidia Dcgm Power UsagePower draw
nvidia_dcgm_mem_clockNvidia Dcgm Mem Clock FreqMemory clock frequency
nvidia_dcgm_mem_copy_utilNvidia Dcgm Mem UtilMemory utilization
nvidia_dcgm_fb_mem_usedNvidia Dcgm Framebuffer Memory UsedFramebuffer Memory Used
nvidia_dcgm_gpu_tempNvidia Dcgm Gpu TempGPU temperature
nvidia_dcgm_memory_tempNvidia Dcgm Memory TempMemory temperature
nvidia_dcgm_gpu_utilNvidia Dcgm Gpu UtilGPU utilization

Nvidia GPU Monitoring - v2 (16.0.0)

Collector Type: Agent

Category: Application Monitors

Application Name: nvidiagpumonitor

G2 Monitor Name: Agent G2 - Nvidia Gpu Monitor - v2

Global Template Name: Agent G2 - Linux - Nvidia GPU Monitoring - v2

Supported DCGM Version: 3.1.7

Supported Agent Version : 16.0.0

Configuration Parameters

NameDescriptionDefault Value
NamespaceNamespace on which dcgm exporter is runninggpu-operator
PortPort on which metrics are exported9400

Collected Metrics

Monitor NameDisplay NameDescription
nvidia_dcgm_fi_dev_sm_clockNvidia Dcgm Fi Dev Sm ClockSM clock frequency
nvidia_dcgm_fi_dev_total_energy_consumptionNvidia Dcgm Fi Dev Total Energy ConsumptionTotal energy consumption since boot (in mJ)
nvidia_dcgm_fi_dev_pcie_replay_counterNvidia Dcgm Fi Dev Pcie Replay CounterTotal number of PCIe retries
nvidia_dcgm_fi_dev_xid_errorsNvidia Dcgm Fi Dev Xid ErrorsValue of the last XID error encountered
nvidia_dcgm_fi_dev_nvlink_bandwidth_totalNvidia Dcgm Fi Dev Nvlink Bandwidth TotalTotal number of NVLink bandwidth counters for all lanes
nvidia_dcgm_fi_dev_vgpu_license_statusNvidia Dcgm Fi Dev Vgpu License StatusvGPU License status
nvidia_dcgm_fi_prof_gr_engine_activeNvidia Dcgm Fi Prof Gr Engine ActiveRatio of time the graphics engine is active (in %)
nvidia_dcgm_fi_prof_pipe_tensor_activeNvidia Dcgm Fi Prof Pipe Tensor ActiveRatio of cycles the tensor (HMMA) pipe is active (in %)
nvidia_dcgm_fi_prof_dram_activeNvidia Dcgm Fi Prof Dram ActiveRatio of cycles the device memory interface is active sending or receiving data (in %)
nvidia_dcgm_fi_prof_pcie_tx_bytesNvidia Dcgm Fi Prof Pcie Tx BytesThe rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second
nvidia_dcgm_fi_prof_pcie_rx_bytesNvidia Dcgm Fi Prof Pcie Rx BytesThe rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second
nvidia_dcgm_fi_dev_fb_freeNvidia Dcgm Fi Dev Fb FreeFramebuffer memory free (in MiB)
nvidia_dcgm_fi_dev_fb_usedNvidia Dcgm Fi Dev Fb UsedFramebuffer memory used (in MiB)
nvidia_dcgm_power_usageNvidia Dcgm Power UsagePower draw
nvidia_dcgm_mem_clockNvidia Dcgm Mem Clock FreqMemory clock frequency
nvidia_dcgm_mem_copy_utilNvidia Dcgm Mem UtilMemory utilization
nvidia_dcgm_fb_mem_usedNvidia Dcgm Framebuffer Memory UsedFramebuffer Memory Used
nvidia_dcgm_gpu_tempNvidia Dcgm Gpu TempGPU temperature
nvidia_dcgm_memory_tempNvidia Dcgm Memory TempMemory temperature
nvidia_dcgm_gpu_utilNvidia Dcgm Gpu UtilGPU utilization