This page lists the metrics collected by the NVIDIA Bright Cluster Manager integration and describes the default monitoring configuration applied when the integration is installed.
Use this information to understand what is monitored out-of-the-box and how to customize thresholds, templates, and alert behaviour for your environment.

Supported Metrics

The metrics collected by the integration are grouped into logical categories based on the monitored component type. Each category is represented as a tabbed table that displays the native type, metric names, display labels, units, supported versions, and descriptions.

Supported Metrics

Availability
Capacity
Performance
Thermal
Usage
No Category
Native TypeMetric NameDisplay NameUnitsVersionDescriptionMetric Group
NVIDIA BCM Linux Servernvidia_bcm_linuxServer_nfs_server_replyMissesnvidia_bcm_linuxServer_nfs_server_replyMissescount5.0.0Returns NFS Server Reply Miss on LinuxServer
nvidia_bcm_linuxServer_nfs_server_rpcBadCallsnvidia_bcm_linuxServer_nfs_server_rpcBadCallscount5.0.0Returns NFS Server Reply RPC BadCalls
nvidia_bcm_linuxServer_nfs_server_rpcBadAuthnvidia_bcm_linuxServer_nfs_server_rpcBadAuthcount5.0.0Returns NFS Server Reply RPC BadAuth
nvidia_bcm_linuxServer_nfs_server_rpcBadClntnvidia_bcm_linuxServer_nfs_server_rpcBadClntcount5.0.0Returns NFS Server Reply RPC BadClnt
nvidia_bcm_linuxServer_nfs_server_fileStalenvidia_bcm_linuxServer_nfs_server_fileStalecount5.0.0Returns NFS Server File Stales
NVIDIA BCM Head Nodenvidia_bcm_headNode_healthStatusnvidia_bcm_headNode_healthStatusNA1.0.0It monitors the health status of each component available. States: PASS-0, FAIL-1, UNKNOWN-2
nvidia_bcm_headNode_errorsRecvnvidia_bcm_headNode_errorsRecvErrors per Sec1.0.0It monitors Errors received per sec
nvidia_bcm_headNode_errorsSentnvidia_bcm_headNode_errorsSentErrors per Sec1.0.0It monitors the Errors sent
nvidia_bcm_headNode_hardwareCorruptedMemorynvidia_bcm_headNode_hardwareCorruptedMemoryBytes1.0.0It monitors the Hardware Corrupted Memory in Bytes
nvidia_bcm_headNode_nfs_server_replyMissesnvidia_bcm_headNode_nfs_server_replyMissescount4.0.0Returns NFS Server Reply Miss on HeadNode
nvidia_bcm_headNode_nfs_server_rpcBadAuthnvidia_bcm_headNode_nfs_server_rpcBadAuthNA4.0.0Returns NFS Server Reply RPC BadAuth
nvidia_bcm_headNode_nfs_server_rpcBadCallsnvidia_bcm_headNode_nfs_server_rpcBadCallscount4.0.0Returns NFS Server Reply RPC BadCalls
nvidia_bcm_headNode_nfs_server_rpcBadClntnvidia_bcm_headNode_nfs_server_rpcBadClntNA4.0.0Returns NFS Server Reply RPC BadClnt
NVIDIA BCM Physical Nodenvidia_bcm_physicalNode_healthStatusnvidia_bcm_physicalNode_healthStatusNA1.0.0It monitors the health status of each component available. States: PASS-0, FAIL-1, UNKNOWN-2
nvidia_bcm_physicalNode_blockedProcessesnvidia_bcm_physicalNode_blockedProcessescount1.0.0It monitors the count of Blocked Processes
nvidia_bcm_physicalNode_errorsRecvnvidia_bcm_physicalNode_errorsRecvErrors per Sec1.0.0It monitors the Errors Received Per Sec
nvidia_bcm_physicalNode_errorsSentnvidia_bcm_physicalNode_errorsSentErrors per Sec1.0.0It monitors the Errors Sent Per Sec
Native TypeMetric NameDisplay NameUnitsVersionDescriptionMetric Group
NVIDIA BrightCluster Managernvidia_bcm_cluster_gpuUnits_totalnvidia_bcm_cluster_gpuUnits_totalcount1.0.0It monitors the count of total GPU units
nvidia_bcm_cluster_nodesTotalnvidia_bcm_cluster_nodesTotalcount1.0.0It monitors the count of total number of nodes
Native TypeMetric NameDisplay NameUnitsVersionDescriptionMetric Group
NVIDIA BCM Linux Servernvidia_bcm_linuxServer_nfs_server_packetsnvidia_bcm_linuxServer_nfs_server_packetscount5.0.0Returns NFS Server Packets
nvidia_bcm_linuxServer_nfs_server_replyHitsnvidia_bcm_linuxServer_nfs_server_replyHitscount5.0.0Returns NFS Server Reply Hits on LinuxServer
NVIDIA BCM Head Nodenvidia_bcm_headNode_systemCpuTimenvidia_bcm_headNode_systemCpuTimejiffies1.0.0It monitors the System CPU Time
nvidia_bcm_headNode_cpuWaitTimenvidia_bcm_headNode_cpuWaitTimejiffies1.0.0It monitors the System CPU Wait Time
nvidia_bcm_headNode_nfs_client_rpcRetransnvidia_bcm_headNode_nfs_client_rpcRetranscount4.0.0Returns NFS Client RPC Retrans
nvidia_bcm_headNode_nfs_server_fileStalenvidia_bcm_headNode_nfs_server_fileStalecount4.0.0Returns NFS Server File Stales
nvidia_bcm_headNode_nfs_server_packetsnvidia_bcm_headNode_nfs_server_packetscount4.0.0Returns NFS Server Packets
nvidia_bcm_headNode_nfs_server_replyHitsnvidia_bcm_headNode_nfs_server_replyHitscount4.0.0Returns NFS Server Reply Hits on HeadNode
nvidia_bcm_headNode_nfs_server_rpcXdrCallnvidia_bcm_headNode_nfs_server_rpcXdrCallNA4.0.0Returns NFS Server Reply RPC XDR Call
NVIDIA BCM Virtual Nodenvidia_bcm_headNode_nfs_server_rpcXdrCallnvidia_bcm_headNode_nfs_server_rpcXdrCallNA4.0.0Returns NFS Server Reply RPC XDR Call
Native TypeMetric NameDisplay NameUnitsVersionDescriptionMetric Group
NVIDIA BrightCluster Managernvidia_bcm_cluster_smartHdaTempnvidia_bcm_cluster_smartHdaTempCelsius2.0.0It monitors the temperature of a spindle disks
NVIDIA BCM Head Nodenvidia_bcm_headNode_gpu_temperaturenvidia_bcm_headNode_gpu_temperatureCelsius1.0.0GPU temperature
NVIDIA BCM Physical Nodenvidia_bcm_physicalNode_gpu_temperaturenvidia_bcm_physicalNode_gpu_temperatureCelsius1.0.0GPU temperature
Native TypeMetric NameDisplay NameUnitsVersionDescriptionMetric Group
NVIDIA BrightCluster Managernvidia_bcm_cluster_cpuIdlenvidia_bcm_cluster_cpuIdle%1.0.0It monitors the % of CPU Idle Time
nvidia_bcm_cluster_cpuUtilizationnvidia_bcm_cluster_cpuUtilization%1.0.0It monitors the % of CPU Utilization
nvidia_bcm_cluster_totalUsersLoginnvidia_bcm_cluster_totalUsersLogincount1.0.0It monitors the count of total users login
nvidia_bcm_cluster_totalKnownUsersnvidia_bcm_cluster_totalKnownUserscount1.0.0It monitors the count of total known users
nvidia_bcm_cluster_occupationRatenvidia_bcm_cluster_occupationRate%1.0.0It monitors the Occupation rate in %
nvidia_bcm_cluster_queuedJobsnvidia_bcm_cluster_queuedJobscount1.0.0Returns number jobs in queue on the cluster
nvidia_bcm_cluster_runningJobsnvidia_bcm_cluster_runningJobscount1.0.0Returns number of jobs running on the cluster
nvidia_bcm_cluster_uniqueUserLogincountnvidia_bcm_cluster_uniqueUserLogincountcount1.0.0Returns number of unique users logged in to the cluster
NVIDIA BCM Linux Servernvidia_bcm_linuxServer_memory_utilizationnvidia_bcm_linuxServer_memory_utilization%5.0.0It monitors the percentage memory utilization of Linux server
nvidia_bcm_linuxServer_disk_utilizationnvidia_bcm_linuxServer_disk_utilization%5.0.0It monitors the percentage disk utilization of Linux server
nvidia_bcm_linuxServer_cpu_utilizationnvidia_bcm_linuxServer_cpu_utilization%5.0.0It monitors the percentage cpu utilization of Linux server
NVIDIA BCM Head Nodenvidia_bcm_headNode_blockedProcessesnvidia_bcm_headNode_blockedProcessescount1.0.0It monitors the count of blocked processes
nvidia_bcm_headNode_memoryFreenvidia_bcm_headNode_memoryFreeGB1.0.0It monitors the free memory in GB
nvidia_bcm_headNode_gpu_utilizationnvidia_bcm_headNode_gpu_utilization%1.0.0Average GPU utilization percentage.
NVIDIA BCM Physical Nodenvidia_bcm_physicalNode_systemCpuTimenvidia_bcm_physicalNode_systemCpuTimejiffies1.0.0It monitors the System CPU Time
nvidia_bcm_physicalNode_cpuWaitTimenvidia_bcm_physicalNode_cpuWaitTimejiffies1.0.0It monitors the CPU Wait Time
nvidia_bcm_physicalNode_hardwareCorruptedMemorynvidia_bcm_physicalNode_hardwareCorruptedMemoryBytes1.0.0It monitors the Hardware Corrupted Memory in Bytes
nvidia_bcm_physicalNode_memoryFreenvidia_bcm_physicalNode_memoryFreeGB1.0.0It monitors the Free Memory in GB
nvidia_bcm_physicalNode_gpu_utilizationnvidia_bcm_physicalNode_gpu_utilization%1.0.0GPU utilization percentage.
nvidia_bcm_physicalNode_nfsmount_totalSizenvidia_bcm_physicalNode_nfsmount_totalSizeGB3.0.0It monitors the total size of the nfs mount file on the node
nvidia_bcm_physicalNode_nfsmount_usedSizenvidia_bcm_physicalNode_nfsmount_usedSizeGB3.0.0It monitors the used size of the nfs mount file on the node
nvidia_bcm_physicalNode_nfsmount_utilizationnvidia_bcm_physicalNode_nfsmount_utilization%3.0.0It monitors the percentage utilization of the nfs mount file on the node
Native TypeMetric NameDisplay NameUnitsVersionDescriptionMetric Group
NVIDIA BrightCluster Managernvidia_bcm_cluster_uptimenvidia_bcm_cluster_uptimehIt monitors the cluster uptime (in hours)

Default Monitoring Configurations

Nvidia Bright Cluster Manager provides default monitoring components that include:

  • Global Device Management Policies
  • Global Templates
  • Global Monitors

These configurations are applied automatically upon installation. Customize them for your environment by cloning the global versions and modifying thresholds or alerting behaviour.

It is recommended to apply this customization prior to installing the application to prevent unnecessary alerts and reduce noise.

Default Global Device Management Policies available

OpsRamp will have a Global Device Management Policy for each Native Type of Bright Cluster Manager. Find those Device Management Policies at Setup -> Resources -> Device Management Policies -> Search with suggested names in global scope. Each Device Management Policy complies with the following naming convention:
{appName nativeType - version} Ex :NVIDIA BrightCluster Manager NVIDIA BCM Head Node (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)

Default Global Templates available

OpsRamp will have a Global template for each Native Type of Nvidia Bright Cluster Manager. Find those templates at Setup -> Monitoring -> Templates -> Search with suggested names in global scope. Each template complies with the following naming convention:
{appName nativeType 'Template' - version} Ex : nvidia-bright-cluster-manager NVIDIA BCM Head Node Template - 1 (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node Template, version = 1)

Default Global Monitors available

OpsRamp will have a Global Monitors for each Native Type which has monitoring support. We can find those monitors at Setup -> Monitoring -> Monitors -> Search with suggested names in global scope. Each Monitors complies with the following naming convention:
{monitorKey appName nativeType - version} Ex : NVIDIA BCM Head Node Monitor nvidia-bright-cluster-manager NVIDIA BCM Head Node Container 1(i.e, monitorKey = NVIDIA BCM Head Node Monitor, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)

Tabbed Interface with Table