This page lists the metrics collected by the NVIDIA Bright Cluster Manager integration and describes the default monitoring configuration applied when the integration is installed.Use this information to understand what is monitored out-of-the-box and how to customize thresholds, templates, and alert behaviour for your environment.
Supported Metrics
The metrics collected by the integration are grouped into logical categories based on the monitored component type. Each category is represented as a tabbed table that displays the native type, metric names, display labels, units, supported versions, and descriptions.
Supported Metrics
| Native Type | Metric Name | Display Name | Units | Version | Description | Metric Group |
|---|---|---|---|---|---|---|
| NVIDIA BCM Linux Server | nvidia_bcm_linuxServer_nfs_server_replyMisses | nvidia_bcm_linuxServer_nfs_server_replyMisses | count | 5.0.0 | Returns NFS Server Reply Miss on LinuxServer | |
| nvidia_bcm_linuxServer_nfs_server_rpcBadCalls | nvidia_bcm_linuxServer_nfs_server_rpcBadCalls | count | 5.0.0 | Returns NFS Server Reply RPC BadCalls | ||
| nvidia_bcm_linuxServer_nfs_server_rpcBadAuth | nvidia_bcm_linuxServer_nfs_server_rpcBadAuth | count | 5.0.0 | Returns NFS Server Reply RPC BadAuth | ||
| nvidia_bcm_linuxServer_nfs_server_rpcBadClnt | nvidia_bcm_linuxServer_nfs_server_rpcBadClnt | count | 5.0.0 | Returns NFS Server Reply RPC BadClnt | ||
| nvidia_bcm_linuxServer_nfs_server_fileStale | nvidia_bcm_linuxServer_nfs_server_fileStale | count | 5.0.0 | Returns NFS Server File Stales | ||
| NVIDIA BCM Head Node | nvidia_bcm_headNode_healthStatus | nvidia_bcm_headNode_healthStatus | NA | 1.0.0 | It monitors the health status of each component available. States: PASS-0, FAIL-1, UNKNOWN-2 | |
| nvidia_bcm_headNode_errorsRecv | nvidia_bcm_headNode_errorsRecv | Errors per Sec | 1.0.0 | It monitors Errors received per sec | ||
| nvidia_bcm_headNode_errorsSent | nvidia_bcm_headNode_errorsSent | Errors per Sec | 1.0.0 | It monitors the Errors sent | ||
| nvidia_bcm_headNode_hardwareCorruptedMemory | nvidia_bcm_headNode_hardwareCorruptedMemory | Bytes | 1.0.0 | It monitors the Hardware Corrupted Memory in Bytes | ||
| nvidia_bcm_headNode_nfs_server_replyMisses | nvidia_bcm_headNode_nfs_server_replyMisses | count | 4.0.0 | Returns NFS Server Reply Miss on HeadNode | ||
| nvidia_bcm_headNode_nfs_server_rpcBadAuth | nvidia_bcm_headNode_nfs_server_rpcBadAuth | NA | 4.0.0 | Returns NFS Server Reply RPC BadAuth | ||
| nvidia_bcm_headNode_nfs_server_rpcBadCalls | nvidia_bcm_headNode_nfs_server_rpcBadCalls | count | 4.0.0 | Returns NFS Server Reply RPC BadCalls | ||
| nvidia_bcm_headNode_nfs_server_rpcBadClnt | nvidia_bcm_headNode_nfs_server_rpcBadClnt | NA | 4.0.0 | Returns NFS Server Reply RPC BadClnt | ||
| NVIDIA BCM Physical Node | nvidia_bcm_physicalNode_healthStatus | nvidia_bcm_physicalNode_healthStatus | NA | 1.0.0 | It monitors the health status of each component available. States: PASS-0, FAIL-1, UNKNOWN-2 | |
| nvidia_bcm_physicalNode_blockedProcesses | nvidia_bcm_physicalNode_blockedProcesses | count | 1.0.0 | It monitors the count of Blocked Processes | ||
| nvidia_bcm_physicalNode_errorsRecv | nvidia_bcm_physicalNode_errorsRecv | Errors per Sec | 1.0.0 | It monitors the Errors Received Per Sec | ||
| nvidia_bcm_physicalNode_errorsSent | nvidia_bcm_physicalNode_errorsSent | Errors per Sec | 1.0.0 | It monitors the Errors Sent Per Sec |
| Native Type | Metric Name | Display Name | Units | Version | Description | Metric Group |
|---|---|---|---|---|---|---|
| NVIDIA BrightCluster Manager | nvidia_bcm_cluster_gpuUnits_total | nvidia_bcm_cluster_gpuUnits_total | count | 1.0.0 | It monitors the count of total GPU units | |
| nvidia_bcm_cluster_nodesTotal | nvidia_bcm_cluster_nodesTotal | count | 1.0.0 | It monitors the count of total number of nodes |
| Native Type | Metric Name | Display Name | Units | Version | Description | Metric Group |
|---|---|---|---|---|---|---|
| NVIDIA BCM Linux Server | nvidia_bcm_linuxServer_nfs_server_packets | nvidia_bcm_linuxServer_nfs_server_packets | count | 5.0.0 | Returns NFS Server Packets | |
| nvidia_bcm_linuxServer_nfs_server_replyHits | nvidia_bcm_linuxServer_nfs_server_replyHits | count | 5.0.0 | Returns NFS Server Reply Hits on LinuxServer | ||
| NVIDIA BCM Head Node | nvidia_bcm_headNode_systemCpuTime | nvidia_bcm_headNode_systemCpuTime | jiffies | 1.0.0 | It monitors the System CPU Time | |
| nvidia_bcm_headNode_cpuWaitTime | nvidia_bcm_headNode_cpuWaitTime | jiffies | 1.0.0 | It monitors the System CPU Wait Time | ||
| nvidia_bcm_headNode_nfs_client_rpcRetrans | nvidia_bcm_headNode_nfs_client_rpcRetrans | count | 4.0.0 | Returns NFS Client RPC Retrans | ||
| nvidia_bcm_headNode_nfs_server_fileStale | nvidia_bcm_headNode_nfs_server_fileStale | count | 4.0.0 | Returns NFS Server File Stales | ||
| nvidia_bcm_headNode_nfs_server_packets | nvidia_bcm_headNode_nfs_server_packets | count | 4.0.0 | Returns NFS Server Packets | ||
| nvidia_bcm_headNode_nfs_server_replyHits | nvidia_bcm_headNode_nfs_server_replyHits | count | 4.0.0 | Returns NFS Server Reply Hits on HeadNode | ||
| nvidia_bcm_headNode_nfs_server_rpcXdrCall | nvidia_bcm_headNode_nfs_server_rpcXdrCall | NA | 4.0.0 | Returns NFS Server Reply RPC XDR Call | ||
| NVIDIA BCM Virtual Node | nvidia_bcm_headNode_nfs_server_rpcXdrCall | nvidia_bcm_headNode_nfs_server_rpcXdrCall | NA | 4.0.0 | Returns NFS Server Reply RPC XDR Call |
| Native Type | Metric Name | Display Name | Units | Version | Description | Metric Group |
|---|---|---|---|---|---|---|
| NVIDIA BrightCluster Manager | nvidia_bcm_cluster_smartHdaTemp | nvidia_bcm_cluster_smartHdaTemp | Celsius | 2.0.0 | It monitors the temperature of a spindle disks | |
| NVIDIA BCM Head Node | nvidia_bcm_headNode_gpu_temperature | nvidia_bcm_headNode_gpu_temperature | Celsius | 1.0.0 | GPU temperature | |
| NVIDIA BCM Physical Node | nvidia_bcm_physicalNode_gpu_temperature | nvidia_bcm_physicalNode_gpu_temperature | Celsius | 1.0.0 | GPU temperature |
| Native Type | Metric Name | Display Name | Units | Version | Description | Metric Group |
|---|---|---|---|---|---|---|
| NVIDIA BrightCluster Manager | nvidia_bcm_cluster_cpuIdle | nvidia_bcm_cluster_cpuIdle | % | 1.0.0 | It monitors the % of CPU Idle Time | |
| nvidia_bcm_cluster_cpuUtilization | nvidia_bcm_cluster_cpuUtilization | % | 1.0.0 | It monitors the % of CPU Utilization | ||
| nvidia_bcm_cluster_totalUsersLogin | nvidia_bcm_cluster_totalUsersLogin | count | 1.0.0 | It monitors the count of total users login | ||
| nvidia_bcm_cluster_totalKnownUsers | nvidia_bcm_cluster_totalKnownUsers | count | 1.0.0 | It monitors the count of total known users | ||
| nvidia_bcm_cluster_occupationRate | nvidia_bcm_cluster_occupationRate | % | 1.0.0 | It monitors the Occupation rate in % | ||
| nvidia_bcm_cluster_queuedJobs | nvidia_bcm_cluster_queuedJobs | count | 1.0.0 | Returns number jobs in queue on the cluster | ||
| nvidia_bcm_cluster_runningJobs | nvidia_bcm_cluster_runningJobs | count | 1.0.0 | Returns number of jobs running on the cluster | ||
| nvidia_bcm_cluster_uniqueUserLogincount | nvidia_bcm_cluster_uniqueUserLogincount | count | 1.0.0 | Returns number of unique users logged in to the cluster | ||
| NVIDIA BCM Linux Server | nvidia_bcm_linuxServer_memory_utilization | nvidia_bcm_linuxServer_memory_utilization | % | 5.0.0 | It monitors the percentage memory utilization of Linux server | |
| nvidia_bcm_linuxServer_disk_utilization | nvidia_bcm_linuxServer_disk_utilization | % | 5.0.0 | It monitors the percentage disk utilization of Linux server | ||
| nvidia_bcm_linuxServer_cpu_utilization | nvidia_bcm_linuxServer_cpu_utilization | % | 5.0.0 | It monitors the percentage cpu utilization of Linux server | ||
| NVIDIA BCM Head Node | nvidia_bcm_headNode_blockedProcesses | nvidia_bcm_headNode_blockedProcesses | count | 1.0.0 | It monitors the count of blocked processes | |
| nvidia_bcm_headNode_memoryFree | nvidia_bcm_headNode_memoryFree | GB | 1.0.0 | It monitors the free memory in GB | ||
| nvidia_bcm_headNode_gpu_utilization | nvidia_bcm_headNode_gpu_utilization | % | 1.0.0 | Average GPU utilization percentage. | ||
| NVIDIA BCM Physical Node | nvidia_bcm_physicalNode_systemCpuTime | nvidia_bcm_physicalNode_systemCpuTime | jiffies | 1.0.0 | It monitors the System CPU Time | |
| nvidia_bcm_physicalNode_cpuWaitTime | nvidia_bcm_physicalNode_cpuWaitTime | jiffies | 1.0.0 | It monitors the CPU Wait Time | ||
| nvidia_bcm_physicalNode_hardwareCorruptedMemory | nvidia_bcm_physicalNode_hardwareCorruptedMemory | Bytes | 1.0.0 | It monitors the Hardware Corrupted Memory in Bytes | ||
| nvidia_bcm_physicalNode_memoryFree | nvidia_bcm_physicalNode_memoryFree | GB | 1.0.0 | It monitors the Free Memory in GB | ||
| nvidia_bcm_physicalNode_gpu_utilization | nvidia_bcm_physicalNode_gpu_utilization | % | 1.0.0 | GPU utilization percentage. | ||
| nvidia_bcm_physicalNode_nfsmount_totalSize | nvidia_bcm_physicalNode_nfsmount_totalSize | GB | 3.0.0 | It monitors the total size of the nfs mount file on the node | ||
| nvidia_bcm_physicalNode_nfsmount_usedSize | nvidia_bcm_physicalNode_nfsmount_usedSize | GB | 3.0.0 | It monitors the used size of the nfs mount file on the node | ||
| nvidia_bcm_physicalNode_nfsmount_utilization | nvidia_bcm_physicalNode_nfsmount_utilization | % | 3.0.0 | It monitors the percentage utilization of the nfs mount file on the node |
| Native Type | Metric Name | Display Name | Units | Version | Description | Metric Group |
|---|---|---|---|---|---|---|
| NVIDIA BrightCluster Manager | nvidia_bcm_cluster_uptime | nvidia_bcm_cluster_uptime | h | It monitors the cluster uptime (in hours) |
Default Monitoring Configurations
Nvidia Bright Cluster Manager provides default monitoring components that include:
- Global Device Management Policies
- Global Templates
- Global Monitors
These configurations are applied automatically upon installation. Customize them for your environment by cloning the global versions and modifying thresholds or alerting behaviour.
It is recommended to apply this customization prior to installing the application to prevent unnecessary alerts and reduce noise.
Default Global Device Management Policies available
OpsRamp will have a Global Device Management Policy for each Native Type of Bright Cluster Manager. Find those Device Management Policies at Setup -> Resources -> Device Management Policies -> Search with suggested names in global scope. Each Device Management Policy complies with the following naming convention:{appName nativeType - version}
Ex :NVIDIA BrightCluster Manager NVIDIA BCM Head Node (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)
Default Global Templates available
OpsRamp will have a Global template for each Native Type of Nvidia Bright Cluster Manager. Find those templates at Setup -> Monitoring -> Templates -> Search with suggested names in global scope. Each template complies with the following naming convention:{appName nativeType 'Template' - version}
Ex : nvidia-bright-cluster-manager NVIDIA BCM Head Node Template - 1 (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node Template, version = 1)
Default Global Monitors available
OpsRamp will have a Global Monitors for each Native Type which has monitoring support. We can find those monitors at Setup -> Monitoring -> Monitors -> Search with suggested names in global scope. Each Monitors complies with the following naming convention:{monitorKey appName nativeType - version}
Ex : NVIDIA BCM Head Node Monitor nvidia-bright-cluster-manager NVIDIA BCM Head Node Container 1(i.e, monitorKey = NVIDIA BCM Head Node Monitor, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)