Supported Metrics
This page lists the metrics collected by the Zerto integration and describes the default monitoring configuration applied when the integration is installed.You can use this information to understand what is monitored out-of-the-box and how to customize thresholds, templates, and alert behaviour for your environment.
The metrics collected by the integration are grouped into logical categories based on the monitored component type. Each category is represented as a tabbed table that displays the native type, metric names, display labels, units, supported versions, and descriptions.Supported Metrics
| Native Type | Metric Name | Display Name | Units | Description | Application Version |
|---|---|---|---|---|---|
| NVIDIA BrightCluster Manager | nvidia_bcm_cluster_nodeAvailabilityStatus | nvidia bcm cluster nodeAvailabilityStatus | NA | It monitors the availability status of each node. Below are the possible status values: UP-0, Down-1, Closed-2, installing-3, installer_failed-4, installer_rebooting-5, installer_callinginit-6, installer_unreachable-7, installer_burning-8, burning-9, unknown-10, opening-11, going_down-12, pending-13, no data-14 | 1.0.0 |
| nvidia_bcm_cluster_slurmDaemon_status | nvidia bcm cluster slurmDaemon status | NA | It monitors the running status of slurmDaemon & allocation status of the each node, Below are the possible states: allocated-0, idle-1, down-2 | 1.0.0 | |
| nvidia_bcm_cluster_gpuUnits_closed | nvidia bcm cluster gpuUnits closed | count | It monitors the count of how many GPU units in closed state | 1.0.0 | |
| nvidia_bcm_cluster_gpuUnits_down | nvidia bcm cluster gpuUnits down | count | It monitors the count of how many GPU units in Down state | 1.0.0 | |
| nvidia_bcm_cluster_gpuUnits_total | nvidia bcm cluster gpuUnits total | count | It monitors the count of total GPU units | 1.0.0 | |
| nvidia_bcm_cluster_gpuUnits_up | nvidia bcm cluster gpuUnits up | count | It monitors the count of how many GPU units are UP. | 1.0.0 | |
| nvidia_bcm_cluster_cpuIdle | nvidia bcm cluster cpuIdle | % | It monitors the % of CPU Idle Time. | 1.0.0 | |
| nvidia_bcm_cluster_nodesTotal | nvidia bcm cluster nodesTotal | count | It monitors the count of total number of nodes. | 1.0.0 | |
| nvidia_bcm_cluster_nodesUp | nvidia bcm cluster nodesUp | count | It monitors the count of how many nodes are in UP state. | 1.0.0 | |
| nvidia_bcm_cluster_nodesDown | nvidia bcm cluster nodesDown | count | It monitors the count of how many nodes are in DOWN state. | 1.0.0 | |
| nvidia_bcm_cluster_nodesClosed | nvidia bcm cluster nodesClosed | count | It monitors the count of how many nodes are in closed state. | 1.0.0 | |
| nvidia_bcm_cluster_devicesTotal | nvidia bcm cluster devicesTotal | count | It monitors the count of total devices. | 1.0.0 | |
| nvidia_bcm_cluster_devicesUp | nvidia bcm cluster devicesUp | count | It monitors the count of how many devices are in UP state. | 1.0.0 | |
| nvidia_bcm_cluster_devicesDown | nvidia bcm cluster devicesDown | count | It monitors the count of how many devices are in DOWN state. | 1.0.0 | |
| nvidia_bcm_cluster_coresUp | nvidia bcm cluster coresUp | count | It monitors the count of Cores which are in UP state. | 1.0.0 | |
| nvidia_bcm_cluster_coresTotal | nvidia bcm cluster coresTotal | count | It monitors the total count of cores | 1.0.0 | |
| nvidia_bcm_cluster_totalUsersLogin | nvidia bcm cluster totalUsersLogin | count | It monitors the count of total users login. | 1.0.0 | |
| nvidia_bcm_cluster_totalKnownUsers | nvidia bcm cluster totalKnownUsers | count | It monitors the count of total known users. | 1.0.0 | |
| nvidia_bcm_cluster_occupationRate | nvidia bcm cluster occupationRate | % | It monitors the Occupation rate in %. | 1.0.0 | |
| nvidia_bcm_cluster_cpuUtilization | nvidia bcm cluster cpuUtilization | % | It monitors the % of CPU Utilization | 1.0.0 | |
| nvidia_bcm_cluster_failedJobs | nvidia bcm cluster failedJobs | count | Returns number of failed jobs on the cluster. | 1.0.0 | |
| nvidia_bcm_cluster_queuedJobs | nvidia bcm cluster queuedJobs | count | Returns number jobs in queue on the cluster. | 1.0.0 | |
| nvidia_bcm_cluster_runningJobs | nvidia bcm cluster runningJobs | count | Returns number of jobs running on the cluster. | 1.0.0 | |
| nvidia_bcm_cluster_nodesInQueue | nvidia bcm cluster nodes InQueue | count | Returns number of nodes queued on the cluster | 1.0.0 | |
| nvidia_bcm_cluster_uniqueUserLogincount | nvidia bcm cluster uniqueUserLogincount | count | Returns number of unique users logged in to the cluster. | 1.0.0 | |
| nvidia_bcm_cluster_smartHdaTemp | nvidia bcm cluster smartHdaTemp | celsius | It monitors the temperature of spindle disks | 2.0.0 | |
| NVIDIA BCM Head Node | nvidia_bcm_headNode_healthStatus | nvidia bcm headNode healthStatus | NA | Health status(PASS or FAIL) of each component. | 1.0.0 |
| nvidia_bcm_headNode_blockedProcesses | nvidia bcm headNode blockedProcesses | count | Blocked processes waiting for I/O. | 1.0.0 | |
| nvidia_bcm_headNode_systemCpuTime | nvidia bcm headNode systemCpuTime | jiffles | CPU time spent in system mode. | 1.0.0 | |
| nvidia_bcm_headNode_cpuWaitTime | nvidia bcm headNode cpuWaitTime | jiffles | CPU time spent in I/O wait mode. | 1.0.0 | |
| nvidia_bcm_headNode_errorsRecv | nvidia bcm headNode errorsRecv | Errors per Sec | Packets/s received with error. | 1.0.0 | |
| nvidia_bcm_headNode_errorsSent | nvidia bcm headNode errorsSent | Errors per Sec | Packets/s sent with error. | 1.0.0 | |
| nvidia_bcm_headNode_hardwareCorruptedMemory | nvidia bcm headNode hardwareCorruptedMemory | Bytes | Hardware corrupted memory detected by ECC. | 1.0.0 | |
| nvidia_bcm_headNode_memoryFree | nvidia bcm headNode memoryFree | GB | Free system memory. | 1.0.0 | |
| nvidia_bcm_headNode_gpu_utilization | nvidia bcm headNode gpu utilization | % | Average GPU utilization percentage. | 1.0.0 | |
| nvidia_bcm_headNode_gpu_temperature | nvidia bcm headNode gpu temperature | Celsius | GPU temperature. | 1.0.0 | |
| nvidia_bcm_headNode_nfs_client_rpcRetrans | nvidia_bcm_headNode_nfs_client_rpcRetrans | count | Returns NFS Client RPC Retrans | 4.0.0 | |
| nvidia_bcm_headNode_nfs_server_fileStale | nvidia_bcm_headNode_nfs_server_fileStale | count | Returns NFS Server File Stales | 4.0.0 | |
| nvidia_bcm_headNode_nfs_server_packets | nvidia_bcm_headNode_nfs_server_packets | count | Returns NFS Server Packets | 4.0.0 | |
| nvidia_bcm_headNode_nfs_server_replyHits | nvidia_bcm_headNode_nfs_server_replyHits | count | Returns NFS Server Reply Hits on HeadNode | 4.0.0 | |
| nvidia_bcm_headNode_nfs_server_replyMisses | nvidia_bcm_headNode_nfs_server_replyMisses | count | Returns NFS Server Reply Miss on HeadNode | 4.0.0 | |
| nvidia_bcm_headNode_nfs_server_rpcBadAuth | nvidia_bcm_headNode_nfs_server_rpcBadAuth | Returns NFS Server Reply RPC BadAuth | 4.0.0 | ||
| nvidia_bcm_headNode_nfs_server_rpcBadCalls | nvidia_bcm_headNode_nfs_server_rpcBadCalls | count | Returns NFS Server Reply RPC BadCalls | 4.0.0 | |
| nvidia_bcm_headNode_nfs_server_rpcBadClnt | nvidia_bcm_headNode_nfs_server_rpcBadClnt | Returns NFS Server Reply RPC BadClnt | 4.0.0 | ||
| nvidia_bcm_headNode_nfs_server_rpcXdrCall | nvidia_bcm_headNode_nfs_server_rpcXdrCall | Returns NFS Server Reply RPC XDR Call | 4.0.0 | ||
| NVIDIA BCM Physical Node | nvidia_bcm_physicalNode_healthStatus | nvidia bcm physicalNode healthStatus | NA | Health status(PASS or FAIL) of each component. | 1.0.0 |
| nvidia_bcm_physicalNode_blockedProcesses | nvidia bcm physicalNode blockedProcesses | count | Blocked processes waiting for I/O. | 1.0.0 | |
| nvidia_bcm_physicalNode_systemCpuTime | nvidia bcm physicalNode systemCpuTime | jiffles | CPU time spent in system mode. | 1.0.0 | |
| nvidia_bcm_physicalNode_cpuWaitTime | nvidia bcm physicalNode cpuWaitTime | jiffles | CPU time spent in I/O wait mode. | 1.0.0 | |
| nvidia_bcm_physicalNode_errorsRecv | nvidia bcm physicalNode errorsRecv | Errors per Sec | Packets/s received with error. | 1.0.0 | |
| nvidia_bcm_physicalNode_errorsSent | nvidia bcm physicalNode errorsSent | Errors per Sec | Packets/s sent with error. | 1.0.0 | |
| nvidia_bcm_physicalNode_hardwareCorruptedMemory | nvidia bcm physicalNode hardwareCorruptedMemory | Bytes | Hardware corrupted memory detected by ECC. | 1.0.0 | |
| nvidia_bcm_physicalNode_memoryFree | nvidia bcm physicalNode memoryFree | GB | Free system memory. | 1.0.0 | |
| nvidia_bcm_physicalNode_gpu_utilization | nvidia bcm physicalNode gpu utilization | % | GPU utilization percentage | 1.0.0 | |
| nvidia_bcm_physicalNode_gpu_temperature | nvidia bcm physicalNode gpu temperature | Celsius | GPU temperature. | 1.0.0 | |
| nvidia_bcm_physicalNode_nfsmount_totalSize | nvidia bcm physicalNode nfsmount totalSize | GB | It monitors the total size of the nfs mount file on the node. | 3.0.0 | |
| nvidia_bcm_physicalNode_nfsmount_usedSize | nvidia bcm physicalNode nfsmount usedSize | GB | It monitors the used size of the nfs mount file on the node. | 3.0.0 | |
| nvidia_bcm_physicalNode_nfsmount_utilization | nvidia bcm physicalNode nfsmount utilization | % | It monitors the percentage utilization of the nfs mount file on the node. | 3.0.0 | |
| NVIDIA BCM Linux Server | nvidia_bcm_linuxServer_nfs_server_packets | nvidia_bcm_linuxServer_nfs_server_packets | count | Returns NFS Server Packets | 5.0.0 |
| nvidia_bcm_linuxServer_nfs_server_replyHits | nvidia_bcm_linuxServer_nfs_server_replyHits | count | Returns NFS Server Reply Hits on LinuxServer | 5.0.0 | |
| nvidia_bcm_linuxServer_nfs_server_replyMisses | nvidia_bcm_linuxServer_nfs_server_replyMisses | count | Returns NFS Server Reply Miss on LinuxServer | 5.0.0 | |
| nvidia_bcm_linuxServer_nfs_server_rpcBadCalls | nvidia_bcm_linuxServer_nfs_server_rpcBadCalls | count | Returns NFS Server Reply RPC BadCalls | 5.0.0 | |
| nvidia_bcm_linuxServer_nfs_server_rpcBadAuth | nvidia_bcm_linuxServer_nfs_server_rpcBadAuth | count | Returns NFS Server Reply RPC BadAuth | 5.0.0 | |
| nvidia_bcm_linuxServer_nfs_server_rpcBadClnt | nvidia_bcm_linuxServer_nfs_server_rpcBadClnt | count | Returns NFS Server Reply RPC BadClnt | 5.0.0 | |
| nvidia_bcm_linuxServer_nfs_server_fileStale | nvidia_bcm_linuxServer_nfs_server_fileStale | count | Returns NFS Server File Stales | 5.0.0 | |
| nvidia_bcm_linuxServer_memory_utilization | nvidia_bcm_linuxServer_memory_utilization | % | It monitors the percentage memory utilization of Linux server | 5.0.0 | |
| nvidia_bcm_linuxServer_disk_utilization | nvidia_bcm_linuxServer_disk_utilization | % | It monitors the percentage disk utilization of Linux server | 5.0.0 | |
| nvidia_bcm_linuxServer_cpu_utilization | nvidia_bcm_linuxServer_cpu_utilization | % | It monitors the percentage cpu utilization of Linux server | 5.0.0 |
Default Monitoring Configurations
Bright Cluster Manager has default Global Device Management Policies, Global Templates, Global Monitors and Global Metrics in OpsRamp. You can customize these default monitoring configurations as per your business use cases by cloning respective Global Templates, and Global Device Management Policies. OpsRamp recommends doing this activity before installing the application to avoid noise alerts and data.
Default Global Device Management Policies
OpsRamp has a Global Device Management Policy for each Native Type of NVIDIA Bright Cluster Manager. You can find those Device Management Policies at Setup > Resources > Device Management Policies, search with suggested names in global scope. Each Device Management Policy follows below naming convention:
{appName nativeType - version}Ex: NVIDIA BrightCluster Manager NVIDIA BCM Head Node (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)
Default Global Templates
OpsRamp has a Global Template for each Native Type of NVIDIA Bright Cluster Manager. You can find those templates at Setup > Monitoring > Templates, search with suggested names in global scope. Each template follows below naming convention:
{appName nativeType 'Template' - version}Ex : nvidia-bright-cluster-manager NVIDIA BCM Head Node Template - 1 (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node Template, version = 1)
Default Global Monitors
OpsRamp has a Global Monitors for each Native Type which has monitoring support. You can find those monitors at Setup > Monitoring > Monitors, search with suggested names in global scope. Each Monitors follows below naming convention:
{monitorKey appName nativeType - version}Ex: NVIDIA BCM Head Node Monitor nvidia-bright-cluster-manager NVIDIA BCM Head Node Container 1(i.e, monitorKey = NVIDIA BCM Head Node Monitor, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)