Supported Metrics and Default Monitoring Configuration

Supported Metrics

This page lists the metrics collected by the Zerto integration and describes the default monitoring configuration applied when the integration is installed.
You can use this information to understand what is monitored out-of-the-box and how to customize thresholds, templates, and alert behaviour for your environment.

The metrics collected by the integration are grouped into logical categories based on the monitored component type. Each category is represented as a tabbed table that displays the native type, metric names, display labels, units, supported versions, and descriptions.

Supported Metrics


Native Type	Metric Name	Display Name	Units	Description	Application Version
NVIDIA BrightCluster Manager	nvidia_bcm_cluster_nodeAvailabilityStatus	nvidia bcm cluster nodeAvailabilityStatus	NA	It monitors the availability status of each node. Below are the possible status values: UP-0, Down-1, Closed-2, installing-3, installer_failed-4, installer_rebooting-5, installer_callinginit-6, installer_unreachable-7, installer_burning-8, burning-9, unknown-10, opening-11, going_down-12, pending-13, no data-14	1.0.0
	nvidia_bcm_cluster_slurmDaemon_status	nvidia bcm cluster slurmDaemon status	NA	It monitors the running status of slurmDaemon & allocation status of the each node, Below are the possible states: allocated-0, idle-1, down-2	1.0.0
	nvidia_bcm_cluster_gpuUnits_closed	nvidia bcm cluster gpuUnits closed	count	It monitors the count of how many GPU units in closed state	1.0.0
	nvidia_bcm_cluster_gpuUnits_down	nvidia bcm cluster gpuUnits down	count	It monitors the count of how many GPU units in Down state	1.0.0
	nvidia_bcm_cluster_gpuUnits_total	nvidia bcm cluster gpuUnits total	count	It monitors the count of total GPU units	1.0.0
	nvidia_bcm_cluster_gpuUnits_up	nvidia bcm cluster gpuUnits up	count	It monitors the count of how many GPU units are UP.	1.0.0
	nvidia_bcm_cluster_cpuIdle	nvidia bcm cluster cpuIdle	%	It monitors the % of CPU Idle Time.	1.0.0
	nvidia_bcm_cluster_nodesTotal	nvidia bcm cluster nodesTotal	count	It monitors the count of total number of nodes.	1.0.0
	nvidia_bcm_cluster_nodesUp	nvidia bcm cluster nodesUp	count	It monitors the count of how many nodes are in UP state.	1.0.0
	nvidia_bcm_cluster_nodesDown	nvidia bcm cluster nodesDown	count	It monitors the count of how many nodes are in DOWN state.	1.0.0
	nvidia_bcm_cluster_nodesClosed	nvidia bcm cluster nodesClosed	count	It monitors the count of how many nodes are in closed state.	1.0.0
	nvidia_bcm_cluster_devicesTotal	nvidia bcm cluster devicesTotal	count	It monitors the count of total devices.	1.0.0
	nvidia_bcm_cluster_devicesUp	nvidia bcm cluster devicesUp	count	It monitors the count of how many devices are in UP state.	1.0.0
	nvidia_bcm_cluster_devicesDown	nvidia bcm cluster devicesDown	count	It monitors the count of how many devices are in DOWN state.	1.0.0
	nvidia_bcm_cluster_coresUp	nvidia bcm cluster coresUp	count	It monitors the count of Cores which are in UP state.	1.0.0
	nvidia_bcm_cluster_coresTotal	nvidia bcm cluster coresTotal	count	It monitors the total count of cores	1.0.0
	nvidia_bcm_cluster_totalUsersLogin	nvidia bcm cluster totalUsersLogin	count	It monitors the count of total users login.	1.0.0
	nvidia_bcm_cluster_totalKnownUsers	nvidia bcm cluster totalKnownUsers	count	It monitors the count of total known users.	1.0.0
	nvidia_bcm_cluster_occupationRate	nvidia bcm cluster occupationRate	%	It monitors the Occupation rate in %.	1.0.0
	nvidia_bcm_cluster_cpuUtilization	nvidia bcm cluster cpuUtilization	%	It monitors the % of CPU Utilization	1.0.0
	nvidia_bcm_cluster_failedJobs	nvidia bcm cluster failedJobs	count	Returns number of failed jobs on the cluster.	1.0.0
	nvidia_bcm_cluster_queuedJobs	nvidia bcm cluster queuedJobs	count	Returns number jobs in queue on the cluster.	1.0.0
	nvidia_bcm_cluster_runningJobs	nvidia bcm cluster runningJobs	count	Returns number of jobs running on the cluster.	1.0.0
	nvidia_bcm_cluster_nodesInQueue	nvidia bcm cluster nodes InQueue	count	Returns number of nodes queued on the cluster	1.0.0
	nvidia_bcm_cluster_uniqueUserLogincount	nvidia bcm cluster uniqueUserLogincount	count	Returns number of unique users logged in to the cluster.	1.0.0
	nvidia_bcm_cluster_smartHdaTemp	nvidia bcm cluster smartHdaTemp	celsius	It monitors the temperature of spindle disks	2.0.0
NVIDIA BCM Head Node	nvidia_bcm_headNode_healthStatus	nvidia bcm headNode healthStatus	NA	Health status(PASS or FAIL) of each component.	1.0.0
	nvidia_bcm_headNode_blockedProcesses	nvidia bcm headNode blockedProcesses	count	Blocked processes waiting for I/O.	1.0.0
	nvidia_bcm_headNode_systemCpuTime	nvidia bcm headNode systemCpuTime	jiffles	CPU time spent in system mode.	1.0.0
	nvidia_bcm_headNode_cpuWaitTime	nvidia bcm headNode cpuWaitTime	jiffles	CPU time spent in I/O wait mode.	1.0.0
	nvidia_bcm_headNode_errorsRecv	nvidia bcm headNode errorsRecv	Errors per Sec	Packets/s received with error.	1.0.0
	nvidia_bcm_headNode_errorsSent	nvidia bcm headNode errorsSent	Errors per Sec	Packets/s sent with error.	1.0.0
	nvidia_bcm_headNode_hardwareCorruptedMemory	nvidia bcm headNode hardwareCorruptedMemory	Bytes	Hardware corrupted memory detected by ECC.	1.0.0
	nvidia_bcm_headNode_memoryFree	nvidia bcm headNode memoryFree	GB	Free system memory.	1.0.0
	nvidia_bcm_headNode_gpu_utilization	nvidia bcm headNode gpu utilization	%	Average GPU utilization percentage.	1.0.0
	nvidia_bcm_headNode_gpu_temperature	nvidia bcm headNode gpu temperature	Celsius	GPU temperature.	1.0.0
	nvidia_bcm_headNode_nfs_client_rpcRetrans	nvidia_bcm_headNode_nfs_client_rpcRetrans	count	Returns NFS Client RPC Retrans	4.0.0
	nvidia_bcm_headNode_nfs_server_fileStale	nvidia_bcm_headNode_nfs_server_fileStale	count	Returns NFS Server File Stales	4.0.0
	nvidia_bcm_headNode_nfs_server_packets	nvidia_bcm_headNode_nfs_server_packets	count	Returns NFS Server Packets	4.0.0
	nvidia_bcm_headNode_nfs_server_replyHits	nvidia_bcm_headNode_nfs_server_replyHits	count	Returns NFS Server Reply Hits on HeadNode	4.0.0
	nvidia_bcm_headNode_nfs_server_replyMisses	nvidia_bcm_headNode_nfs_server_replyMisses	count	Returns NFS Server Reply Miss on HeadNode	4.0.0
	nvidia_bcm_headNode_nfs_server_rpcBadAuth	nvidia_bcm_headNode_nfs_server_rpcBadAuth		Returns NFS Server Reply RPC BadAuth	4.0.0
	nvidia_bcm_headNode_nfs_server_rpcBadCalls	nvidia_bcm_headNode_nfs_server_rpcBadCalls	count	Returns NFS Server Reply RPC BadCalls	4.0.0
	nvidia_bcm_headNode_nfs_server_rpcBadClnt	nvidia_bcm_headNode_nfs_server_rpcBadClnt		Returns NFS Server Reply RPC BadClnt	4.0.0
	nvidia_bcm_headNode_nfs_server_rpcXdrCall	nvidia_bcm_headNode_nfs_server_rpcXdrCall		Returns NFS Server Reply RPC XDR Call	4.0.0
NVIDIA BCM Physical Node	nvidia_bcm_physicalNode_healthStatus	nvidia bcm physicalNode healthStatus	NA	Health status(PASS or FAIL) of each component.	1.0.0
	nvidia_bcm_physicalNode_blockedProcesses	nvidia bcm physicalNode blockedProcesses	count	Blocked processes waiting for I/O.	1.0.0
	nvidia_bcm_physicalNode_systemCpuTime	nvidia bcm physicalNode systemCpuTime	jiffles	CPU time spent in system mode.	1.0.0
	nvidia_bcm_physicalNode_cpuWaitTime	nvidia bcm physicalNode cpuWaitTime	jiffles	CPU time spent in I/O wait mode.	1.0.0
	nvidia_bcm_physicalNode_errorsRecv	nvidia bcm physicalNode errorsRecv	Errors per Sec	Packets/s received with error.	1.0.0
	nvidia_bcm_physicalNode_errorsSent	nvidia bcm physicalNode errorsSent	Errors per Sec	Packets/s sent with error.	1.0.0
	nvidia_bcm_physicalNode_hardwareCorruptedMemory	nvidia bcm physicalNode hardwareCorruptedMemory	Bytes	Hardware corrupted memory detected by ECC.	1.0.0
	nvidia_bcm_physicalNode_memoryFree	nvidia bcm physicalNode memoryFree	GB	Free system memory.	1.0.0
	nvidia_bcm_physicalNode_gpu_utilization	nvidia bcm physicalNode gpu utilization	%	GPU utilization percentage	1.0.0
	nvidia_bcm_physicalNode_gpu_temperature	nvidia bcm physicalNode gpu temperature	Celsius	GPU temperature.	1.0.0
	nvidia_bcm_physicalNode_nfsmount_totalSize	nvidia bcm physicalNode nfsmount totalSize	GB	It monitors the total size of the nfs mount file on the node.	3.0.0
	nvidia_bcm_physicalNode_nfsmount_usedSize	nvidia bcm physicalNode nfsmount usedSize	GB	It monitors the used size of the nfs mount file on the node.	3.0.0
	nvidia_bcm_physicalNode_nfsmount_utilization	nvidia bcm physicalNode nfsmount utilization	%	It monitors the percentage utilization of the nfs mount file on the node.	3.0.0
NVIDIA BCM Linux Server	nvidia_bcm_linuxServer_nfs_server_packets	nvidia_bcm_linuxServer_nfs_server_packets	count	Returns NFS Server Packets	5.0.0
	nvidia_bcm_linuxServer_nfs_server_replyHits	nvidia_bcm_linuxServer_nfs_server_replyHits	count	Returns NFS Server Reply Hits on LinuxServer	5.0.0
	nvidia_bcm_linuxServer_nfs_server_replyMisses	nvidia_bcm_linuxServer_nfs_server_replyMisses	count	Returns NFS Server Reply Miss on LinuxServer	5.0.0
	nvidia_bcm_linuxServer_nfs_server_rpcBadCalls	nvidia_bcm_linuxServer_nfs_server_rpcBadCalls	count	Returns NFS Server Reply RPC BadCalls	5.0.0
	nvidia_bcm_linuxServer_nfs_server_rpcBadAuth	nvidia_bcm_linuxServer_nfs_server_rpcBadAuth	count	Returns NFS Server Reply RPC BadAuth	5.0.0
	nvidia_bcm_linuxServer_nfs_server_rpcBadClnt	nvidia_bcm_linuxServer_nfs_server_rpcBadClnt	count	Returns NFS Server Reply RPC BadClnt	5.0.0
	nvidia_bcm_linuxServer_nfs_server_fileStale	nvidia_bcm_linuxServer_nfs_server_fileStale	count	Returns NFS Server File Stales	5.0.0
	nvidia_bcm_linuxServer_memory_utilization	nvidia_bcm_linuxServer_memory_utilization	%	It monitors the percentage memory utilization of Linux server	5.0.0
	nvidia_bcm_linuxServer_disk_utilization	nvidia_bcm_linuxServer_disk_utilization	%	It monitors the percentage disk utilization of Linux server	5.0.0
	nvidia_bcm_linuxServer_cpu_utilization	nvidia_bcm_linuxServer_cpu_utilization	%	It monitors the percentage cpu utilization of Linux server	5.0.0

Default Monitoring Configurations

Bright Cluster Manager has default Global Device Management Policies, Global Templates, Global Monitors and Global Metrics in OpsRamp. You can customize these default monitoring configurations as per your business use cases by cloning respective Global Templates, and Global Device Management Policies. OpsRamp recommends doing this activity before installing the application to avoid noise alerts and data.

Default Global Device Management Policies
OpsRamp has a Global Device Management Policy for each Native Type of NVIDIA Bright Cluster Manager. You can find those Device Management Policies at Setup > Resources > Device Management Policies, search with suggested names in global scope. Each Device Management Policy follows below naming convention:
{appName nativeType - version}
Ex: NVIDIA BrightCluster Manager NVIDIA BCM Head Node (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)
Default Global Templates
OpsRamp has a Global Template for each Native Type of NVIDIA Bright Cluster Manager. You can find those templates at Setup > Monitoring > Templates, search with suggested names in global scope. Each template follows below naming convention:
{appName nativeType 'Template' - version}
Ex : nvidia-bright-cluster-manager NVIDIA BCM Head Node Template - 1 (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node Template, version = 1)
Default Global Monitors
OpsRamp has a Global Monitors for each Native Type which has monitoring support. You can find those monitors at Setup > Monitoring > Monitors, search with suggested names in global scope. Each Monitors follows below naming convention:
{monitorKey appName nativeType - version}
Ex: NVIDIA BCM Head Node Monitor nvidia-bright-cluster-manager NVIDIA BCM Head Node Container 1(i.e, monitorKey = NVIDIA BCM Head Node Monitor, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)