Supported Metrics

This page lists the metrics collected by the Zerto integration and describes the default monitoring configuration applied when the integration is installed.
You can use this information to understand what is monitored out-of-the-box and how to customize thresholds, templates, and alert behaviour for your environment.


The metrics collected by the integration are grouped into logical categories based on the monitored component type. Each category is represented as a tabbed table that displays the native type, metric names, display labels, units, supported versions, and descriptions.

Supported Metrics

Native TypeMetric NameDisplay NameUnitsDescriptionApplication Version
NVIDIA BrightCluster Managernvidia_bcm_cluster_nodeAvailabilityStatusnvidia bcm cluster nodeAvailabilityStatusNAIt monitors the availability status of each node. Below are the possible status values: UP-0, Down-1, Closed-2, installing-3, installer_failed-4, installer_rebooting-5, installer_callinginit-6, installer_unreachable-7, installer_burning-8, burning-9, unknown-10, opening-11, going_down-12, pending-13, no data-141.0.0
nvidia_bcm_cluster_slurmDaemon_statusnvidia bcm cluster slurmDaemon statusNAIt monitors the running status of slurmDaemon & allocation status of the each node, Below are the possible states: allocated-0, idle-1, down-21.0.0
nvidia_bcm_cluster_gpuUnits_closednvidia bcm cluster gpuUnits closedcountIt monitors the count of how many GPU units in closed state1.0.0
nvidia_bcm_cluster_gpuUnits_downnvidia bcm cluster gpuUnits downcountIt monitors the count of how many GPU units in Down state1.0.0
nvidia_bcm_cluster_gpuUnits_totalnvidia bcm cluster gpuUnits totalcountIt monitors the count of total GPU units1.0.0
nvidia_bcm_cluster_gpuUnits_upnvidia bcm cluster gpuUnits upcountIt monitors the count of how many GPU units are UP.1.0.0
nvidia_bcm_cluster_cpuIdlenvidia bcm cluster cpuIdle%It monitors the % of CPU Idle Time.1.0.0
nvidia_bcm_cluster_nodesTotalnvidia bcm cluster nodesTotalcountIt monitors the count of total number of nodes.1.0.0
nvidia_bcm_cluster_nodesUpnvidia bcm cluster nodesUpcountIt monitors the count of how many nodes are in UP state.1.0.0
nvidia_bcm_cluster_nodesDownnvidia bcm cluster nodesDowncountIt monitors the count of how many nodes are in DOWN state.1.0.0
nvidia_bcm_cluster_nodesClosednvidia bcm cluster nodesClosedcountIt monitors the count of how many nodes are in closed state.1.0.0
nvidia_bcm_cluster_devicesTotalnvidia bcm cluster devicesTotalcountIt monitors the count of total devices.1.0.0
nvidia_bcm_cluster_devicesUpnvidia bcm cluster devicesUpcountIt monitors the count of how many devices are in UP state.1.0.0
nvidia_bcm_cluster_devicesDownnvidia bcm cluster devicesDowncountIt monitors the count of how many devices are in DOWN state.1.0.0
nvidia_bcm_cluster_coresUpnvidia bcm cluster coresUpcountIt monitors the count of Cores which are in UP state.1.0.0
nvidia_bcm_cluster_coresTotalnvidia bcm cluster coresTotalcountIt monitors the total count of cores1.0.0
nvidia_bcm_cluster_totalUsersLoginnvidia bcm cluster totalUsersLogincountIt monitors the count of total users login.1.0.0
nvidia_bcm_cluster_totalKnownUsersnvidia bcm cluster totalKnownUserscountIt monitors the count of total known users.1.0.0
nvidia_bcm_cluster_occupationRatenvidia bcm cluster occupationRate%It monitors the Occupation rate in %.1.0.0
nvidia_bcm_cluster_cpuUtilizationnvidia bcm cluster cpuUtilization%It monitors the % of CPU Utilization1.0.0
nvidia_bcm_cluster_failedJobsnvidia bcm cluster failedJobscountReturns number of failed jobs on the cluster.1.0.0
nvidia_bcm_cluster_queuedJobsnvidia bcm cluster queuedJobscountReturns number jobs in queue on the cluster.1.0.0
nvidia_bcm_cluster_runningJobsnvidia bcm cluster runningJobscountReturns number of jobs running on the cluster.1.0.0
nvidia_bcm_cluster_nodesInQueuenvidia bcm cluster nodes InQueuecountReturns number of nodes queued on the cluster1.0.0
nvidia_bcm_cluster_uniqueUserLogincountnvidia bcm cluster uniqueUserLogincountcountReturns number of unique users logged in to the cluster.1.0.0
nvidia_bcm_cluster_smartHdaTempnvidia bcm cluster smartHdaTempcelsiusIt monitors the temperature of spindle disks2.0.0
NVIDIA BCM Head Nodenvidia_bcm_headNode_healthStatusnvidia bcm headNode healthStatusNAHealth status(PASS or FAIL) of each component.1.0.0
nvidia_bcm_headNode_blockedProcessesnvidia bcm headNode blockedProcessescountBlocked processes waiting for I/O.1.0.0
nvidia_bcm_headNode_systemCpuTimenvidia bcm headNode systemCpuTimejifflesCPU time spent in system mode.1.0.0
nvidia_bcm_headNode_cpuWaitTimenvidia bcm headNode cpuWaitTimejifflesCPU time spent in I/O wait mode.1.0.0
nvidia_bcm_headNode_errorsRecvnvidia bcm headNode errorsRecvErrors per SecPackets/s received with error.1.0.0
nvidia_bcm_headNode_errorsSentnvidia bcm headNode errorsSentErrors per SecPackets/s sent with error.1.0.0
nvidia_bcm_headNode_hardwareCorruptedMemorynvidia bcm headNode hardwareCorruptedMemoryBytesHardware corrupted memory detected by ECC.1.0.0
nvidia_bcm_headNode_memoryFreenvidia bcm headNode memoryFreeGBFree system memory.1.0.0
nvidia_bcm_headNode_gpu_utilizationnvidia bcm headNode gpu utilization%Average GPU utilization percentage.1.0.0
nvidia_bcm_headNode_gpu_temperaturenvidia bcm headNode gpu temperatureCelsiusGPU temperature.1.0.0
nvidia_bcm_headNode_nfs_client_rpcRetransnvidia_bcm_headNode_nfs_client_rpcRetranscountReturns NFS Client RPC Retrans4.0.0
nvidia_bcm_headNode_nfs_server_fileStalenvidia_bcm_headNode_nfs_server_fileStalecountReturns NFS Server File Stales4.0.0
nvidia_bcm_headNode_nfs_server_packetsnvidia_bcm_headNode_nfs_server_packetscountReturns NFS Server Packets4.0.0
nvidia_bcm_headNode_nfs_server_replyHitsnvidia_bcm_headNode_nfs_server_replyHitscountReturns NFS Server Reply Hits on HeadNode4.0.0
nvidia_bcm_headNode_nfs_server_replyMissesnvidia_bcm_headNode_nfs_server_replyMissescountReturns NFS Server Reply Miss on HeadNode4.0.0
nvidia_bcm_headNode_nfs_server_rpcBadAuthnvidia_bcm_headNode_nfs_server_rpcBadAuthReturns NFS Server Reply RPC BadAuth4.0.0
nvidia_bcm_headNode_nfs_server_rpcBadCallsnvidia_bcm_headNode_nfs_server_rpcBadCallscountReturns NFS Server Reply RPC BadCalls4.0.0
nvidia_bcm_headNode_nfs_server_rpcBadClntnvidia_bcm_headNode_nfs_server_rpcBadClntReturns NFS Server Reply RPC BadClnt4.0.0
nvidia_bcm_headNode_nfs_server_rpcXdrCallnvidia_bcm_headNode_nfs_server_rpcXdrCallReturns NFS Server Reply RPC XDR Call4.0.0
NVIDIA BCM Physical Nodenvidia_bcm_physicalNode_healthStatusnvidia bcm physicalNode healthStatusNAHealth status(PASS or FAIL) of each component.1.0.0
nvidia_bcm_physicalNode_blockedProcessesnvidia bcm physicalNode blockedProcessescountBlocked processes waiting for I/O.1.0.0
nvidia_bcm_physicalNode_systemCpuTimenvidia bcm physicalNode systemCpuTimejifflesCPU time spent in system mode.1.0.0
nvidia_bcm_physicalNode_cpuWaitTimenvidia bcm physicalNode cpuWaitTimejifflesCPU time spent in I/O wait mode.1.0.0
nvidia_bcm_physicalNode_errorsRecvnvidia bcm physicalNode errorsRecvErrors per SecPackets/s received with error.1.0.0
nvidia_bcm_physicalNode_errorsSentnvidia bcm physicalNode errorsSentErrors per SecPackets/s sent with error.1.0.0
nvidia_bcm_physicalNode_hardwareCorruptedMemorynvidia bcm physicalNode hardwareCorruptedMemoryBytesHardware corrupted memory detected by ECC.1.0.0
nvidia_bcm_physicalNode_memoryFreenvidia bcm physicalNode memoryFreeGBFree system memory.1.0.0
nvidia_bcm_physicalNode_gpu_utilizationnvidia bcm physicalNode gpu utilization%GPU utilization percentage1.0.0
nvidia_bcm_physicalNode_gpu_temperaturenvidia bcm physicalNode gpu temperatureCelsiusGPU temperature.1.0.0
nvidia_bcm_physicalNode_nfsmount_totalSizenvidia bcm physicalNode nfsmount totalSizeGBIt monitors the total size of the nfs mount file on the node.3.0.0
nvidia_bcm_physicalNode_nfsmount_usedSizenvidia bcm physicalNode nfsmount usedSizeGBIt monitors the used size of the nfs mount file on the node.3.0.0
nvidia_bcm_physicalNode_nfsmount_utilizationnvidia bcm physicalNode nfsmount utilization%It monitors the percentage utilization of the nfs mount file on the node.3.0.0
NVIDIA BCM Linux Servernvidia_bcm_linuxServer_nfs_server_packetsnvidia_bcm_linuxServer_nfs_server_packetscountReturns NFS Server Packets5.0.0
nvidia_bcm_linuxServer_nfs_server_replyHitsnvidia_bcm_linuxServer_nfs_server_replyHitscountReturns NFS Server Reply Hits on LinuxServer5.0.0
nvidia_bcm_linuxServer_nfs_server_replyMissesnvidia_bcm_linuxServer_nfs_server_replyMissescountReturns NFS Server Reply Miss on LinuxServer5.0.0
nvidia_bcm_linuxServer_nfs_server_rpcBadCallsnvidia_bcm_linuxServer_nfs_server_rpcBadCallscountReturns NFS Server Reply RPC BadCalls5.0.0
nvidia_bcm_linuxServer_nfs_server_rpcBadAuthnvidia_bcm_linuxServer_nfs_server_rpcBadAuthcountReturns NFS Server Reply RPC BadAuth5.0.0
nvidia_bcm_linuxServer_nfs_server_rpcBadClntnvidia_bcm_linuxServer_nfs_server_rpcBadClntcountReturns NFS Server Reply RPC BadClnt5.0.0
nvidia_bcm_linuxServer_nfs_server_fileStalenvidia_bcm_linuxServer_nfs_server_fileStalecountReturns NFS Server File Stales5.0.0
nvidia_bcm_linuxServer_memory_utilizationnvidia_bcm_linuxServer_memory_utilization%It monitors the percentage memory utilization of Linux server5.0.0
nvidia_bcm_linuxServer_disk_utilizationnvidia_bcm_linuxServer_disk_utilization%It monitors the percentage disk utilization of Linux server5.0.0
nvidia_bcm_linuxServer_cpu_utilizationnvidia_bcm_linuxServer_cpu_utilization%It monitors the percentage cpu utilization of Linux server5.0.0

Default Monitoring Configurations

Bright Cluster Manager has default Global Device Management Policies, Global Templates, Global Monitors and Global Metrics in OpsRamp. You can customize these default monitoring configurations as per your business use cases by cloning respective Global Templates, and Global Device Management Policies. OpsRamp recommends doing this activity before installing the application to avoid noise alerts and data.

  1. Default Global Device Management Policies

    OpsRamp has a Global Device Management Policy for each Native Type of NVIDIA Bright Cluster Manager. You can find those Device Management Policies at Setup > Resources > Device Management Policies, search with suggested names in global scope. Each Device Management Policy follows below naming convention:

    {appName nativeType - version}

    Ex: NVIDIA BrightCluster Manager NVIDIA BCM Head Node (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)

  2. Default Global Templates

    OpsRamp has a Global Template for each Native Type of NVIDIA Bright Cluster Manager. You can find those templates at Setup > Monitoring > Templates, search with suggested names in global scope. Each template follows below naming convention:

    {appName nativeType 'Template' - version}

    Ex : nvidia-bright-cluster-manager NVIDIA BCM Head Node Template - 1 (i.e, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node Template, version = 1)

  3. Default Global Monitors

    OpsRamp has a Global Monitors for each Native Type which has monitoring support. You can find those monitors at Setup > Monitoring > Monitors, search with suggested names in global scope. Each Monitors follows below naming convention:

    {monitorKey appName nativeType - version}

    Ex: NVIDIA BCM Head Node Monitor nvidia-bright-cluster-manager NVIDIA BCM Head Node Container 1(i.e, monitorKey = NVIDIA BCM Head Node Monitor, appName = nvidia-bright-cluster-manager, nativeType = NVIDIA BCM Head Node, version = 1)