Introduction

Linux cluster is a group of Linux computers or nodes, storage devices that work together and are managed as a single system. A traditional clustering configuration has two nodes that are connected to shared storage (typically a SAN). With Linux clustering, an application is run on one node, and clustering software is used to monitor its operation.

A Linux cluster provides faster processing speed, larger storage capacity, better data integrity, greater reliability and wider availability of resources.

Failover

Failover is a process. Whenever a primary system, network or a database fails or is abnormally terminated, then a Failover acts as a standby which helps resume these operations.

Failover Cluster

Failover cluster is a set of servers that work together to provide High Availability (HA) or Continuous availability (CA). As mentioned earlier, if one of the servers goes down another node in the cluster can take over its workload with minimal or no downtime. Some failover clusters use physical servers whereas others involve virtual machines (VMs).

CA clusters allow users to access and work on the services and applications without any incidence of timeouts (100% availability), in case of a server failure. HA clusters, on the other hand, may cause a short hiatus in the service, but system recovers automatically with minimum downtime and no data loss.

A cluster is a set of two or more nodes (servers) that transmit data for processing through cables or a dedicated secure network. Even load balancing, storage or concurrent/parallel processing is possible through other clustering technologies.

Linux Failover Cluster Monitoring

If you look at the above image, Node 1 and Node 2 have common shared storage. Whenever one node goes down, the other one will pick up from there. These two nodes have one virtual IP that all other clients connect to.

Let us take a look at the two failover clusters, namely High Availability Failover Clusters and Continuous Availability Failover Clusters.

High Availability Failover Clusters

In case of High Availability Failover Clusters, a set of servers share data and resources in the system. All the nodes have access to the shared storage.

High Availability Clusters also include a monitoring connection that servers use to check the “heartbeat” or health of the other servers. At any time, at least one of the nodes in a cluster is active, while at least one is passive.

Continuous Availability Failover Clusters

This system consists of multiple systems that share a single copy of a computer’s operating system. Software commands issued by one system are also executed on the other systems. In case of a failover, the user can check critical data in a transaction.

There are a few Failover Cluster types like Linux Server Failover Cluster (WSFC), VMware Failover Clusters, SQL Server Failover Clusters, and Red Hat Linux Failover Clusters.

Hierarchy of Linux Cluster

Cluster
  -Nodes

Pre-Requisites

  1. Opsramp classic gateway 12.0.1 and above
  2. Pre-requisites for Pacemaker:
    • Credentials: root / non-root privileges with a member of “haclient” group.
    • Cluster management: Pacemaker
    • Accessibility: All nodes within a cluster should be accessible by a single credential set.
    • For non-root users: Update “~/.bashrc” file with “pcs” command path across all cluster nodes.
      Ex: export PATH=$PATH:/usr/sbin -> as a new line in ~/.bashrc file.
  3. Pre-requisites for RGManager(non-pacemaker)
    • Credentials: should provide access to both root and non-root users.

    • Cluster management: RGManager

    • Accessibility: All the nodes within a cluster should be accessible by a single credential set.

    • For non-root users: Update the following commands in “etc/sudoers” file to provide access for non-root users to execute these commands.

      “/usr/sbin/cman_tool nodes,/usr/sbin/cman_tool status,/usr/sbin/clustat -l,/sbin/service cman status,/sbin/service rgmanager status,/sbin/service corosync status,/usr/sbin/dmidecode -s system-uuid,/bin/cat /sys/class/dmi/id/product_serial”

      Note: Usually a linux cluster will be configured with a virtual-ip normally called as cluster-virtual-ip.We use this Ip for adding configurations during the installation of integration.

    • If the cluster-virtual-ip is not configured give the ip address of the reachable node associated with the cluster.

Application migration

  1. Check for the gateway version as a prerequisite step - classic gateway-12.0.1 and above.
    Notes:

    • You only have to follow these steps when you want to migrate from sdk 1.0 to sdk 2.0.
    • For the first time installation below steps are not required.
  2. Disable all configurations associated with sdk 1.0 adaptor integration application.

  3. Install and Add the configuration to that sdk 2.0 application.
    Note: refer to Configure and install the integration & View the Linux Failover Cluster details sections of this document.

  4. Once all discoveries are completed with the sdk 2.0 application, follow any one of the approaches.

    • Direct uninstallation of the sdk 1.0 adaptor application through the uninstall API with skipDeleteResources=true in the post request

      End-Point: https://{{host}}/api/v2/tenants/{tenantId}/integrations/installed/{installedIntgId}

      Request Body:
          {
          "uninstallReason": "Test",
          "skipDeleteResources": true
          }


      (OR)

    • Delete the configuration one by one through the Delete adaptor config API with the request parameter as skipDeleteResources=true

      End-Point: https://{{host}}/api/v2/tenants/{tenantId}/integrations/installed/config/{configId}?skipDeleteResources=true.

    • Finally, uninstall the adaptor application through API with skipDeleteResources=true in the post request.

      End-Point: https://{{host}}/api/v2/tenants/{tenantId}/integrations/installed/{installedIntgId}

      Request Body:
          {
          "uninstallReason": "Test",
          "skipDeleteResources": true
          }

Configure and install the integration

  1. From All Clients, select a client.
  2. Go to Setup > Integrations and Apps > Integrations.
  3. Click Manage Apps.
    Notes:
    • If there are already installed applications, it will redirect to the INSTALLED APPS page where all the installed applications are displayed.
    • If there are no installed applications, it will navigate to the ADD APP page.
Linux Install Integration
  1. Click + ADD on the **INSTALLED APP page. The ADD APP page displays all the available applications along with the newly created application with the version.

    Note: You can even search for the application using the search option available. Also you can use the All Categories option to search.
Linux Install Integration
  1. Click ADD in the linux-failover-cluster application.
  2. Select an existing registered profile, and click Next.
  3. In the Configurations page, click + ADD. The Add Configuration page appears.
  4. Enter the below mentioned BASIC INFORMATION:
Object NameDescription
NameEnter the name for the integration
IP Address/Host NameIP address/host name of the target.
CredentialsSelect the credentials from the drop-down list.

Note: Click + Add to create a credential.
Cluster TypeSelect Pacemake or RGManager from the Cluster Type drop-down list.

Note: Select App Failure Notifications to be notified in case of an application failure that is, Connectivity Exception, Authentication Exception.

  1. In the RESOURCE TYPE section, select:
    • ALL: All the existing and future resources will be discovered.
    • SELECT: You can select one or multiple resources to be discovered.
  2. In the DISCOVERY SCHEDULE section, select Recurrence Pattern to add one of the following patterns:
    • Minutes
    • Hourly
    • Daily
    • Weekly
    • Monthly
  3. Click ADD.
    The configuration is saved and displayed on the page.
Linux Install Integration

View the Linux Failover Cluster details

To view the resource information, go to Infrastructure > Resources > Cluster and click on your created cluster name.

Linux Install Integration

View resource attributes

The discovered resource(s) are displayed under Attributes. In this page you will get the basic information about the resources such as: Resource Type, Native Resource Type, Resource Name, IP Address etc.

Linux Install Integration

View resource metrics

To confirm Linux Cluster monitoring, review the following:

  • Metric graphs: A graph is plotted for each metric that is enabled in the configuration.
  • Alerts: Alerts are generated for metrics that are configured as defined for integration.
Linux Install Integration

Supported Metrics

Resource Type: Cluster

Pacemaker

Metric NamesDescriptionDisplay NameUnitPacemaker / RGManager
linux_cluster_nodes_statusStatus of each nodes present in linux cluster. 0 - offline, 1- online, 2- standbyCluster Node StatusBoth
linux_cluster_system_OS_UptimeTime lapsed since last reboot in minutesSystem UptimemBoth
linux_cluster_system_cpu_LoadMonitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.System CPU LoadBoth
linux_cluster_system_cpu_UtilizationThe percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time)System CPU Utilization%Both
linux_cluster_system_memory_UsedspacePhysical and virtual memory usage in GBSystem Memory Used SpaceGbBoth
linux_cluster_system_memory_UtilizationPhysical and virtual memory usage in percentage.System Memory Utilization%Both
linux_cluster_system_cpu_Usage_StatsMonitors cpu time in percentage spent in various program spaces. User - The processor time spent running user space processes System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete Idle - The time the processor spends idle Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernal Time Total TimeSystem CPU Usage Statistics%Both
linux_cluster_system_disk_UsedspaceMonitors disk used space in GBSystem Disk UsedSpaceGbBoth
linux_cluster_system_disk_UtilizationMonitors disk utilization in percentageSystem Disk Utilization%Both
linux_cluster_system_disk_Inode_UtilizationThis monitor is to collect DISK Inode metrics for all physical disks in a server.System Disk Inode Utilization%Both
linux_cluster_system_disk_freespaceMonitors the Free Space usage in GBSystem FreeDisk UsageGbBoth
linux_cluster_system_network_interface_Traffic_InMonitors In traffic of each interface for Linux DevicesSystem Network In TrafficKbpsBoth
linux_cluster_system_network_interface_Traffic_OutMonitors Out traffic of each interface for Linux DevicesSystem Network Out TrafficKbpsBoth
linux_cluster_system_network_interface_Packets_InMonitors in Packets of each interface for Linux DevicesSystem Network In packetspackets/secBoth
linux_cluster_system_network_interface_Packets_OutMonitors Out packets of each interface for Linux DevicesSystem Network out packetsBoth
linux_cluster_system_network_interface_Errors_InMonitors network in errors of each interface for Linux DevicesSystem Network In ErrorsErrors per SecBoth
linux_cluster_system_network_interface_Errors_OutMonitors Network Out traffic of each interface for Linux DevicesSystem Network Out ErrorsErrors per SecBoth
linux_cluster_system_network_interface_discards_InMonitors Network in discards of each interface for Linux DevicesSystem Network In discardspsecBoth
linux_cluster_system_network_interface_discards_OutMonitors network Out Discards of each interface for Linux DevicesSystem Network Out discardspsecBoth
linux_cluster_service_status_PacemakerPacemaker High Availability Cluster Manager. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown"Pacemaker Service StatusPacemaker
linux_cluster_service_status_CorosyncThe Corosync Cluster Engine is a Group Communication System. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown"Corosync Service StatusPacemaker
linux_cluster_service_status_PCSDPCS GUI and remote configuration interface. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown"PCSD Service StatusPacemaker
linux_cluster_Online_Nodes_CountOnline cluster nodes countOnline Nodes CountcountBoth
linux_cluster_Failover_StatusProvides the details about cluster failover status. The integer representation as follows , 0 - cluster is running on the same node , 1 - there is failover happenedCluster FailOver StatusBoth
linux_cluster_node_HealthThis metrics gives the info about the percentage of online linux nodes available within a cluster.Cluster Node Health Percentage%Both
linux_cluster_service_StatusCluster Services Status. The status representation as follows : 0 - disabled, 1-blocked, 2 - failed, 3 - stopped, 4 - recovering, 5 - stopping, 6 - starting, 7 - started, 8 - unknownLinux Cluster Service StatusBoth
linux_cluster_service_status_rgmanagerRGManager Service Status. The status representation as follows : 0 - \"failed\", 1 - \"active\" , 2 - \"unknown\"RGManager Service StatusRGManager
linux_cluster_service_status_CMANCMAN Service Status. The status representation as follows : 0 - \"failed\", 1 - \"active\" \u0026 2 - \"unknown\"CMAN Service StatusRGManager

Resource Type: Server

Metric NamesDescriptionDisplay NameUnitPacemaker / RGManager
linux_node_system_OS_UptimeTime lapsed since last reboot in minutesSystem UptimemBoth
linux_node_system_cpu_LoadMonitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.System CPU LoadBoth
linux_node_system_cpu_UtilizationThe percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time)System CPU Utilization%Both
linux_node_system_memory_UsedspacePhysical and virtual memory usage in GBSystem Memory Used SpaceGbBoth
linux_node_system_memory_UtilizationPhysical and virtual memory usage in percentage.System Memory Utilization%Both
linux_node_system_cpu_Usage_StatsMonitors cpu time in percentage spent in various program spaces. User - The processor time spent running user space processes System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete Idle - The time the processor spends idle Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernal Time Total TimeSystem CPU Usage Statistics%Both
linux_node_system_disk_UsedspaceMonitors disk used space in GBSystem Disk UsedSpaceGbBoth
linux_node_system_disk_UtilizationMonitors disk utilization in percentageSystem Disk Utilization%Both
linux_node_system_disk_Inode_UtilizationThis monitor is to collect DISK Inode metrics for all physical disks in a server.System Disk Inode Utilization%Both
linux_node_system_disk_freespaceMonitors the Free Space usage in GBSystem FreeDisk Usage.GbBoth
linux_node_system_network_interface_Traffic_InMonitors In traffic of each interface for Linux DevicesSystem Network In Traffic.KbpsBoth
linux_node_system_network_interface_Traffic_OutMonitors Out traffic of each interface for Linux DevicesSystem Network Out TrafficKbpsBoth
linux_node_system_network_interface_Packets_InMonitors in Packets of each interface for Linux DevicesSystem Network In packetspackets/secBoth
linux_node_system_network_interface_Packets_OutMonitors Out packets of each interface for Linux DevicesSystem Network out packetspackets/secBoth
linux_node_system_network_interface_Errors_InMonitors network in errors of each interface for Linux DevicesSystem Network In ErrorsErrors per SecBoth
linux_node_system_network_interface_Errors_OutMonitors Network Out traffic of each interface for Linux DevicesSystem Network Out ErrorsErrors per SecBoth
linux_node_system_network_interface_discards_InMonitors Network in discards of each interface for Linux DevicesSystem Network In discardspsecBoth
linux_node_system_network_interface_discards_OutMonitors network Out Discards of each interface for Linux DevicesSystem Network Out discardspsecBoth

Default monitoring configurations

Linux Failover Cluster application has default Global Device Management Policies, Global Templates, Global Monitors and Global metrics in OpsRamp. Users can customize these default monitoring configurations as per their business use cases by cloning respective Global Templates, and Global Device Management Policies. OpsRamp recommends doing this activity before installing the application to avoid noise alerts and data.

  1. Default Global Device Management Policies available

    OpsRamp has a Global Device Management Policy for each Native Type of Lnux Failover Cluster. You can find those Device Management Policies at Setup -> Resources -> Device Management Policies, search with suggested names in global scope. Each Device Management Policy follows below naming convention:

    {appName nativeType - version}

    Ex: linux-failover-cluster Linux Cluster - 1 (i.e, appName = linux-failover-cluster, nativeType = Linux Cluster, version = 1)

  2. Default Global Templates available

    OpsRamp has a Global template for each Native Type of LINUX-FAILOVER-CLUSTER. We can find those templates at Setup -> Monitoring -> Templates, Search with suggested names in global scope. Each template follows below naming convention:

    {appName nativeType 'Template' - version}

    Ex: linux-failover-cluster Linux Cluster Template - 1 (i.e, appName = linux-failover-cluster, nativeType = Linux Cluster, version = 1)

  3. Default Global Monitors available

    OpsRamp has a Global Monitors for each Native Type which has monitoring support. We can find those monitors at Setup -> Monitoring -> Monitors -> Search with suggested names in global scope. Each Monitors follows below naming convention:

    {monitorKey appName nativeType - version}

    Example: Linux Failover Cluster Monitor linux-failover-cluster Linux Cluster 1 (i.e, monitorKey = Linux Failover Cluster Monitor, appName = linux-failover-cluster, nativeType = Linux Cluster, version = 1)

Risks, Limitations & Assumptions

  • Application can handle Critical/Recovery failure notifications for below two cases when user enables App Failure Notifications in configuration
    • Connectivity Exception
    • Authentication Exception
  • Application will not send any duplicate/repeat failure alert notification until the already existing critical alert is recovered.
  • Macro replacement limitation (i.e, customisation for threshold breach alert subject, description) is there in sdk 2.0 applications also.
  • Application cannot control monitoring pause/resume actions based on above alerts
  • Metrics can be used to monitor Linux-Failover-Cluster resources and can generate alerts based on the threshold values.
  • This application supports only Classic gateway. Not supported with Cluster gateway.
  • Component level thresholds can be configured on each resource level.
  • No support of showing activity log and applied time.
  • No support for the option to get Latest snapshot metric.