Introduction

High availability enables your IT infrastructure to function continuously though some of the components may fail. High availability plays a vital role in case of a severe disruption in services that may lead to severe business impact.

It is a concept that entails the elimination of single points of failure to make sure that even if one of the components fail, such as a server, the service is still available.

Failover

Failover is a process. Whenever a primary system, network or a database fails or is abnormally terminated, then a Failover acts as a standby which helps resume these operations.

Failover Cluster

Failover cluster is a set of servers that work together to provide High Availability (HA) or Continuous availability (CA). As mentioned earlier, if one of the servers goes down another node in the cluster can take over its workload with minimal or no downtime. Some failover clusters use physical servers whereas others involve virtual machines (VMs).

CA clusters allow users to access and work on the services and applications without any incidence of timeouts (100% availability), in case of a server failure. HA clusters, on the other hand, may cause a short hiatus in the service, but system recovers automatically with minimum downtime and no data loss.

A cluster is a set of two or more nodes (servers) that transmit data for processing through cables or a dedicated secure network. Even load balancing, storage or concurrent/parallel processing is possible through other clustering technologies.

If you look at the above image, Node 1 and Node 2 have common shared storage. Whenever one node goes down, the other one will pick up from there. These two nodes have one virtual IP that all other clients connect to.

Let us take a look at the two failover clusters, namely High Availability Failover Clusters and Continuous Availability Failover Clusters.

High Availability Failover Clusters

In case of High Availability Failover Clusters, a set of servers share data and resources in the system. All the nodes have access to the shared storage.

High availability clusters also include a monitoring connection that servers use to check the “heartbeat” or health of the other servers. At any time, at least one of the nodes in a cluster is active, while at least one is passive.

Continuous Availability Failover Clusters

This system consists of multiple systems that share a single copy of a computer’s operating system. Software commands issued by one system are also executed on the other systems. In case of a failover, the user can check critical data in a transaction.

There are a few Failover Cluster types like Windows Server Failover Cluster (WSFC), VMware Failover Clusters, SQL Server Failover Clusters, and Red Hat Linux Failover Clusters.

Windows Server Failover Clustering (WSFC)

One of the powerful features of Windows Server is the ability to create Windows failover clusters. With Windows Server 2019, Windows Failover Clustering is more powerful than ever and can host many highly available resources for business-critical workloads.

Following are the types of Windows Server 2019 Failover Clustering:

  • Hyper-V Clustering
  • Clustering for File Services
  • Scale-Out File Server
  • Application Layer Clustering
  • Host Layer Clustering
  • Tiered Clustering

Each provides tremendous capabilities to ensure production workloads are resilient and highly available.

Windows Server 2019 Failover Clustering supports the new and demanding use cases with a combination of various cluster types and applications of various clustering technologies.

Windows Server Failover Clustering (WSFC) is a feature of the Windows server platform for improving the high availability of clustered roles (formerly called clustered applications and services). For example, say there are two servers. They communicate through a series of heartbeat signals over a dedicated network.

Pre-requisites

  1. OpsRamp Classic Gateway 11.0 and above (or) OpsRamp Cluster gateway
  2. Ensure that “adapter integrations” addon is enabled in client configuration. Once enabled you can see Windows Fail-over Cluster integration under Setup -> Integrations -> Adapter section
  3. PS Remoting and WMI remoting to be enabled on each cluster node. If the configured user is non-administrator, then the below user should have privileges on WMI remoting and on Windows Services.

Following are the specific pre-requisites for non-administrator/operator level users:

  • 1. Enable WMI Remoting

    To enable WMI remoting:

    1. Click Start and select Run.
    2. Enter wmimgmt.msc and click OK.
    3. Right click WMI control (Local) and select Properties.
    4. Click Security tab.
    5. Expand Root.
    6. Select WMI and Click Security.
    7. Add user and select the following permissions:
      • Execute methods
      • Enable account
      • Enable remoting
      • Read security

  • 2. Enable WMI Remoting – CPU, Disk, Network

    1. Click Start and select Run.
    2. Enter lusrmgr.msc and click OK.
    3. In the Groups folder, right click Performance Monitor Users and select Properties.
    4. Click Members of tab, and click Add.
    5. Add users.

  • 3. Enable Windows Service Monitoring

    1. Retrieve the user SID of the User Account from the monitored device.
    2. Open Command Prompt in Administrator mode.
    3. Run the below command to retrieve the user SID.
      Note: Replace UserName with the user name for the User account.
        wmic useraccount where name="UserName" get name,sid
    
        Example:
        wmic useraccount where name="apiuser" get name,sid
        
    1. Note down the SID.
      (Ex. S-1-0-10-200000-30000000000-4000000000-500)

    2. Retrieve the current SDDL for the SC Manager.

    3. Run the below command which will save the current SDDL for the SC Manager to CurrentSDDL.txt.

        sc sdshow clussvc > CurrentSDDL.txt
        
    1. Edit the CurrentSDDL.txt and copy the entire content.
      The SDDL will look like below:
        D:(A;;CC;;;AU)(A;;CCLCRPRC;;;IU)(A;;CCLCRPRC;;;SU)(A;;CCLCRPWPRC;;;SY)(A;;KA;;;BA)(A;;CC;;;AC)S:(AU;FA;KA;;;WD)(AU;OIIOFA;GA;;;WD)
        
    1. Update the SDDL:
      Frame new SDDL snippet for above SID
        (A;;CCLCRPWPRC;;; <SID of User> )
    
        Example:
        (A;;CCLCRPWPRC;;;S-1-0-10-200000-30000000000-4000000000-500)
        
    1. Place this snippet before “S:” of original SDDL.
      Updated SDDL will be like this:
        D:(A;;CC;;;AU)(A;;CCLCRPRC;;;IU)(A;;CCLCRPRC;;;SU)(A;;CCLCRPWPRC;;;SY)(A;;KA;;;BA)(A;;CC;;;AC)(A;;CCLCRPWPRC;;;S-1-0-10-200000-30000000000-4000000000-500)S:(AU;FA;KA;;;WD)(AU;OIIOFA;GA;;;WD)
    
        
    1. Execute the below command with the updated SDDL:
        sc sdset clussvc D:(A;;CC;;;AU)(A;;CCLCRPRC;;;IU)(A;;CCLCRPRC;;;SU)(A;;CCLCRPWPRC;;;SY)(A;;KA;;;BA)(A;;CC;;;AC)(A;;CCLCRPWPRC;;;S-1-0-10-200000-30000000000-4000000000-500)S:(AU;FA;KA;;;WD)(AU;OIIOFA;GA;;;WD)
    
        

  • 4. Open ports and add user in all nodes and cluster

    • Opsramp gateway should be able to access cluster and nodes.
    • Ports to be opened are 5985 and 5986.
      Note: By default, WS-Man and PowerShell remoting use port 5985 and 5986 for connections over HTTP and HTTPS, users should be present in nodes and cluster.

Install the integration

  1. From All Clients, select a client
  2. Go to Setup > Integrations > Integrations
  3. From Available Integrations, select Adapter > Windows Fail-over Cluster. The Install Windows Fail-over Cluster Integration popup appears.
    Note: Ensure that Adapter addon is enabled at client and partner levels.
  1. Enter the following information:
    • Name: Name of the integration
    • Upload Logo: Optional logo for the integration.
    • GateWay Profiles: Select a gateway management profile to associate with the client.
  2. Click Install. The Integration page displays the installed integration.

Note: For classic gateway patch would be required on top of gateway 11.0. Reach out to OpsRamp support for patch details.

Configure the integration

  1. In CONFIGURATION section, click +Add.
  2. On Create Adapter Configuration, enter:
    • Name: Configuration name.
    • IP Address/Host Name: IP address or host name of the target.
    • Notification Alerts: Select TRUE or FALSE.
  3. From the Credentials section, select Custom and enter Username and Password.
    Note: These credentials are required to communicate with the target (cluster).
  4. From the Resource Types & Metrics section, select the metrics and configure for availability and alert conditions, for Cluster & Server.
  5. In the Discovery Schedule section, configure how frequently the discovery action should trigger. Select Recurrence Pattern to add one of the following patterns:
    • Minutes
    • Hourly
    • Daily
    • Weekly
    • Monthly
  6. In the Monitoring Schedule section, configure how frequently the monitoring action should trigger.
  7. Click Save.

After saving the configuration, the resources are discovered and monitoring is done as specified in the configuration profile.

The configuration is saved and displayed on the page.

You can perform the actions manually, like Discovery, Monitoring or even Disable the configuration.

The discovered resource(s) are displayed in the Infrastructure page under “Cluster”, with Native Resource Type as Windows Failover Cluster.

The cluster nodes are displayed under Components.

Supported Metrics

Resource Type: Cluster

Metric NamesDescriptionPS cmdletSample output
windows_cluster_node_state,
windows_cluster_online_nodes_count,
windows_cluster_node_health
current state of each cluster node,
Count of online node, when status is Up,
Cluster health - percentage of online nodes
Get-ClusterNodePS C:\Users\Administrator> get-clusternode

Name State Type:
WFCluster1 Up Node WFCluster2 Up Node
windows_cluster_group_stateState of cluster group of the failover cluster. Possible values 0-OFFLINE,1-ONLINEGet-ClusterGroupPS C:\Users\Administrator> Get-ClusterGroup

Name OwnerNode State:
Available Storage WFCluster2 Online Cluster Group WFCluster1 Online
windows_cluster_group_failover_statusWhenever the owner node which is hosting all the cluster services, goes down any other node becomes owner node automatically. This metric explains whether the current node was the owner node, when the last failover happened. Possible values 0-FALSE,1-TRUENANA, This metric is calculated with help of previous request ownernode data which will be stored in cache and current request owner node.
windows_cluster_resource_stateState of resources with in failover cluster. Possible values 0-OFFLINE,1-ONLINEGet-ClusterResourcePS C:\Users\Administrator> Get-ClusterResource

Name State OwnerGroup ResourceType:
Cluster Disk 2 Online Available Storage Physical Disk
Cluster IP Address Online Cluster Group IP Address
Cluster Name Online Cluster Group Network Name
Storage Qos Resource Online Cluster Group Storage QoS Policy Manager
Windows_cluster_system_disk_UsedspaceMonitors disk used space in GB
Windows_cluster_system_disk_UtilizationMonitors disk utilization in percentage
Windows_cluster_system_disk_FreespaceMonitors the Free Space usage in GB
Windows_cluster_system_network_interface_InTrafficMonitors In traffic of each interface for windows Devices
Windows_cluster_system_network_interface_OutTrafficMonitors Out traffic of each interface for windows Devices
Windows_cluster_system_os_UptimeTime lapsed since last reboot in minutes
Windows_cluster_system_cpu_LoadMonitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.
Windows_cluster_system_cpu_UtilizationThe percentage of elapsed time that the processor spends to execute a non-Idle thread (This does not include CPU steal time)
Windows_cluster_system_memory_UsedspacePhysical and virtual memory usage in GB
Windows_cluster_system_memory_UtilizationPhysical and virtual memory usage in percentage.
Windows_cluster_system_cpu_IdleTimeThe percentage of idle time that the processor spends on waiting for an operation.
Windows_cluster_system_network_interface_InPacketsMonitors in Packets of each interface for windows Devices
Windows_cluster_system_network_interface_OutPacketsMonitors Out packets of each interface for windows Devices
Windows_cluster_system_network_interface_InErrorsMonitors network in errors of each interface for windows Devices
Windows_cluster_system_network_interface_OutErrorsMonitors network out errors of each interface for windows Devices
Windows_cluster_system_network_interface_InDiscordsMonitors Network in discards of each interface for windows Devices
Windows_cluster_system_network_interface_OutDiscordsMonitors network Out Discards of each interface for windows Devices

Resource Type: Server

Metric NamesDescriptionPS cmdletSample output
windows_cluster_service_status,State of each node's windows OS service named cluster service which is responsible for windows failover clusterGet-Service -Name "cluster service"
Windows_cluster_node_system_disk_UsedspaceMonitors disk used space in GB
Windows_cluster_node_system_disk_UtilizationMonitors disk utilization in percentage
Windows_cluster_node_system_disk_FreespaceMonitors the Free Space usage in GB
Windows_cluster_node_system_network_interface_InTrafficMonitors In traffic of each interface for windows Devices
Windows_cluster_node_system_network_interface_OutTrafficMonitors Out traffic of each interface for windows Devices
windows_cluster_node_service_statusState of each node's windows OS service named cluster service which is responsible for windows failover cluster. Possible values 0-STOPPED,1-RUNNINGGet-Service -Name "cluster service"PS C:\Users\Administrator> Get-Service -Name "cluster service"

Status Name DisplayName:
Running ClusSvc cluster service
Windows_cluster_node_system_os_UptimeTime lapsed since last reboot in minutes
Windows_cluster_node_system_cpu_LoadMonitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.
Windows_cluster_node_system_cpu_UtilizationThe percentage of elapsed time that the processor spends to execute a non-Idle thread (This does not includes CPU steal time)
Windows_cluster_node_system_memory_UsedspacePhysical and virtual memory usage in GB
Windows_cluster_node_system_memory_UtilizationPhysical and virtual memory usage in percentage.
Windows_cluster_node_system_cpu_IdleTimeThe percentage of idle time that the processor spends on waiting for an operation.
Windows_cluster_node_system_network_interface_InPacketsMonitors in Packets of each interface for windows Devices
Windows_cluster_node_system_network_interface_OutPacketsMonitors Out packets of each interface for windows Devices
Windows_cluster_node_system_network_interface_InErrorsMonitors network in errors of each interface for windows Devices
Windows_cluster_node_system_network_interface_OutErrorsMonitors network out errors of each interface for windows Devices
Windows_cluster_node_system_network_interface_InDiscordsMonitors Network in discards of each interface for windows Devices
Windows_cluster_node_system_network_interface_OutDiscordsMonitors network Out Discards of each interface for windows Devices

Risks, Limitations & Assumptions

  • Windows_cluster_group_failover_status metric’s possible instance values are
    • 0 - if there is no change in OwnerNode.
    • 1 - if there is a change in OwnerNode.
    • 2 - If no OwnerNode.
  • Application can handle Critical/Recovery failure alert notifications when user enables Notification Alerts in configuration for below two cases:
    • Connectivity Exception
    • Authentication Exception
  • You can use this SDK application with cluster gateway without any further cluster gateway dependencies. But Classic gateway requires additional libraries to be packaged which are needed to execute the powershell script to achieve this integration approach. As per plan, the functionality will be added to the classic gateway and shipped in the next gateway release.
  • For cluster Object Discovery and Monitoring implementation, consider the object which has Name equals to Cluster Name in Get-ClusterResource response.
  • For ClusterGroup monitoring implementation, consider the object which has Name as Cluster Group in Get-ClusterGroup response.
  • If you enable agent monitoring templates on the Cluster and Node resource, you might see the duplicate metrics with different naming conventions.
  • If you enable the same thresholds on Additional OS level monitoring metrics on both Cluster and Node, you might see two alerts with same details with respective metric names (i.e, Windows_cluster_system_disk_Utilization, Windows_cluster_node_system_disk_Utilization).
  • While trying to fetch the node ip address we receive multiple node ips, which includes many local ips and actual ips. Example: lets say actual node ip is 10.1.1.1 when trying to fetch the details we will receive two ips one associated with custer (192.168.0.0) and other is the actual ip, identifying the actual node ip address from the list of ip address received we are assuming that node ip address is part of same subnet of cluster ip address, meaning if cluster ip is 10.1.1.1 then node ips will be 10.1.X.X.
  • OpsRamp provides the provision to give Cluster Ip Address OR HostName in configuration, however, HostName provision will work only if the Host Name Resolution works.