Multi-node failover cluster Instance

SQL Server clustering describes a collection of two or more physical servers (nodes), connected via LAN, each of which host a SQL server instance and have the same access to shared storage. Clustering SQL servers provides high availability and protection from disasters when a server hosting the SQL Server instance fails.

Always on failover cluster instances

SQL Server Always On Failover Cluster Instances (FCIs) use Windows Server Failover Clustering (WSFC) to provide high availability at the server instance level. An FCI is a single instance of SQL Server that is installed across WSFC nodes to provide high availability for the entire installation of SQL Server.

Always On availability group

Always On availability groups feature a high availability and disaster recovery solution that provides an alternative to database mirroring, and maximizes the availability of a set of user databases for an enterprise. This supports a failover environment for a discrete set of user databases, known as availability databases.

The availability group supports a set of read-write primary databases and one to eight sets of corresponding secondary databases. Optionally, secondary databases can be made available for read-only access and/or some backup operations. Availability group fails over at the level of an availability replica. Failovers are not caused by database issues such as a database becoming suspect due to a loss of a data file, deletion of a database, or corruption of a transaction log.

Pre-requisites

  • The application would require classic gateway 12.0 and above.

  • There is no version dependency for cluster gateway.

  • Powershell cmdlets have the following prerequisites:

    • Windows domain User should be able to do powershell remoting
      Enable-PSRemoting -Force
    • Windows domain user should be added to “Remote Management users” group net localgroup “Remote Management Users” /add < user >
    • Windows domain user should be added to “Performance monitor users” group net localgroup “Performance monitor users” /add < user >
  • Granting Remote DCOM Rights

    • To grant users DCOM rights, log on to each monitored system and complete the following procedure:
      • Go to command prompt and enter dcomcnfg
      • Navigate to component services > computers > My computer and then right click and select Properties. Then go to the COM Security tab.
      • Under Access Permissions, go to edit limits and add the domain non-admin user and enable both local and remote access then click OK.
      • In Launch and Activation permissions, go to edit limits, add the domain non-admin user and check all boxes , and click OK.
  • Granting Remote WMI Rights

    • To give the user remote WMI rights, log on to each system to be monitored and complete the following procedure:
      • Go to “computer management” and select “WMI Control” under ‘Services and Applications’.
      • Right-click WMI control and go to properties. Then select the Security tab.
      • Select root and click Security.
      • Add the domain non-admin user, and check the boxes for execute methods, enable account, remote enable, and reas security.
      • Click Advanced, select added non-admin domain user and click edit. Then for “Applies” to select “namespace and sub namespaces” in the dropdown.
      • Click OK 3 times .
  • For monitoring services like windows cluster service and other mssql services (sql server, sql agent , sql browser etc.), windows domain users have to be a part of “Local Administrators group” as we are using win32_Service class for fetching the details.

  • Windows domain user should be granted read-only access to Cluster
    Grant-ClusterAccess -User <domain\user> -Readonly

SQL authorization permissions:

  • For monitoring some metrics, we are using JDBC. For JDBC connections we are supporting mixed authentication ( both NLTM and SQL server authentication)
  • The minimum privileges for SQL auth are:
    • server role: public - This server role is available by default
    • securables: We need to enable some securables for connecting to SQL instances,connecting to all databases, and also some read-only permissions like below:

 - CONNECT ANY DATABASE
      GRANT CONNECT ANY DATABASE to < user >
 - CONNECT SQL
      GRANT CONNECT SQL to < user >
 - SELECT ALL USER SECURABLES
      GRANT SELECT ALL USER SECURABLES to < user >
 - VIEW ANY DATABASE
      GRANT VIEW ANY DATABASE to < user >
 - VIEW ANY DEFINITION
      GRANT VIEW ANY DEFINITION to < user >
 - VIEW SERVER STATE
      GRANT VIEW SERVER STATE to < user >

Install the integration

  1. From All Clients, select a client.
  2. Go to Setup > Integrations > Integrations.
  3. From Available Integrations, select Adapter >MS SQL Cluster. The Install MS SQL Cluster Integration popup appears.
MS SQL Install
  1. Enter the following information:
    • Name: Name of the integration
    • Upload Logo: (Optional) logo for the integration.
    • GateWay Profiles: Select a gateway management profile to associate with the client.
  2. Click Install. The Integration page displays the installed integration.
MS SQL Install

Configure the integration

  1. In CONFIGURATION section, click +Add.
  2. On Create Adapter Configuration, enter:
    • Name: Configuration name.
    • IP Address/Host Name: IP address or host name of the target.
    • Notification Alerts: Select TRUE or FALSE.
      Notes:
      • By default FALSE is selected.
      • If you select TRUE, App will handle Critical/Recovery failure alert notifications for Connectivity, Authentication Exceptions.
    • MSSQL Cluster Type : Select AvailabilityGroup or FailoverClusterInstance from the dropdown.
    • SQL Server Authentication Mode: Select WindowsAuthentication or SqlServerAuthentication from the dropdown.
  3. From the Windows Cluster Credentials section, select Custom and enter Username and Password.
  4. From the SQL Server Credentials section, select Custom and enter Username and Password.
  5. From the Resource Types & Metrics section, select the metrics and configure for availability and alert conditions, for Cluster, Server, Group, Storage, & SQL Server.
  6. In the Discovery Schedule section, configure how frequently the discovery action should trigger. Select Recurrence Pattern to add one of the following patterns:
    • Minutes
    • Hourly
    • Daily
    • Weekly
    • Monthly
  7. In the Monitoring Schedule section, configure how frequently the monitoring action should trigger.
  8. Click Save.
MS SQL Configure

After saving the configuration, the resources are discovered and monitoring is done as specified in the configuration profile.

The configuration is saved and displayed on the page.

MS SQL Configure

You can perform the actions manually, like Discovery, Monitoring or Disable the configuration.

Resource information

  1. To view the resource information, go to Infrastructure > Resources > Cluster.
  2. Use the search function to find your resource and then click on it.

View resource attributes

The discovered resource(s) are displayed under Clusters. In this page you will get the basic information about the resources such as: Resource Type, Native Resource Type, Resource Name, IP Address etc.

MS SQL Configure

Note: The children cluster nodes are displayed under Components, click any children node and navigate to Metrics to view the individual metrics to the children node.

MS SQL Configure

View resource metrics

To confirm MS SQL Cluster monitoring, review the following:

  • Metric graphs: A graph is plotted for each metric that is enabled in the configuration.
  • Alerts: Alerts are generated for metrics that are configured as defined for integration.
MS SQL Configure

Supported metrics

Resource TypeMetric NamesDisplay NameUnit
Clusterwindows_cluster_node_state

State of all nodes of the cluster such as up or down. Possible values 0-DOWN,1-UP
Windows Cluster Node State
windows_cluster_online_nodes_count

Count of nodes which are in Online state
Windows Cluster Online Nodes Countcount
windows_cluster_node_health

Cluster health - percentage of online nodes
Windows Cluster Node Health%
Serverwindows_cluster_node_service_status

"State of each node's windows os service named cluster service which is responsible for windows failover cluster. Possible values 0-STOPPED,1-RUNNING 2-Start Pending, 3-Stop Pending, 4-Continue Pending, 5-Pause Pending, 6-Paused, 7-Unknown"
Windows Cluster Node Service Status
Windows_cluster_node_system_os_Uptime

Time lapsed since last reboot in minutes
System Uptimem
Windows_cluster_node_system_cpu_Load

Monitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.
System CPU Load
Windows_cluster_node_system_cpu_Utilization

The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time)
System CPU Utilization%
Windows_cluster_node_system_memory_Usedspace

Physical and virtual memory usage in GB
System Memory Used SpaceGB
Windows_cluster_node_system_memory_Utilization

Physical and virtual memory usage in percentage.
System Memory Utilization%
Windows_cluster_node_system_cpu_IdleTime

Monitors cpu time in percentage spent in various program spaces. User - The processor time spent running user space processes System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete Idle - The time the processor spends idle Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernal Time Total Time
System CPU IdleTime%
Windows_cluster_node_system_network_interface_InTraffic

Monitors In traffic of each interface for windows Devices
System Network In TrafficKbps
Windows_cluster_node_system_network_interface_OutTraffic

Monitors Out traffic of each interface for windows Devices
System Network Out TrafficKbps
Windows_cluster_node_system_network_interface_InPackets

Monitors in Packets of each interface for windows Devices
System Network In packetspackets/sec
Windows_cluster_node_system_network_interface_OutPackets

Monitors Out packets of each interface for windows Devices
System Network out packetspackets/sec
Windows_cluster_node_system_network_interface_InErrors

Monitors network in errors of each interface for windows Devices
System Network In ErrorsErrors per Sec
Windows_cluster_node_system_network_interface_OutErrors

Monitors network out errors of each interface for windows Devices
System Network Out ErrorsErrors per Sec
Windows_cluster_node_system_network_interface_InDiscords

>Monitors Network in discards of each interface for windows Devices
System Network In discardspsec
Windows_cluster_node_system_network_interface_OutDiscords

Monitors network Out Discards of each interface for windows Devices
System Network Out discardspsec
windows_cluster_node_sqlservices_state

"Monitors sql services state. Possible values 0-STOPPED,1-RUNNING 2-Start Pending, 3-Stop Pending, 4-Continue Pending, 5-Pause Pending, 6-Paused, 7-Unknown"
SQL services state
windows_cluster_node_sqlprocesses_cpuUsage

Monitors sql processes cpu usage in seconds
SQL processes CPU usages
windows_cluster_node_sqlprocesses_handles

Monitors sql processes handles count
SQL processes handlescount
Groupwindows_mssql_cluster_group_running_status

"State of cluster group of the failover cluster. Possible values 0-OFFLINE,1-ONLINE 2-Initializing, 3-Failed, 4-Pending, 5-OnlinePending, 6-OfflinePending, 7-Unknown"
Windows MSSQL Cluster Group Running Status
windows_mssql_cluster_group_failover_status

Whenever the owner node which is hosting all the cluster services, goes down any other node becomes owner node automatically. This metric explains whether the current node was the owner node, when the last failover happened. Possible instance values, 0-if there is no change in OwnerNode, 1-If there is a change in OwnderNode, 2 If no OwnerNode
Windows MSSQL Cluster Group Failover Status
windows_mssql_cluster_group_listener_status

"Monitors mssql cluster group listeners status. Possible values 0-OFFLINE,1-ONLINE 2-Initializing, 3-Failed, 4-Pending, 5-OnlinePending, 6-OfflinePending, 7-Unknown"
Windows MSSQL cluster group listener status
windows_mssql_cluster_AG_instance_status

Monitors mssql availability group cluster instance status. Possible values 0-OFFLINE,1-ONLINE 2-Initializing, 3-Failed, 4-Pending, 5-OnlinePending, 6-OfflinePending, 7-Unknown. This metric is applicable only for AAG cluster.
windows MSSQL cluster availability group instance status
Storagewindows_mssql_cluster_disk_AvailabilityStatus

"Monitors system disk availability status. Possible values 0-OFFLINE,1-ONLINE 2-Initializing, 3-Failed, 4-Pending, 5-OnlinePending, 6-OfflinePending, 7-Unknown"
System Disk availability status
windows_mssql_cluster_disk_FailoverStatus

Whenever the owner node which is hosting storage disk, goes down any other node becomes owner node automatically. This metric explains whether the current node was the owner node, when the last failover happened. Possible instance values, 0-if there is no change in OwnerNode, 1-If there is a change in OwnderNode, 2 If no OwnerNode
System Disk Failover status
windows_mssql_cluster_disk_Usedspace

Monitors disk used space in GB
System Disk UsedSpaceGB
windows_mssql_cluster_disk_Utilization

Monitors disk utilization in percentage
System Disk Utilization%
windows_mssql_cluster_disk_Freespace

Monitors the Free Space usage in GB
System Disk FreeSpaceGB
SQL_SERVERwindows_mssql_cluster_instance_RunningStatus

"Monitors MSSQL failover cluster instance running status. Possible values 0-OFFLINE,1-ONLINE 2-Initializing, 3-Failed, 4-Pending, 5-OnlinePending, 6-OfflinePending, 7-Unknown"
MSSQL Failover Cluster Instance Running Status
windows_mssql_cluster_instance_FailoverStatus

Monitors MSSQL cluster instance failover status. Possible instance values, 0-if there is no change in OwnerNode, 1-If there is a change in OwnderNode, 2 If no OwnerNode. This metric is applicable only for FCI cluster.
MSSQL Failover Cluster Instance Failover Status
windows_mssql_cluster_avgLockWaitTime

The average amount of wait time (milliseconds) for each lock request that resulted in a wait.
Average Lock Wait Timems
windows_mssql_cluster_bufferCacheHitRatio

Percentage of pages that were found in the buffer pool without having to incur a read from disk.
Buffer Cache Hit Ration%
windows_mssql_cluster_sqlCacheMemory

Total amount of dynamic memory the server is using for the dynamic SQL cache
SQL Cache MemoryKB
windows_mssql_cluster_datafileSize

The cumulative size of all the data files in the database.
Data File SizeKB
windows_mssql_cluster_databasesCount

Show the count of total number of databases
Databases Countcount
windows_mssql_cluster_fullScans

Number of unrestricted full scans. These can either be base table or full index scans.
SQL Server Full Scanspsec
windows_mssql_cluster_databasepageReads

Number of physical database page reads issued.
Databases Page Readsrps
windows_mssql_cluster_databasepageWrites

Number of physical database page writes issued.
Database Page Writeswps
windows_mssql_cluster_serverUptime

Monitors the uptime (in days) of the database server
Server UptimeDays
windows_mssql_cluster_totalServerMemory

Total amount of dynamic memory the server is currently consuming
Total Server MemoryKB
windows_mssql_cluster_databaseTransactions

The number of transactions started for the database.
Database Transactionspsec
windows_mssql_cluster_databaseStatus

0 = ONLINE 1 = RESTORING 2 = RECOVERING SQL Server 2008 and later 3 = RECOVERY_PENDING SQL Server 2008 and later 4 = SUSPECT 5 = EMERGENCY SQL Server 2008 and later 6 = OFFLINE SQL Server 2008 and later 7 = COPYING Azure SQL Database Active Geo-Replication 10 = OFFLINE_SECONDARY Azure SQL Database Active Geo-Replication
Database Status
windows_mssql_cluster_databaseActiveUsers

Monitors the number of active user transactions per database
Database Active Userscount
windows_mssql_cluster_datafilesFreeSpace

Monitors datafiles free space regardless of auto-growth
Data Files Free Space%
windows_mssql_cluster_databaseFreeSpace

Monitors Database Freespace in Percentage
Database Free Space%
windows_mssql_cluster_dbLogCacheHitRatio

Monitors MSSQL database log cache-hit ratio
DB Log Cache Hit Ratio%
windows_mssql_cluster_dbLogfilesFreeSpace

Monitors MSSQL database LogFiles free space
DB Logfiles Free Space%
windows_mssql_cluster_minsSinceLastLogBackup

Monitors Database transaction log backup in minutes
Mins since Last Log Backupm
windows_mssql_cluster_minsSinceLastLogFullBackup

Monitors Database Backup status in Minutes since Last Full Backup
Mins since Last Log Full Backupm
windows_mssql_cluster_fileGroupFreespaceWithAvailableDisk

Monitor MSSQL DB Filegroup Freespace with Available Disk
File Group Free Space with Available Disk%
windows_mssql_cluster_dbFileGroupFreespace

Monitors MSSQL DB Filegroup Freespace
DB File Group Free Space%
windows_mssql_cluster_longRunningQueriesLast5minCount

Monitors MSSQL Database Long Running queries count Last 5 minutes
Long Running Queries Last 5min Countcount
windows_mssql_cluster_serverIOBusy

Monitors MSSQL DB Server IO Busy
Server IO Busy%
windows_mssql_cluster_dbWorkspaceMemory

Monitors MSSQL Database Workspace Memory in Percentage
DB Workspace Memory%
windows_mssql_cluster_dbInstanceCpuUtilization

Monitors MSSQL Database Instance CPU Utilization in percentage
DB Instance CPU Utilization%
windows_mssql_cluster_logfileFreespaceWithAvailableDisk

Monitors MSSQL DB Logfile Freespace with Available Disk in percentage
Logfile Free Space with Available Disk%
windows_mssql_cluster_instanceFreeConnections

Monitors MSSQL Database Instance Free Connections in percentage
Instance Free Connections%
windows_mssql_cluster_daysSinceLastFullBackup

MSSQL DB Backup Days Since Last Full Backup
Days since Last Full BackupDays
windows_mssql_cluster_daysSinceLastLogBackup

MSSQL DB Backup Days Since Last Log Backup
Days since Last Log BackupDays
windows_mssql_cluster_daysSinceLastDifferentialBackup

MSSQL DB Backup Days Since Last Differential Backup
Days since Last Differential BackupDays
windows_mssql_cluster_alwaysOnDBAvailabilitySyncHealth

MSSQL AlwaysOn DataBase Availability Synchronization Health - Below are the possible states:0 : NOT_HEALTHY 1 : PARTIALLY_HEALTHY 2 : HEALTHY. This metric is applicable only for AAG cluster.
AlwaysOn DB Avalability Sync Health
windows_mssql_cluster_alwaysOnAGSyncHealth

MSSQL AlwaysOn Availability Group Synchronization Health - Below are the possible values: 0 : NOT_HEALTHY 1 : PARTIALLY_HEALTHY 2 : HEALTHY. This metric is applicable only for AAG cluster.
AlwaysOn AG Sync Health
windows_mssql_cluster_alwaysOnListenerState

MSSQL AlwaysOn Listener State - Below are the possible values: 0 : OFFLINE 1 : ONLINE 2 : ONLINE_PENDING 3 : FAILED. This metric is applicable only for AAG cluster.
AlwaysOn Listener State
windows_mssql_cluster_cpuBusy

MSSQL CPU Busy
CPU Busymicrosec
windows_mssql_cluster_cpuIdle

MSSQL CPU IDLE
CPU Idlemicrosec
windows_mssql_cluster_cpuIOBusy

MSSQL CPU IoBusy
CPU IO Busymicrosec
windows_mssql_cluster_alwayson_database_replicaState

"It monitor the MSSQL Always On Database Replica synchronization state along with role description.Below are the possible states:

PRIMARY_NOT SYNCHRONIZING - 0 PRIMARY_SYNCHRONIZING - 1 PRIMARY_SYNCHRONIZED - 2 PRIMARY_REVERTING - 3 PRIMARY_INITIALIZING - 4 SECONDARY_NOT SYNCHRONIZING - 5 SECONDARY_SYNCHRONIZING - 6 SECONDARY_SYNCHRONIZED - 7 SECONDARY_REVERTING - 8 SECONDARY_INITIALIZING - 9 This metric is applicable only for AAG cluster."

MSSQL AlwaysOn availability Group Replica State

Supported product versions

  • The app is validated on Windows Server 2019 and Microsoft SQL Server 2019 (RTM) - 15.0.2000.5 (X64) for both MSSQL failover instance & availability group clusters.

Risks, Limitations & Assumptions

  • This sdk app only supports mssql-cluster which is built upon windows failover cluster.
  • You have to provide credentials in “SQL Server Credentials” section even though you opted for “WindowsAuthentication” as authentication mode while adding configuration.
  • For Cluster object discovery and monitoring implementation, we are considering the object which has “Name” equals to Cluster Name in Get-ClusterResource response
  • For ClusterGroup monitoring implementation, we are considering the object which has Name as Cluster Group in Get-ClusterGroup response All the failover status metrics possible instance values are 0 - if there is no change in OwnerNode, 1 - If there is a change in OwnerNode, 2 - If no OwnerNode
  • App can handle Critical/Recovery failure alert notifications for below 2 cases of when you enable Notification Alerts in configuration
    • Connectivity Exception
    • Authentication Exception
  • If you enable agent monitoring templates on the Cluster/Node resource, you might see the duplicate metrics with different naming conventions
  • While trying to fetch the node ip address we receive multiple node ips, which include many local ips and actual ips (example : lets say actual node ip is 10.1.1.1 when trying to fetch the details we will receive two ips one associated with custer(192.168.0.0) and other is the actual ip). to identify the actual node ip address from the list of ip addresses received we are assuming that node ip address is part of the same subnet of cluster ip address. meaning if cluster ip is 10.1.1.1 then node ips will be 10.1.X.X.
  • We have provided the provision to give Cluster Ip Address OR HostName in configuration, but hostName provision will work only if the host name resolution works.
  • Component level threshold configuration is not possible
  • Resource level metric threshold customization and frequency setting are not possible
  • Usability issues in app configuration page while adding/editing
  • Optional configuration parameters cannot be defined
  • App upgrade is manual process without version change