Alert Correlation

Machine learning (ML) is used to find repeated alert sequence patterns that can be correlated. You can specify an alert correlation policy that controls how alerts are correlated.

Approach to alert correlation

Alert correlation best practice recommends a four-step approach.

Step 1: Enable the correlation policy in observed mode

Create one policy in observed mode. The policy does ML correlation based on the alert metric sequence. If you have a service group configured in the system, also add service groups/device groups identical to the policy. If not, leave similarly empty.

Step 2: Observe the correlation

Full alert transparency is provided so you can see all of the alerts that were involved in an alert correlation sequence. Observe the alert sequence patterns and correlation results to determine if alerting accurately reflects the anomalies reported in your environment and supports recovery from fault conditions. When you are satisfied that alerts are accurately reported, you can fully enable alert correlation or fine-tune the alert correlation policy as needed.

Step 3: Fine-tune the correlation policy

If the observed alert correlation policy results in alert notifications that accurately reflect system fault conditions, you might not need to fine-tune the policy. This is usually the case, where out-of-the-box alert correlation successfully handles alert sequences. Some environments impose unique requirements on alert correlation so you might need to fine-tune the correlation. Possible scenarios and solutions include:

Observe the alert sequence views. If existing data indicates that there are more sequences than ML discovered, add regex-based sequences to the training file.
If correlation should be done on single-device alerts, such as CPU, memory, and disk alert sequences, configure resource alerts to be identical.
For generic alerts, more detailed information might need to be extracted from the alert subject or description to refine the sequence pattern.
An example is an SNMP trap alert, which has an SNMP Trap metric that does not provide specific information about the problem. More specific problem information is embedded in the alert subject. Use the Alert Enrichment policy to extract problem area information from the subject or description. ML interprets the problem area sequence instead of the metric sequence.

Step 4: Fully enable alert correlation

If there are multiple correlation policies, put ones that are more direct with higher orders. For example, ML correlation with resource/service group/device group identical should be higher than ones using topology.

Turn alert policy from Observed to ON to fully enable alert correlation.

Alert correlation factors

Several factors affect event correlation.

Co-occurrence

Co-occurrence clusters alerts based on the time they are received. The gap between adjacent alerts determines the sequence pattern start and end, with a default gap of five minutes

Sites

When you create a resource with site information, alert correlation automatically checks that correlated alerts are on the same site.

Problem area

The problem area can be extracted from the alert subject or description using Alert Enrichment. This overrides the default metric name setting in the alert. The updated problem area is subsequently used in ML sequence patterns.

The Alert Enrichment policy is configurable in the UI when the Alert Enrichment add-on is added for the partner or client. After creating or updating Alert Enrichment policies, the ML model needs to be retrained.

It takes time to get new data enriched and infer new patterns so the impact of enrichment is not immediately evident. Alert Enrichment only enriches new alerts, not old alerts.

Alert sequencing

ML uses alert sequencing, which utilizes the alert problem area and component attributes of discovered sources. By default, the component is not taken into account for alert-level integrations.

A training file can be used to train the model with known sequences. The training file is only needed when additional alert sequences need to be added to those already learned or when specific alert sequences need to be specifically omitted.

Policy precedence order

When defining an alert correlation policy, consider the order of evaluation as specified by the precedence value. Any filter clauses must also be considered when specifying precedence.

Summary of alert correlation mechanisms

Alert correlation incorporates several correlation mechanisms.

View Alert Correlation Policies

Ensure that you have selected a client from the ALL Clients list.
Go to Setup > Alerts > Alert Correlation.
You can select the number of alert correlation policies to display per page.

If there are one or more correletation policies enabled for the ML correlation, a new Detected Alert Sequence Patterns option is provided on the Alert Correlation Policy page.

The users can choose the client from the list of clients on the Alert Correlation Policy page to view the respective ML detected alert sequence patterns.

Each correlation policy contains the following information:

Attribute	Description
Policy Name	Name of the alert correlation policy.
Created By	Name of the user who created the policy.
Updated By	Name of the user who last modified the policy.
Processed Inferences	Number of inferences processed.
Precedence	Indicates the priority of the policy.
ML Status	Indicates the Machine Learning status.
Mode	You can select supported policy modes from the drop-down list.
Review Status	Display the progress of the policy when the policy is configured with review mode. When you select the Review option under the Observed Mode, the correlation policy will run on the alerts for the last 7 days and show the results to the user.

Policy modes

The following policy modes are supported:

Policy Mode	Description
ON	The policy drives automated actions on alerts.
OFF	The policy is inactive and does not affect alerts. You can use this mode to review a newly defined policy before choosing one of the other modes.
Recommend	The policy creates a recommendation for actions that you should take on the alert. Recommendations are based on learned patterns in historical alerts. The recommendation includes a link to take the action.
Observed	This mode permits you to simulate a policy without affecting alerts. The policy creates an observed alert, which simulates the original alert. The observed alert shows the actions that would be taken on the original alert if the policy were in `On` mode. The observed alert includes a link to the original alert.
Recommend and Observed modes apply to incident actions.

Filter criteria setting

This setting filters alerts that you do not want correlated with other alerts covered by the same policy.

Inference subject setting

By default, an inference uses the subject of the alert with the earliest created date. You can optionally specify a subject to override the default subject.

Below are the supported tokens in the Inference subject field:

Type	Tokens
Alert	subject metric component monitorName templateName currentState integrationName clientTechnology description templateDescription uniqueId problemArea
Resource	resource.aliasName resource.hostName resource.name resource.resourceName resource.ipAddress resource.resourceType resource.location.name

Learned sequences

The correlation algorithm correlates alerts that occur near the same time and learns common alert sequences using historical data.

The continuous learning option causes the learning models to be continuously updated using recent data.

Trained sequences

Using the advanced option, you can train the alert correlation algorithm to correlate known alert sequences. A training file is used to provide training data.

Time-based sequences

Time-based sequences correlate alerts that occur in the same time interval. For example, you can use the within time window setting to correlate all alerts that occur within a specified time range, such as five minutes to four hours.

Learning reinforcement

Learning reinforcement applies additional criteria in making correlation decisions on learned, trained, and time-based sequences.

Learning reinforcement can use topological relationships. Alerts that occur close in time and which are from connected resources are usually related to the same underlying cause. For example, a failed switch can cause a cascade of alerts on downstream servers and applications. In deciding whether to correlate a sequence of alerts into an inference, a higher weight is applied to sequences when associated resources are topologically related.

Attribute similarity criteria can also be used to correlate sequences. Alerts can be related to the same underlying cause if they:

Occur at about the same time.
Have identical or similar attributes.

For example, application failure alerts can generate multiple alerts that have a similar subject.

Use the alert similarity setting to specify alert similarity criteria.