Monitor-generated events indicate the presence of device or resource anomalies. Alerting helps you interpret and act on events by providing a single aggregation and response system for all events. Events can also be produced by diagnostic and third-party tools.

Much of the work of interpreting and responding to an alert is automated. Alerting correlates related-cause alerts, automatically suppresses redundant alerts, notifies operators, and creates incident tickets for alerts that need attention.

The following figure shows the alert handling workflow:

Event Management

Alert terminology

The following terms are used in alert management:

IDSequential number that uniquely identifies an alert or inference. In the alert list, the ID field also indicates the alert state using color-coding.
SubjectAlert description summary, which includes metrics associated with the alert.
DescriptionBrief description of the alert source and cause. This might include metrics with threshold crossings, monitor description, device type, template name, group, site, service Level, and component.
SourcePlatform or monitoring tool that generated the alert.
MetricService name of a threshold-crossing alert.
First Alert TimeTime when monitoring started for a resource. An alert is generated to provide notification that monitoring started for the resource.
Alert Updated TimeMost recent alert time. Updated when an alert is unsuppressed manually or with the alert First Response policy.
Elapsed TimeElapsed time since the first alert was generated.
Action/StatusCurrent alert status and most recent alert action.
Last Updated TimeTime when alert status was last updated.
Device TypeDevice type associated with an alert.
ResourceResource name associated with the alert.
Repeated AlertsCount of the number of duplicate alerts generated by the resource.
PartnerName of the partner owning the resource.
Incident IDUnique incident ID associated with the alert. Alerts are associated with incidents by:
  • Manually creating an incident.
  • Escalating an alert as an incident.
Entity TypeCategory of the source that generated the alert:
  • Resource: Alerts originating from managed resources.
  • Service: Service mapped to the resources.
  • Integration: Alerts originated by monitoring the installed integrations.
  • Client: Alerts not generated by monitoring but which are a representation of logical clustered alerts. For example, correlated or grouped inference alerts or RCA alerts based on resource dependency mapping).

Alert types

There are two alert types:

  • Forecast alerts
  • Change detection alerts

Forecast alerts

When a monitor triggers a forecast alert, the alert is displayed on the Alerts page. The following example shows the forecast alert display:

Forecast Alerts

The details and updates on a forecast alert are displayed in the comments section:

Forecast Alert Details

Change detection alerts

Change detection helps monitor sudden changes in metric behavior, especially on metrics with an indefinite threshold. This feature utilizes machine learning with a sliding window where change is calculated as soon as a data point becomes available.

Assign the change detection parameters to the required monitors/metrics to detect the changes and trigger alerts when signification change is detected.

Key considerations include:

  • If a significant change is detected continuously, an alert is appended to an existing alert.
  • If no change is detected after two hours, an alert is automatically resolved.
  • A minimum of four hours of real-time data is required to process change detection.
  • When a monitor triggers a change detection alert, the alert is displayed on the Alerts page.

A sliding window of two hours is used to calculate the change score percentile and determine if an alert should be triggered.

The following example shows the forecast alert display on the Alerts page:

Example Change Detection Alerts

Change detection is summarized on the Alert Details page:

Detailed Information on Change Detection Alerts

Alert lifecycle

The alert lifecycle describes alert status transitions, from Open status to Closed status, as a result of actions applied to the alert.

Alert action

The following actions can be applied to an alert:

AcknowledgeAcknowledge an alert upon receipt. In the Alerts list Action/Status column, click Acknowledge. After acknowledging the alert, a comment displays the Acknowledged status and the user name.
Create IncidentYou can create a ticket for the alert, which creates an incident, attaching the alert to the incident, and assigns users and sets the alert priority. This changes the alert status to Ticketed and the incident ID is displayed in the Action/Status column.
Attach and Update IncidentThis action attaches an alert to an existing ticket and updates the ticket with the alert contents. This action is typically used to update an alert ticket with related alerts.
Attach IncidentSimilar to Attach and Update Incident this action maps an alert to an existing ticket but does not update the ticket with the alert contents.
SuppressThe suppress action suppresses the alert and changes the alert status to Suppressed. Use the Snooze setting to suppress an alert for a specific time duration. If a repeated alert occurs when the alert is in the snoozed state, the alert repeat count increments and the snooze duration is reset based on the repeated alert attributes.
UnacknowledgeUndo the previous Acknowledge action. You might want to unacknowledge an alert if an applied solution did not rectify the problem. The status of the alert changes to Open or Ticketed if an incident ID is associated with the alert.
UnsuppressUndo the previous Suppress action. The status of the alert changes to Open or Ticketed if an incident ID is associated with the alert.
Run ProcessYou can define and run a process for an alert if the alert does not have a Suppressed status or if the alert is healed.
CloseClose an alert when an issue is solved and the alert is resolved. You can only close an alert, manually, which is in the OK state.

For correlated alerts, an action can be performed on the entire inference, but not on a single alert.

Alert status

Alert status describes a logical condition of an alert with respect to the alert lifecycle. Alert status should not be confused with alert state, which can be critical, warning, or OK.

Both automatic and manual alert actions can cause an alert status change, as shown in the following figure:

Alert status
OpenThe initial alert status is Open.
CorrelatedAlert correlation processing changes the alert status to Correlated.
Alerts correlated to an inference have a Correlated status and subsequently inherit the inference alert status. Correlated alerts do not change status independently but transition with the associated inference alert status. Suppress and Acknowledge actions can be applied to an inference alert and the correlated alert logically inherits the associated inference alert status but the alert, itself, retains a Correlated status. Therefore, you do not need to suppress a correlated alert because the Correlated status is a final status for alerts that are part of an inference.
TicketedThe Create Incident action transitions open alerts to a Ticketed status.
A Ticketed alert retains a Ticketed status even if an Unacknowledge or Unsuppress action is applied.
AcknowledgedAcknowledged alerts are set to an Acknowledged status.
SuppressedSuppressed alerts are set to a Suppressed status.
ClosedThe Closed status is a final alert status. Alerts can be closed manually only when the alert is in the OK state.

You can monitor alert status in the Alert Details page comments section:

Track Actions on an Alert

After waiting until the problem no longer is displayed as an alert the alert is placed in the OK state. Alerts in the OK state are not displayed in the alert browser. In the OK state, if the same alert reoccurs, a new alert is created. Otherwise, the repeat count is incremented for the alert.

Alert filters

The following filters can be applied to alerts:

Attribute NameDescription
ClientView the alerts of all or select clients.
Resource OriginView alerts from all sources or a single source.
SitesSelect alerts specific to the site.
Resource GroupsSelect alerts associated with a resource group.
Resource TypeFiltered alerts by resource type. Multiple resource types can be specified up to a maximum of ten.
NameFilter alerts by resource name.
SourceFilter alerts by source, such as the available integrations.
Entity TypeFilter alerts by entity type:
  • Resource
  • Service
  • Client
  • Integration
Alert TypeFilter alerts by alert type:
  • Agent
  • Appliance
  • Change Detection
  • Forecast
  • Maintenance
  • Monitoring
  • Obsolete
  • Scheduled Maintenance
MetricFilter alerts by metric name.
PriorityFilter alerts by priority, where P0 is the lowest priority and P5 is the highest priority.
Current StatusFilter alerts by status or state:
  • All Status
  • Critical
  • Observed; select the observed value
  • Warning
  • Ok
  • Info
ActionsView alerts by current status:
  • Acknowledged
  • Closed
  • Correlated
  • Open
  • Suppressed
  • Ticketed
DurationFilters on alerts that occurred within a specified duration. For example, if the duration is set to the last seven days and the alert timestamp is set to created time, alerts created within the last seven days are selected.
From DateSpecify duration start time.
To DateSpecify duration end time.
Alert TimestampsFilter by alert timestamp.
Event TypeFilter by event type:
  • Alerts
  • RCA
  • Inference Type
Special AttributeFilter on alerts triggered when a resource becomes unavailable.

Alert noise

Alert noise or alert floods might make it difficult to recognize and deal with more important alerts. The platform can reduce alert noise in the following ways:

  • Automatically ignore repeat alerts
  • Manually stop processing alerts temporarily

Automatically ignore repeat alerts

When the platform receives more than one alert with the same resource, metric, component, and state combination within one minute, the platform ignores the alerts after processing the first alert.
Ignoring the repeat alerts conservers resources. For example, there have been instances where outgoing email is blocked if the platform continues to process the noise. Once there are no repeat alerts with the same resource, metric, component, and state combination within a minute, the platform resumes normal alert processing.

Manually stop processing alerts temporarily

When the platform receives an alert flood from a resource or metric, the operations team stops the processing of the resource or metric until they ensure the flooding has stopped. For example, flooding occurs when you get 100s of alerts within few minutes from a single metric on a resource or a resource sends 100s of alerts within few minutes. This flood of alerts affects the processing of other alerts for the client, which might be important alerts. Additionally, this flood of alerts from one client might affect alert processing for other clients as well because of the load generated by these noisy alerts. When this flooding occurs, the operations team changes the server configuration to stop processing these alerts and informs the corresponding client of the issue. Once the issue is resolved, the operations team changes the configuration back to normal to process all alerts.

Alert repeat and occurrence history

Older alert occurrences are not always available because they are purged after 180 days.

  • On the Alerts page, the link in the Repeat column will not show the number of alerts indicated when the occurrences are older than 180 days.
  • On the Alert Details page, the Total Occurrences link will not show the number of alerts indicated when the occurrences are older than 180 days.

Machine learning concepts

Machine learning (ML) status shows the stages of machine learning implemented in a policy used with analyzing a sequence to suppressing alerts:

ML StatusDescription
Insufficient data. The policy is temporarily disabled. Due to insufficient data, the machine learning model cannot detect any alert sequences and the policy is temporarily disabled. The policy becomes active when the machine learning model has sufficient data.
Training ML model is queued. When a policy is created or a CSV file is uploaded to a policy, the training can be queued. If a policy is in training, the new policy is queued. After the training on the existing policy is completed, the status of the new policy moves to training initiated.
Training ML model is initiated. After completion, training on the machine learning model is initiated. The status then moves to training started.
Training ML model is started. After completion, training on the machine learning model is started. Training progress is shown on the progress bar.
Training ML model is in progress. After completion, training on the ML model continues with the percentage of progress shown in the progress bar.
ML model training is complete. Predictions will commence and the ML model detects alert sequences and suppression.
ML training encountered an error. Contact support for assistance.

ML seasonality patterns

Alert seasonality patterns help detect seasonal patterns of alerts. These patterns are based upon unmodified alert information retrieved from existing data. OpsQ suppresses an alert that matches with the seasonality pattern detected by ML. For example, if an alert is displayed at 10 PM on Mondays, OpsQ suppresses the alert triggered at 10 PM on Mondays.

ML retrieves three months of alerts and groups alerts based on three attributes (resource, metric, and component). Seasonal patterns of alerts are studied for each group. Alert groups that have seasonal patterns are listed and displayed when a group is selected.

Ninety percent of the alerts must be seasonal to qualify as seasonal alerts.

Seasonality pattern graphs

Seasonality pattern graphs have the following attributes:

Resource, metric, and componentAttributes of alerts and the seasonality group that is selected.
The seasonality groups are named using these three attributes.
Numbers (displayed horizontally)The timeline with dates in a month.
Hover over the dates to view the date and time for each generated alert.
Grey linesThe past alerting time for a specific group (resource, metric, and component).
Orange linesThe predicted alerting time based on the learned alert seasonality.
Blue shaded areaZoom in (or zoom out) on the blue line to examine time duration.

Seasonality groups in the graph refer to alert groups that have seasonal patterns.

The following graph provides the details of the seasonal patterns for alert data and describes when OpsQ suppresses alerts due to matching seasonal patterns:

Alert Seasonality Pattern

This is a sample recommendation by the OpsQ bot for a suppressed seasonal alert:

OpsQBot Recommendation for Seasonal Alert

The following graph shows the seasonal pattern of a suppressed alert:

Seasonal Alert Graph