Monitor-generated events indicate the presence of device or resource anomalies. Alerting helps you interpret and act on events by providing a single aggregation and response system for all events. Events can also be produced by diagnostic and third-party tools.

Much of the work of interpreting and responding to an alert is automated. Alerting correlates related-cause alerts, automatically suppresses redundant alerts, notifies operators, and creates incident tickets for alerts that need attention.

The following figure shows the alert handling workflow:

Event Management

Note: The information alerts do not activate actions within our alert policies. The information alerts are considered to provide information and are distinct from regular alerts.

Alert terminology

The following terms are used in alert management:

TermDescription
IDSequential number that uniquely identifies an alert or inference. In the alert list, the ID field also indicates the alert state using color-coding.
SubjectAlert description summary, which includes metrics associated with the alert.
DescriptionBrief description of the alert source and cause. This might include metrics with threshold crossings, monitor description, device type, template name, group, site, service Level, and component.
SourcePlatform or monitoring tool that generated the alert.
MetricService name of a threshold-crossing alert.
First Alert TimeTime when monitoring started for a resource. An alert is generated to provide notification that monitoring started for the resource.
Alert Updated TimeMost recent alert time. Updated when an alert is unsuppressed manually or with the alert First Response policy.
Elapsed TimeElapsed time since the first alert was generated.
Action/StatusCurrent alert status and most recent alert action.
Last Updated TimeTime when alert status was last updated.
Device TypeDevice type associated with an alert.
ResourceResource name associated with the alert.
Repeated AlertsCount of the number of duplicate alerts generated by the resource.
PartnerName of the partner owning the resource.
Incident IDUnique incident ID associated with the alert. Alerts are associated with incidents by:
  • Manually creating an incident.
  • Escalating an alert as an incident.
Entity TypeCategory of the source that generated the alert:
  • Resource: Alerts originating from managed resources.
  • Service: Service mapped to the resources.
  • Integration: Alerts originated by monitoring the installed integrations.
  • Client: Alerts not generated by monitoring but which are a representation of logical clustered alerts. For example, correlated or grouped inference alerts or RCA alerts based on resource dependency mapping).

Alert types

There are two alert types:

  • Forecast alerts
  • Change detection alerts

Forecast alerts

When a monitor triggers a forecast alert, the alert is displayed on the Alerts page. The following example shows the forecast alert display:

Forecast Alerts

The details and updates on a forecast alert are displayed in the comments section:

Forecast Alert Details

Change detection alerts

Change detection helps monitor sudden changes in metric behavior, especially on metrics with an indefinite threshold. This feature utilizes machine learning with a sliding window where change is calculated as soon as a data point becomes available.

Assign the change detection parameters to the required monitors/metrics to detect the changes and trigger alerts when signification change is detected.

Key considerations include:

  • If a significant change is detected continuously, an alert is appended to an existing alert.
  • If no change is detected after two hours, an alert is automatically resolved.
  • A minimum of four hours of real-time data is required to process change detection. Hence, change detection graphs/widgets will be populated only after 4 hours after assigning change detection monitor to a resource.
  • When a monitor triggers a change detection alert, the alert is displayed on the Alerts page.

A sliding window of two hours is used to calculate the change score percentile and determine if an alert should be triggered.

The following example shows the forecast alert display on the Alerts page:

Example Change Detection Alerts

Change detection is summarized on the Alert Details page:

Detailed Information on Change Detection Alerts

Alert lifecycle

The alert lifecycle describes alert status transitions, from Open status to Closed status, as a result of actions applied to the alert.

Alert action

The following actions can be applied to an alert:

ActionDescription
AcknowledgeA received alert needs to be acknowledged. After you acknowledge the alert, a comment is displayed as Acknowledged and includes the user name. From the Incident unique ID drop-down menu, click Acknowledge and a tick mark is displayed below the Incident ID.
Create IncidentA ticket can be created for the generated alert, assigning users and setting the priority. After an incident is created, the status of the alert changes to Ticketed and the incident ID is displayed in the Action/Status column.
Attach And Update IncidentMap an alert to an existing ticket or update the ticket with the alert contents. This action is generally used to update the same ticket with related alerts.
Attach IncidentMap an alert to an existing ticket without updating the ticket with the alert contents.
SuppressSuppress the current alert and all duplicate alerts. A new alert of the same type is displayed as a fresh alert and not as a duplicate alert. The status of the alert changes to Suppressed. The Snooze setting suppresses alerts for a specified time interval. If a repeated alert occurs when the alert is in snoozed state, the alerts repeat count increments and the snooze duration is reset based on the repeated alert attributes.

Note that if the alert payload has a source time that is older than the suppression time, the First Response recommendation or suppression is not applied.
UnacknowledgeUndo the Acknowledge action taken on an alert. For example, if a solution did not address a specific problem, unacknowledge the alert. The status of the alert changes to Open or Ticketed provided an incident ID is associated with the alert.
UnsuppressUndo a Suppress action taken on an alert. The status of the alert changes to Open or Ticketed provided an incident ID is associated with the alert.
Run ProcessAdd process definitions to an alert and run. The option does not appear for Suppressed and Heal alerts.
CloseClose an alert when an issue is solved and the alert is resolved. The alert state changes to OK.
HealUnder the alert list, there is a new option called Heal. When the user selects the Heal action, an OK alert with the identical properties is created, giving the appearance that the alert has been healed. This will make it easier for users to manually fix the alert.

This option is applicable to heal the alert for critical and warning types of alerts even if they are in any action. We don't have the option to perform heal action on multiple alerts simultaneously, but can perform heal alert action only on one alert at a time.

For correlated alerts, an action can be performed on the entire inference, but not on a single alert.

Alert status

Alert status describes a logical condition of an alert within the alert lifecycle. Alert status should not be confused with alert state, which can be critical, warning, or OK.

Both automatic and manual alert actions can cause an alert status change, as shown in the following figure:

Alert status
StatusDescription
OpenThe initial alert status is Open.
CorrelatedAlert correlation processing changes the alert status to Correlated.
Alerts correlated to an inference have a Correlated status and subsequently inherit the inference alert status. Correlated alerts do not change status independently but transition with the associated inference alert status. Suppress and Acknowledge actions can be applied to an inference alert and the correlated alert logically inherits the associated inference alert status but the alert, itself, retains a Correlated status. Therefore, you do not need to suppress a correlated alert because the Correlated status is a final status for alerts that are part of an inference.
TicketedThe Create Incident action transitions open alerts to a Ticketed status.
A Ticketed alert retains a Ticketed status even if an Unacknowledge or Unsuppress action is applied.
AcknowledgedAcknowledged alerts are set to an Acknowledged status.
SuppressedSuppressed alerts are set to a Suppressed status.
ClosedThe Closed status is a final alert status. Alerts can be closed manually only when the alert is in the OK state.

You can monitor alert status in the Alert Details page comments section:

Track Actions on an Alert

After waiting until the problem no longer is displayed as an alert the alert is placed in the OK state. Alerts in the OK state are not displayed in the alert browser. In the OK state, if the same alert reoccurs, a new alert is created. Otherwise, the repeat count is incremented for the alert.

Alert filters

The following filters can be applied to alerts:

Attribute NameDescription
ClientView the alerts of all or select clients.
Resource OriginView alerts from all sources or a single source.
SitesSelect alerts specific to the site.
Resource GroupsSelect alerts associated with a resource group.
Resource TypeFiltered alerts by resource type. Multiple resource types can be specified up to a maximum of ten.
NameFilter alerts by resource name.
SourceFilter alerts by source, such as the available integrations.
Entity TypeFilter alerts by entity type:
  • Resource
  • Service
  • Client
  • Integration
Alert TypeFilter alerts by alert type:
  • Agent
  • Appliance
  • Change Detection
  • Forecast
  • Maintenance
  • Monitoring
  • Obsolete
  • Scheduled Maintenance
MetricFilter alerts by metric name.
PriorityFilter alerts by priority, where P0 is the lowest priority and P5 is the highest priority.
Current StatusFilter alerts by status or state:
  • All Status
  • Critical
  • Observed; select the observed value
  • Warning
  • Ok
  • Info
ActionsView alerts by current status:
  • Acknowledged
  • Closed
  • Correlated
  • Open
  • Suppressed
  • Ticketed
DurationFilters on alerts that occurred within a specified duration. For example, if the duration is set to the last seven days and the alert timestamp is set to created time, alerts created within the last seven days are selected.
From DateSpecify duration start time.
To DateSpecify duration end time.
Alert TimestampsFilter by alert timestamp.
Event TypeFilter by event type:
  • Alerts
  • RCA
  • Inference Type
Special AttributeFilter on alerts triggered when a resource becomes unavailable.

Native attributes filter criteria

Below are the description of the native attributes that can be applied to alerts in alert browser criteria and Alert Management Policies filter criteria.

Native Attribute NameDescription
Agent InstalledAgent monitoring resources.
Alert ComponentComponent of a threshold-crossing alert.
Alert DescriptionFilter by a brief description of the alert source and cause. This might include metrics with threshold crossings, monitor description, device type, template name, group, site, service Level, and component.
Alert MetricFilter alerts by service name of a threshold-crossing alert.
Alert Occurrence FrequencyFilter by the number of different alerts within a period of time.
Alert Problem AreaFilter alerts by the problem area which is extracted from the alert metric.
Alert TypeFilter type of the alert.
Alert SubjectAlert description summary, which includes metrics associated with the alert.
Alert SourceFilter alerts by their source which is the platform or the monitoring tool that generates the alert.
Alert StateFilter alerts by status or state.
Alias nameFilter alerts by Alias name of the resource associated with the alert.
DNS NameFilter alerts by DNS name of the resource associated with the alert.
ip AddressFilter alerts by ip address of the resource associated with the alert.
ModelFilter alerts by the model of the resource associated with the alert.
ApplicationFilter alerts by application of the resource associated with the alert.
Operating systemFilter alerts by an operating system of the resource associated with the alert.
Resource NameFilter alerts by name of the resource associated with the alert.
Resource GroupFilter alerts by name of the resource group associated with the alert.
Resource TypeFilter alerts by resource type. Multiple resource types can be specified up to a maximum of ten.
Service GroupFilter alerts by name of the service group of the resource associated with the alert.
Inference AlertFilter alerts by the alert whose event type is inference.
MakeFilter alerts by selecting the manufacturer of the resource.
SiteFilter the alerts specific to site.
Resource: Instance idFilter alerts by instance id of the resource associated with the alert.
Resource: Resource IPFilter alerts by resource ip of the resource associated with the alert.
Resource: Resource NameFilter alerts by name of the resource associated with the alert.
Resource: Resource TypeFilter alerts by type of the resource associated with the alert.
Resource: Resource GroupFilter alerts by name of the resource group of the resource associated with the alert.
Alert: No. of OccurrenceFilter alerts by number of repeated alerts.
Alert: Update TimeFilter alerts by most recent alert time.
Alert: Healed TimeFilter alerts by alert healed time.
Resource: AWS Account idFilter alerts by the AWS account id of the resource associated with the alert.
Resource: AWS Instance NameFilter alerts by the AWS instance name of the resource associated with the alert.
Business HoursFilter alerts by the business hours based on the user configuration.
Alert: ComponentFilter alerts by the component of the alert.
Alert: Created DateFilter alerts by the alert creation date.
Resource: DNS NameFilter alerts by the DNS name of the resource associated with the alert.

Alert noise

Alert noise or alert floods might make it difficult to recognize and deal with more important alerts. The platform can reduce alert noise in the following ways:

  • Automatically ignore repeat alerts
  • Manually stop processing alerts temporarily

Automatically ignore repeat alerts

When the platform receives more than one alert with the same resource, metric, component, and state combination within one minute, the platform ignores the alerts after processing the first alert.
Ignoring the repeat alerts conservers resources. Outgoing email can be blocked if the platform continues to process the noise. When there are no repeat alerts with the same resource, metric, component, and state combination within a minute, the platform resumes normal alert processing.

Manually stop processing alerts temporarily

When the platform receives an alert flood from a resource or metric, the operations team stops the processing of the resource or metric until they ensure the flooding has stopped. For example, flooding occurs when you get 100s of alerts within a few minutes from a single metric on a resource or a resource sends 100s of alerts within a few minutes. This flood of alerts affects the processing of other alerts for the client, which might be important alerts. Also, this flood of alerts from one client might affect alert processing for other clients as well because of the load generated by these noisy alerts. When this flooding occurs, the operations team changes the server configuration to stop processing these alerts and informs the corresponding client of the issue. After the issue is resolved, the operations team changes the configuration back to normal to process all alerts.

Alert repeat and occurrence history

Older alert occurrences are not always available because they are purged after 180 days.

  • On the Alerts page, the link in the Repeat column does not show the number of alerts indicated when the occurrences are older than 180 days.
  • On the Alert Details page, the Total Occurrences link does not show the number of alerts indicated when the occurrences are older than 180 days.

Machine learning concepts

Machine learning (ML) status shows the stages of machine learning implemented in a policy used with analyzing a sequence to suppressing alerts:

ML StatusDescription
Insufficient data. The policy is temporarily disabled. Due to insufficient data, the machine learning model cannot detect any alert sequences and the policy is temporarily disabled. The policy becomes active when the machine learning model has sufficient data.
Training ML model is queued. When a policy is created or a CSV file is uploaded to a policy, the training can be queued. If a policy is in training, the new policy is queued. After the training on the existing policy is completed, the status of the new policy moves to training initiated.
Training ML model is initiated. After completion, training on the machine learning model is initiated. The status then moves to training started.
Training ML model is started. After completion, training on the machine learning model is started. Training progress is shown on the progress bar.
Training ML model is in progress. After completion, training on the ML model continues with the percentage of progress shown in the progress bar.
ML model training is complete. Predictions will commence and the ML model detects alert sequences and suppression.
ML training encountered an error. Contact support for assistance.

ML seasonality patterns

Alert seasonality patterns help detect seasonal patterns of alerts. These patterns are based On unmodified alert information retrieved from existing data. OpsQ suppresses an alert that matches with the seasonality pattern detected by ML. For example, if an alert is displayed at 10 PM on Mondays, OpsQ suppresses the alert triggered at 10 PM on Mondays.

ML retrieves three months of alerts and groups alerts based on three attributes (resource, metric, and component). Seasonal patterns of alerts are studied for each group. Alert groups that have seasonal patterns are listed and displayed when a group is selected.

Ninety percent of the alerts must be seasonal to qualify as seasonal alerts.

Seasonality pattern graphs

Seasonality pattern graphs have the following attributes:

AttributeDescription
Resource, metric, and componentAttributes of alerts and the seasonality group that is selected.
The seasonality groups are named using these three attributes.
Numbers (displayed horizontally)The timeline with dates in a month.
Hover over the dates to view the date and time for each generated alert.
Gray linesThe past alerting time for a specific group (resource, metric, and component).
Orange linesThe predicted alerting time based on the learned alert seasonality.
Blue shaded areaZoom in (or zoom out) on the blue line to examine time duration.

Seasonality groups in the graph refer to alert groups that have seasonal patterns.

The following graph provides the details of the seasonal patterns for alert data and describes when OpsQ suppresses alerts due to matching seasonal patterns:

Alert Seasonality Pattern

This is a sample recommendation by the OpsQ bot for a suppressed seasonal alert:

OpsQBot Recommendation for Seasonal Alert

The following graph shows the seasonal pattern of a suppressed alert:

Seasonal Alert Graph