Monitor-generated events indicate the presence of device or resource anomalies. Alerting helps you interpret and act on events by providing a single aggregation and response system for all events. Events can also be produced by diagnostic and third-party tools.
Much of the work of interpreting and responding to an alert is automated. Alerting correlates related-cause alerts, automatically suppresses redundant alerts, notifies operators, and creates incident tickets for alerts that need attention.
The following figure shows the alert handling workflow:
The following terms are used in alert management:
|ID||Sequential number that uniquely identifies an alert or inference. In the alert list, the ID field also indicates the alert state using color-coding.|
|Subject||Alert description summary, which includes metrics associated with the alert.|
|Description||Brief description of the alert source and cause. This might include metrics with threshold crossings, monitor description, device type, template name, group, site, service Level, and component.|
|Source||Platform or monitoring tool that generated the alert.|
|Metric||Service name of a threshold-crossing alert.|
|First Alert Time||Time when monitoring started for a resource. An alert is generated to provide notification that monitoring started for the resource.|
|Alert Updated Time||Most recent alert time. Updated when an alert is unsuppressed manually or with the alert First Response policy.|
|Elapsed Time||Elapsed time since the first alert was generated.|
|Action/Status||Current alert status and most recent alert action.|
|Last Updated Time||Time when alert status was last updated.|
|Device Type||Device type associated with an alert.|
|Resource||Resource name associated with the alert.|
|Repeated Alerts||Count of the number of duplicate alerts generated by the resource.|
|Partner||Name of the partner owning the resource.|
|Incident ID||Unique incident ID associated with the alert. Alerts are associated with incidents by:|
|Entity Type||Category of the source that generated the alert:|
There are two alert types:
- Forecast alerts
- Change detection alerts
When a monitor triggers a forecast alert, the alert is displayed on the Alerts page. The following example shows the forecast alert display:
The details and updates on a forecast alert are displayed in the comments section:
Change detection alerts
Change detection helps monitor sudden changes in metric behavior, especially on metrics with an indefinite threshold. This feature utilizes machine learning with a sliding window where change is calculated as soon as a data point becomes available.
Assign the change detection parameters to the required monitors/metrics to detect the changes and trigger alerts when signification change is detected.
Key considerations include:
- If a significant change is detected continuously, an alert is appended to an existing alert.
- If no change is detected after two hours, an alert is automatically resolved.
- A minimum of four hours of real-time data is required to process change detection.
- When a monitor triggers a change detection alert, the alert is displayed on the Alerts page.
A sliding window of two hours is used to calculate the change score percentile and determine if an alert should be triggered.
The following example shows the forecast alert display on the Alerts page:
Change detection is summarized on the Alert Details page:
The alert lifecycle describes alert status transitions, from Open status to Closed status, as a result of actions applied to the alert.
The following actions can be applied to an alert:
|Acknowledge an alert upon receipt. In the Alerts list Action/Status column, click Acknowledge. After acknowledging the alert, a comment displays the Acknowledged status and the user name.|
|You can create a ticket for the alert, which creates an incident, attaching the alert to the incident, and assigns users and sets the alert priority. This changes the alert status to Ticketed and the incident ID is displayed in the Action/Status column.|
|This action attaches an alert to an existing ticket and updates the ticket with the alert contents. This action is typically used to update an alert ticket with related alerts.|
|Similar to |
|The suppress action suppresses the alert and changes the alert status to Suppressed. Use the Snooze setting to suppress an alert for a specific time duration. If a repeated alert occurs when the alert is in the snoozed state, the alert repeat count increments and the snooze duration is reset based on the repeated alert attributes.|
|Undo the previous |
|Undo the previous |
|You can define and run a process for an alert if the alert does not have a Suppressed status or if the alert is healed.|
|Close an alert when an issue is solved and the alert is resolved. You can only close an alert, manually, which is in the OK state.|
For correlated alerts, an action can be performed on the entire inference, but not on a single alert.
Alert status describes a logical condition of an alert with respect to the alert lifecycle. Alert status should not be confused with alert state, which can be critical, warning, or OK.
Both automatic and manual alert actions can cause an alert status change, as shown in the following figure:
|Open||The initial alert status is Open.|
|Correlated||Alert correlation processing changes the alert status to Correlated.|
Alerts correlated to an inference have a Correlated status and subsequently inherit the inference alert status. Correlated alerts do not change status independently but transition with the associated inference alert status.
A Ticketed alert retains a Ticketed status even if an
|Acknowledged||Acknowledged alerts are set to an Acknowledged status.|
|Suppressed||Suppressed alerts are set to a Suppressed status.|
|Closed||The Closed status is a final alert status. Alerts can be closed manually only when the alert is in the OK state.|
You can monitor alert status in the Alert Details page comments section:
After waiting until the problem no longer is displayed as an alert the alert is placed in the OK state. Alerts in the OK state are not displayed in the alert browser. In the OK state, if the same alert reoccurs, a new alert is created. Otherwise, the repeat count is incremented for the alert.
The following filters can be applied to alerts:
|Client||View the alerts of all or select clients.|
|Resource Origin||View alerts from all sources or a single source.|
|Sites||Select alerts specific to the site.|
|Resource Groups||Select alerts associated with a resource group.|
|Resource Type||Filtered alerts by resource type. Multiple resource types can be specified up to a maximum of ten.|
|Name||Filter alerts by resource name.|
|Source||Filter alerts by source, such as the available integrations.|
|Entity Type||Filter alerts by entity type:|
|Alert Type||Filter alerts by alert type:|
|Metric||Filter alerts by metric name.|
|Priority||Filter alerts by priority, where P0 is the lowest priority and P5 is the highest priority.|
|Current Status||Filter alerts by status or state:|
|Actions||View alerts by current status:|
|Duration||Filters on alerts that occurred within a specified duration. For example, if the duration is set to the last seven days and the alert timestamp is set to created time, alerts created within the last seven days are selected.|
|From Date||Specify duration start time.|
|To Date||Specify duration end time.|
|Alert Timestamps||Filter by alert timestamp.|
|Event Type||Filter by event type:|
|Special Attribute||Filter on alerts triggered when a resource becomes unavailable.|
Alert noise or alert floods might make it difficult to recognize and deal with more important alerts. The platform can reduce alert noise in the following ways:
- Automatically ignore repeat alerts
- Manually stop processing alerts temporarily
Automatically ignore repeat alerts
When the platform receives more than one alert with the same resource, metric, component, and state combination within one minute, the platform ignores the alerts after processing the first alert.
Ignoring the repeat alerts conservers resources. For example, there have been instances where outgoing email is blocked if the platform continues to process the noise. Once there are no repeat alerts with the same resource, metric, component, and state combination within a minute, the platform resumes normal alert processing.
Manually stop processing alerts temporarily
When the platform receives an alert flood from a resource or metric, the operations team stops the processing of the resource or metric until they ensure the flooding has stopped. For example, flooding occurs when you get 100s of alerts within few minutes from a single metric on a resource or a resource sends 100s of alerts within few minutes. This flood of alerts affects the processing of other alerts for the client, which might be important alerts. Additionally, this flood of alerts from one client might affect alert processing for other clients as well because of the load generated by these noisy alerts. When this flooding occurs, the operations team changes the server configuration to stop processing these alerts and informs the corresponding client of the issue. Once the issue is resolved, the operations team changes the configuration back to normal to process all alerts.
Alert repeat and occurrence history
Older alert occurrences are not always available because they are purged after 180 days.
- On the Alerts page, the link in the Repeat column will not show the number of alerts indicated when the occurrences are older than 180 days.
- On the Alert Details page, the Total Occurrences link will not show the number of alerts indicated when the occurrences are older than 180 days.
Machine learning concepts
Machine learning (ML) status shows the stages of machine learning implemented in a policy used with analyzing a sequence to suppressing alerts:
|Insufficient data. The policy is temporarily disabled. Due to insufficient data, the machine learning model cannot detect any alert sequences and the policy is temporarily disabled. The policy becomes active when the machine learning model has sufficient data.|
|Training ML model is queued. When a policy is created or a CSV file is uploaded to a policy, the training can be queued. If a policy is in training, the new policy is queued. After the training on the existing policy is completed, the status of the new policy moves to training initiated.|
|Training ML model is initiated. After completion, training on the machine learning model is initiated. The status then moves to training started.|
|Training ML model is started. After completion, training on the machine learning model is started. Training progress is shown on the progress bar.|
|Training ML model is in progress. After completion, training on the ML model continues with the percentage of progress shown in the progress bar.|
|ML model training is complete. Predictions will commence and the ML model detects alert sequences and suppression.|
|ML training encountered an error. Contact support for assistance.|
ML seasonality patterns
Alert seasonality patterns help detect seasonal patterns of alerts. These patterns are based upon unmodified alert information retrieved from existing data. OpsQ suppresses an alert that matches with the seasonality pattern detected by ML. For example, if an alert is displayed at 10 PM on Mondays, OpsQ suppresses the alert triggered at 10 PM on Mondays.
ML retrieves three months of alerts and groups alerts based on three attributes (resource, metric, and component). Seasonal patterns of alerts are studied for each group. Alert groups that have seasonal patterns are listed and displayed when a group is selected.
Ninety percent of the alerts must be seasonal to qualify as seasonal alerts.
Seasonality pattern graphs
Seasonality pattern graphs have the following attributes:
|Resource, metric, and component||Attributes of alerts and the seasonality group that is selected.The seasonality groups are named using these three attributes.|
|Numbers (displayed horizontally)||The timeline with dates in a month.Hover over the dates to view the date and time for each generated alert.|
|Grey lines||The past alerting time for a specific group (resource, metric, and component).|
|Orange lines||The predicted alerting time based on the learned alert seasonality.|
|Blue shaded area||Zoom in (or zoom out) on the blue line to examine time duration.|
Seasonality groups in the graph refer to alert groups that have seasonal patterns.
The following graph provides the details of the seasonal patterns for alert data and describes when OpsQ suppresses alerts due to matching seasonal patterns:
This is a sample recommendation by the OpsQ bot for a suppressed seasonal alert:
The following graph shows the seasonal pattern of a suppressed alert: