Alert noise is reaching an all-time high in IT organizations. The volume of these alerts from disparate tools and technologies has reached a point where they are greatly undermining the ability of IT organizations to properly manage, secure and optimize services and applications for users and customers. DEJ’s recent study on AIOps found that organizations currently spend $1.27 million annually on avoidable incident escalations that result from non-contextual and non-actionable alerts. This threatens the business on many fronts and detracts from enterprise IT credibility and viability.
Yet at the same time, nobody wants to slow down the progress of the technology transformation which is behind this alert chaos. Cloud-native services, real-time analytics and software-defined infrastructure are creating great momentum for IT to empower the business and help customers in new, meaningful ways. Therefore, we need to find cost-effective ways to manage this data and support business and customer initiatives. This is the mission of OpsRamp.
For most digital operations teams, filtering alerts and discerning which ones require immediate attention is an extremely manual process. In light of this, machine learning algorithms have been developed into solutions known as AIOps, to quickly filter, and correlate alerts so that operators can focus on important issues much faster. In some cases, AIOps tools will suggest recommended courses of action or even, when policies dictate, fix a problem automatically.
For example, Zebra Technologies had success using OpsRamp to reduce alert noise and more importantly, improve staff time to address end-user performance issues. Tommy Eurick, manager of service delivery operations at Zebra puts it this way: “When you have too many alerts, people get into the mindset of ignoring them and then we have issues that turn into problems. With OpsRamp, the alerts my team receives are more meaningful and actionable.”
OpsRamp recently conducted analysis of customer data to find out exactly what results organizations can achieve with our modern IT operations management (ITOM) platform. Our analysis focused on alert volume reduction from alert de-duplication, changes in alert severities and correlation across a data set of 56.8M alerts spanning more than 100 customer environments over a six-month timeframe.
First, some quick definitions:
- De-duplication is the process of reducing or eliminating duplicate or redundant alerts.
- Changes in Alert Severity is the scenario where multiple alerts are generated during threshold violations at various stages for a similar metric from warning to critical values or vice versa.
- ML- based Correlation is the process of grouping related alerts into an inference based on pattern detection, learned patterns and shared attributes. Such alerts might originate from dependent infrastructure resources, as follows:
- Scenario 1: Multiple metrics alerts across multiple resources
- Scenario 2: Multiple metrics alerts in one resource
- Scenario 3: Single metric alerts across multiple resources.
OpsRamp’s Alert Correlation Categories
- Clustering. Cluster events based on their attributes by analyzing similarities and correlating different alerts into one alert.
- Correlations. Analyze alert sequence patterns to correlate alerts and identify the root cause(s) for an incident. Topology maps allow you to identify the root cause alerts for an incident with the right situational context and impact analysis.
Here are the steps we took:
- Raw alerts from native and third-party sources are passed through OpsRamp’s alert enrichment model to eliminate duplicates.
- The normalized data is fed into the OpsQ Engine to recognize patterns in the data using ML and view experienced data from prior implementations and alert specs to derive correlation rules to train the model and create inferences.
- After creating inferences, calculate alert volume reduction potential using the number of alerts that are sent through the model and the number of alerts that got correlated.
[Note: Variables influencing model performance include: number of alerts in the dataset; timespan of alerts; density of the alert volume.]
Alert Stats for a brief period - Year (2020) to date for a customer.
The de-duplication model combined with our advanced correlation model reduced raw alert volume ingested by 92%. This is a compelling testament to the power of OpsRamp OpsQ’s AI and machine learning to address real pain points in modern IT environments. Reducing alert volume saves valuable staff time, and allows personnel to focus on identifying and resolving incidents supporting important business services. The end goal of this technology is not just to correlate alerts and reduce the noise, but to discover which alerts are tied to and possibly negatively impacting critical business services. This information provides context for IT teams to focus on critical issues first, and to ensure they are delivering the best possible performance and availability to customers.
Percentage of alert volume reduction by de-duplication is shown in red and the percentage of alert volume reduction by correlation on top of de-duplication is shown in blue in the above graph. The entire bar for each client shows the overall alert volume reduction potential from de-duplication and correlation.
All OpsRamp customers get alert de-duplication capabilities as part of the native event management solution. Customers have a choice of enabling machine-learning based correlation, which on average delivers an additional 51% reduction in alert volume. OpsRamp also has Observed Mode and Recommend Mode, features which allow users to build trust in machine learning.