Closer Look: OpsRamp Intelligent Alert Escalation (Part 1)

In this article:

  • How OpsRamp’s machine learning (OpsQ) can learn how your IT teams handle incidents to enable more accurate prioritization, routing and assignment;
  • Learn about the three components of intelligent alert escalation;
  • How your ITSM is the seed to begin building a decision tree for alert escalation.

Alert management is no longer a manageable task, given the growth in applications, cloud environments and point monitoring tools. Too much time is spent filtering and making sense of alert data and determining where to route incidents. All of these steps slow down critical issue identification and resolution.

In this article, I want to discuss a more sensible, modern way to deal with IT alerts, through machine learning intelligence and automation. The concept of intelligent alert escalations relies upon AIOps technology which can take the inefficient manual work out of alert management so that operators can focus on investigating and fixing important issues for the business without delay.  

Intelligent alert escalation incorporates the following five activities:

  • Create a set of policies for alert/incident routing;
  • Define conditions for incident creation;
  • Integrate with your IT service management platform;
  • Enable continuous learning in OpsRamp to train the decision-tree model;
  • Refocus staff workflow on resolving unknown events.

Intelligent alert escalation with OpsRamp can deliver the following benefits:

  • No more teams staring at screens of event streams for direction;
  • Shift workflow to focus on new incidents;
  • Use machine learning (OpsQ) to learn how your team handles incidents, enabling more accurate automation;
  • Delivering enriched incidents to improve investigation and resolution;
  • Integration with IT Service Management (ITSM) tools to save time;
  • Better business response and alignment;
 

Getting started

  1. Select the right policies. Most organizations have alert escalation policies, although they may have too many or ones which are not high priority. Start by creating a handful of critical alert escalation policies – no more than five or six – which relate to your greatest business service risks. 
    alert-escalation-policies-list

  2. Define the conditions for incident creation from alerts. OpsRamp machine learning algorithms drive the process from there, sending the alert to the correct operator who then can enrich the incident with tags and initiate the right incident resolution workflow. OpsRamp includes full roster support so alerts can be routed to the person who is currently on duty, eliminating time-consuming texts and calls to find an available expert. Note: this process takes place after OpsRamp has automatically de-duplicated, filtered and correlated the alerts to reduce noise. Typically, organizations see 60-80% reduction in alert volume through this process.

  3. Integrate with ITSM. OpsRamp provides out-of-the-box integrations with ServiceNow and other popular ITSM tools. Our APIs also allow IT teams to connect the platform with any other ITSM of their choice. To start, you can import historical ticket data from the ITSM system for context, which is then a seed to begin building a decision tree for alert escalation, prioritization and routing. What’s great is that through this integration, users don’t have to go back and forth between their service desk system and the IT incident management system. Incidents can be created in OpsRamp and then automatically populated as tickets in the ITSM tool. At each step of the incident workflow process, the ITSM system is automatically updated with the latest information and status. That way help desk employees, IT operations staff and SREs can be on the same page. And, the information flow goes both ways since this is a two-way integration enabling OpsQ to learn from actions taken within the ITSM system.

  4. Enable Continuous Learning to improve alert and incident management. As you begin to use OpsRamp’s alert escalation and routing features, the system can learn best practices and emulate them later. This can save a lot of time and effort especially with common, repeatable events. To illustrate how this works, consider a ticket that is first routed to the event management team, which then assigns a Network Level I priority. After investigation, that team decides to assign the incident to the SQL DBA L2 team, which then changes the ticket to high priority. The next time that a similar incident is created, it is routed directly to the DBA team and attaches the appropriate knowledge documents. This can bypass hours of triage for a known event. 

    To initiate the training, you must enable Continuous Learning in OpsRamp’s Alert Escalation Policies. You can prepare a training file, and notate which attributes are used (such as resource groups, cloud subscription or custom attributes) to drive the incident decision. You can also let OpsQ learn the field values from existing alert and incident data by providing an empty CSV file with only column names. OpsQ will then learn incident creation rules from the last three months of alert and incident data. The learning process runs every month to ensure the learned ML model matches the latest pattern in the data.

  5. Focus staff time on unknown events. When an event occurs that creates an incident and doesn't have a known outcome it would then be routed to the general incident queue in the ITSM. Event management teams will then assign, prioritize and route the incidents which will, in turn, drive the learning process within OpsRamp. This allows experts to focus energies on the tougher, more obscure issues.

In my next post, I will take a deep dive into alert correlations, highlighting the use of OpsRamp Observe and Recommend Mode to guide users down smart and safe automation pathways.

Related links:

Promo-OpsRamp-Offer-CTA


Recommended posts