In this article:
- How OpsRamp’s machine learning (OpsQ) can learn how your IT teams handle incidents to enable more accurate prioritization, routing and assignment;
- Learn about the three components of intelligent alert escalation;
- How your ITSM is the seed to begin building a decision tree for alert escalation.
- Select the right policies. Most organizations have alert escalation policies, although they may have too many or ones which are not high priority. Start by creating a handful of critical alert escalation policies – no more than five or six – which relate to your greatest business service risks.
- Define the conditions for incident creation from alerts. OpsRamp machine learning algorithms drive the process from there, sending the alert to the correct operator who then can enrich the incident with tags and initiate the right incident resolution workflow. OpsRamp includes full roster support so alerts can be routed to the person who is currently on duty, eliminating time-consuming texts and calls to find an available expert. Note: this process takes place after OpsRamp has automatically de-duplicated, filtered and correlated the alerts to reduce noise. Typically, organizations see 60-80% reduction in alert volume through this process.
- Integrate with ITSM. OpsRamp provides out-of-the-box integrations with ServiceNow and other popular ITSM tools. Our APIs also allow IT teams to connect the platform with any other ITSM of their choice. To start, you can import historical ticket data from the ITSM system for context, which is then a seed to begin building a decision tree for alert escalation, prioritization and routing. What’s great is that through this integration, users don’t have to go back and forth between their service desk system and the IT incident management system. Incidents can be created in OpsRamp and then automatically populated as tickets in the ITSM tool. At each step of the incident workflow process, the ITSM system is automatically updated with the latest information and status. That way help desk employees, IT operations staff and SREs can be on the same page. And, the information flow goes both ways since this is a two-way integration enabling OpsQ to learn from actions taken within the ITSM system.
- Enable Continuous Learning to improve alert and incident management. As you begin to use OpsRamp’s alert escalation and routing features, the system can learn best practices and emulate them later. This can save a lot of time and effort especially with common, repeatable events. To illustrate how this works, consider a ticket that is first routed to the event management team, which then assigns a Network Level I priority. After investigation, that team decides to assign the incident to the SQL DBA L2 team, which then changes the ticket to high priority. The next time that a similar incident is created, it is routed directly to the DBA team and attaches the appropriate knowledge documents. This can bypass hours of triage for a known event.
To initiate the training, you must enable Continuous Learning in OpsRamp’s Alert Escalation Policies. You can prepare a training file, and notate which attributes are used (such as resource groups, cloud subscription or custom attributes) to drive the incident decision. You can also let OpsQ learn the field values from existing alert and incident data by providing an empty CSV file with only column names. OpsQ will then learn incident creation rules from the last three months of alert and incident data. The learning process runs every month to ensure the learned ML model matches the latest pattern in the data.
- Focus staff time on unknown events. When an event occurs that creates an incident and doesn't have a known outcome it would then be routed to the general incident queue in the ITSM. Event management teams will then assign, prioritize and route the incidents which will, in turn, drive the learning process within OpsRamp. This allows experts to focus energies on the tougher, more obscure issues.
- TechTalk: How OpsRamp's Intelligent Escalations Enables Remote Workers
- Blog: Alert Integrations: Right Person, Right Message, Right Time
- Blog: Reduce Incident Downtime and Fix Outages Sooner with Policy-Based Alert Escalation Management