AI: Let’s Get Real

5 Min Read | December 27, 2019

This article originally appeared in Toolbox.

The hype has gone off the charts for AI and machine learning tools' potential in IT organizations. It’s time to temper expectations and move towards a sensible adoption path.

“AIOps platform adoption —in particular, machine data, event, telemetry and log-management-oriented tools— continues to rise in support of monitoring and root cause analysis efforts. This is due to their ability to rapidly perform and support highly complex diagnostic tasks across multiple domains.” - Hype Cycle for IT Performance Analysis, 2018.

Artificial intelligence is not fairy dust sprinkled on a data center, despite the enthusiastic proclamations of many IT operations vendors today. A couple of years ago, AI was the bride atop the wedding cake. It was perfect, with promises to render obsolete errors and out-of-control performance issues. It would make IT’s job so much easier and more enjoyable. But in most cases, the concept remains a beautiful fantasy. CIOs are questioning the practicalities behind AI and machine learning very closely. Indeed, Gartner’s 2019 Hype Cycle for ITSM predicts that while AIOps will have transformational benefits in IT, it won’t go mainstream for at least another five years.

The practical and safest goal for AI in IT is this: operators and other end users should use AI insights to support but not relinquish decision-making. IT people don’t fully trust the machines, nor should they. Firstly, let’s revise the narrative. AI is not truly artificial intelligence, but algorithms that are programmed to make decisions. Secondly, AI will never be 100% accurate (this is why the original definition of “AIOps” was algorithmic intelligence for IT operations). It needs context and human intelligence to confirm its analysis and recommendations.

Let’s dive in with some examples:

Triage

At my company, OpsRamp, we use AI-enhanced monitoring tools (AIOps) to help decide who should receive the alert when an issue is surfaced. This cuts down significantly on the manual time it can take otherwise to determine the source of the problem and then research who can best fix it now. The system suggests where to send the alert, based on historical events, and the operator makes the final call on where it goes. Why is this important? Without a check by an experienced IT operations professional, AI can mischaracterize an alert and escalate it to the wrong group. Those individuals, busy with other tasks, may choose to ignore the alert or delay acting upon it. Then potentially the issue escalates and/or causes other problems in the environment which are more severe.

Applying a fix

Let’s say an organization has AIOps in place to help with root cause analysis, an increasingly common and effective use of the technology. Perhaps a business service is running slowly. The system can automatically run a script to safely reboot a database and see if it clears up the problem: no harm done and potential time saved. Now let’s say that reboot doesn’t solve the problem and the AI system begins to move through the next steps methodically to solve the issue. That might result in the procurement of a new cloud resource to port the troubled workload, adding unneeded cost and sprawl if another solution was available. Worse, there are interdependencies which AI is not aware of: the automated action may end up creating more harm, such as taking down the ecommerce site or transferring a workload involving customer data to an infrastructure resource without the proper security protections.

The AI-human blend

To avoid the above scenarios, IT needs a blended approach to AI, incorporating human decision-making and confirmation of recommended courses of action. IT operators will need to learn how to set up an appropriate AI automation workflow, based on their knowledge of scenarios and the chain of events that occur in incident detection, management, and resolution. At a high level, here’s how the blended approach works:

Correlations and root cause analysis: AI tools are adept at making correlations from analyzing vast amounts of disparate data across the infrastructure, including external workloads in the cloud. They can deduplicate alerts to cut down the noise, find patterns between different datasets and investigate the most likely causes of an issue. Having done the dirty work, an operations expert can then review the analysis and add her own research and knowledge to arrive at the most accurate root cause.

Suggest and apply a fix

AI can highlight the best courses of action, yet the operator first reviews and adds his knowledge before moving forward. The operator can choose to fix the problem himself or in the case of low-risk actions, program the AI tool to handle it. Those tasks could include patch updates, handling password reset requests, or translating email requests into trouble tickets. In these examples, AI can save a few minutes per incident which adds up measurably over the course of a week.

IT operations leaders and data scientists will play an important role in formulating the right strategy for applying AI technologies to monitoring, managing and optimizing the environment. How much should the tools be allowed to do, unfettered? These decisions require a balancing of business opportunity and business risk, with likely many adjustments along the way.

AIOps won’t protect operators from occasionally making the wrong decision. The benefit is that ITOps teams have heightened visibility and insight to research issues and improve outcomes faster as the infrastructure evolves. Start by enabling AI to save time on the mundane tasks: research, correlation, filtering the noise and applying simple, routine fixes. Implement these expensive, high-octane tools to the areas of the organization where they can have the most impact in empowering users to make the best decisions when the stakes are highest.

Next Steps:

State-of-AIOps-report-CTA


Recommended posts