Rethinking Data for AIOps-Driven IT Operations

The following article first appeared in Datanami.

The rise of the hybrid enterprise is wreaking havoc on IT teams. IT service/operations management (ITSM/ITOM) is more challenging than ever, as infrastructure continues to become more complex, more involved, and more resource-intensive.

Emerging technologies such as cloud-native architectures, DevOps, and multi-cloud infrastructure are creating as many problems for IT teams as they are creating opportunities. Each new system, platform, or solution needs to be continuously monitored and managed, usually through a dedicated monitoring tool and reporting dashboard.

With event and alert volumes constantly increasing, IT teams simply don’t have the time or manpower to manually evaluate, correlate, and act on all that data. Worse, IT leaders are struggling to unify disparate fragmented technologies, fragmented data, and disparate toolsets across teams and domains, increasing time-to-resolution and making it nearly impossible to keep services running at peak performance at all times.

Gartner predicts that the demands of digital business on ITOM will continue to intensify and evolve in the years to come. But the growth rate of data volume is outpacing companies’ ability to spend on technology and personnel, allowing potentially impactful insights to slip through the cracks and greatly devalue IT operations’ role in the organization.

AIOps: The Obvious Solution to a Growing Challenge

These IT teams are turning to artificial intelligence for IT operations (AIOps) to help them sift through huge volumes of IT infrastructure data. From events and logs to telemetry data or other workload performance and availability information, IT leaders are seeking comprehensive solutions for automating many of the most time- and resource-intensive ITOps activities to free their teams to take a more strategic role in driving digital innovation and transformation.

Consolidating IT-related event management into a single engine makes it easier to correlate various data sources, while eliminating duplicate information from tools with redundant or overlapping capabilities and standardizing the data used in strategic decision making. The computational techniques and advanced algorithms help predict a range of incidents across infrastructure, common IT management tools, and enterprise processes — and resolve them autonomously.

Find the Right Frequency to Eliminate Noise and Interference

All machine learning and AI-based technologies are driven by data, and lots of it. But making sure the system gets the right data — the cleanest and most relevant data — is the secret to successfully automating core ITSM activities.

Like a radio that can only produce crystal-clear sound when it’s receiving specific data, AIOps tools must also be configured to receive specific data signals from the right devices to produce a crystal-clear, dynamic snapshot of your entire environment. Bolt-on point tools like application performance monitoring (APM) and network performance monitoring (NPM) tools that collect limited, more general data that can contribute to more noise and interference — similar to the static heard when the radio dial set a half-tick off the target radio station band.

AIOps tools must be configured to receive data from the highest priority endpoints from among the potentially thousands of devices, components, or customer touchpoints common to today’s sprawling IT environment. The platform should aggregate data from all those systems and endpoints, filter out unnecessary signals to help prioritize which data signals to pay the closest attention to, and employ inference models to validate it for accuracy and relevance.

That way, the actionable insights are contextual and relevant to the existing environment. It is, in effect, tuning the radio to extract the best signals for the clearest output.

Turning the Dial: How to Tune the Radio

Tuning a radio is as simple as turning a dial. Want to listen to Country music? Tune specifically to a country station’s frequency. Any variance in the tune may result in static or pull in other signals that negatively impact the listener’s experience.

Tuning an AIOps platform is more involved than tuning a radio, but the principle is the same: the algorithms that power the solution must be “tuned” to align with specific operational and business objectives. For example, to proactively identify service disruptions and refocus IT’s efforts on restoring high priority services, an algorithm should be tuned specifically to forecasting and change detection alerts to identify capacity or performance degradation conditions as early as possible before there is any user impact.

Similarly, understanding how alerts relate to one another — whether an outage with a microservice is an isolated incident due to a bug or if it’ll take down other services with it — requires an algorithm designed to infer which of the alerts is valid and another rule dictating whether the alert can be automatically resolved or if specific users should be notified to address it manually.

The value of an AIOps platform is directly tied to its ability to ingest specific data and execute precisely particular commands. Building generic algorithms will always yield limited, generic results that provide little value or utility in advancing business goals.

Instead, algorithms should be adapted to specifically defined outcomes and purposes. This is a much likelier scenario with solutions featuring native instrumentation that parse data before it gets ingested by the algorithm and deliver only the right information, instead of having to manually retune the rules for each data stream individually.

Quality Signal + Proper Tuning = High-Fidelity “Sound”

The primary benefit of AIOps is that it contextualizes massive volumes of data autonomously that human staff otherwise might not have the time, energy, or know-how to do. Keeping with the radio analogy in which coordinating signal production with signal processing creates a quality sound output, AIOps takes the signals collected from endpoints, runs the data through purpose-built algorithms, and produces actionable insights as outputs that inform both automated and manual management activities.

AIOps tools can inject context into data by linking into other data types, such as change logs or user-generated documentation, rather than focusing on just a single stream. To do so, it must be able to take into account both operational and business considerations to deliver a deeper and more contextual view IT incidents that enables:

Visualization of upstream and downstream resources comprising an IT service
Creating rules to correlate events based on their attributes
And, correlating alerts that are related by the same cause

Consider that in a multi-server environment, an organization running two sets of servers — one in production, the other in development — hosting the same applications. Contextualizing alerts from each individual device is the only way to discern which are production servers need immediate attention and which can be dropped down the priority list.

Configuring an AIOps platform to analyze both IT operational and business data makes it possible to detect patterns of behavior that not only help to keep essential services up and running, but also help uncover new opportunities to monetize data or services that may emerge from the added intelligence.

AIOps Thrives on Complexity

Perhaps the greatest advantage and promise of AIOps is its ability to ingest the ever-expanding universe of analytics, logs, metrics, events and data that’s churned out of the typical modern enterprise. And the more data it consumes, the better the eventual outcome. We’re at an inflection point in the history of technology; at no time in history has technology been this complex, but luckily, modern technology is also capable of handling it.