Closer Look: Intelligent Infrastructure Monitoring

This blog is running in conjunction with the upcoming Opsramp/451 webinar: The Four Stages of IT Operations Maturity.

In this article:

Learn why infrastructure monitoring tools need to adapt
The early issues with new, intelligent infrastructure monitoring
How intelligent infrastructure monitoring tools work

Infrastructure monitoring systems have been undergoing massive change in the past few years. Before the proliferation of cloud computing, IoT and edge computing, mobile apps and SaaS apps, the practice was fairly static, albeit traditionally quite siloed.

IT managers needed to get data on a set of standard metrics, from a handful of tools:

Application – Performance (like response time and error rates) and availability
Operating System – Utilization, OS service availability
Security– patch compliance, intrusion attempts, MTTD, MTTR
Network – Bandwidth consumption, latency, throughput and availability.
Storage–Capacity, IOPS, latency
Hardware – CPU, memory, disk usage

This was fairly simple to do, since most of this data came from systems inside the corporate firewall—systems which IT had full control over and which changed infrequently.

Now enterprises are running IT in the cloud as well as maintaining internal data centers, and they might be running multiple cloud environments. Employees are fully engaged in shadow IT, which means that anywhere from 20% to 50% of data and applications is flying under the radar – unmanaged and unmonitored for performance and security risks. As well, there is exponentially more data to discover and analyze.

Fortunately, new technology solutions have been introduced that apply machine learning to help control the chaos of dynamic and highly distributed IT assets. Modern infrastructure monitoring systems can look across many environments, technologies and toolsets to pinpoint critical problems and their root causes much faster.

The advent of machine-learning monitoring technologies

The industry started to see machine learning applied to monitoring around 2016, according to Nancy Gohring, Senior Research Analyst with 451 Research. In this video below, Gohring discusses the rise of intelligent monitoring as a core characteristic of the modern IT operations organization.

The 451 Research report from 2019, “The ITOps Evolution,” discusses the four stages of IT Operations: Traditional, Maturing, Modern and Future. As organizations progress to the Modern phase, adding AI and machine learning capabilities to their IT operations platform is an important enabler.

In those early days, the user experience for this type of software, often called AIOps, was not great, she says. “It was difficult. The user had to choose the metric and sometimes even the algorithm for the machine learning to work with, and it was ripe with error.” In the next phase of AI-driven software development, vendors applied machine learning to analyze all of their operational metrics. But this just created more alerts and information overload.

Finally, things began to get clear when tools grouped alert data into severity categories, such as Critical, Major or Minor, and connected them to context-rich incidents. The latest step has been causation, Gohring says: “The systems began to look at things like topology to make sure there were correlations in the data other than time.”

“Intelligent monitoring tools will be relied upon to accurately pinpoint the cause of issues and then kick off the right automation to fix them,”
- Nancy Gohring, Senior Analyst, Application and Infrastructure Performance, 451 Research

Intelligent monitoring: a journey toward IT operations excellence

Being proactive is one of the top characteristics of high-performing IT operations organizations, according to a report by the Digital Enterprise Journal; the research states that 79% of top-performing organizations can detect performance issues proactively compared to 39% for other organizations.

Artificial intelligence and intelligent monitoring can help drive this by using algorithms, or inference models to correlate, deduplicate or suppress alerts that have a similar point of origination or behavior. With intelligent infrastructure monitoring, organizations can filter machine data much faster so users only receive the valid alerts, and eliminate duplicates.

Operators can also prioritize incident resolution if the alert data is connected to specific business services, such as the ecommerce site. A newer development in the space is the notion of collective intelligence, says Gohring, in which SaaS vendors are monitoring data across their customer base to see patterns which can inform more accurate suggestions by the system.

The ROI from intelligent infrastructure monitoring depends upon advances in our understanding of automation best practices in IT environments. “Intelligent monitoring tools will be relied upon to accurately pinpoint the cause of issues and then kick off the right automation to fix them,” Gohring says.