In recent months, the major cloud and tech players have been unfurling their latest products and strategies to win enterprise dollars. One trend that’s catching fire is hybrid infrastructure management: the need to support a mix of on-premises, private cloud, and public cloud resources.  

Today’s typical enterprise with a hybrid cloud infrastructure has more flexibility than ever to host applications and data where they best suit the business, whether that’s for performance, agility, compliance or cost reasons.

On the downside, hybrid (and multi-cloud) environments are more complex, resulting in:

  • More alerts and events in general;
  • More globally-distributed events, such as clickstreams, transactions and machine-generated events from mobile and IoT devices;
  • More complex events with cross-silo ramifications.

This creates huge problems. IT monitoring teams struggle with disconnected, technology-specific information that makes it incredibly hard to identify and resolve service issues. Manually correlating data across domains is extremely time-consuming and can lead to lost revenue if a major service goes down. 

During a typical incident escalation scenario, when support calls the domain experts, few can agree on what’s wrong. Each team has its own take on what’s broken and points the finger elsewhere. As a result, businesses suffer prolonged service outages that can have serious business impacts. A typical problem is cascading alerts—when the first alert triggers several others, creating a long investigation cycle to weed out the alert noise from the original alert and find the root cause. Cascading alerts may lead to a cascading failure; without rapid filtering and correlation of event data, that’s a potential business disaster.

Concurrent alerts create unnecessary noise as well. Let’s say you roll out a code update to different servers in a cluster and the update has a bug that causes alerts on each of the servers. Determining the cause of each server alert takes an inordinate amount of time, even though each alert is caused by the same issue.

Let’s look at how a modern Digital Operations Command Center framework can reduce the extra work for hybrid cloud management.

A typical alert management workflow includes the following:

  1. Correlation of alerts to a common cause;
  2. Triage and prioritization of alerts for resolution;
  3. Integration with IT service management (ITSM) tools for improved problem management, including the creation of incident tickets.

Under the Command Center framework, all these steps are handled automatically in a central location or platform, using AI and policy-based tools that also help determine the root cause(s) of cascading, cross-silo, and concurrent alerts. The modern IT operations system should enable:

  • Instrumentation and integrations for event collection and aggregation, including integration of many existing monitoring tools;
  • The filtering, aggregation and de-deduplication of alerts;
  • Event correlation and impact analysis without human interaction;
  • Fully-automated first response and auto-remediation where possible;
  • Automated escalation to a human operator when necessary;

Studies have shown that this AI-based approach to alert management can reduce by 90% the number of alerts requiring human interaction. For those alerts that require human interaction, automated event creation and incident routing, coupled with analytics-driven root cause identification, results in lowering mean time to resolution (MTTR) by more than 50%. This can have a profound impact on a company’s bottom line. 

In my next post, I'll review a Digital Operations Command Center ROI model that may help you accelerate the adoption of a modern framework to modernize your hybrid cloud infrastructure.

Next Steps:

Calculator_CTA

Recommended posts