Incident Resolution for Remote Teams

This article originally appeared on DevOps.

In this article:

The ad hoc, reactive fire drill process of traditional incident management is no longer viable in today’s remote work landscape.
Adopting a problem-first approach allows IT to solve the most pressing user issues first.
Success involves better inter-departmental coordination and a new approach to metrics.

People working in IT support and incident management right now are faced with unusual difficulties supporting large remote workforces and managing unpredictable workloads. On Reddit, system admins and other IT pros are bemoaning the hiccups and hassles of working in isolation while trying to resolve issues and maintain high SLAs. You can’t go grab your indispensable SME for troubleshooting, because that person is also home and inundated with messages and alerts from many different tools.

Networks, VPNs and servers are being hammered extra hard right now from higher usage of SaaS apps and collaboration tools. User expectations are even more grandiose than normal: The home-based IT environment is a lifeline to whatever remains of human interaction. IT organizations are expected to make applications and services work flawlessly–overnight. Undoubtedly, IT pros are working with assiduity to keep the wheels of work moving for their employers during a time of crisis. Yet, working longer hours isn’t the answer. We need new ways of working—ones which may prove better than the current status quo for the long term.

In many organizations, the process of incident management and resolution begins with people looking at screens full of metrics. You see a spike in something, such as network I/O, and immediately begin to investigate it, assuming the worst. A more adverse scenario is when users call out problems on Twitter, Reddit or other popular tech communities, creating a noxious PR situation which IT and the business must swiftly respond to and manage. This ad hoc, reactive fire drill takes time, and resolving that particular issue may not be the most important item of the day. In software development, it’s akin to the bad practice of designing the architecture of a system first and then the user experience.

Given everything that’s at stake in the foreseeable future of stay-at-home mandates, SREs and IT support technicians need to flip the incident management process. Rather than beginning with the metrics and working backwards to uncover the problem, if there even is one, start by looking for the most important problems to solve. Then, discover the resources and components which support that end-user process and finally, analyze the metrics which are most relevant.

For example:

A B-to-C provider of a streaming video service is now seeing a 200% increase in daytime traffic. This is resulting in sluggish load times for users and glitches.
IT has tagged the components beneath that service, so IT operators know exactly which VMs, databases and load balancers come into play.
Next, the operations team can identify the most relevant metrics, for instance CPU idle time, to measure and take steps to correct.

The Problem-First Approach

It sounds so simple: flip the process. Yet, this is not the case. IT operators and SREs are not accustomed to thinking in terms of which problems to solve or prevent. Being proactive or predictive hasn’t been the common workflow.

Here is how IT operations and DevOps teams can take a proactive approach to managing the availability and health of enterprise services.

Coordinate Across Departments

Identifying the most critical problems to solve right now is the hardest step and the one which will take the longest. Common problems these days are related to cost and scale: How do I right-size my environment to reduce costs or handle more workloads? Broaden your field of inputs to narrow down to the specific issue, such as slow page loads on the financial reporting website. Product managers, account managers and business unit leads who are the closest to the customer experience can deliver feedback on the top issues affecting customer/user satisfaction. Teams should also review recent support tickets to identify common themes of pain.

Fine-Tune Your Approach to Metrics

The Utilization Saturation and Errors (USE) Method is one way to approach the problem-first incident management process. As detailed by Brendan Gregg, a senior performance architect at Netflix, this methodology begins by posing questions, seeking answers and working backwards to the metrics. For each resource that you want to measure, identify three metrics: one for utilization, one for saturation and one for errors.

“The USE Method has made you aware of what you didn’t check: What were once unknown-unknowns are now known-unknowns,” Gregg explained.

Standardize Workflows

Create a common process for incident management across all your teams. Without the advantage of having almost everyone in the same room to huddle together in an ad hoc fashion when big issues crop up, it’s imperative to institute clear steps and roles. Doing so will prevent the frustrations, confusion and oversights that needlessly delay resolution. Since most incidents are composed of multiple contributing factors, teams need to adopt a small number of user-friendly tools to document and organize the information. At our company, we’re now using tools such as Miro, an online whiteboard application, to replace our physical whiteboarding sessions. Of course, there’s also Zoom, Slack, Jira and a host of other cloud-based tools already in place at many organizations. Mandate which tools everyone should use, with some guidelines on how to use them.

Increase Automation

In some organizations, scaling requirements in response to demand have increased tenfold. Automation is playing a critical role now; moving away from a web GUI, for example, is more scalable and aligns with modern tools such as Chef and Puppet. User tickets can be auto generated, for instance, from emails and linked to code management systems like GitHub. Modern development and operations teams are also expanding automation in unit testing and provisioning.

Watch for Burnout

Whether there’s more work and/or a need to fill the hours during long quarantine days, many software engineers, testers and architects are working longer days right now. Yet, exhaustion and burnout can lead to errors and oversights along with low morale. It’s up to managers to make sure that employees are taking breaks, working reasonable days and having the time and energy to attend to personal needs.