An outage is simply an event during which your IT services are not available to all or some of your users. Recently Amazon Web Services experienced an outage in its US-East region datacenter that took more than six hours to resolve. Close to 35 AWS services were affected during the outage and prominent AWS customers like Buffer, IMDB, and Tinder all experienced service outages.
Let’s face it, outages are a fact of life in IT! IT systems are complex things with many moving parts, each of which is prone to outages. Although there can be many underlying root causes behind an outage, there are common, definite, and distinct patterns behind root causes. You can think of these common patterns as the “anatomy of the beast” behind outages.
Understanding the anatomy can help IT teams better prepare, predict, prevent, detect, diagnose, and recover from outages. In this blog post, we will deconstruct the anatomy of an IT outage by detailing each pattern and describing symptom types. In subsequent blogs, we’ll discuss how IT teams can prepare and respond to outage patterns from a process and a technology standpoint.
Usually, there is a single point of failure associated with an outage: a database query that is running extremely slowly, a newly introduced software bug that brings down an application node, or a network switch failure. Outages resulting from multiple points of failure are extremely rare, if not non-existent, in practice. This is good news for IT practitioners as the common case is tractable! There are three common patterns behind single point of failure type outages:
Outages Due To Capacity Constraints. The root cause behind capacity constraints is that your IT system is experiencing more demand for a resource than the amount of that resource you’ve provisioned. These outages are experienced as performance slowdowns with an application, that grow worse over time until it becomes unusable. Capacity constraints can occur at any level in your IT stack:
- Servers that need more CPU, memory.
- Network links that are congested.
- Storage arrays that need more disk space.
- Databases that need more table space.
- Application connection pools that are too small.
- Applications that have too few threads to process work in parallel.
Outages Due To Component Failures. Component failure outages are due to a software or hardware component that’s not working as expected. The root causes behind hardware failures are varied, but their manifestation typically appears deterministic:
- A disk that has stopped responding.
- Memory modules that are self-reporting bad segments.
- A network interface on a switch that has gone cold.
- A UPS battery that is fully discharged.
The root causes behind software failures are also varied, but can be broadly seen as bugs in the code or flaws in software design. Software failures include both outright failures - “core dumps”, “stack traces”, or their equivalents as well as extreme performance degradations that amount to a failure. These failures are “less deterministic” in practice because the failure may depend on a specific sequence of inputs:
- API calls that take extremely long, but only for specific inputs.
- UI screens that blank out only when the user navigates certain workflows.
Outages Due To Human Error. Human error failures result from software or hardware that’s not been configured or deployed correctly. These outages are related to human factors and are quite distinct from capacity or component related outages. Here are some examples:
- Software deployed on a version of an OS that was known to be incompatible.
- Hardware deployed on an older version of firmware with known issues.
- An application that was configured to connect to the wrong database when it was deployed.
Given the three common outage patterns, IT practitioners can design processes and tools to minimize business impact from outages. The outage classification table below can help predict, detect, diagnose, and prevent outages across your IT environment:
|Type of Outage||Approaches to Predict||Approaches to Detect||Approaches to Diagnose||Approaches to Prevent|
Capacity Related Outages
Component Failure Related Outages
Human Error Related Outages