The following first appeared in Cloud Tech News.

In early June 2019, Google Cloud suffered a cascading set of faults that rendered multiple service regions unavailable for a number of hours.

This by itself isn’t totally unprecedented; what made it significant was the way it propagated through the very software that was designed to contain it. Moreover, engineers’ initial attempts to correct the issue were thwarted by the failure of that same software architecture. It was the interconnectedness and interdependencies of Google’s management components that contributed to the outage.

The Outage

To understand this situation more fully, the following is a short summary of what happened. A maintenance “event” that normally wouldn’t be such a big deal triggered a reaction in GCP’s network control plane, further exacerbated by a fault in that code enabling it to stop other jobs elsewhere in Google’s infrastructure.

The distributed nature of a cloud platform means that although clusters in one area are designed to be maintained independently of clusters in another, if those management processes leak across regions, the failure spreads like a virus. And because these controllers are responsible for routing traffic throughout their cloud, as more of them turned off, network traffic just became that much more constrained, leading to even more failures. In technical terms:

  • Network control-plane jobs were stopped on multiple clusters, across multiple regions at the same time
  • Simultaneous packet loss and error rates increased across multiple regions
  • As a result, key network routes became unavailable, increasing network latency for certain services
  • Tooling to troubleshoot the issue became unusable because tooling traffic competed with service traffic

Once the root cause was identified, remediating the failures required unraveling the management paths to take the right processes offline in the right order and apply the necessary fixes:

  • Google engineers first disabled the automation tool responsible for performing maintenance jobs
  • Server cluster management software was updated to no longer accept risky requests which could affect other clusters
  • Updates were made to store configurations locally, reducing recovery times by avoiding the latency caused by rebuilding systems automatically
  • Network fail-static conditions were extended to allow engineers more time to mitigate errors
  • Tooling was improved to communicate status to impacted customers even through network congestion

Ultimately, the fault did not lie in Google’s willingness or ability to address issues, but rather in a systemic problem with how the platform reacted to unforeseen events. In a real-time computing environment, there is no margin during which management systems can be offline to fix a different problem located in the other system used to apply fixes to the first one.

So, what does this teach us about modern IT? It proves the theory that operations need to be centralised across footprints and infrastructure with bias towards platforms over point tools. Here are three specific features you want in a modern IT operations strategy:

  1. A global view for local action. Having a global view of a distributed infrastructure is critical. Addressing issues in isolation has the potential of propagating faults into areas not currently under test. You need to create points of observation across all regions simultaneously in order to aggregate management data, thus enabling unified analysis to avoid compounding disparate events. This will also help you build a global service performance model by understanding activity throughout your infrastructure. You must also consider how management tasks are carried out when systems become unresponsive; protect tooling traffic from service traffic and don’t share the network capacity between operational data and system status / events.
  2. The ability to see impact, impactfully. Your enterprise should design a global impact model. IT operations personnel need to understand topology-driven impact before making changes - even automated changes. Google's automation did not consider an impact model, and instead was straight policy-driven automation that scaled mistakes automatically. It’s equally important to understand service dependencies via topology mapping, thus enabling impact analysis to take into consideration the cascading effects of policy-driven automation.
  3. Configuration stores you can count on. Lastly, “backing up the backups” by storing configuration data locally instead of relying on a distributed hierarchy can reduce service restoration time. Retrieving this data regionally will increase recovery latency since competing network traffic during a fault will restrict bandwidth available for management tasks.

This outage was essentially a gift to enterprise-class IT operations teams everywhere who think they are prepared for any inevitable service disruptions. It’s taught us all the value in building an IT ops management strategy that includes a singular, global view of dependencies and impacts. Every business thinks they want to become Google. But this is one way you don't.

Next Steps:

CTA-1


Recommended posts