How does Google serve up 3.5 billion searches every day and 1.2 trillion searches every year worldwide? If you’ve wondered what it takes, Site Reliability Engineering: How Google Runs Production Systems has all the answers. The book explains what it takes to operate hyperscale infrastructure for a global audience.
Google’s secret weapon for world domination is Site Reliability Engineering. While you might think that SRE is a fancy label for IT Ops, nothing is further from the truth. Benjamin Treynor Sloss heads a team of 4,000 Site Reliability Engineers at Google. Treynor (who has a software engineering background) established the Google SRE team in 2003.
Instead of recruiting traditional datacenter experts, Treynor hired software engineers with a strong background in systems engineering. SREs come from the same pool of engineers who write code for Google's services. Along with programming, SREs have deep knowledge of “UNIX system internals and networking (Layer 1 to Layer 3)”.
So who are SREs? Site Reliability Engineers are “focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient.” SREs ensure Google's applications are available, reliable, and deliver the right experience for users.
Google's Site Reliability Engineering book offers valuable insights for enterprise IT teams. Our infographic shows how you can build reliable, scalable, and agile IT infrastructure, just like Google.
01. Unique Approach To Hiring
Google looks for a unique mix of software engineering and systems engineering expertise when hiring SREs. SREs are “the world’s most intense pit crew” for resolving complex IT infrastructure problems.
SREs spend at least 50% of their time doing development and less than 50% of their time doing ops work. The emphasis on engineering enables SREs to build systems for managing large, complicated services.
02. Embrace Error Budgets
Product teams focus on delivering the latest features to users. IT operations teams want to deliver a reliable service. To resolve this conflict, Google uses error budgets to strike a balance between innovation and reliability.
Product managers define the Service Level Objective (SLO), which is the uptime for a particular service. If the actual uptime is more than the SLO, product managers have an error budget to launch new features. If the actual uptime is less than the SLO, no features are introduced till system reliability improves.
03. Eliminate Toil
The only way to banish toil when running planet-scale systems is to focus on the engineering part of site reliability engineering. SREs write code to build and operate highly reliable services with the same level of headcount. Engineering work allows Google to avoid repetitive tasks and “invent more and toil less.”
04. Automate Everything
Google has an ideological preference for “using machines to manage machines when possible.” Given the scale at which Google operates, it’s impossible to hand-hold infrastructure operations.
Automation allows SREs to deliver consistency and avoid errors, omissions, and reliability issues. Automation drives autonomous behavior for faster repairs, quicker action, and time savings.
05. Blameless Postmortems
Blameless postmortems help document the root cause of any outage to prevent future incidents. SREs organize activities like postmortem of the month, postmortem reading clubs, and wheel of misfortune. Teams work through the issues in a constructive environment without any finger-pointing.
06. Reliable Product Launches
How do you launch a product to millions of users without outages? Launch Coordination Engineering (LCE) teams in SRE help Google deliver reliable and agile products. LCE teams help products meet “reliability standards and best practices” for a smooth launch.
Conclusion
Site Reliability Engineering is not just for companies operating at Google scale. Every IT practitioner should read the book to learn how Google instilled a remarkable culture across its site operations teams.