Book description
In 2016, Googleâ??s Site Reliability Engineering book ignited an industry discussion on what it means to run production services todayâ??and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Googleâ??s experiences, but also provides case studies from Googleâ??s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didnâ??t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
Youâ??ll learn:
- How to run reliable services in environments you donâ??t completely controlâ??like cloud
- Practical applications of how to create, monitor, and run your services via Service Level Objectives
- How to convert existing ops teams to SREâ??including how to dig out of operational overload
- Methods for starting SRE from either greenfield or brownfield
Publisher resources
Table of contents
- Foreword I
- Foreword II
- Preface
- 1. How SRE Relates to DevOps
- I. Foundations
- 2. Implementing SLOs
- 3. SLO Engineering Case Studies
- 4. Monitoring
- 5. Alerting on SLOs
-
6. Eliminating Toil
- What Is Toil?
- Measuring Toil
- Toil Taxonomy
-
Toil Management Strategies
- Identify and Measure Toil
- Engineer Toil Out of the System
- Reject the Toil
- Use SLOs to Reduce Toil
- Start with Human-Backed Interfaces
- Provide Self-Service Methods
- Get Support from Management and Colleagues
- Promote Toil Reduction as a Feature
- Start Small and Then Improve
- Increase Uniformity
- Assess Risk Within Automation
- Automate Toil Response
- Use Open Source and Third-Party Tools
- Use Feedback to Improve
- Case Studies
- Case Study 1: Reducing Toil in the Datacenter with Automation
- Case Study 2: Decommissioning Filer-Backed Home Directories
- Conclusion
- 7. Simplicity
- II. Practices
- 8. On-Call
- 9. Incident Response
- 10. Postmortem Culture: Learning from Failure
- 11. Managing Load
- 12. Introducing Non-Abstract Large System Design
- 13. Data Processing Pipelines
- 14. Configuration Design and Best Practices
-
15. Configuration Specifics
- Configuration-Induced Toil
- Reducing Configuration-Induced Toil
-
Critical Properties and Pitfalls of Configuration Systems
- Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
- Pitfall 2: Designing Accidental or Ad Hoc Language Features
- Pitfall 3: Building Too Much Domain-Specific Optimization
- Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects”
- Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- Integrating a Configuration Language
- Integrating an Existing Application: Kubernetes
- Integrating Custom Applications (In-House Software)
- Effectively Operating a Configuration System
- When to Evaluate Configuration
- Guarding Against Abusive Configuration
- Conclusion
-
16. Canarying Releases
- Release Engineering Principles
- Balancing Release Velocity and Reliability
- What Is Canarying?
- Release Engineering and Canarying
- A Roll Forward Deployment Versus a Simple Canary Deployment
- Canary Implementation
- Selecting and Evaluating Metrics
- Dependencies and Isolation
- Canarying in Noninteractive Systems
- Requirements on Monitoring Data
- Related Concepts
- Conclusion
- III. Processes
- 17. Identifying and Recovering from Overload
- 18. SRE Engagement Model
-
19. SRE: Reaching Beyond Your Walls
-
Truths We Hold to Be Self-Evident
- Reliability Is the Most Important Feature
- Your Users, Not Your Monitoring, Decide Your Reliability
- If You Run a Platform, Then Reliability Is a Partnership
- Everything Important Eventually Becomes a Platform
- When Your Customers Have a Hard Time, You Have to Slow Down
- You Will Need to Practice SRE with Your Customers
- How to: SRE with Your Customers
- Conclusion
-
Truths We Hold to Be Self-Evident
- 20. SRE Team Lifecycles
- 21. Organizational Change Management in SRE
- Conclusion
- A. Example SLO Document
- B. Example Error Budget Policy
- C. Results of Postmortem Analysis
- Index
Product information
- Title: The Site Reliability Workbook
- Author(s):
- Release date: July 2018
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492029502
You might also like
book
Site Reliability Engineering
The overwhelming majority of a software system's lifespan is spent in use, not in design or …
book
Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations
Improve Your Service Scalability and Reliability with SRE “The techniques and principles of SRE are not …
video
Site Reliability Engineering Fundamentals
Over the past five years, the ideas behind site reliability engineering (SRE) have caught fire because …
book
The Staff Engineer's Path
For years, companies have rewarded their most effective engineers with management positions. But treating management as …