O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Real-World SRE

Book Description

This hands-on survival manual will give you the tools to confidently prepare for and respond to a system outage.

Key Features

  • Proven methods for keeping your website running
  • A survival guide for incident response
  • Written by an ex-Google SRE expert

Book Description

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it.

Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response.

Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis.

The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

What you will learn

  • Monitor for approaching catastrophic failure
  • Alert your team to an outage emergency
  • Dissect your incident response strategies
  • Test automation tools and build your own software
  • Predict bottlenecks and fight for user experience
  • Eliminate the competition in an SRE interview

Who this book is for

Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Real-World SRE
    1. Table of Contents
    2. Real-World SRE
      1. Why subscribe?
      2. PacktPub.com
    3. Contributors
      1. About the author
      2. About the reviewer
      3. Packt is Searching for Authors Like You
    4. Preface
      1. Who this book is for
      2. What this book covers
      3. To get the most out of this book
        1. Download the example code files
        2. Download the color images
        3. Conventions used
      4. Get in touch
        1. Reviews
    5. 1. Introduction
      1. A brief history
      2. What is SRE?
      3. What is in the book?
      4. SRE as a framework for new projects
      5. Summary
      6. References
    6. 2. Monitoring
      1. Why monitoring?
      2. Instrumenting an application
        1. What should we measure?
        2. A short introduction to SLIs, SLOs, and error budgets
          1. Service levels
        3. Error budgets
      3. Collecting and saving monitoring data
        1. Polling applications
          1. Nagios
          2. Prometheus
          3. Cacti
          4. Sensu
        2. Push applications
          1. StatsD
          2. Telegraf
          3. ELK
      4. Displaying monitoring information
        1. Arbitrary queries
        2. Graphs
        3. Dashboards
        4. Chatbots
      5. Managing and maintaining monitoring data
      6. Communicating about monitoring
        1. Do they even know there is monitoring?
      7. References and related reading
        1. Future reading
      8. Summary
    7. 3. Incident Response
      1. What is an incident?
      2. What is incident response?
      3. Alerting
        1. When do you alert?
        2. How do you alert?
          1. Alerting services
          2. What is in an alert?
        3. Who do you alert?
      4. Being on call
      5. Communication
        1. Incident Command System (ICS)
        2. Where do you communicate?
      6. Recovering the system
      7. Calling all clear
      8. Summary
    8. 4. Postmortems
      1. What is a postmortem?
      2. Why write a postmortem?
      3. When to write a postmortem document
      4. Carrying out incident analysis
      5. How to write a postmortem document
        1. Summary
        2. Impact
        3. Timeline
        4. Root cause
        5. Action items
          1. Postmortems without action items
        6. Appendix
      6. Blameless postmortems
      7. Holding a postmortem meeting
      8. Analyzing past postmortems
        1. MTTR and MTBF
        2. Alert fatigue
        3. Discussing past outages
      9. Summary
      10. References
    9. 5. Testing and Releasing
      1. Testing
        1. What do you test?
          1. Testing code
            1. Code reviews
            2. Unit, feature, and integration tests
            3. Unit tests
            4. Feature tests
            5. Integration tests
          2. Testing infrastructure
          3. Testing processes
      2. Releasing
        1. When to release
          1. Releasing to production
          2. Validating your release
        2. Rollbacks
      3. Automation
        1. Continuous everything
      4. Summary
    10. 6. Capacity Planning
      1. A quick introduction to business finance
      2. Why plan?
        1. Managing risk and managing expectations
      3. Defining a plan
        1. What is our current capacity?
        2. When are we going to run out of capacity?
        3. How should we change our capacity?
          1. State and concurrency
          2. Is your service limited by another service?
          3. Scaling for events
          4. Unpredictable growth–user-generated content
          5. Preplanned versus autoscaling
          6. Delivering
        4. Execute the plan
      4. Architecture–where performance changes come from
      5. Tech as a profit center and procurement
      6. Summary
    11. 7. Building Tools
      1. Finding projects
      2. Defining projects
        1. RDD
          1. Example
        2. Design documents
      3. Planning projects
        1. Example
        2. Retrospectives and standups
        3. Allocation
      4. Building projects
        1. Advice for writing code
        2. Separation of concerns
        3. Long-term work
          1. Example OKRs
        4. Notebooks
      5. Documenting and maintaining projects
      6. Summary
    12. 8. User Experience
      1. An introduction to design and UX
        1. Real-world interaction design
      2. User testing
        1. Picking an experience
        2. Designing the test
        3. Finding people to test
      3. Developer experience
      4. Experience of tools
      5. Performance budgets
      6. Security
        1. Authentication
        2. Authorization
        3. Risk profile
        4. Phishing
      7. ACM code of ethics
      8. Summary
      9. References
    13. 9. Networking Foundations
      1. The internet
      2. Sending an HTTP request
        1. DNS
          1. dig
        2. Ethernet and TCP/IP
          1. Ethernet
          2. IP
          3. CIDR notation
          4. ICMP
          5. UDP
          6. TCP
        3. HTTP
        4. curl and wget
      3. Tools for watching the network
        1. netstat
        2. nc
        3. tcpdump
      4. Summary
        1. References
    14. 10. Linux and Cloud Foundations
      1. Linux fundamentals
        1. Everything is a file
          1. Files, directories, and inodes
            1. Permissions
          2. Sockets
          3. Devices
          4. /proc
          5. Filesystem layout
        2. What is a process?
          1. Zombies
          2. Orphans
          3. What is nice?
        3. syscalls
          1. How to trace
          2. Watching processes
            1. Load averages
        4. Build your own
      2. Cloud fundamentals
        1. VMs
        2. Containers
        3. Load balancing
        4. Autoscaling
        5. Storage
        6. Queues and Pub/Sub
      3. Units of scale
      4. Example architecture interview
      5. Summary
      6. References
    15. Other Books You May Enjoy
      1. Leave a review - let other readers know what you think
    16. Index