book

Real-World SRE

Name: Real-World SRE
Author: Nat Welch
ISBN: 9781788628884

by Nat Welch

August 2018

Beginner to intermediate

340 pages

7h 30m

English

Packt Publishing

Read now

Unlock full access

Real-World SRE
Table of Contents
Real-World SRE
Learn more from Packt subscription
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewer
Packt is Searching for Authors Like You
Preface
Who this book is for
What this book covers

To get the most out of this book
Download the example code filesDownload the color imagesConventions used
Get in touch
Reviews
1. Introduction
A brief history
What is SRE?
What is in the book?
SRE as a framework for new projects
Summary
References
2. Monitoring
Why monitoring?
Instrumenting an application
What should we measure?A short introduction to SLIs, SLOs, and error budgetsService levelsError budgets
Collecting and saving monitoring data
Polling applicationsNagiosPrometheusCactiSensuPush applicationsStatsDTelegrafELK
Displaying monitoring information
Arbitrary queriesGraphsDashboardsChatbots
Managing and maintaining monitoring data
Communicating about monitoring
Do they even know there is monitoring?
References and related reading
Future reading
Summary
3. Incident Response
What is an incident?
What is incident response?
Alerting
When do you alert?How do you alert?Alerting servicesWhat is in an alert?Who do you alert?
Being on call
Communication
Incident Command System (ICS)Where do you communicate?
Recovering the system
Calling all clear
Summary
4. Postmortems
What is a postmortem?
Why write a postmortem?
When to write a postmortem document
Carrying out incident analysis
How to write a postmortem document
SummaryImpactTimelineRoot causeAction itemsPostmortems without action itemsAppendix
Blameless postmortems
Holding a postmortem meeting
Analyzing past postmortems
MTTR and MTBFAlert fatigueDiscussing past outages
Summary
References
5. Testing and Releasing
TestingWhat do you test?Testing codeCode reviewsUnit, feature, and integration testsUnit testsFeature testsIntegration testsTesting infrastructureTesting processes
Releasing
When to releaseReleasing to productionValidating your releaseRollbacks
Automation
Continuous everything
Summary
6. Capacity Planning
A quick introduction to business finance
Why plan?
Managing risk and managing expectations
Defining a plan
What is our current capacity?When are we going to run out of capacity?How should we change our capacity?State and concurrencyIs your service limited by another service?Scaling for eventsUnpredictable growth–user-generated contentPreplanned versus autoscalingDeliveringExecute the plan
Architecture–where performance changes come from
Tech as a profit center and procurement
Summary
7. Building Tools
Finding projects
Defining projects
RDDExampleDesign documents
Planning projects
ExampleRetrospectives and standupsAllocation
Building projects
Advice for writing codeSeparation of concernsLong-term workExample OKRsNotebooks
Documenting and maintaining projects
Summary
8. User Experience
An introduction to design and UXReal-world interaction design
User testing
Picking an experienceDesigning the testFinding people to test
Developer experience
Experience of tools
Performance budgets
Security
AuthenticationAuthorizationRisk profilePhishing
ACM code of ethics
Summary
References
9. Networking Foundations
The internet
Sending an HTTP request
DNSdigEthernet and TCP/IPEthernetIPCIDR notationICMPUDPTCPHTTPcurl and wget
Tools for watching the network
netstatnctcpdump
Summary
References
10. Linux and Cloud Foundations
Linux fundamentalsEverything is a fileFiles, directories, and inodesPermissionsSocketsDevices/procFilesystem layoutWhat is a process?ZombiesOrphansWhat is nice?syscallsHow to traceWatching processesLoad averagesBuild your own
Cloud fundamentals
VMsContainersLoad balancingAutoscalingStorageQueues and Pub/Sub
Units of scale
Example architecture interview
Summary
References
Other Books You May Enjoy
Leave a review - let other readers know what you think
Index

Overview

Real-World SRE equips you with the essential tools and techniques to navigate the challenges of system outages and ensure reliable uptime. Authored by an industry expert with experience at leading outage-sensitive companies, this book provides a practical roadmap for troubleshooting and anticipating issues.

What this Book will help me do

Implement effective monitoring strategies for early failure detection.
Develop resilient incident response plans to minimize downtime.
Leverage automated solutions for efficient software testing and deployments.
Analyze capacity and plan for future growth to avoid bottlenecks.
Excel in SRE interviews and advance your career in reliability engineering.

Author(s)

Nat Welch brings years of experience as a Site Reliability Engineer, including time at Google, where reliability is paramount. He has a knack for transforming complex scenarios into actionable insights, making this book a vital resource. Nat's methods are practical, drawn directly from his in-the-trenches expertise. His approach is approachable and geared towards pragmatic solutions.

Who is it for?

This book is for developers, system administrators, and aspiring Site Reliability Engineers who want to improve their skill set for ensuring software uptime and handling system crises effectively. It's geared toward those with a foundational understanding of systems who wish to deepen their knowledge. Nat Welch's guidance is ideal for students and professionals looking to excel in SRE roles. Beginners curious about SRE practices will also find this book accessible and informative.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781788628884

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills