O'Reilly logo
live online training icon Live Online training

O'Reilly Infrastructure & Ops Superstream: SRE Edition

Topic: Web Ops & Performance
Sam Newman

About the Infrastructure & Ops Superstream Series: This four-part series of half-day online events covers the most challenging and promising topics facing those working in infrastructure and operations today: site reliability engineering, security, Kubernetes, and microservices.

Series schedule:

  • Event 1: SRE Edition - June 17, 2020
  • Event 2: Security Edition - September 23, 2020
  • Event 3: Kubernetes Edition - October 21, 2020
  • Event 4: Microservices Edition - November 18, 2020

NOTE: With today’s registration, you’ll be automatically signed up for all sessions in the Superstream series. We’ll continue to update this page. Check back to see speakers and sessions for later events.

Description: Site reliability engineering (SRE) is the practice of ensuring a company’s digital assets remain stable, performant, and resilient. Given that all companies rely on a digital presence these days, SRE is becoming increasingly vital to the engineering teams supporting the sites- and ultimately the business.

In this edition of the O’Reilly Infrastructure & Ops Superstream Series, you’ll get an introduction to SRE concepts and best practices and learn how to put them to work in your organization.

What you'll learn-and how you can apply it

  • Explore the fundamentals of site reliability engineering, including how to apply SRE within your organization
  • Understand best practices for service-level objectives (SLOs), service-level indicators (SLIs), and communications
  • Learn how to grow an SRE practice-where to start and how to iterate
  • Gain real-world insight from Pivotal’s experience of implementing SLIs and SLOs
  • Get pragmatic advice from in the trenches on topics such as blameless postmortems and cross-functional collaboration, defining “done” and managing technical debt, and driving change and funding technical improvements

This Superstream is for you because...

  • You’re a developer new to or looking to enter an SRE role.
  • You help build the tools that improve deployment, shepherding code from developers into production and making sure it keeps running (or anything remotely related).
  • You want to become well-versed in the foundations and best practices of SRE.

Prerequisites

  • Come with your questions
  • Have a pen and paper handy to capture notes, insights, and inspiration

About your hosts

  • After spending time at multiple startups and 12 years at ThoughtWorks, Sam Newman is now an independent consultant. Specializing in microservices, cloud, and continuous delivery, Sam helps clients deliver software faster and more reliably through training and consulting. Sam is an experienced speaker who has spoken at conferences across the world and is the author of Building Microservices and Monolith to Microservices both from O'Reilly. Sam is also chair of the O’Reilly Infrastructure & Ops Superstream Series.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Sam Newman - Introduction (10mins) 9:00 AM PT | 12:00 PM ET | 5:00 PM UTC/GMT

  • Sam Newman welcomes you to the O’Reilly Infrastructure & Ops Superstream.

Meet the Expert: Liz Fong-Jones—Refining Systems Data Without Losing Fidelity (50mins)- 9:10 AM PT | 12:10 PM ET | 5:10 PM UTC/GMT

  • It isn’t feasible to run an observability infrastructure that’s the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services’ top-level performance and inner workings. Liz Fong-Jones takes you through a three-R approach to data retention: reducing junk data, statistically reusing data points as samples, and recycling data into counters. Learn how to keep the context of the anomalous data flows and cases in your supported services while not allowing the volume of ordinary data to drown it out.

  • Liz Fong-Jones is a developer advocate, labor and ethics organizer, and site reliability engineer (SRE) with 16+ years of experience. She’s an advocate at Honeycomb for the SRE and observability communities and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights. She lives in Brooklyn with her wife Elly, metamours, and a Samoyed-golden retriever mix and in San Francisco and Seattle with her other partners. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights as a board member of the National Center for Transgender Equality.

Eric Zielinski and Jason Patterson: So You Want to Be an SRE? (60mins) - 10:00 AM PT | 1:00 PM ET | 6:00 PM UTC/GMT

  • Eric Zielinski and Jason Patterson explain how to implement site reliability engineering in your organization. You’ll learn how to build a business case for SRE, the roles and responsibilities of the SRE, and how to grow and scale SRE best practices across your organization.

  • Eric Zielinski leads the cloud delivery organization at Nationwide, where his teams are responsible for cloud infrastructure, containers, security, and site reliability. With over 20 years of industry experience leading advanced infrastructure operations, engineering, and cybersecurity, Eric is in his 15th year at Nationwide where he has worked across the company to deliver a portfolio of products and technologies including cloud platforms, security automation, self-service adoption, and DevOps transformation. He is a frequent speaker at conferences such as Dockercon, FIRST, FS-ISAC, 614con, and many others. He holds a bachelor’s degree in information systems from Franklin University and several certifications such as GCCC, GMON, EnCE, and GCIH.

  • Jason Patterson works with developers, engineers, and technology leaders to flip the reliance put on manual processes. As AVP of site reliability engineering at Nationwide, he’s growing the team and injecting the role where it matters most. With experience developing and supporting applications for enterprises, academia, and startups, Jason has executed transformations across the country with the goal of improving the lives of developers.

  • Break (5mins)

Debbie Wood: How and Why We Lowered Our SLO and Other SRE Life Lessons (55mins) - 11:05 AM PT | 2:05 PM ET | 7:05 PM UTC/GMT

  • Debbie Wood shares real-world life lessons on successfully implementing SLIs and SLOs, offering an executive summary of a three-year journey spent iterating to a more mature understanding of SLIs as an accurate proxy for user pain and adequate SLOs as a faithful threshold for tolerable user disruption. Join in to learn 10 practical steps for bootstrapping these practices.

  • Deborah Wood is the product manager for cloud operations in Europe at Pivotal. Together with her team, she supports the production platform for the Pivotal Tracker application, advocates for the use of site reliability engineering, and supports Pivotal product teams in feedback and improvements to the UX of operations. The cloud ops team has also supported the multiregion multifoundation platform on which the Comic Relief’s Sport Relief campaign donations application ran in 2016 and 2018. Debbie also teaches and socializes SRE in-house.

  • Break (5mins)

Randy Shoup: Learning from Learnings—Anatomy of 3 Incidents (50mins) - 12:05 PM PT | 3:05 PM ET | 8:05 PM UTC/GMT

  • The best response to a system outage isn't "What did you do?" but "What did we learn?" Randy Shoup walks you through three system-wide outages—at Google, Stitch Fix, and WeWork—from incident to aftermath to recovery. You'll hear a few war stories from the trenches and learn a set of actionable suggestions for dealing with customers, engineering teams, and upper management.

  • Randy Shoup has spent more than two decades building distributed systems and high-performing teams as a senior technology leader at eBay, Google, and Stitch Fix. He coaches CTOs, advises companies, and generally makes a nuisance of himself wherever possible. Most recently, he was VP of engineering at WeWork in San Francisco. He’s particularly interested in the nexus between culture, technology, and organization.

Sam Newman - Closing Remarks (5mins) - 12:55 PM PT | 3:55 PM ET | 8:55 PM UTC/GMT