Skip to Content
View all events

Site Reliability Engineering (SRE) and Observability Fundamentals

Published by O'Reilly Media, Inc.

Beginner content levelBeginner

Learn how to prevent downtime, identify problems quickly, and maintain reliable systems

Course outcomes

  • Differentiate observability, monitoring, and reliability and learn the role of logs, metrics, and traces in system health
  • Understand how service level objectives (SLOs) and service level indicators (SLIs) support system performance and reliability goals
  • Explore essential tools and best practices for observability and reliability engineering
  • Identify key system metrics, interpret health reports, and understand how observability supports failure diagnosis
  • Understand key concepts like fault tolerance, redundancy, and incident response to minimize system downtime and improve uptime

Course description

Join expert Adora Nwodo to learn how to support the health and performance of software systems, understand key concepts like system monitoring, reliability, and spotting and resolving issues, and explore how logs, metrics, and traces help track system health and how service level objectives are used to set performance goals. You’ll discover practical methods used by companies to prevent downtime, identify problems quickly, and maintain reliable systems, and you’ll come away with the knowledge to apply these methods in real-world job roles.

What you’ll learn and how you can apply it

  • Understand the importance of system observability and reliability and how they help reduce downtime
  • Learn the difference between observability and monitoring and why both are essential for system health
  • Identify the role of logs, metrics, and traces in tracking system performance and troubleshooting issues
  • Understand how service level objectives and service level indicators are used to measure system performance and reliability
  • Recognize industry tools and how they support observability
  • Analyze system reports and performance metrics to identify potential issues before they impact users
  • Understand best practices for incident response and system recovery to reduce downtime and improve system reliability
  • Learn how system reliability efforts contribute to business success by improving uptime and user satisfaction

This live event is for you because...

  • You’re a software engineer, DevOps specialist, or site reliability engineer (SRE) who’s looking to strengthen your knowledge of system health and performance.
  • You work with cloud-based applications, distributed systems, or software that requires high availability and minimal downtime.
  • You want to become a key contributor to system reliability initiatives by learning how to track performance, reduce downtime, and support incident response efforts.

Prerequisites

  • Basic understanding of software development concepts such as applications, servers, and system performance
  • Familiarity with general IT or DevOps principles like system monitoring, uptime, and troubleshooting

Recommended follow-up:

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction to observability and reliability (40 minutes)

  • Presentation: What are observability and reliability?; key differences between observability and monitoring; why observability and reliability are critical in modern systems; the happenings framework—observability through time; how time affects system health and performance analysis; observing happenings before, during, and after system events; how AI enhances observability and reliability; AIOps platforms (Moogsoft, BigPanda) for incident correlation
  • Group discussion: When have you faced an unexplained system slowdown?; Which of these concepts (observability, monitoring, or reliability) would have helped you solve it faster?
  • Break

Logs, metrics, and traces (40 minutes)

  • Presentation: Introduction to logs, metrics, and traces; their roles in the happenings framework; how they provide visibility across time; challenges in using them effectively
  • Group discussion: Which of these observability components would you use to spot the exact point of failure in a request?
  • Q&A
  • Break

Service level objectives and indicators (40 minutes)

  • Presentation: What are SLOs and SLIs?; difference between SLOs, SLIs, and service level agreements (SLAs); how to define SLOs and SLIs using the happenings framework; SLOs for time-based analysis of performance and availability
  • Group discussion: If you were managing a video streaming platform, what would be a reasonable SLO for buffering time?
  • Q&A
  • Break

Tools for observability and reliability (40 minutes)

  • Presentation: What observability tools do and why they matter (Prometheus, Grafana, and OpenTelemetry); how they fit into the happenings framework; using observability tools to analyze happenings before, during, and after system events; AI-driven tools (Splunk, Dynatrace, Datadog); observability tool setup and use; data flow—from event to log, metric, or trace; role of dashboards and alerts in the happenings framework
  • Group discussion: Which of these tools (Prometheus, Grafana, OpenTelemetry) would be most useful for visualizing system activity over time?; Which do you think your team would benefit from using most, and why?
  • Q&A
  • Break

Incident response and failure recovery (50 minutes)

  • Presentation: Detection, diagnosis, response, and recovery; the role of the happenings framework; mapping incidents; AI in incident management
  • Group discussion: When responding to a live incident, which step is the most difficult— detection, diagnosis, response, or recovery?
  • Q&A
  • Break

Designing reliable systems (30 minutes)

  • Presentation: Redundancy, failover, and fault tolerance; strategies for designing for reliability; designing for resilience—how to observe and respond to key system happenings; design considerations for system uptime and availability
  • Group discussion: Which reliability strategy do you think has the biggest impact on system uptime, redundancy, failover, or fault tolerance?
  • Q&A

Your Instructor

  • Adora Nwodo

    Adora Nwodo is a multiple-award-winning engineering leader and published author working across cloud, AI, and developer experience. Her work focuses on building intelligent platforms, distributed systems, and scalable cloud infrastructure. She has held engineering and leadership roles at Stack Overflow and Microsoft, working at the intersection of platform engineering, cloud infrastructure, and organizational execution. She’s the author of seven books on cloud engineering, DevOps, and technology leadership and serves as executive director of NexaScale, a nonprofit whose mission is to expand access to technology education for African technologists.

Skill covered

Site Reliability Engineering (SRE)