Site Reliability Engineering (SRE) and Observability Fundamentals

Beginner

Learn how to prevent downtime, identify problems quickly, and maintain reliable systems

Course outcomes

Differentiate observability, monitoring, and reliability and learn the role of logs, metrics, and traces in system health
Understand how service level objectives (SLOs) and service level indicators (SLIs) support system performance and reliability goals
Explore essential tools and best practices for observability and reliability engineering
Identify key system metrics, interpret health reports, and understand how observability supports failure diagnosis
Understand key concepts like fault tolerance, redundancy, and incident response to minimize system downtime and improve uptime

Course description

Join expert Adora Nwodo to learn how to support the health and performance of software systems, understand key concepts like system monitoring, reliability, and spotting and resolving issues, and explore how logs, metrics, and traces help track system health and how service level objectives are used to set performance goals. You’ll discover practical methods used by companies to prevent downtime, identify problems quickly, and maintain reliable systems, and you’ll come away with the knowledge to apply these methods in real-world job roles.

What you’ll learn and how you can apply it

Understand the importance of system observability and reliability and how they help reduce downtime
Learn the difference between observability and monitoring and why both are essential for system health
Identify the role of logs, metrics, and traces in tracking system performance and troubleshooting issues
Understand how service level objectives and service level indicators are used to measure system performance and reliability
Recognize industry tools and how they support observability
Analyze system reports and performance metrics to identify potential issues before they impact users
Understand best practices for incident response and system recovery to reduce downtime and improve system reliability
Learn how system reliability efforts contribute to business success by improving uptime and user satisfaction

This live event is for you because...

You’re a software engineer, DevOps specialist, or site reliability engineer (SRE) who’s looking to strengthen your knowledge of system health and performance.
You work with cloud-based applications, distributed systems, or software that requires high availability and minimal downtime.
You want to become a key contributor to system reliability initiatives by learning how to track performance, reduce downtime, and support incident response efforts.

Prerequisites

Basic understanding of software development concepts such as applications, servers, and system performance
Familiarity with general IT or DevOps principles like system monitoring, uptime, and troubleshooting

Recommended follow-up:

Read Observability Engineering (book)
Listen to Cloud Native Observability (audiobook)
Explore Fundamentals of Observability with OpenTelemetry (on-demand course)

Schedule

The time frames are only estimates and may vary according to how the class is progressing.

Introduction to observability and reliability (40 minutes)

Presentation: What are observability and reliability?; key differences between observability and monitoring; why observability and reliability are critical in modern systems; the happenings framework—observability through time; how time affects system health and performance analysis; observing happenings before, during, and after system events; how AI enhances observability and reliability; AIOps platforms (Moogsoft, BigPanda) for incident correlation
Group discussion: When have you faced an unexplained system slowdown?; Which of these concepts (observability, monitoring, or reliability) would have helped you solve it faster?
Break

Logs, metrics, and traces (40 minutes)

Presentation: Introduction to logs, metrics, and traces; their roles in the happenings framework; how they provide visibility across time; challenges in using them effectively
Group discussion: Which of these observability components would you use to spot the exact point of failure in a request?
Q&A
Break

Service level objectives and indicators (40 minutes)

Presentation: What are SLOs and SLIs?; difference between SLOs, SLIs, and service level agreements (SLAs); how to define SLOs and SLIs using the happenings framework; SLOs for time-based analysis of performance and availability
Group discussion: If you were managing a video streaming platform, what would be a reasonable SLO for buffering time?
Q&A
Break

Tools for observability and reliability (40 minutes)

Presentation: What observability tools do and why they matter (Prometheus, Grafana, and OpenTelemetry); how they fit into the happenings framework; using observability tools to analyze happenings before, during, and after system events; AI-driven tools (Splunk, Dynatrace, Datadog); observability tool setup and use; data flow—from event to log, metric, or trace; role of dashboards and alerts in the happenings framework
Group discussion: Which of these tools (Prometheus, Grafana, OpenTelemetry) would be most useful for visualizing system activity over time?; Which do you think your team would benefit from using most, and why?
Q&A
Break

Incident response and failure recovery (50 minutes)

Presentation: Detection, diagnosis, response, and recovery; the role of the happenings framework; mapping incidents; AI in incident management
Group discussion: When responding to a live incident, which step is the most difficult— detection, diagnosis, response, or recovery?
Q&A
Break

Designing reliable systems (30 minutes)

Presentation: Redundancy, failover, and fault tolerance; strategies for designing for reliability; designing for resilience—how to observe and respond to key system happenings; design considerations for system uptime and availability
Group discussion: Which reliability strategy do you think has the biggest impact on system uptime, redundancy, failover, or fault tolerance?
Q&A

Your Instructor

Adora Nwodo
Adora Nwodo is a multiple-award-winning engineering leader and published author working across cloud, AI, and developer experience. Her work focuses on building intelligent platforms, distributed systems, and scalable cloud infrastructure. She has held engineering and leadership roles at Stack Overflow and Microsoft, working at the intersection of platform engineering, cloud infrastructure, and organizational execution. She’s the author of seven books on cloud engineering, DevOps, and technology leadership and serves as executive director of NexaScale, a nonprofit whose mission is to expand access to technology education for African technologists.
search

Skill covered

Site Reliability Engineering (SRE)

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills