Spotlight on Cloud: Using Prometheus for Black Box Monitoring with Aaron Wieczorek
An interactive case study from the United States Digital Service
What happens when you rely on a third-party service and it goes down? And how do you even know it’s down until your own product stops working or hangs?
Join us for this edition of Spotlight on Cloud to learn how the United States Digital Service (USDS) made its systems more fault tolerant. Aaron Wieczorek, site reliability engineer at USDS, analyzes the challenges of proactively monitoring third-party services and details USDS’s black box monitoring solution, which uses modern open source tools like Prometheus and Grafana to provide monitoring, incident response, and root cause analysis for events outside of the team’s control.
O’Reilly Spotlight explores emerging business and technology topics and ideas through a series of one-hour interactive events. You’ll engage in a live conversation with experts, sharing your questions and ideas while hearing their unique perspectives, insights, fears, and predictions for the future.
In every edition of Spotlight on Cloud, you’ll learn about, discuss, and debate the complex, ever-evolving world of the cloud. Best of all, you’ll discover how successful companies have adopted and embraced this massive network of shared information and how you can follow their lead to transform your organization and prepare for the Next Economy.
What you'll learn-and how you can apply it
- How the USDS was able to deploy a black box monitoring solution with modern open source tools like Prometheus and Grafana
- How the USDS approached monitoring, incident response, and root cause analysis on events outside of their team’s control
- Proper actions to take when your third-party service goes down
This training course is for you because...
- You're a site reliability engineer, product manager, or DevOps practitioner who needs to monitor applications and make them fault-tolerant in the face of outages.
- Come with your questions for Aaron Wieczorek
- Have a pen and paper handy to capture notes, insights, and inspiration
- Read “Working with Third Parties Shouldn’t Suck” (chapter 5 in Seeking SRE)
- Read Prometheus: Up & Running (book)
- Watch Practical monitoring with Prometheus and Grafana (recorded conference session)
- Read Distributed Systems Observability (report)
- Read Site Reliability Engineering (book)
- Read Practical Monitoring (book)
About your instructor
Aaron Wieczorek is a site reliability engineer on the headquarters team at the United States Digital Service. He works on hard technical and bureaucratic problems, building infrastructure as code and CI/CD pipelines, along with network and release engineering.
The timeframes are only estimates and may vary according to how the class is progressing
Thursday, September 5, 2019, at 9:00am PT / 12:00pm ET
- Introduction and presentation (15 minutes)
- Interactive discussion and Q&A (45 minutes)