Chapter 10. Monitoring in Production
Whether you’re a product owner, machine learning (ML) engineer, or site reliability engineer (SRE), once agents hit production, you need to see what they’re doing and why. Shipping agentic systems is only the halfway point. The real challenge begins once your agents are operating in dynamic, unpredictable, high-stakes environments. Monitoring is how you learn from reality—how you catch failures before they escalate, identify regressions before users notice, and adapt systems in response to real-world signals.
Unlike traditional software, agents behave probabilistically. They depend on foundation models, chain together tools, and respond to unbounded user inputs. You can’t write exhaustive tests for every scenario. That’s why monitoring becomes the nervous system of your deployed agent infrastructure.
Monitoring isn’t just about detecting problems. It’s the backbone of a tight feedback loop that accelerates learning and iteration. Teams that monitor well learn faster, ship safer, and improve reliability with every deployment.
In this chapter, we focus on open source monitoring. While there are excellent commercial platforms like Arize AX, Langfuse, and WhyLabs, we’ll concentrate here on tooling you can self-host and extend freely. Our reference stack includes:
- OpenTelemetry
-
For instrumenting agent workflows
- Loki
-
For log aggregation and search
- Tempo
-
For distributed traces
- Grafana
-
For visualization, alerts, and dashboards
We’ll walk ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access