Distributed Tracing in Practice
by Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, Rebecca Isaacs
Foreword
Human beings have struggled to understand production software for exactly as long as human beings have had production software. We have these marvelously fast machines, but they don’t speak our language and—despite their speed and all of the hype about artificial intelligence—they are still entirely unreflective and opaque.
For many (many) decades, our efforts to understand production software ultimately boiled down to two types of telemetry data: log data and time series statistics. The time series data—also known as metrics—helped us understand that “something terrible” was happening inside of our computers. If we were lucky, the logging data would help us understand specifically what that terrible thing was.
But then everything changed: our software needed more than just one computer. In fact, it needed thousands of them.
We broke the software into tiny, independently operated services and distributed those fragmented services across the planet, atomized among the millions of computers housed in massive datacenters. And with so many processes involved in every end-user request, the logs and statistics from individual machines told only a sliver of the story. It felt like we were flying blind.
I started working on distributed tracing in early 2005. At the time, I was a 25-year-old software engineer working—somewhat grudgingly, if I’m being candid—on a far-flung service within the Google AdWords backend infrastructure. Like the rest of the company, I was trying to write ...