Chapter 7. Building End-to-End Lineage

On July 27, 2004, a five-year-old startup by the name of Google was faced with a serious problem: their application was down.

For several hours, users across the United States, France, and Great Britain were unable to access the popular search engine. The then-700-person company and their millions of users were left in the dark as engineers struggled to fix the problem and discover the root cause of the issue. By midday, a tedious and intensive process conducted by a few panicked engineers determined that the MyDoom virus was to blame.

In 2021, an outage of that length and scale was considered rather anomalous, but 15 years ago, these types of software outages weren’t uncommon. After leading teams through several of these experiences over the years, Benjamin Treynor Sloss, a Google engineering manager at the time, determined there had to be a better way to manage and prevent these dizzying fire drills, not just at Google but across the industry.

Inspired by his early career building data and IT infrastructure, Sloss codified his learnings as an entirely new discipline—site reliability engineering (SRE)—dedicated to optimizing the maintenance and operations of software systems (like Google’s search engine) with reliability in mind.

According to Sloss and others paving the way forward for the discipline, SRE was about automating away the need to worry about edge cases and unknown unknowns (like buggy code, server failures, and viruses). ...

Get Data Quality Fundamentals now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.