Chapter 6. Architecture of BigQuery

BigQuery aspires to scale to your datasets and run as fast as your business requires. The experience should seem like magic. The problem with things that appear to be “magic” is that when you encounter a problem, you don’t know how to even begin fixing it.

This chapter delves into the inner workings of BigQuery. We cover its high-level architecture and the Dremel query engine and provide details on the storage metadata. We cover the details on how BigQuery handles security, availability, and disaster recovery in Chapter 10. At best, this chapter might just satisfy your curiosity. However, in case something doesn’t behave the way you expect it to, this chapter can help you to understand more about what is actually going on and how you can fix or work around the problem.

High-Level Architecture

BigQuery is a large-scale distributed system with hundreds of thousands of execution tasks in dozens of interrelated microservices in several availability zones across every Google Cloud region. This section presents a simplified view of how the high-level pieces fit together. Describing all of the components in detail might require its own book, and we’d lose most of our readers by the time we got past the storage transcoder, and the rest would drop out long before we got to the stubby proxy (yes, that’s a real thing, and no, it isn’t as weird as it sounds).

Life of a Query Request

To understand how BigQuery is put together, let’s step through what happens ...

Get Google BigQuery: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.