Chapter 15. Query Optimization Service

Now we are ready to operationalize the insights in production. The data users have written the business logic to generate insights in the form of dashboards, ML models, and so on. The data transformation logic is written either as SQL queries or big data programming models (such as Apache Spark, Beam, and so on) implemented in Python, Java, Scala, etc. This chapter focuses on the optimization of the queries and big data programs.

The difference between good and bad queries is quite significant. For instance, based on real-world experience, it is not unusual for a deployed production query to run for over 4 hours, when after optimization it could run in less than 10 minutes. Long-running queries that are run repeatedly are candidates for tuning.

Data users aren’t engineers, which leads to several pain points for query tuning. First, query engines like Hadoop, Spark, and Presto have a plethora of knobs. Understanding which knobs to tune and their impact is nontrivial for most data users and requires a deep understanding of the inner workings of the query engines. There are no silver bullets—the optimal knob values for the query vary based on data models, query types, cluster sizes, concurrent query load, and so on. Given the scale of data, a brute-force approach to experimenting with different knob values is not feasible either.

Second, given the petabyte (PB) scale of data, writing queries optimized for distributed data processing best practices ...

Get The Self-Service Data Roadmap now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.