Skip to Content
The Self-Service Data Roadmap
book

The Self-Service Data Roadmap

by Sandeep Uttamchandani
September 2020
Beginner to intermediate
284 pages
7h 40m
English
O'Reilly Media, Inc.
Content preview from The Self-Service Data Roadmap

Chapter 15. Query Optimization Service

Now we are ready to operationalize the insights in production. The data users have written the business logic to generate insights in the form of dashboards, ML models, and so on. The data transformation logic is written either as SQL queries or big data programming models (such as Apache Spark, Beam, and so on) implemented in Python, Java, Scala, etc. This chapter focuses on the optimization of the queries and big data programs.

The difference between good and bad queries is quite significant. For instance, based on real-world experience, it is not unusual for a deployed production query to run for over 4 hours, when after optimization it could run in less than 10 minutes. Long-running queries that are run repeatedly are candidates for tuning.

Data users aren’t engineers, which leads to several pain points for query tuning. First, query engines like Hadoop, Spark, and Presto have a plethora of knobs. Understanding which knobs to tune and their impact is nontrivial for most data users and requires a deep understanding of the inner workings of the query engines. There are no silver bullets—the optimal knob values for the query vary based on data models, query types, cluster sizes, concurrent query load, and so on. Given the scale of data, a brute-force approach to experimenting with different knob values is not feasible either.

Second, given the petabyte (PB) scale of data, writing queries optimized for distributed data processing best practices ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Data Management at Scale

Data Management at Scale

Piethein Strengholt
Data Mesh

Data Mesh

Zhamak Dehghani
The Enterprise Data Catalog

The Enterprise Data Catalog

Ole Olesen-Bagneux

Publisher Resources

ISBN: 9781492075240Errata Page