Skip to Content
High Performance Spark, 2nd Edition
book

High Performance Spark, 2nd Edition

by Holden Karau, Adi Polak, Rachel Warren
May 2026
Intermediate to advanced
350 pages
2h 50m
English
O'Reilly Media, Inc.
Content preview from High Performance Spark, 2nd Edition

Chapter 3. Joins (SQL and Core)

Joining data is an important part of many of our pipelines, and both Spark Core and SQL support the same fundamental types of joins. While Core and SQL support the same types they have very different execution options and performance. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create datasets beyond our capability to handle.1 In core Spark it can be more important to think about the ordering of operations, since the DAG optimizer, unlike the SQL optimizer, isn’t able to re-order or push down filters.

To understand the performance of joins in Spark it’s important to distinguish two similar-sounding concepts, join types and join execution techniques. Join types impact the result and are things like left and right inner joins, outer joins, full-self-joins, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Learning Spark, 2nd Edition

Learning Spark, 2nd Edition

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

Publisher Resources

ISBN: 9781098145842Errata Page