Chapter 3. Joins (SQL and Core)

Joining data is an important part of many of our pipelines, and both Spark Core and SQL support the same fundamental types of joins. While Core and SQL support the same types they have very different execution options and performance. While joins are very common and powerful, they warrant special performance consideration as they may require large network transfers or even create datasets beyond our capability to handle.1 In core Spark it can be more important to think about the ordering of operations, since the DAG optimizer, unlike the SQL optimizer, isn’t able to re-order or push down filters.

To understand the performance of joins in Spark it’s important to distinguish two similar-sounding concepts, join types and join execution techniques. Join types impact the result and are things like left and right inner joins, outer joins, full-self-joins, ...

Get High Performance Spark, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.