Skip to Content
Hadoop Application Architectures
book

Hadoop Application Architectures

by Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira
July 2015
Intermediate to advanced
250 pages
10h 47m
English
O'Reilly Media, Inc.
Content preview from Hadoop Application Architectures

Appendix A. Joins in Impala

We provided an overview of Impala and how it works in Chapter 3. In this appendix, we look at how Impala plans and executes a distributed join query. At the time of this writing, Impala has two join strategies: broadcast joins and partitioned hash joins.

Broadcast Joins

The broadcast join is the first and the default join pattern of Impala. In a broadcast join Impala takes the smaller data set and distributes it to all the Impala daemons involved with the query plan. Once distributed, the participating Impala daemons will store the data set as an in-memory hash table. Then each Impala daemon will read the parts of the larger data set that are local to its node and use the in-memory hash table to find the rows that match between both tables, (i.e., perform a hash join). There is no need to read the entire large data set into memory, so Impala uses a 1 GB buffer to read the large table and perform the joining part by part.

Figures A-1 and A-2 show how this works. Figure A-1 shows how each daemon will cache the smaller data set. While this join strategy is simple, it requires that the join occur with at least one small table.

hdaa aain01
Figure A-1. Smaller table caching in broadcast joins

It’s important to note that:

  • The smaller data set is now taking up memory on every node. So if you have three nodes with 50 GB of Impala memory, the smaller data set in a broadcast ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Apache Hadoop 3 Quick Start Guide

Apache Hadoop 3 Quick Start Guide

Hrishikesh Vijay Karambelkar
Architecting HBase Applications

Architecting HBase Applications

Jean-Marc Spaggiari, Kevin O'Dell

Publisher Resources

ISBN: 9781491910313Errata Page