Skip to Main Content
Data Algorithms with Spark
book

Data Algorithms with Spark

by Mahmoud Parsian
April 2022
Intermediate to advanced content levelIntermediate to advanced
435 pages
9h 44m
English
O'Reilly Media, Inc.
Book available
Content preview from Data Algorithms with Spark

Chapter 5. Partitioning Data

Partitioning is defined as “the act of dividing; separation by the creation of a boundary that divides or keeps apart.” Data partitioning is used in tools like Spark, Amazon Athena, and Google BigQuery to improve query execution performance. To scale out big data solutions, data is divided into partitions that can be managed, accessed, and executed separately and in parallel.

As discussed in previous chapters of this book, Spark splits data into smaller chunks, called partitions, and then processes these partitions in a parallel fashion (many partitions can be processed concurrently) using executors on the worker nodes. For example, if your input has 100 billion records, then Spark might split it into 10,000 partitions, where each partition will have about 10 million elements:

  • Total records: 100,000,000,000

  • Number of partitions: 10,000

  • Number of elements per partition: 10,000,000

  • Maximum possible parallelism: 10,000

Note

By default, Spark implements hash-based partitioning with a HashPartitioner, which uses Java’s Object.hashCode() function.

Partitioning data can improve manageability and scalability, reduce contention, and optimize performance. Suppose you have hourly temperature data for cities in all the countries in the world (7 continents and 195 countries), and the goal is to query and analyze data for a given continent, country, or or set of countries. If you do not partition your data accordingly, for each query you’ll have to load, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms

Data Algorithms

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781492082378Errata PageSupplemental Content