Chapter 5. Partitioning Data

Partitioning is defined as “the act of dividing; separation by the creation of a boundary that divides or keeps apart.” Data partitioning is used in tools like Spark, Amazon Athena, and Google BigQuery to improve query execution performance. To scale out big data solutions, data is divided into partitions that can be managed, accessed, and executed separately and in parallel.

As discussed in previous chapters of this book, Spark splits data into smaller chunks, called partitions, and then processes these partitions in a parallel fashion (many partitions can be processed concurrently) using executors on the worker nodes. For example, if your input has 100 billion records, then Spark might split it into 10,000 partitions, where each partition will have about 10 million elements:

  • Total records: 100,000,000,000

  • Number of partitions: 10,000

  • Number of elements per partition: 10,000,000

  • Maximum possible parallelism: 10,000

Note

By default, Spark implements hash-based partitioning with a HashPartitioner, which uses Java’s Object.hashCode() function.

Partitioning data can improve manageability and scalability, reduce contention, and optimize performance. Suppose you have hourly temperature data for cities in all the countries in the world (7 continents and 195 countries), and the goal is to query and analyze data for a given continent, country, or or set of countries. If you do not partition your data accordingly, for each query you’ll have to load, ...

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.