Storing data as Parquet files

Parquet (https://parquet.apache.org/) is rapidly becoming the go-to data storage format in the world of big data because of the distinct advantages it offers:

  • It has a column-based representation of data. This is better represented in a picture, as follows:
    Storing data as Parquet files

    As you can see in the preceding screenshot, Parquet stores data in chunks of rows, say 100 rows. In Parquet terms, these are called RowGroups. Each of these RowGroups has chunks of columns inside them (or column chunks). Column chunks can hold more than a single unit of data for a particular column (as represented in the blue box in the first column). For example. ...

Get Scala: Guide for Data Science Professionals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.