Analyzing Parquet files using Spark

Parquet is columnar data file format, which is being used extensively. In this recipe, we are going to take a look at how to access this data from Spark and then process it.

Getting ready

To perform this recipe, you should have Hadoop and Spark installed. You also need to install Scala. I am using Scala 2.11.0.

How to do it...

Spark supports the accessing of Parquet files from the SQL context. You can read and write Parquet files using this SQL context. In this recipe, we are going to take a look at how to read a Parquet file from HDFS and process it.

First of all, download the sample parquet file, users.parquet, and store it in the HDFS /parquet path https://github.com/deshpandetanmay/hadoop-real-world-cookbook/blob/master/data/users.parquet ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Hadoop: Data Processing and Modelling by Garry Turkington, Tanmay Deshpande, Sandeep Karanth

Analyzing Parquet files using Spark

Getting ready

How to do it...

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly