Analytics with DataFrames

Let's learn how to create and use DataFrames for Big Data Analytics. For easy understanding and a quick example, the pyspark shell is to be used for the code in this chapter. The data needed for exercises used in this chapter can be found at https://github.com/apache/spark/tree/master/examples/src/main/resources. You can always create multiple data formats by reading one type of data file. For example, once you read .json file, you can write data in parquet, ORC, or other formats.

Note

All programs in this chapter are executed on CDH 5.8 VM except the programs in the DataFrame based Spark-on-HBase connector section, which are executed on HDP2.5. For other environments, file paths might change, but the concepts are the same ...

Get Big Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.