Analytics with DataFrames
Let's learn how to create and use DataFrames for Big Data Analytics. For easy understanding and a quick example, the pyspark
shell is to be used for the code in this chapter. The data needed for exercises used in this chapter can be found at https://github.com/apache/spark/tree/master/examples/src/main/resources. You can always create multiple data formats by reading one type of data file. For example, once you read .json
file, you can write data in parquet, ORC, or other formats.
Note
All programs in this chapter are executed on CDH 5.8 VM except the programs in the DataFrame based Spark-on-HBase connector section, which are executed on HDP2.5. For other environments, file paths might change, but the concepts are the same ...
Get Big Data Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.