November 2016
Beginner to intermediate
941 pages
21h 55m
English
So far, we've seen how to load text data from the local filesystem and HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines able to transform a file into a DataFrame, similar to the DataFrame in R and Python pandas. DataFrames are very similar to RDBMS tables, where a schema is set.
In order to import JSON-compliant files, we should first create a SQL context, creating a SQLContext object from the local Spark Context:
In:from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
Now, let's see the content of a small JSON file (it's ...