Data preprocessing in Spark
So far, we've seen how to load text data from the local filesystem and HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines able to transform a file into a DataFrame, similar to the DataFrame in R and Python pandas. DataFrames are very similar to RDBMS tables, where a schema is set.
JSON files and Spark DataFrames
In order to import JSON-compliant files, we should first create a SQL context, creating a SQLContext
object from the local Spark Context:
In:from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
Now, let's see the content of a small JSON file (it's ...
Get Python: Real World Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.