Skip to Content
Python: Real World Machine Learning
book

Python: Real World Machine Learning

by Prateek Joshi, John Hearty, Bastiaan Sjardin, Luca Massaron, Alberto Boschetti
November 2016
Beginner to intermediate
941 pages
21h 55m
English
Packt Publishing
Content preview from Python: Real World Machine Learning

Data preprocessing in Spark

So far, we've seen how to load text data from the local filesystem and HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines able to transform a file into a DataFrame, similar to the DataFrame in R and Python pandas. DataFrames are very similar to RDBMS tables, where a schema is set.

JSON files and Spark DataFrames

In order to import JSON-compliant files, we should first create a SQL context, creating a SQLContext object from the local Spark Context:

In:from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Now, let's see the content of a small JSON file (it's ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Interpretable Machine Learning with Python

Interpretable Machine Learning with Python

Serg Masís
Large Scale Machine Learning with Python

Large Scale Machine Learning with Python

Luca Massaron, Alberto Boschetti, Bastiaan Sjardin

Publisher Resources

ISBN: 9781787123212Supplemental ContentPurchase Link