Data preprocessing in Spark

So far, we've seen how to load text data from the local filesystem and the HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines that are able to transform a file into a DataFrame, similar to the DataFrame in R and the Python package pandas. DataFrames are very similar to RDBMS tables, where a schema is set.

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.