Data preprocessing in Spark

So far, we've seen how to load text data from the local filesystem and the HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines that are able to transform a file into a DataFrame, similar to the DataFrame in R and the Python package pandas. DataFrames are very similar to RDBMS tables, where a schema is set.

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.