CSV files and Spark DataFrames

We start by showing you how to read CSV files and transform them into Spark DataFrames. Just follow the steps in the following example:

  1. In order to import CSV-compliant files, we need to first create a SQL context, by creating an SQLContext object from the local SparkContext:
In: from pyspark.sql import SQLContext    sqlContext = SQLContext(sc)
  1. For our example, we created a simple CSV file, which is a table with six rows and three columns, where some attributes are missing (such as the gender attribute for the user with user_id=0):
In: data = """balance,gender,user_id    10.0,,0    1.0,M,1    -0.5,F,2    0.0,F,3    5.0,,4    3.0,M,5    """    with open("users.csv", "w") as output:    output.write(data)
  1. Using the read.format method provided ...

Get Python Data Science Essentials - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.