We start by showing you how to read CSV files and transform them into Spark DataFrames. Just follow the steps in the following example:
- In order to import CSV-compliant files, we need to first create a SQL context, by creating an SQLContext object from the local SparkContext:
In: from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
- For our example, we created a simple CSV file, which is a table with six rows and three columns, where some attributes are missing (such as the gender attribute for the user with user_id=0):
In: data = """balance,gender,user_id 10.0,,0 1.0,M,1 -0.5,F,2 0.0,F,3 5.0,,4 3.0,M,5 """ with open("users.csv", "w") as output: output.write(data)
- Using the read.format method provided ...