July 2017
Intermediate to advanced
796 pages
18h 55m
English
Let's start with loading, parsing, and viewing simple flight data. At first, download the NYC flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv. Now let's load and parse the dataset using read.csv() API of PySpark:
# Creating DataFrame from data file in CSV formatdf = spark.read.format("com.databricks.spark.csv") .option("header", "true") .load("data/nycflights13.csv")
This is pretty similar to reading the libsvm format. Now you can see the resulting DataFrame's structure as follows:
df.printSchema()
The output is as follows:

Now let's ...
Read now
Unlock full access