Using DataFrames with SparkR
The following steps will help us to understand more operations with DataFrames on SparkR by analyzing a New York flights dataset:
- As a first step, let's download the flights data and copy it to HDFS:
[cloudera@quickstart ~]$ wget https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv --no-check-certificate [cloudera@quickstart ~]$ hadoop fs -put nycflights13.csv flights.csv
- Start the SparkR shell and create a DataFrame using the CSV DataSource. While installing packages, use HTTP locations near you:
[cloudera@quickstart ~]$ cd spark-2.0.0-bin-hadoop2.7/ [cloudera@quickstart spark-2.0.0-bin-hadoop2.7]$ bin/sparkR > install.packages("magrittr", dependencies = TRUE) > library(magrittr) > flights <- read.df("flights.csv",source="csv", ...
Get Big Data Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.