Using DataFrames with SparkR

The following steps will help us to understand more operations with DataFrames on SparkR by analyzing a New York flights dataset:

  1. As a first step, let's download the flights data and copy it to HDFS:
    [cloudera@quickstart ~]$ wget https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv --no-check-certificate
    
    [cloudera@quickstart ~]$ hadoop fs -put nycflights13.csv flights.csv
    
  2. Start the SparkR shell and create a DataFrame using the CSV DataSource. While installing packages, use HTTP locations near you:
    [cloudera@quickstart ~]$ cd spark-2.0.0-bin-hadoop2.7/
    [cloudera@quickstart spark-2.0.0-bin-hadoop2.7]$ bin/sparkR
    
    > install.packages("magrittr", dependencies = TRUE)
    > library(magrittr)
    
    > flights <- read.df("flights.csv",source="csv", ...

Get Big Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.