SparkR DataFrames

In this section, we try out some useful, commonly used operations. First, we try out the traditional R/dplyr operations and then show equivalent operations using the SparkR API:

> //Open the R shell and NOT SparkR shell  
> library(dplyr,warn.conflicts=FALSE)  //Load dplyr first 
//Perform a common, useful operation  
> iris %>%               
+   group_by(Species) %>% +   summarise(avg_length = mean(Sepal.Length),  
+             avg_width = mean(Sepal.Width)) %>% +   arrange(desc(avg_length)) 
Source: local data frame [3 x 3] 
     Species avg_length avg_width 
      (fctr)      (dbl)     (dbl) 
1  virginica      6.588     2.974 
2 versicolor      5.936     2.770 
3     setosa      5.006     3.428 
 
//Remove from R environment 
> detach("package:dplyr",unload=TRUE) 

This operation is very similar to the SQL group and is followed ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.