Chapter 5. Data Analysis on Spark

The field of data analytics at scale has been evolving like never before. Various libraries and tools were developed for data analysis with a rich set of algorithms. On a parallel line, distributed computing techniques were evolving with time, to process huge datasets at scale. These two traits had to converge, and that was the primary intention behind the development of Spark.

The previous two chapters outlined the technology aspects of data science. It covered some fundamentals on the DataFrame API, Datasets, streaming data  and how it facilitated data representation through DataFrames that R and Python users were familiar with. After introducing this API, we saw how operating on datasets became easier than ever. ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.