Chapter 2. Getting Started with Apache Spark DataFrames

In this chapter, we will cover the following recipes:

Getting Apache Spark
Creating a DataFrame from CSV
Manipulating DataFrames
Creating a DataFrame from Scala case classes

Introduction

Apache Spark is a cluster computing platform that claims to run about 10 times faster than Hadoop. In general terms, we could consider it as a means to run our complex logic over massive amounts of data at a blazingly fast speed. The other good thing about Spark is that the programs that we write are much smaller than the typical MapReduce classes that we write for Hadoop. So, not only do our programs run faster but it also takes less time to write them.

Spark has four major higher level tools built on top of the ...

Get Scala: Guide for Data Science Professionals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Scala: Guide for Data Science Professionals by Pascal Bugnion, Arun Manivannan, Patrick R. Nicolas

Chapter 2. Getting Started with Apache Spark DataFrames

Introduction

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly