3

An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL

Apache Spark is written in Scala and has become the dominant distributed data processing framework due to its ability to ingest, enrich, and prepare at-scale data for analytical use cases. As a data engineer, you will eventually have to work with data volumes that won’t be processable on a single machine. This chapter will teach you how to leverage Spark and its various APIs to do that processing on a cluster of machines.

In this chapter, we’re going to cover the following main topics:

  • Working with Apache Spark
  • Creating a Spark application using Scala
  • Understanding the Spark Dataset API
  • Understanding the Spark DataFrame API

Technical requirements

Please refer ...

Get Data Engineering with Scala and Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.