O'Reilly logo

Learning Spark, 2nd Edition by Tathagata Das, Brooke Wenig, Denny Lee, Jules Damji

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 1. Introduction to Unified Analytics with Apache Spark

In this chapter, we’ll chart the course of Apache Spark’s short evolution: its genesis, inspiration, and adoption in the community as a de-facto big data unified processing engine.

If you are familiar with its history and high-level concepts, you can skip this chapter.

Today, most big data practitioners—data engineers or data scientists—either use it at scale for unified analytics or are on course to use it. And with recent integrations of deep learning frameworks as first-class citizens in Apache Spark 2.4, many machine learning developers are choosing to adopt it in conjunction with deep learning frameworks too.

But first, let’s chart a brief history of distributed computing at scale.

The Genesis of Big Data and Distributed Computing at Google

When you think of scale today, you can’t help but think of Google’s search engine’s ability to index and search the world’s data on the internet at lightning speed. The name Google is synonymous with scale. In fact, Google is a deliberate misspelling of the mathematical term googol: that’s 1 plus 100 zeros!

Neither traditional storage systems such as RDBMS nor imperative ways of programming satisfied the scale at which Google wanted to build and search the internet’s indexed documents. Hence, this necessity led to the innovation of Google Distributed Filesystem1 (GFS), MapReduce2 (MR), and Big Table.3

While the GFS provided a fault-tolerant and distributed file system across many commodity hardware servers in a cluster farm, Big Table offered scalable storage of structured data across GFS, and MR introduced a new parallel programming paradigm, based on functional programming, to process data at scale distributed over GFS and Big Table.

In essence, your MR applications interact with the master that sends computation (mappers and reducers) code to where the data resides, favoring data locality and cluster rack affinity rather than bringing data to your application.

The workers aggregate and reduce intermediate computations and send the final reduced results back to the master, which then returns it back to your application. This way you significantly reduce network traffic and keep most of I/O local to disk.

But most of the work Google did was proprietary, yet the ideas expressed in the aforementioned three papers4 spurred innovative ideas elsewhere in the open source community, especially at Yahoo! who was dealing with similar big data challenges of scale for their search engine.

Hadoop at Yahoo!

The computational challenges and solutions expressed in Google’s GFS paper provided a blueprint for Hadoop Distributed File System (HDFS)5, including the MapReduce implementation as a framework for distributed computing for data on HDFS-based storage systems. Donated to the Apache Software Foundation6 in April 2006, Apache Hadoop led to a proliferation of other related components such as Hadoop Common, MapReduce, HDFS, and Hadoop YARN.7

Although Apache Hadoop had garnered a large open source community of contributors and adoption outside Yahoo! and led to two open source based commercial companies (Cloudera and Hortonworks, now merged), the MapReduce framework on HDFS had its few shortcomings.

For one it was hard to manage and administer, with cumbersome operational complexity. Second, its general batch-processing MapReduce API was verbose and required a lot of boiler-setup Java code, with brittle fault-tolerance. Third, for large batches of data jobs, because of the way MR writes its intermittent computed results between mappers and reducers to the local disk for the subsequent stage of its computational operation, this repeated performance of random disk I/O took its toll. Large MR jobs could run for hours on end or even days.

Intermittent iteration of reads and writes between map and reduce computationshttps   brookewenig.github.io SparkOverview.html  20
Figure 1-1. Intermittent iteration of reads and writes between map and reduce computations8

And finally, even though Hadoop MR was conducive to large-scale jobs for general batch processing, it fell short to compute and combine other workloads such as machine learning, streaming, or interactive SQL like queries.

To satisfy these new workloads, engineers developed bespoke systems such as Apache Hive, Apache Storm, Impala, Giraffe, Drill, Mahout, etc, with their own APIs and clustered configurations, further adding to the operational complexity of Hadoop and the steep learning curve for developers by orders of magnitude.

What have we learned from Alan Kay’s adage: “Simple things should be simple, complex things should be possible.” Is it possible, then, to make Hadoop and MR simpler and faster?

Spark’s Early Years at AMPLab

Researchers at UC Berkeley’s AMPLab who had previously worked on Hadoop MapReduce entertained this question. They discovered MR was inefficient for interactive or iterative computing jobs and a complex framework to learn.

So from the onset, they embraced the idea of making it simple, faster, and easier. This endeavor started in 2009 at RAD Lab, which later became AMPLab (and now is RISELab).

Early papers published on Spark9 demonstrated that it was 10-20x faster than Hadoop MapReduce for certain jobs. Today, it’s orders of magnitude faster.10

The central thrust of the Spark project was to bring ideas borrowed from Hadoop MapReduce into Spark but enhance it: make it highly fault-tolerant; embarrassingly parallel; support in-memory storage for intermittent results between iterative and interactive map and reduce computations; offer easy and composable APIs in multiple languages as a programming model; support other workloads in a unified manner. More on the idea of unification later after we explain what Spark is.

Intermittent Spark operations of reads and writes between map and reduce computations in memory.https   brookewenig.github.io SparkOverview.html  20
Figure 1-2. Intermittent Spark operations of reads and writes between map and reduce computations in memory.11

By 2013 within AMPLab and outside, Spark had garnered a widespread use, and some of its original creators and researchers—Matei Zaharia, Ali Ghodsi, Reynold Xin, Patrick Wendell, Ion Stoica, and Andy Konwinski—formed a company called Databricks and donated the Spark project to Apache Software Foundation (ASF), a vendor-neutral non-profit organization.

Subsequently, Databricks and the community of open-source developers worked to release Apache Spark 1.012 in May 11, 2014, under the governance of ASF.

This first major release established the momentum for frequent future releases and contributions of notable features to Apache Spark from Databricks and over 100+ commercial vendors’ contributors to Apache Spark.

What is Apache Spark?

Since its first release, Apache Spark13 has become a lightning-fast unified engine designed for large-scale distributed data processing and machine learning on compute clusters, whether running on-premise at data centers or in the cloud.

It replaced Hadoop MapReduce with its in-memory storage for intermediate computations, making it much faster than Hadoop. It incorporated libraries with composable APIs to do machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).

But more importantly, it embraced (and continues) in its design philosophy of four characteristics:

  1. Speed

  2. Ease of Use

  3. Modularity

  4. Extensibility

Speed

First, Spark’s internal implementation benefits immensely from the hardware industry’s huge strides in both CPUs and Memory price and performance. So today’s commodity servers come cheap with 100s of GBs of memory and multiple cores, with the underlying UNIX-based operating system taking advantage of efficient multi-threading and parallel processing.

Second, Spark builds its query computations as a Directed Acyclic Graph (DAG); its DAG scheduler and Catalyst query optimizer construct an efficient computational graph that can be executed in parallel stages, decomposing them as tasks and executing them in parallel across workers on the cluster. And third its whole-stage-code generation physical execution engine, Tungsten, generates compact code for execution (We will cover Catalyst and whole-stage-code generation in later chapters.)

With all its intermediate results retained in memory and its limited disk I/O, this gives it a huge performance boost and speed.

Ease of Use

Spark achieves simplicity by providing a fundamental abstraction of a simple logical data structure called Resilient Distributed Data (RDD) upon which all other higher-level structured data abstractions are constructed. By providing a set of transformations and actions as operations, Spark has offered a simple programming model that you can use to build parallel applications.

“Using this simple extension [model], Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing14” as library modules.

Modularity

Spark operations can work across many workloads. That is, these operations as APIs can be expressed in the supported programming languages: Scala, Java, Python, SQL, and R. Spark offers unified libraries with well-documented APIs that include modules as core components: Spark SQL, Structured Streaming, Machine Learning (MLlib), and GraphX, combining all the workloads running under one engine.

You can write a single Spark application that can do it all. No need for bespoke engines for disparate workloads; no need to learn separate APIs—with Spark, you get a general engine that unifies your analytics.

Extensibility

Spark focuses more on its fast, parallel, in-memory computation engine than on storage. That is, unlike Apache Hadoop that included both storage and computing, Spark decouples itself from storage and enables reading data from myriad sources, including Hadoop-supported formats. It does this by reading data into its memory as a logical data structure RDD and DataFrames (we will discuss more in later chapters).

As such, it offers flexibility to read data stored in Apache Hadoop, Apache Cassandra, Apache HBase, MongoDB, Apache Hive, RDBMS, and other diverse data sources. Its DataReaders and DataWriters can be extended to read data from other sources such as Apache Kafka, Kinesis, Azure Storage, Amazon S3 etc into its logical data abstraction, on which it can operate.

The community of Spark developers maintains a list of third-party Spark packages15 as part of its growing Spark ecosystem. This rich ecosystem of packages includes Spark connectors, extended DataSource readers from external data sources, performance monitors, and more (see Fig. 1-3).

This extensibility also allows its engine to run in multiple environments: on your laptop; in the cloud on Azure or AWS; inside Apache Hadoop YARN or Apache Mesos; and more recently on Kubernetes.

Why Unified Analytics?

While the notion of unification is not unique to Spark, it existed as part of its design philosophy and evolution. In November 2016, ACM recognized Apache Spark and conferred its original creators the prestigious ACM Award for their paper asserting Apache Spark as the “Unified Engine of Big Data Processing.”16 The award-winning paper notes that Apache Spark replaces all the bespoke batch processing, graph, stream, and query engines like Apache Storm, Impala, Dremel, Pregel etc. with a unified stack of components that address diverse workloads under a single distributed fast engine.

Today, with the amount of data companies have to process, grapple with, and analyze, data practitioners/professionals generally can’t be burdened with learning siloed engines to process specialized workloads. Rather, they want a single unified engine; they want to build their big data applications and data pipelines with a programming language they are familiar with; they prefer simple, composable APIs with which to program and build with; they want to incorporate their favorite ML toolkits; and they want to be productive.

Spark with its unified stack of components aspires to satisfy these developers’ requirements. Hence, a unified stack to perform unified analytics seems to fit developers’ needs. Many developers in the community would contend that it’s hard not to think of Spark when you think of processing big data at scale in a distributed manner.

Apache Spark Components as a Unified Stack

Spark offers four distinct components as libraries for diverse workloads: Spark SQL, Spark MLlib, Spark Structured Streaming, and GraphX. Each component offers simple and composable APIs. The component is separate from Spark’s core fault-tolerant engine, in that your Spark code written with these library APIs is translated into a DAG that is executed by the engine. So whether you write your Spark code using the provided Structured APIs in Java, R, Scala, SQL, or Python, the underlying code is decomposed into highly-compact byte code that is executed in the workers’ JVM across the cluster. Let’s look at each of these components in more detail.

Apache Spark components and APIs stack
Figure 1-3. Apache Spark components and APIs stack

Spark SQL

This module works well with structured data, meaning you can read data stored in an RDBMS table or from files with structured data in CSV, Text, JSON, Avro, ORC, or Parquet format and then construct permanent or temporary tables in Spark. Also, when using Spark’s Structured APIs in Java, Python, Scala or R, you can combine SQL-like queries to query the data just read into a Spark DataFrame.17 To date, Spark SQL is ANSI SQL:2003 compliant.18

For example, in this Scala code snippet, you can read from a JSON file stored on Amazon S3, create a temporary table, and issue a SQL-like query on the results read into memory as a Spark DataFrame.

Note

You can issue the same code snippet in Python, R or Java, and the generated bytecode will be identical, resulting in the same performance.

// read data off Amazon S3 bucket into a Spark DataFrame
spark.read.json("s3://apache_spark/data/committers.json")
                  .createOrReplaceTempView(“committers”)
// issue an SQL query and return the result as a Spark DataFrame
val results = spark.sql("""SELECT name, org, module, release, num_commits
     FROM committers WHERE module = ‘mllib’ AND num_commits > 10 
     ORDER BY num_commits DESC""")

Spark MLlib

Spark comes with a library containing common machine learning (ML) algorithms called MLlib. Since Spark’s first release, the performance of this library component has improved. Today, MLlib provides data scientists with a comprehensive list of machine learning algorithms, including classification, regression, clustering, collaborative filtering, and high-level DataFrame-based APIs to build predictive models.

These APIs allow you to do featurization, to extract or transform features, to build pipelines (for training and evaluating), and to persist models (for saving and reloading them) during deployment. Additional utilities include the use of common linear algebra operations and statistics. MLlib includes other low-level ML primitives, including a generic gradient descent optimization. Using just the high-level DataFame-based APIs, this Python code snippet encapsulates the basic operations a typical data scientist may do when building a model.

Note

Starting Apache Spark 1.6, the MLlib project split between two packages: spark.mllib and spark.ml. The latter is DataFrame-based APIs while the former is RDD-based APIs, which is now in maintenance mode. All new features and development by the community goes into spark.ml19

This code snippet is just an illustration of the ease with which you can build your models—it feels Pythonic. More extensive examples will be discussed in chapters 11 and 12.

from pyspark.ml.classification import LogisticRegression
...
training = spark.read.csv(s3://...")
test = spark.read.csv(s3://...")

# Load training data
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(training)

# predict
lrModel.predict(test)
...

All these new enhancements in spark.ml are designed for developer’s ease of use and for scaling training across the Spark cluster.

Spark Structured Streaming

Apache Spark 2.0 introduced an experimental Continuous Streaming model20 and Structured Streaming APIs21, built atop the Spark SQL engine and DataFrame-based APIs. By Spark 2.2, Structured Streaming was generally available, meaning that developers could use it in their production environments.

Necessary for big data developers to combine and react in real-time to both static data and streaming data from engines like Apache Kafka and other streaming sources, the new model viewed a stream as a continually growing table, with new rows of data appended at the end. And developers merely treated this as a structured table and issued queries against it as they would a static table.

This new model obviated the old DStreams model in Spark’s 1.x series, which we will discuss in more detail in chapter 9. Underneath the structured streaming model, the Spark SQL core engine handles all aspects of fault-tolerance and late-data semantics, allowing developers to focus on writing streaming applications with relative ease.

Furthermore, Spark 2.x extends its streaming data sources to include Apache Kafka, Kinesis, and HDFS-based or cloud storage.

For example, this code snippet shows a typical anatomy of a Structured Streaming application doing the Hello World “word count” of a streaming application.

Note

Take note of the composability of the high-level domain specific language (DSL) syntax of the DataFrame-based APIs transformation and computation. It has the look-and-feel of SQL-like aggregation but done in Python DataFrame API. We will expand on this in the later chapters.

# read from kafka stream
lines = (spark.readStream 
  .format("kafka") 
  .option("subscribe", "input") 
  .load())

# perform transformation
wordCounts = (lines.groupBy(“value.cast("string") as key”) 
  .agg(count("*") as “value”) )

# write out back to the stream
query = (wordCounts.writeStream() 
  .format("kafka") 
  .option("topic", "output")) 

GraphX

As the name suggests, GraphX is a library for manipulating graphs (e.g., social networks graphs, routes and connections points, network topology) and performing graph-parallel computations. Contributed by many users in the community, it offers the standard graph algorithms for analysis, connections, and traversals. Algorithms include PageRank, Connected Components, and Triangle Counting. This code snippet shows a simple example of how to join two graphs using the GraphX APIs.

val graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
  (id, vertex, msg) => ...
}

GraphFrames

GraphFrames allow graph processing in Spark, similar in functionality to GraphX, except most of its functionality is built atop DataFrames, giving developers structured APIs. Additionally, it accords developers a few benefits: 22

  • uniform APIs in Python, Java and Scala;

  • all algorithms available in GraphX are also available in Python and Java;

  • ability to express queries in Spark SQL and DataFrames relational operators; and

  • supports DataFrame data sources for saving and loading graphs in formats such as Parquet, JSON, and CSV

Because vertices and edges are represented as DataFrames, this allows you to store arbitrary data with each vertex and edge. Also GraphFrames seamless integrate with GraphX, meaning that you can convert your GraphFrames into an equivalent GraphX representation and vice versa.

For example, if you have a graph g with vertices (and a column as age) and edges as your DataFrame, you can express the following queries and switch to GraphX seamlessly.

val filterDF = g.vertices.filter(“age > 35”)
val graphRDD:[Row, Row] = g.toGraphX()
val gf: GraphFrame = GraphFrame.fromGraphX(graphRDD)
Note

Among all the Spark components, GraphX is the only one which is still based on RDD-APIs. However, Databricks and the community have contributed a DataFrame-based equivalent as an open-source package called GraphFrames.23 24

Apache Spark’s Distributed Execution and Concepts

If you have read this far, you already know that Spark is a distributed data processing engine working collaboratively with its components on a cluster of machines. Before we explore programming Spark in the following chapters of this book, we need to understand how all of its components within Spark’s distributed architecture work together and communicate, and what are some of the deployment modes that render it flexible and relatively easy to configure and deploy? Let’s look at each of the individual components shown in Figure 1-4 and how they fit into the architecture.

At a high level in the Spark architecture, a Spark application consists of a driver program that is responsible to orchestrate parallel operations on the Spark cluster. The driver accesses the all the distributed components—Spark Master, Spark Worker, and Cluster Manager—in the cluster through a SparkSession. Let’s look at each of the individual components and how they interact with each other.

Apache Spark components and architecture
Figure 1-4. Apache Spark components and architecture

Spark Session

In Spark 2.0, the SparkSession became a unified conduit to all Spark operations and data. Not only did it subsumes previous entry points to Spark like SparkContext, SQLContext, HiveContext, SparkConf and StreamingContext, but it also made working with Spark simpler and easier.25

Because of this unified conduit, you can create JVM runtime parameters, create DataFrames and Datasets, read from data sources, access Catalog Metadata, and issue Spark SQL queries. This is the essence of simplicity in Spark 2.x: a single unified entry point to all of Spark’s functionality!

In a standalone Spark application, you can create a SparkSession using one of the high-level APIs in the programming language of your choice, whereas in Spark shells (more on this later in the next chapter), it’s created for you, and you can access it via a global variable called spark or sc.

For example, in a Spark application, you can create a single SparkSession per JVM and use it to perform a number of Spark operations, whereas in Spark 1.x, you would have had to create individual contexts, introducing extra boiler code.

Note

Although in Spark 2.x, the SparkSession subsumes all other contexts, you can still access the individual contexts and its respective methods. This way, the community maintained backward compatibility. That is, your old 1.x code with SparkContext or SQLContext will still work.

import org.apache.spark.sql.SparkSession

//build SparkSession
val spark = SparkSession().builder()
.appName(“LearnSpark”)
.config("spark.sql.shuffle.partitions", 6)
.getOrCreate()
...
val people = spark.read.json(“...”)
..
val resultsDF = spark.sql("SELECT city, pop, state, zip FROM zips_table")

Spark Driver

As part of the Spark application and responsible to instantiate a SparkSession, the Spark Driver has multiple responsibilities: it communicates with the Spark Master (explained below) where the cluster manager runs; it requests from the Spark Master resources from Spark workers’ Executors JVMs; and it transforms all the Spark operations into DAG computations, schedules them, and distributes their execution across all Spark workers.

Its interaction with the Spark Master and cluster manager is merely to get Spark workers’ resources: JVMs, CPU, memory, disk etc. Once allocated, it communicates directly with Spark workers.

Spark Master and Cluster Manager

The Spark Master is a physical node where the Cluster Manager runs as a long running daemon. The cluster manager maintains a cluster of machines. The Cluster Manager daemon communicates with respective worker daemons on the Spark worker nodes.

It fulfills requests from the Spark driver by launching Spark workers’ daemon processes and is responsible for managing them, as shown in figure 1-4.

Spark Worker

The Spark Worker is one of many physical nodes (machines) in the cluster. It is responsible to launch Executors on which Spark’s tasks run. Typically, only a single worker runs per node in all its various deployment modes.

Deployment Modes

An attractive feature of Spark is its support for myriad deployment modes and in various popular environments. Because the Spark Master and cluster manager are agnostic to where they run (as long as they can manage Spark workers’ processes and fulfill resource requests), Spark can be deployed in some of the most popular environments -- such as Apache Mesos, Apache Hadoop YARN, and Kubernetes -- and can operate in different modes.

Table 1-1. Cheat Sheet for Spark Deployment modes. a
Mode Spark Driver Spark Worker Spark Executor Spark Master/Cluster Manager
Local Runs on a single JVM, like a laptop and a single node. Runs on the same JVM and the driver Runs on the same JVM as the driver Runs on the same host.
Standalone Can run on any node in the cluster Runs on its own JVM on each node Each worker in the cluster will launch its own JVM Can be allocated arbitrarily where the master is started
YARN (client) On a client, not part of the cluster YARN NodeManager YARN’s NodeManager’s Container YARN’s Resource Manager works with YARN’s Application Master to allocate the containers on NodeManagers for Executors
YARN (cluster) Runs with the YARN’s Application Master Same as YARN client mode Same as YARN client mode Same as the YARN client mode
Mesos (client) Runs on a client, not part of Mesos cluster Runs on Mesos Agent Container within Mesos Agent Mesos’ Master
Mesos (cluster) Runs within one of Mesos’ Master Same as client mode Same as client mode Mesos’ Master
Kubernetes Kubernetes pod Each worker runs within its own pod Each worker runs within its own pod Kubernetes Master

a https://www.kdnuggets.com/2016/09/7-steps-mastering-apache-spark.html

Distributed Data and Partitions

Actual physical data is distributed across an entire Spark cluster as partitions residing in either HDFS or cloud storage. Often more than one partition may reside on a node in a cluster. Each partition by default is 64MB; it may vary on the cloud storage.

While the data is distributed as partitions across the physical cluster, Spark treats them as a high-level logical data abstraction—as an RDD or DataFrame in memory—where each Spark worker’s Executor reads the partition closest to it in the rack (observing data locality). See Fig. 1-5.

Data distributed across physical machines as partitions
Figure 1-5. Data distributed across physical machines as partitions

Partitioning allows for efficient parallelism. A distributed scheme of breaking up data into chunks of partitions allow Spark Executors to process only data that is close to them, minimizing network bandwidth. That is, each Executor’s core or slot is assigned its own data partition to work on (Fig. 1-6).

For example, this code snippet will break up the physical data stored across clusters into 8 partitions, and each executor will get some partitions to read into its memory.

log_rdd = spark.textFile(“path_to_log_files.txt”, minPartitions=8)
log_df  = spark.createDataFrame(log_rdd, schema)

Or this code will create a DataFrame of 10,000 integers distributed over 8 partitions in memory.

df = spark.range(2, 10000, 8)
println(df.rdd.getNumPartitions())
Each Executor s core gets a partition of data to work on
Figure 1-6. Each Executor’s core gets a partition of data to work on

In chapters 3 and chapter 8, we will discuss how to tune and change partitioning configuration for maximum parallelism based on how many cores (or slots) you have on your Executors.

Developer’s Experience

Of all the developers’ delight, none is more attractive than a set of composable APIs that make developers productive, that is easy to use, and that is intuitive and expressive. One of Apache Spark’s appeals to developers has been its easy-to-use APIs, for operating on small to large datasets, across languages: Scala, Java, Python, SQL, and R.26

One primary motivation behind Spark 2.x was unifying and simplifying Spark by limiting the number of concepts that developers have to grapple with. And Spark 2.x introduced higher-level abstraction APIs as domain-specific language constructs. These made the programming Spark highly expressive and a pleasant developer experience. You express what you want the task or operation to compute, not how to compute—let Spark ascertain how best to optimize your computation and do it for you. We will cover these Structured Spark APIs in chapter 3. But let us understand who are the Spark developers.

Who Uses Spark, and for What?

Not surprisingly, most developers who grapple with big data are data engineers, data scientists or machine learning engineers. They are drawn to Spark because it is a unified engine for big data processing and it allows them to build a range of applications using a single engine, with a familiar programming language.

More recently, because of the inclusion of deep learning frameworks (such as TensorFlow, Keras, and PyTorch) in Spark 2.4 as first-class citizens and the introduction of a gang scheduler to schedule tasks suitable and fault-tolerant for distributed deep learning training, more ML engineers are drawn to Spark. Unsurprisingly, then, the typical use cases extend not only to data science and data engineering but also to deep learning and machine learning tasks too.

Of course, many developers wear many hats and sometimes either do one or all three of the tasks, especially in startup companies or smaller engineering groups. Among all the tasks, however, data—massive amounts of data—is the foundation!

Data Science Tasks

As a discipline that has come to prominence in the era of big data, data scientists use data to tell stories. But before they can narrate the stories, they have to explore the data, cleanse the data, discover patterns, and build models to predict or suggest outcomes.

Some of these tasks require an academic background or knowledge in statistics, mathematics, computer science, and programming. Most of them are proficient in using analytical tools like SQL, comfortable with libraries like NumPy, MATLAB, Pandas, and conversant in programming languages like R and Python. But they also require knowledge of how to wrangle or transform data, how to use established algorithms in classification, regression or clustering for building models. Often, their tasks are iterative, interactive or ad hoc, or experimental to assert their hypothesis.

Fortunately, Spark supports these different tools: with the recent release of Spark 2.x, MLlib offers a comprehensive set of machine learning algorithms to build model pipelines, using high-level estimators, transformers, and data featurizers; Spark SQL and Spark Shells provide interactive and ad-hoc exploration of data quickly; and Python, R, and popular external libraries can seamlessly work with Spark.

Additionally, Spark enables data scientists to tackle large data sets and scale their model training and evaluation, using a Spark cluster, increasing their accuracy by repeating, tuning, tweaking, and tracking experiments with open-source platforms like MLflow.27

Often, data scientists, after building their model, need to work with other team members, who may be responsible for deploying models. Or they may need to work closely with others to build and transform raw, dirty data into clean data that is easily consumable or usable by other data scientists. For example, a classification or clustering model does not exist in isolation. It works in conjunction with other components like a web application or a streaming engine like Apache Kafka or part of a larger data pipeline. This pipeline is often built by data engineers.

Data Engineering Tasks

Data engineers have a strong understanding of software engineering principles and methodologies, and possess skills for building scalable data pipelines for a stated business use case. Data pipelines enable end-to-end transformations of raw data coming from myriad sources—data is cleansed so that it can be consumed downstream by data scientists, stored in the cloud or in NoSQL or RDBMS for report generation, or made accessible to data analysts via BI tools.

Spark 2.x introduced an evolutionary streaming model called continuous applications with Structured Streaming.28 With Structured Streaming APIs, data engineers can build complex data pipelines that include extracting, transforming, or loading (ETL) data from both real-time or static data sources. We will discuss the Structured Streaming model in chapter 9.

Data engineers use Spark because it provides a simple way to parallelize computations and hides all the complexity of distribution and fault-tolerance. Instead, data engineers focus on using high-level DataFrame-based APIs and DSL queries to do ETL reading and combining data from multiple sources.

With the Spark 2.x performance improvements, due to Catalyst optimizer for SQL and Tungsten for compact-code generation29 30, life for data engineers is much easier. They can choose to use any of the three Spark APIs—RDDs, DataFrames, or Datasets—that suit the task, and reap all the benefits of Spark 2.x31

Machine Learning or Deep Learning Tasks

Members within an organization’s data science team i will likely also embark on tasks to develop deep learning models. This is due to many factors. First, popular frameworks such as TensorFlow and PyTorch and advanced algorithms (to discover patterns and features in images, videos, audio, and speech) have become more widely available. Second, these frameworks and algorithms provide support for traditional supervised machine learning approaches. And, third, there is now an abundance of large, labelled datasets and pretrained models with which to work.

Apache Spark 2.4 has introduced a new gang scheduler, as part of Project Hydrogen32, to accommodate the fault-tolerant needs of training and scheduling deep learning models in a distributed manner. So developers whose tasks demand deep learning techniques can use Spark along with deep and traditional machine learning techniques.

Whether you are a data engineer, data scientist or machine learning engineer, Spark is used for the following use cases:

  • process in parallel large data sets distributed across a cluster;

  • perform ad hoc or interactive queries to explore and visualize data sets;

  • build, train, and evaluate machine learning models using MLlib and deep learning frameworks;

  • implement end-to-end data pipelines from myriad streams of data; and

  • analyze graph data sets and social networks.

Community Adoption and Expansion

Not surprisingly, Apache Spark struck a chord in the open source community, especially among data engineers and data scientists. Its four design characteristics mentioned above, and its inclusion as part of the vendor-neutral Apache Software Foundation, has fostered immense interest among the developer community.

Today, there are over 600 Apache Spark Meetup groups globally and close to half-million members.33 Every week, someone in the world is giving a talk at a meetup or a conference, giving a tutorial or sharing a blog on how to use Spark to build data pipelines, build ML models, or combine Spark with deep learning techniques. The Spark + AI Summit34 is the largest conference dedicated to the use of Spark for machine learning, data engineering, and data science across many verticals.

Since Spark’s first release in May 2014, the community has had many minor and major releases, with the most recent major release of Spark 2.4 in 2018. This book will focus on Spark 2.x, but by the time of this book’s publication, the community will be close to releasing (or will have released) Spark 3.0.

The State of Apache Spark on GitHubhttps   github.com apache spark
Figure 1-7. The State of Apache Spark on GitHub35

Over its course of releases, Spark has continued to attract contributors from across the globe and from numerous organizations. Today, the Spark has close to 1400 contributors, over 93 releases, 8 thousand forks and over 23,810 commits in GitHub, as Figure 1-7 shows. And we hope that when you finish this book, you feel compelled to contribute too.

Now we can turn our attention to the fun of learning: where and how to start using Spark? In the next chapter, we’ll discuss how to download Spark and get started quickly in three simple steps.

1 http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

2 http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

3 http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

4 https://bowenli86.github.io/2016/10/23/distributed%20system/data/Big-Data-and-Google-s-Three-Papers-I-GFS-and-MapReduce/

5 http://storageconference.us/2010/Papers/MSST/Shvachko.pdf

6 https://www.apache.org/

7 https://en.wikipedia.org/wiki/Apache_Hadoop

8 https://brookewenig.github.io/SparkOverview.html#/20

9 https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf

10 https://spark.apache.org

11 https://brookewenig.github.io/SparkOverview.html#/20

12 https://spark.apache.org/releases/spark-release-1-0-0.html

13 https://spark.apache.org

14 https://cacm.acm.org/magazines/2016/11/209116-apache-spark/abstract

15 https://spark.apache.org/third-party-projects.html

16 https://cacm.acm.org/magazines/2016/11/209116-apache-spark/abstract

17 https://spark.apache.org/sql

18 https://en.wikipedia.org/wiki/SQL:2003

19 https://spark.apache.org/docs/latest/ml-guide.html

20 https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

21 https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

22 https://databricks.com/blog/2016/03/03/introducing-graphframes.html

23 https://github.com/graphframes/graphframes

24 https://databricks.com/blog/2016/03/03/introducing-graphframes.html

25 https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html

26 https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

27 https://mlflow.org

28 https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

29 https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

30 https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

31 https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

32 https://databricks.com/session/databricks-keynote-2

33 https://www.meetup.com/topics/apache-spark/

34 https://databricks.com/sparkaisummit

35 https://github.com/apache/spark

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required