Skip to Content
Hadoop: The Definitive Guide, 4th Edition
book

Hadoop: The Definitive Guide, 4th Edition

by Tom White
April 2015
Beginner to intermediate
754 pages
21h 39m
English
O'Reilly Media, Inc.
Content preview from Hadoop: The Definitive Guide, 4th Edition

Chapter 19. Spark

Apache Spark is a cluster computing framework for large-scale data processing. Unlike most of the other processing frameworks discussed in this book, Spark does not use MapReduce as an execution engine; instead, it uses its own distributed runtime for executing work on a cluster. However, Spark has many parallels with MapReduce, in terms of both API and runtime, as we will see in this chapter. Spark is closely integrated with Hadoop: it can run on YARN and works with Hadoop file formats and storage backends like HDFS.

Spark is best known for its ability to keep large working datasets in memory between jobs. This capability allows Spark to outperform the equivalent MapReduce workflow (by an order of magnitude or more in some cases[128]), where datasets are always loaded from disk. Two styles of application that benefit greatly from Spark’s processing model are iterative algorithms (where a function is applied to a dataset repeatedly until an exit condition is met) and interactive analysis (where a user issues a series of ad hoc exploratory queries on a dataset).

Even if you don’t need in-memory caching, Spark is very attractive for a couple of other reasons: its DAG engine and its user experience. Unlike MapReduce, Spark’s DAG engine can process arbitrary pipelines of operators and translate them into a single job for the user.

Spark’s user experience is also second to none, with a rich set of APIs for performing many common data processing tasks, such as joins. At ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Hadoop: The Definitive Guide, 3rd Edition

Hadoop: The Definitive Guide, 3rd Edition

Tom White
Kafka: The Definitive Guide

Kafka: The Definitive Guide

Neha Narkhede, Gwen Shapira, Todd Palino
Kafka: The Definitive Guide, 2nd Edition

Kafka: The Definitive Guide, 2nd Edition

Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty

Publisher Resources

ISBN: 9781491901687Errata PageSupplemental Content