17.6 Spark

In this section, we’ll overview Apache Spark. We’ll use the Python PySpark library and Spark’s functional-style filter/map/reduce capabilities to implement a simple word count example that summarizes the word counts in Romeo and Juliet.

17.6.1 Spark Overview

When you process truly big data, performance is crucial. Hadoop is geared to disk-based batch processing—reading the data from disk, processing the data and writing the results back to disk. Many big-data applications demand better performance than is possible with disk-intensive operations. In particular, fast streaming applications that require either real-time or near-real-time processing won’t work in a disk-based architecture.

History

Spark was initially developed in 2009 ...

Get Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.