Apache Spark

Apache Spark is an open-source framework for fast, big data or large-scale processing with the support for streaming, SQL, Machine learning, and graph processing. This framework is implemented in Scala and supports programming languages such as Java, Scala, and Python. The magnitude of performance is up to 10X to 20X is the traditional Hadoop stack. Spark is a general purpose framework and allows interactive programming along with the support for streaming. Spark can work with Hadoop supporting Hadoop formats like SequenceFiles or InputFormats in a standalone mode. It includes local file systems, Hive, HBase, Cassandra, and Amazon S3 among others.

We will use Spark 1.2.0 for all the examples throughout this book.

The following figure ...

Get Practical Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.