E. Installing Apache Spark

As mentioned in Chapter 8, “Hadoop YARN Applications,” Apache Spark is a fast, in-memory data processing engine. Spark differs from the classic MapReduce model in two ways. First, Spark holds intermediate results in memory, rather than writing them to disk. Second, Spark supports more than just MapReduce functions, greatly expanding the set of possible analyses that can be executed over HDFS data stores. It also provides APIs in Scala, Java, and Python. Spark has been fully integrated to run under YARN.

As of this writing, Apache Spark has not been fully integrated into the Hortonworks HDP Hadoop distribution version 2.2.4. The next release will include Spark as a fully integrated Ambari and HDP component.

As demonstrated ...

Get Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.