September 2015
Beginner to intermediate
608 pages
13h 43m
English
The Spark project (https://spark.apache.org/) is a cluster computing framework that emphasizes low-latency job execution. It's a relatively recent project, growing out of UC Berkley's AMP Lab in 2009.
Although Spark is able to coexist with Hadoop (by connecting to the files stored on Hadoop Distributed File System (HDFS), for example), it targets much faster job execution times by keeping much of the computation in memory. In contrast with Hadoop's two-stage MapReduce paradigm, which stores files on the disk in between each iteration, Spark's in-memory model can perform tens or hundreds of times faster for some applications, particularly those performing multiple iterations over the data. ...
Read now
Unlock full access