Preface
I don’t like to think I have many regrets, but it’s hard to believe anything good came out of a particular lazy moment in 2011 when I was looking into how to best distribute tough discrete optimization problems over clusters of computers. My advisor explained this newfangled Apache Spark thing he had heard of, and I basically wrote off the concept as too good to be true and promptly got back to writing my undergrad thesis in MapReduce. Since then, Spark and I have both matured a bit, but only one of us has seen a meteoric rise that’s nearly impossible to avoid making “ignite” puns about. Cut to a few years later, and it has become crystal clear that Spark is something worth paying attention to.
Spark’s long lineage of predecessors, from MPI to MapReduce, makes it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to these frameworks that its scope is defined by what these frameworks can handle. Spark’s promise is to take this a little further—to make writing distributed programs feel like writing regular programs.
Spark is great at giving ETL pipelines huge boosts in performance and easing some of the pain that feeds the MapReduce programmer’s daily chant of despair (“why? whyyyyy?”) to the Apache Hadoop gods. But the exciting thing for ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access