Construction progress on the Guiyang-Guangzhou high-speed railway.
Construction progress on the Guiyang-Guangzhou high-speed railway. (source: Billyshanenunn on Wikimedia Commons)

I recently talked to Sean Suchter, co-founder and CEO of Pepperdata, about why Spark has become so popular and where it still presents challenges. Spark represents the evolution of how the computer field understands and is addressing the challenge of handling big data (a term I will casually use without trying to define—any definition you care to plug in will be relevant for this discussion).

A revolution in how to handle big data started with the MapReduce model in 2004, invented at Google and brought to the wider community through Yahoo's open source Hadoop project in 2006, which also incidentally is where Suchter launched the first production implementation of Hadoop. (An essay about the development on MapReduce can be found in the O'Reilly book Beautiful Code.) Since that time, we've discovered several important things about large-scale data processing:

  • MapReduce's pipeline of mapping and possible partitioning, followed by a reduction, is only a subset of potential applications in this area. For instance, mapping and reduction are often performed in sequences, which in Hadoop requires job chaining.
  • Despite the popularity of NoSQL database solutions, people still like joins and other powerful features uniquely offered by SQL. This has led to projects that essentially wrap MapReduce jobs in SQL-like statements.
  • Increases in computer main memory now allow caching of results and other in-memory operations on sets of data that were previously too large to consider processing in memory.

All these developments are addressed by Spark. In our interview, Suchter discussed:

  • The place of Spark in the Hadoop world, as a processing engine.
  • Why Spark may well replace the core MapReduce model of Hadoop, particularly its flexible pipelining and its caching and reuse of in-memory data.
  • Importance of Spark for streaming use cases.
  • Pepperdata's work on multitenant clusters, where the resource requirements of one job can disrupt others and monitoring is difficult, along with the new challenges introduced by Spark. (See our recent report Hadoop and Spark Performance for the Enterprise, which investigates this problem in more detail.)
  • Recent improvements in packaging and deployment options for Spark.
  • The importance of providing SQL to expand the number of users able to use Spark.
  • Basic tips to watch out for when using Spark.

The topic of Spark continues to become more mainstream; ideas and opinions of this new platform are in no short supply. Listen to the podcast to hear how Suchter, a Hadoop veteran, sees the technology fitting into (and also changing) the current ecosystem of distributed computing tools.

This post is a collaboration between O'Reilly and Pepperdata. See our statement of editorial independence.

 

Article image: Construction progress on the Guiyang-Guangzhou high-speed railway. (source: Billyshanenunn on Wikimedia Commons).