Apache Spark’s journey from academia to industry

In this O'Reilly Data Show Podcast: Ion Stoica talks about the rise of Apache Spark and Apache Mesos.

By Ben Lorica
December 31, 2014
A cluster of center pivot irrigation fields. A cluster of center pivot irrigation fields. (source: Soil Science)

Apache Spark’s journey from academia to industry
Data Show Podcast

 
 
00:00 / 00:20:19
 
1X
 

Three projects from UC Berkeley’s AMPLab have been keenly adopted by industry: Apache Mesos, Apache Spark, and Tachyon. As an early user, it’s been fun to watch Spark go from an academic lab to the most active open source project in big data. In my recent travels, I’ve met Spark users from companies of all sizes and and from many industries. I’ve also spoken with companies that came of age before Spark was available or mature enough, and many are replacing homegrown tools with Spark (Full disclosure: I’m an advisor to Databricks, a start-up commercializing Apache Spark..)

A few months ago, I spoke with UC Berkeley Professor and Databricks CEO Ion Stoica about the early days of Spark and the Berkeley Data Analytics Stack. Ion noted that by the time his students began work on Spark and Mesos, his experience at his other start-up Conviva had already informed some of the design choices:

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

“Actually, this story started back in 2009, and it started with a different project, Mesos. So, this was a class project in a class I taught in the spring of 2009. And that was to build a cluster management system, to be able to support multiple cluster computing frameworks like Hadoop, at that time, MPI and others. To share the same cluster as the data in the cluster. Pretty soon after that, we thought about what to build on top of Mesos, and that was Spark. Initially, we wanted to demonstrate that it was actually easier to build a new framework from scratch on top of Mesos, and of course we wanted it to be also special. So, we targeted workloads for which Hadoop at that time was not good enough. Hadoop was targeting batch computation. So, we targeted interactive queries and iterative computation, like machine learning.

“I also co-founded a company, Conviva. It was in the area of video management, and one of its products was an analytics tool. And as a part of that product, one feature was adhoc queries. And, at that time, you know … we were using MySQL. MySQL was not good enough. I saw first hand the limitation of the existing technologies, especially on the open source side. And finally, you’re pretty anchored in seeing the trends and the problems in the industry around us, because we are funded at Berkeley by many companies, like Facebook, Yahoo and so forth…so, after the initial building…batch jobs, on top of Hadoop, they were looking for the next level; you want to have something faster.”

It’s one thing to build something as an academic project, where papers and conference presentations are the standard metrics. Successful open source projects involve great developers coming together to tackle real problems — having great timing is also usually an important and under appreciated factor. Ion explained:

“There are many components. And if you look back, you can always revise history. Especially if you had success. First of all, we had a fantastic group of students. Matei, the creator of Spark and others who did Mesos. And then another great group of different students who contributed and built different modules on top of Spark, and made what Spark it is today, which is really a platform. So, that’s one: the students.

“The other one was a great collaboration with the industry. We are seeing first hand what the problems are, challenges, so you’re pretty anchored in reality.

“The third thing is, we are early. In some sense, we started very early to look at big data, we started as early 2006, 2007 starting to look at big data problems. We had a little bit of a first-mover advantage, at least in the academic space. So, all this together, plus the fact that the first releases of these tools, in particular Spark, was like 2000 lines of code…very small, so tractable.”

UC Berkeley has had many successful open source projects in the past — BSD and Postgres are prominent examples. I asked Ion if early on they thought Spark would be embraced so enthusiastically by industry:

“Absolutely not. We wanted to have some good, interesting research projects; we wanted to make it as real as possible, but in no way could we have anticipated the adoption and the enthusiasm of people and of the community around what we’ve built.”

You can listen to our entire interview in the SoundCloud player above, in our SoundCloud stream, or on iTunes.

Post topics: AI & ML, Data, O'Reilly Data Show Podcast
Post tags: Podcast
Share:

Get the O’Reilly Radar Trends to Watch newsletter