The state of machine learning in Apache Spark
The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.
The O’Reilly Data Show Podcast: Ion Stoica and Matei Zaharia explore the rich ecosystem of analytic tools around Apache Spark.
In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.
We began our conversation by discussing recent academic research that would be of interest to the Apache Spark community (Stoica leads the RISE Lab at UC Berkeley, Zaharia is part of Stanford’s DAWN Project). The bulk of our conversation centered around machine learning. Like many in the audience, I was first attracted to Spark because it simultaneously allowed me to scale machine learning algorithms to large data sets while providing reasonable latency.
Here is a partial list of the items we discussed:
[Full disclosure: I am an advisor to Databricks.]
Related resources: