Kay Ousterhout

Making Sense of Spark Performance

Date: This event took place live on April 01 2015

Presented by: Kay Ousterhout

Duration: Approximately 60 minutes.

Cost: Free

Questions? Please send email to


Hosted By: Ben Lorica

There has been significant work dedicated to improving the performance of big-data systems like Spark, but comparatively little effort has been spent systematically analyzing the performance bottlenecks of these systems. In this talk, I'll take a deep dive into Spark's performance on two benchmarks (TPC-DS and the Big Data Benchmark from UC Berkeley) and one production workload and demonstrate that many commonly-held beliefs about performance bottlenecks do not hold. In particular, I'll demonstrate that CPU (and not I/O) is often the bottleneck, that network performance can improve job completion time by a median of at most 4%, and that the causes of most stragglers can be identified and fixed. I'll also demo how the open-source tools I developed can be used to understand performance of other Spark jobs.

About Kay Ousterhout

Kay Ousterhout is a Spark committer and PMC member and a PhD student at UC Berkeley. In the Spark project, Kay is a maintainer of the scheduler, and her work on Spark has focused on improving scheduler performance. At UC Berkeley, Kay's research work centers around understanding and improving performance of large-scale analytics frameworks.

Twitter: @kayousterhout

About Ben Lorica

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services. He is an advisor to Databricks.

Twitter: @bigdata

You may also be interested in:

Developer Certification
for Apache Spark