O'Reilly logo

Fast Data Processing with Spark 2 - Third Edition by Krishna Sankar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Spark v2.0 and beyond

Spark v2.0 and beyond has been the catalyst for a renaissance in data science! Datasets, DataFrames, ML pipelines, and new and improved algorithms in MLlib have paved the way for data wrangling at scale. I think Version 2.0 marks the spot where Spark turned into a mature framework. It could handle huge workloads in terms of the number of machines as well as the volume of data. The community update at the Spark Summit 2015 in San Francisco included a slide that showed the power of Spark:

  • The largest cluster-8,000 nodes (Tencent)
  • The largest single job-1 petabyte and more (Alibaba and Tencent)
  • The longest running job-1 petabyte and more for a week (Alibaba)
  • The top streaming intake-1 terabyte/hour (Janelia farm)
  • The largest shuffle-1 ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required