O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Spark 1.6 to 2.0

The DataFrame-based API will be the primary API.

The RDD-based API is entering maintenance mode. The MLlib guide (http://spark.apache.org/docs/2.0.0/ml-guide.html) provides more details.

The following are the new features introduced in Spark 2.0:

  • ML persistence: The DataFrames-based API provides support for saving and loading ML models and Pipelines in Scala, Java, Python, and R
  • MLlib in R: SparkR offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression in this release
  • Python: PySpark in 2.0 supports new MLlib algorithms, LDA, Generalized Linear Regression, Gaussian Mixture Model, among others

Algorithms added to DataFrames-based API are GMM, Bisecting K-Means clustering, ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required