O'Reilly logo

Apache Spark 2.x Machine Learning Cookbook by Shuen Mei, Broderick Hall, Meenakshi Rajendran, Siamak Amirghodsi

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

How to do it...

  1. For the Naive Bayes exercise, we use a famous dataset called iris.data, which can be obtained from UCI. The dataset was originally introduced in the 1930s by R. Fisher. The set is a multivariate dataset with flower attribute measurements classified into three groups.

In short, by measuring four columns, we attempt to classify a species into one of the three classes of Iris flower (that is, Iris Setosa, Iris Versicolor, Iris Virginica).

We can download the data from here:

https://archive.ics.uci.edu/ml/datasets/Iris/

The column definition is as follows:

    • Sepal length in cm
    • Sepal width in cm
    • Petal length in cm
    • Petal width in cm
    • Class:
      • -- Iris Setosa => Replace it with 0
      • -- Iris Versicolour => Replace it with 1

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required