How it works...

Let's draw a decision tree for the tennis example we created in this recipe:

This model has a depth of three levels. Which attribute to select depends upon how we can maximize information gain. We compute this by measuring the purity of the split. Purity means that irrespective of whether the certainty of playing tennis is increasing, the given dataset will be considered positive or negative. In this example, this equates to whether the chances of play are increasing or the chances of not playing are increasing.

Purity is measured using entropy. Entropy is a measure of disorder in a system. In this context, it is easier to ...

Get Apache Spark 2.x Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.