Chapter 4. Predicting Forest Cover with Decision Trees
Prediction is very difficult, especially if it’s about the future.
Niels Bohr
In the late 19th century, the English scientist Sir Francis Galton was busy measuring things like peas and people. He found that large peas (and people) had larger-than-average offspring. This isn’t surprising. However, the offspring were, on average, smaller than their parents. In terms of people: the child of a 7-foot-tall basketball player is likely to be taller than the global average, but still more likely than not to be less than 7 feet tall.
As almost a side effect of his study, Galton plotted child versus parent size and noticed there was a roughly linear relationship between the two. Large parent peas had large children, but slightly smaller than themselves; small parents had small children, but generally a bit larger than themselves. The line’s slope was therefore positive but less than 1, and Galton described this phenomenon as we do today, as regression to the mean.
Although maybe not perceived this way at the time, this line was, to me, an early example of a predictive model. The line links the two values, and implies that the value of one suggests a lot about the value of the other. Given the size of a new pea, this relationship could lead to a more accurate estimate of its offsprings’ size than simply assuming the offspring would be like the parent or like every other pea.
Fast Forward to Regression
More than a century ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access