CHAPTER 25Forests

In Chapter 24, we introduced the concept of decision trees. They represent a useful method for capturing and modelling non-linear data generating processes. They also provide results which can be easily read and interpreted by human analysts. The notion of decision trees can be extended further into a more powerful concept – the ensemble of trees – random forests; see Breiman (2001). They give rise to a method of estimation which is very powerful, due to its numerical ease in obtaining the fit and its explanatory power. Random forests attempt to address one of the main challenges of decision trees, which is overfitting, using the statistical concept of bootstrapping. In this section, building upon the previous chapter on trees, we introduce the theoretical foundations behind random forests and demonstrate an implementation.

25.1 BOOTSTRAP

Let us start with the concept of bootstrapping as it is introduced in statistics; see Efron (1992). This will equip us with the intuition on how it can be used to improve the performance of decision trees. The concept of bootstrapping stands for testing which involves random sampling with replacement. It is particularly useful when we need to estimate the sampling distribution for a particular measure or statistic. Consequently, we can obtain the significance level for sample statistics and test the various hypotheses on the given sample. The bootstrap is a general procedure and an alternative to asymptotic methods which ...

Get Machine Learning and Big Data with kdb+/q now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.