December 2018
Beginner to intermediate
684 pages
21h 9m
English
Random forests offer the benefit of built-in cross-validation because individual trees are trained on bootstrapped versions of the training data. As a result, each tree uses on average only two-thirds of the available observations. To see why, consider that a bootstrap sample has the same size, n, as the original sample, and each observation has the same probability, 1/n, to be drawn. Hence, the probability of not entering a bootstrap sample at all is (1-1/n)n, which converges (quickly) to 1/e, or roughly one-third.
This remaining one-third of the observations that are not included in the training set used to grow a bagged tree is called out-of-bag (OOB) observations and can serve as a validation set. Just as with cross-validation, ...