In this chapter we will look at the random forest machine learning algorithm. It is a wonderful algorithm: effective on a wide range of data sets, while having relatively few parameters to tune. It is a decision tree algorithm (as is GBM, which we look at it in the next chapter).
I start with a brief look at basic decision trees, then how random forest is different, and then go through the optional parameters that H2O’s implementation offers. Then I apply random forest to each of the three data sets: first out-of-the-box, with all defaults, then using a tuning process to find the best single model I can. Each of the subsequent three chapters will follow this same pattern. As the first of the four chapters, grids—a great tool to aid in tuning—are also introduced here. The results of all models are summarized at the end of the book, in Chapter 11.
The tuning process is to try and improve on the default settings. But the H2O implementations tend to have good defaults that adapt to characteristics of your data, so I quickly reach the point of diminishing returns. Have in your mind how much time and effort a certain increase in model accuracy is worth. Maybe your day is better spent on feature engineering than tuning? Maybe $1000 would be better spent on additional data (whether buying data sets, or running your own surveys) than buying 500 node-hours on EC2 to run grids?
Throughout this book, but particularly in the next few chapters, I’ve deliberately shown ...