Cross-validation

Given that the goal of predictive analytics is to build generalizable models that predict well for data yet unobserved, we should ideally be testing our models on data unseen, and check our predictions against the observed outcomes. The problem with that, of course, is that we don't know the outcomes of data unseen—that's why we want a predictive model. We do, however, have a trick up our sleeve, called the validation set approach.

The validation set approach is a technique to evaluate a model's ability to perform well on an independent dataset. But instead of waiting to get our hands on a completely new dataset, we simulate a new dataset with the one we already have.

The main idea is that we can split our dataset into two ...

Get Data Analysis with R - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.