Cross-validation and model selection

In the previous example, we validated our approach by withholding 30% of the data when training, and testing on this subset. This approach is not particularly rigorous: the exact result changes depending on the random train-test split. Furthermore, if we wanted to test several different hyperparameters (or different models) to choose the best one, we would, unwittingly, choose the model that best reflects the specific rows in our test set, rather than the population as a whole.

This can be overcome with cross-validation. We have already encountered cross-validation in Chapter 4, Parallel Collections and Futures. In that chapter, we used random subsample cross-validation, where we created the train-test split ...

Get Scala: Guide for Data Science Professionals now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Scala: Guide for Data Science Professionals by Pascal Bugnion, Arun Manivannan, Patrick R. Nicolas

Cross-validation and model selection

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly