Chapter 39. Hyperparameters and Model Validation
In the previous chapter, we saw the basic recipe for applying a supervised machine learning model:
-
Choose a class of model.
-
Choose model hyperparameters.
-
Fit the model to the training data.
-
Use the model to predict labels for new data.
The first two pieces of this—the choice of model and choice of hyperparameters—are perhaps the most important part of using these tools and techniques effectively. In order to make informed choices, we need a way to validate that our model and our hyperparameters are a good fit to the data. While this may sound simple, there are some pitfalls that you must avoid to do this effectively.
Thinking About Model Validation
In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the predictions to the known values.
This section will first show a naive approach to model validation and why it fails, before exploring the use of holdout sets and cross-validation for more robust model evaluation.
Model Validation the Wrong Way
Let’s start with the naive approach to validation using the Iris dataset, which we saw in the previous chapter. We will start by loading the data:
In[1]:fromsklearn.datasetsimportload_irisiris=load_iris()X=iris.datay=iris.target
Next, we choose a model and hyperparameters. Here we’ll use
a k-nearest neighbors classifier with n_neighbors=1 ...