Chapter 10. Classification
Classification is a supervised learning mechanism for labeling a sample based on the features. Supervised learning means that we have labels for classification or numbers for regression that the algorithm should learn.
We will look at various classification models in this chapter. Sklearn implements many common and useful models. We will also see some that are not in sklearn, but implement the sklearn interface. Because they follow the same interface, it is easy to try different families of models and see how well they perform.
In sklearn, we create a model instance and call the .fit
method on it with the training data and training labels. We can now call the .predict
method (or the .predict_proba
or the .predict_
log_proba
methods) with the fitted model. To evaluate the model, we use the .score
with testing data and testing labels.
The bigger challenge is usually arranging data in a form that will work with sklearn.
The data (X
) should be an (m by n) numpy array (or pandas DataFrame) with m rows of sample data each with n features (columns). The
label (y
) is a vector (or pandas series) of size m with a value (class) for each sample.
The .score
method returns the mean accuracy, which by itself might not
be sufficient to evaluate a classifier. We will see other evaluation metrics.
We will look at many models and discuss their efficiency, the preprocessing techniques they require, how to prevent overfitting, and if the model supports intuitive interpretation ...
Get Machine Learning Pocket Reference now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.