Chapter 10. Classification

Classification is a supervised learning mechanism for labeling a sample based on the features. Supervised learning means that we have labels for classification or numbers for regression that the algorithm should learn.

We will look at various classification models in this chapter. Sklearn implements many common and useful models. We will also see some that are not in sklearn, but implement the sklearn interface. Because they follow the same interface, it is easy to try different families of models and see how well they perform.

In sklearn, we create a model instance and call the .fit method on it with the training data and training labels. We can now call the .predict method (or the .predict_proba or the .predict_log_proba methods) with the fitted model. To evaluate the model, we use the .score with testing data and testing labels.

The bigger challenge is usually arranging data in a form that will work with sklearn. The data (X) should be an (m by n) numpy array (or pandas DataFrame) with m rows of sample data each with n features (columns). The label (y) is a vector (or pandas series) of size m with a value (class) for each sample.

The .score method returns the mean accuracy, which by itself might not be sufficient to evaluate a classifier. We will see other evaluation metrics.

We will look at many models and discuss their efficiency, the preprocessing techniques they require, how to prevent overfitting, and if the model supports intuitive interpretation ...

Get Machine Learning Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.