Chapter 2. Evaluation Metrics

Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of classification, regression, ranking, clustering, topic modeling, etc. Some metrics, such as precision-recall, are useful for multiple tasks. Classification, regression, and ranking are examples of supervised learning, which constitutes a majority of machine learning applications. We’ll focus on metrics for supervised learning models in this report.

Classification Metrics

Classification is about predicting class labels given input data. In binary classification, there are two possible output classes. In multiclass classification, there are more than two possible classes. I’ll focus on binary classification here. But all of the metrics can be extended to the multiclass scenario.

An example of binary classification is spam detection, where the input data could include the email text and metadata (sender, sending time), and the output label is either “spam” or “not spam.” (See Figure 2-1.) Sometimes, people use generic names for the two classes: “positive” and “negative,” or “class 1” and “class 0.”

There are many ways of measuring classification performance. Accuracy, confusion matrix, log-loss, and AUC are some of the most popular metrics. Precision-recall is also widely used; I’ll explain it in “Ranking Metrics”.

Figure 2-1. Email spam detection is ...

Get Evaluating Machine Learning Models now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.