This report on evaluating machine learning models arose out of a sense of need. The content was first published as a series of six technical posts on the Dato Machine Learning Blog. I was the editor of the blog, and I needed something to publish for the next day. Dato builds machine learning tools that help users build intelligent data products. In our conversations with the community, we sometimes ran into a confusion in terminology. For example, people would ask for cross-validation as a feature, when what they really meant was hyperparameter tuning, a feature we already had. So I thought, “Aha! I’ll just quickly explain what these concepts mean and point folks to the relevant sections in the user guide.”

So I sat down to write a blog post to explain cross-validation, hold-out datasets, and hyperparameter tuning. After the first two paragraphs, however, I realized that it would take a lot more than a single blog post. The three terms sit at different depths in the concept hierarchy of machine learning model evaluation. Cross-validation and hold-out validation are ways of chopping up a dataset in order to measure the model’s performance on “unseen” data. Hyperparameter tuning, on the other hand, is a more “meta” process of model selection. But why does the model need “unseen” data, and what’s meta about hyperparameters? In order to explain all of that, I needed to start from the basics. First, I needed to explain the high-level concepts and how they fit together. Only then could I dive into each one in detail.

Machine learning is a child of statistics, computer science, and mathematical optimization. Along the way, it took inspiration from information theory, neural science, theoretical physics, and many other fields. Machine learning papers are often full of impenetrable mathematics and technical jargon. To make matters worse, sometimes the same methods were invented multiple times in different fields, under different names. The result is a new language that is unfamiliar to even experts in any one of the originating fields.

As a field, machine learning is relatively young. Large-scale applications of machine learning only started to appear in the last two decades. This aided the development of data science as a profession. Data science today is like the Wild West: there is endless opportunity and excitement, but also a lot of chaos and confusion. Certain helpful tips are known to only a few.

Clearly, more clarity is needed. But a single report cannot possibly cover all of the worthy topics in machine learning. I am not covering problem formulation or feature engineering, which many people consider to be the most difficult and crucial tasks in applied machine learning. Problem formulation is the process of matching a dataset and a desired output to a well-understood machine learning task. This is often trickier than it sounds. Feature engineering is also extremely important. Having good features can make a big difference in the quality of the machine learning models, even more so than the choice of the model itself. Feature engineering takes knowledge, experience, and ingenuity. We will save that topic for another time.

This report focuses on model evaluation. It is for folks who are starting out with data science and applied machine learning. Some seasoned practitioners may also benefit from the latter half of the report, which focuses on hyperparameter tuning and A/B testing. I certainly learned a lot from writing it, especially about how difficult it is to do A/B testing right. I hope it will help many others build measurably better machine learning models!

This report includes new text and illustrations not found in the original blog posts. In Chapter 1, Orientation, there is a clearer explanation of the landscape of offline versus online evaluations, with new diagrams to illustrate the concepts. In Chapter 2, Evaluation Metrics, there’s a revised and clarified discussion of the statistical bootstrap. I added cautionary notes about the difference between training objectives and validation metrics, interpreting metrics when the data is skewed (which always happens in the real world), and nested hyperparameter tuning. Lastly, I added pointers to various software packages that implement some of these procedures. (Soft plugs for GraphLab Create, the library built by Dato, my employer.)

I’m grateful to be given the opportunity to put it all together into a single report. Blogs do not go through the rigorous process of academic peer reviewing. But my coworkers and the community of readers have made many helpful comments along the way. A big thank you to Antoine Atallah for illuminating discussions on A/B testing. Chris DuBois, Brian Kent, and Andrew Bruce provided careful reviews of some of the drafts. Ping Wang and Toby Roseman found bugs in the examples for classification metrics. Joe McCarthy provided many thoughtful comments, and Peter Rudenko shared a number of new papers on hyperparameter tuning. All the awesome infographics are done by Eric Wolfe and Mark Enomoto; all the average-looking ones are done by me.

If you notice any errors or glaring omissions, please let me know: Better an errata than never!

Last but not least, without the cheerful support of Ben Lorica and Shannon Cutt at O’Reilly, this report would not have materialized. Thank you!

Get Evaluating Machine Learning Models now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.