The preoccupation with test error in applied machine learning

Measure your model’s business impact, not just its accuracy.

By Patrick Hall
May 31, 2016
A close-up of water and soil being splashed by the impact of a single raindrop. A close-up of water and soil being splashed by the impact of a single raindrop. (source: U.S. Department of Agriculture on Wikimedia Commons)

“Predictive accuracy on test sets is the criterion for how good the model is.”

– Leo Breiman, Statistical Modeling: The Two Cultures, 2001

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The quote above may be one of the most important observations, from one of the most important papers, in data science. So forgive me because I am not worthy, but I propose a reinterpretation of this philosophy for the commercial practice of applied machine learning in 2016. The technology exists now, be it purchased or built in-house, to directly measure the monetary value that a machine learning model is generating. This monetary value should be the criterion for selecting and deploying a commercial machine learning model, not its performance on old, static test data sets.

Simply measuring error on static, standard, or simulated test data belies the true difficulty of making accurate decisions about unknowable future events. In the worst cases, I’ve seen organizations choose models purely based on hype, or the shiny appeal of novelty (often buttressed by a blog post or whitepaper with impressive test data performances). Even in some of the best cases, organizations choose machine learning models based on the results of carefully staged bakeoffs between different techniques on static training and test data from recent past exercises. Even though test error is often our best guess on how well a model will perform in the real world, test error can be an overly optimistic estimate of on-line performance. Focusing too narrowly on test accuracy on static, old data can be the business equivalent of overfitting the public leaderboard in a Kaggle competition.

A few problems with test error

Models can over-fit test data, not just training data. An over-fit model learns about the noise and idiosyncrasies in specific data samples, and not about the generalizable knowledge represented by those data samples. Typically, over-fitting means learning too much about a specific training data set. To prevent over-fitting training data, a separate sample of data is often used to test models, usually called test data. If a model accurately predicts the phenomena of interest in the training data, but not in the test data, then the model has probably over-fit the training data. The usual solution to over-fitting training data is to fix something about the modeling process, retrain the model on the same training data, and retest the model’s performance on the same test data. Unfortunately, this is a subtle violation of the scientific method and can lead to over-fitting the test data and overly optimistic test error measurements.

Changing experimental hypotheses based on ad-hoc exploratory data analysis is a common and accepted practice in data science. Yet, most guarantees about the validity of statistical inference require that a hypothesis about a data set is chosen before that data set is explored. We often use our human intellect or intuition to improve our modeling processes based on new discoveries in training data. Moreover, we often iterate through cycles of discovery and improvement before deploying a final machine learning model. Once on-line, even the most adaptive machine learning models do not yet have the ability to adjust themselves to new phenomenon in new data as well as practitioners can adapt modeling processes during model training. Furthermore, we often do not incorporate the effects of our multiple comparisons and our own confirmation biases into test error measurements.

For a more in-depth discussion of preserving validity in adaptive data analysis, check out this approachable post by Moritz Hardt on the Google Research Blog. For a longer examination of the subtle ways over-fitting can hamstring machine learning models, see John Langford’s excellent KDNuggets post on over-fitting.

Even if we can avoid over-fitting test data, everything in our world continually changes. People’s behavior changes over time. Markets change over time. Promotions are offered and then expire, competitors enter and then leave markets, and consumer tastes are always evolving (or devolving). Machine learning models are often trained on static snapshots of data that reflect market conditions when the snapshot was captured. Often they are tested on static snapshots of data taken at a later date. While techniques like cross-validation allow a machine learning model to be trained and tested on different views and angles of these static snapshots, even cross-validated test error estimates will still be too optimistic if market conditions arise that were not represented in the training or test data. While adaptive machine learning models are gaining commercial acceptance and certainly could lessen concerns over on-line model accuracy degrading over time, in my experience, they are a small minority of commercially deployed models. Also the initial parameters or rules of these adaptive models usually must be specified somehow, often in an exercise that involves static, old test data. Moreover, the accuracy of adaptive machine learning models in industry is still discussed in terms of mathematical or statistical error measurements.

Beyond traditional test error

This gets us to the core of the problem with test error in commercial applications of machine learning. It’s just an estimate of the on-line predictive accuracy of a model on new data if the market conditions represented by the test data don’t change. Test error is just a proxy to the measurement we usually care about in industry: does a model make money. Since the technology is available to measure the business value of a model directly, we should probably just start measuring how much money a model is making and keeping track of which models make the most money.

Testing whether a recommendation model actually leads to more purchases has been a best practice for years. The real criterion for a new recommendation approach is to pass an on-line test against a currently deployed model. While the on-line test probably measures the accuracy of the machine learning model behind the recommendations, it’s really measuring the change in revenue caused by the new model. This mindset and technology needs to be employed more often in predictive modeling and applied machine learning. For an excellent introduction to on-line tests and deployment techniques for applied machine learning, including A/B testing and multi-armed bandits, check out Alice Zheng’s report Evaluating Machine Learning Models.

While many analytically advanced and mature organizations are moving toward such capabilities for their predictive modeling and machine learning operations, I still hear a lot of discussions about the accuracy of some machine learning algorithm on some static, old test set. Though it’s certainly much easier said than done, a more fruitful discussion probably revolves around closing the technological feedback loops that would enable an organization to directly monitor the revenue change caused by its machine learning models. Once the wiring is in place, the classic model bakeoff so many organizations use to choose which machine learning model to deploy can be moved on-line, and organizations can use business impact to select the best model for deployment, not just test error.

If your organization is advanced enough to deploy adaptive machine learning models, consider evaluating what type of adaptive model to use and how it is initialized in an on-line setting and, again, measure the model’s business impact not just its on-line accuracy. Of course, models selected by on-line business impact will still probably require monitoring, augmentation with business rules, retraining, and eventually replacement. But, at least we will be measuring what we really care about and not some potentially problematic proxy.

Post topics: Data science