Get notified when our free report “Evaluating Machine Learning Models: A beginner’s guide to key concepts and pitfalls” is available for download. Editor’s note: This is an excerpt of “Evaluating Machine Learning Models,” by Alice Zheng.

This report on evaluating machine learning models arose out of a sense of need. The content was first published as a series of six technical posts on the Dato Machine Learning Blog. I was the editor of the blog, and I needed something to publish for the next day. Dato builds machine learning tools that help users build intelligent data products. In our conversations with the community, we sometimes ran into a confusion in terminology. For example, people would ask for cross validation as a feature, when what they really meant was hyperparameter tuning, a feature we already had. So, I thought, “Aha! I’ll just quickly explain what these concepts mean and point folks to the relevant sections in the user guide.”

I sat down to write a blog post to explain cross validation, hold-out data sets, and hyperparameter tuning. After the first two paragraphs, however, I realized that it would take a lot more than a single blog post. The three terms sit at different depths in the concept hierarchy of machine learning model evaluation. Cross validation and hold-out validation are ways of chopping up a data set in order to measure the model’s performance on “unseen” data. Hyperparameter tuning, on the other hand, is a more “meta” process of model selection. But why does the model need “unseen” data, and what’s meta about hyperparameters? In order to explain all of that, I needed to start from the basics. First, I needed to explain the high-level concepts and how they fit together. Only then could I dive into each one in detail.

Machine learning is a child of statistics, computer science, and mathematical optimization. Along the way, it took inspiration from information theory, neural science, theoretical physics, and many other fields. Machine learning papers are often full of impenetrable mathematics and technical jargon. To make matters worse, sometimes the same methods were invented multiple times in different fields, under different names. The result is a new language that is unfamiliar even to experts in one of the originating fields.

As a field, machine learning is relatively young. Large-scale applications of machine learning only started to appear in the last two decades. This aided the development of data science as a profession. Data science today is like the Wild West: there is endless opportunity and excitement, but also a lot of chaos and confusion. Certain helpful tips are known to only a few.

Clearly, more clarity is needed. But a single report cannot possibly cover all of the worthy topics in machine learning. I am not covering problem formulation or feature engineering, which many people consider to be the most difficult and crucial tasks in applied machine learning. Problem formulation is the process of matching a data set and a desired output to a well-understood machine learning task. This is often more tricky than it sounds. Fortunately, many applications today are already matched to a machine learning task. For instance, it is easy to recognize when a problem should be handled using a personalized recommender. Also, companies like Dato are actively building high-level machine learning toolkits that ease or eliminate the burden of picking the right machine learning model out of hundreds.

Feature engineering is also extremely important. Having good features can make a big difference in the quality of the machine learning models, even more so than the choice of the model itself. Feature engineering takes knowledge, experience, and ingenuity. We will save that topic for another time.

This report focuses on model evaluation. It is for folks who are starting out with data science and applied machine learning. Some seasoned practitioners may also benefit from the latter half of the report, which focuses on hyperparameter tuning and A/B testing. I certainly learned a lot from writing it, especially about how difficult it is to do A/B testing right. I hope it will help many others build measurably better machine learning models.

Alice Zheng will be part of the Data Science Summit and Dato Conference in July — a non-profit event jointly organized by Intel, Comcast, Pandora, Dato, Cloudera, and O’Reilly Media — in San Francisco. Visit the conference website for more information on the program. Use the discount code OREILLY20 and get 20% off either one or both days of the conference.