Chapter 11. Data Leakage

In “Leakage in Data Mining: Formulation, Detection, and Avoidance,” Shachar Kaufman et al. (2012) identify data leakage as one of the top 10 most common problems in data science. In my experience, it should rank even higher: if you have trained enough real-life models, it’s unlikely you haven’t encountered it.

This chapter is devoted to discussing data leakage, some symptoms, and what can be done about it.

What Is Data Leakage?

As the name suggests, data leakage occurs when some of the data used for training a model isn’t available when you deploy your model into production, creating subpar predictive performance in the latter stage. This usually happens when you train a model:

  • Using data or metadata that won’t be available at the prediction stage

  • That is correlated with the outcome you want to predict

  • That creates unrealistically high test-sample predictive performance

The last item explains why leakage is a source of concern and frustration for data scientists: when you train a model, absent any data and model drift, you expect that the predictive performance on the test sample will extrapolate to the real world once you deploy the model in production. This won’t be the case if you have data leakage, and you (your stakeholders and the company) will suffer a big disappointment.

Let’s go through several examples to clarify this definition.

Outcome Is Also a Feature

This is a trivial example, but helps as a benchmark for more realistic examples. ...

Get Data Science: The Hard Parts now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.