August 2018
Intermediate to advanced
378 pages
9h 9m
English
Data leakage is where a feature used to train the model has values that could not exist if the model was used in production. It occurs most frequently in time series data. For example, in our churn use case in Chapter 4, Training Deep Prediction Models, there were a number of categorical variables in the data that indicated customer segmentation. A data modeler may assume that these are good predictor variables, but it is not known how and when these variables were set. They could be based on customer' spend, which means that if they are used in the prediction algorithm, there is a circular reference—an external process calculates the segment based on the spend and then this variable is used to predict spend!
When extracting ...