Chapter 2Data Collection, Sampling, and Preprocessing


Data is a key ingredient for any analytical exercise. Hence, it is of key importance to thoroughly consider and list all data sources that are potentially of interest and relevant before starting the analysis. Large experiments as well as a broad experience in different fields indicate that when it comes to data, bigger is better (see de Fortuny, Martens, & Provost, 2013). However, real-life data can be (typically is) dirty because of inconsistencies, incompleteness, duplication, merging, and many other problems. Hence, throughout the analytical modeling steps, various data-filtering mechanisms will be applied to clean up and reduce the data to a manageable and relevant size. Worth mentioning here is the garbage in, garbage out (GIGO) principle, which essentially states that messy data will yield messy analytical models. Hence, it is of utmost importance that every data preprocessing step is carefully justified, carried out, validated, and documented before proceeding with further analysis. Even the slightest mistake can make the data totally unusable for further analysis and the results invalid and of no use whatsoever. In what follows, we will elaborate on the most important data preprocessing steps that should be considered during an analytical modeling exercise to build a fraud detection model. But first, let us have a closer look at what data to gather.

Types of Data Sources

Data can originate from a ...

Get Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.