Chapter 33. Rethinking the “Get the Data” Step
Phil Bangayan
My key responsibility as a principal data scientist is creating accurate models, which involves getting appropriate data. This step of getting data occurs early in the data science process that was taught to me and all aspiring data scientists, today and going back to the late 1990s, in the form of CRISP-DM (cross-industry standard process for data mining). After practicing on both the client and vendor sides, I have learned that this step receives insufficient attention, opening up data scientists to traps when they do not understand where the data comes from, misuse data collected for a different purpose, or utilize proxy data in a possibly unethical manner.
The data science process I learned is similar to the one documented by Joe Blitzstein and Hanspeter Pfister at Harvard: (1) ask an interesting question, (2) get the data, (3) explore the data, (4) model the data, and (5) communicate and visualize the results. Going back to 1997, the similar process CRISP-DM, prominent in customer relationship management, includes the following steps: (1) business understanding, (2) data understanding, (3) data preparation, (4) modeling, (5) evaluation, and (6) deployment. In both these frameworks, getting the data is the second step and affects all the following steps. Having the wrong data at ...