Chapter 9. Considerations for the Data Scientist

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in structured, semi-structured, and unstructured forms.

Data scientists live at the intersection of science and statistics. They care about the fifth “V” of the “five Vs” of data science: value. The results of their labor are usually used to drive business decisions and perform predictive problem solving. For example, a data scientist might analyze a broad range of customer data from multiple sources—structured and unstructured—to attempt to predict when a customer is at risk of churning.

Pradeep Reddy, a solutions architect at Qubole, says, “If I’m a telecom provider, if I can actually predict a customer who will churn three months from now, I could take some corrective actions.” He continues: “I could then send him a flier or offer a promotion in an attempt to retain him or her.”

It’s important in such cases for data scientists to identify the “white spaces” in their data. For example, if you depend on geolocation data to power an application that tracks consumer behavior, you are at the mercy of whether a consumer has opted out of geolocation. Suppose you are collecting this data from customers’ smartphones—you’re fine until they power down. Then, you don’t know where they went. What they bought. Where they ate lunch. There’s a hole in your data. That’s when you turn to third-party ...

Get Operationalizing the Data Lake now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.