Chapter 1. Introduction
Data is the new oil. There has been exponential growth in the amount of structured, semi-structured, and unstructured data collected within enterprises. Insights extracted from data are becoming a valuable differentiator for enterprises in every industry vertical, and machine learning (ML) models are used in product features as well as improved business processes.
Enterprises today are data-rich, but insights-poor. Gartner predicts that 80% of analytics insights will not deliver business outcomes through 2022. Another study highlights that 87% of data projects never make it to production deployment. Sculley et al. from Google show that less than 5% of the effort of implementing ML in production is spent on the actual ML algorithms (as illustrated in Figure 1-1). The remaining 95% of the effort is spent on data engineering related to discovering, collecting, and preparing data, as well as building and deploying the models in production.
While an enormous amount of data is being collected within data lakes, it may not be consistent, interpretable, accurate, timely, standardized, or sufficient. Data scientists spend a significant amount of time on engineering activities related to aligning systems for data collection, defining metadata, wrangling data to feed ML algorithms, deploying pipelines and models at scale, and so on. These activities are outside of their core insight-extracting skills, and bottlenecked by dependency on data engineers and platform IT ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access