Chapter 1. Introduction

Data is the new oil. There has been exponential growth in the amount of structured, semi-structured, and unstructured data collected within enterprises. Insights extracted from data are becoming a valuable differentiator for enterprises in every industry vertical, and machine learning (ML) models are used in product features as well as improved business processes.

Enterprises today are data-rich, but insights-poor. Gartner predicts that 80% of analytics insights will not deliver business outcomes through 2022. Another study highlights that 87% of data projects never make it to production deployment. Sculley et al. from Google show that less than 5% of the effort of implementing ML in production is spent on the actual ML algorithms (as illustrated in Figure 1-1). The remaining 95% of the effort is spent on data engineering related to discovering, collecting, and preparing data, as well as building and deploying the models in production.

While an enormous amount of data is being collected within data lakes, it may not be consistent, interpretable, accurate, timely, standardized, or sufficient. Data scientists spend a significant amount of time on engineering activities related to aligning systems for data collection, defining metadata, wrangling data to feed ML algorithms, deploying pipelines and models at scale, and so on. These activities are outside of their core insight-extracting skills, and bottlenecked by dependency on data engineers and platform IT engineers who typically lack the necessary business context. The engineering complexity limits data accessibility to data analysts and scientists rather than democratizing it to a growing number of data citizens in product management, marketing, finance, engineering, and so on. While there is a plethora of books on advancement in ML programming, and deep-dive books on specific data technologies, there is little written about operational patterns for data engineering required to develop a self-service platform to support a wide spectrum of data users.

The study by Sculley et al. analyzed the time spent on ML code compared to the data engineering activities required in production deployments
Figure 1-1. The study by Sculley et al. analyzed the time spent in getting ML models to production. ML code took 5% of time spent, while 95% of time spent was on the remaining boxes related to data engineering activities.

Several enterprises have identified the need to automate and make the journey from data to insight self-service. Google’s TensorFlow Extended (TFX), Uber’s Michelangelo, and Facebook’s FBLearner Flow are examples of self-service platforms for developing ML insights. There is no silver bullet strategy that can be adopted universally. Each enterprise is unique in terms of existing technology building blocks, dataset quality, types of use cases supported, processes, and people skills. For instance, creating a self-service platform for a handful of expert data scientists developing ML models using clean datasets is very different from a platform supporting heterogeneous data users using datasets of varying quality with homegrown tools for ingestion, scheduling, and other building blocks.

Despite significant investments in data technologies, there are three reasons, based on my experience, why self-service data platform initiatives either fail to take off or lose steam midway during execution:

Real pain points of data users getting lost in translation
Data users and data platform engineers speak different languages. Data engineers do not have the context of the business problem and the pain points encountered in the journey map. Data users do not understand the limitations and realities of big data technologies. This leads to finger-pointing and throwing over problems between teams, without a durable solution.
Adopting “shiny” new technology for the sake of technology
Given the plethora of solutions, teams often invest in the next “shiny” technology without clearly understanding the issues slowing down the journey map of extracting insights. Oftentimes, enterprises end up investing in technology for the sake of technology, without reducing the overall time to insight.
Tackling too much during the transformation process
Multiple capabilities make a platform self-service. Teams often aim to work on all aspects concurrently, which is analogous to boiling the ocean. Instead, developing self-service data platforms should be like developing self-driving cars, which have different levels of self-driving capabilities that vary in level of automation and implementation complexity.

Journey Map from Raw Data to Insights

Traditionally, a data warehouse aggregated data from transactional databases and generated retrospective batch reports. Warehousing solutions were typically packaged and sold by a single vendor with integrated features for metadata cataloging, query scheduling, ingestion connectors, and so on. The query engine and data storage were coupled together with limited interoperability choices. In the big data era today, the data platform is a patchwork of different datastores, frameworks, and processing engines supporting a wide range of data properties and insight types. There are many technology choices across on-premise, cloud, or hybrid deployments, and the decoupling of storage and compute has enabled mixing and matching of datastores, processing engines, and management frameworks. The mantra in the big data era is using the “right tool for the right job” depending on data type, use case requirements, sophistication of the data users, and interoperability with deployed technologies. Table 1-1 highlights the key differences.

Table 1-1. The key differences in extracting insights from traditional data warehouses compared to the modern big data era
  Extracting insights in the data warehousing era Extracting insights in the big data era
Data formats Structured data Structured, semi-structured, and unstructured data
Data characteristics High-volume data 4 Vs of data: volume, velocity, variety, and veracity
Cataloging data Defined at the time of aggregating data Defined at the time of reading data
Freshness of insights Insights are mainly retrospective (e.g., what happened in the business last week) Insights are a combination of retrospective, interactive, real-time, and predictive
Query processing approach Query processor and data storage coupled together as a single solution Decoupled query processing and data storage
Data services Integrated as a unified solution Mix-and-match, allowing many permutations for selecting the right tool for the job

The journey map for developing any insight can be divided into four key phases: discover, prep, build, and operationalize (as shown in Figure 1-2). To illustrate the journey map, consider the example of building a real-time business insights dashboard that tracks revenue, marketing campaign performance, customer sign-ups and attrition, and so on. The dashboard also includes an ML forecasting model for revenue across different geographic locations.

The journey map for extracting insights from raw data
Figure 1-2. The journey map for extracting insights from raw data.

Discover

Any insights project starts with discovering available datasets and artifacts, as well as collecting any additional data required for developing the insight. The complexity of data discovery arises as a result of the difficulty of knowledge scaling within the enterprise. Data teams typically start small with team knowledge that is easily accessible and reliable. As data grows and teams scale, silos are created across business lines, leading to no single source of truth. Data users today need to effectively navigate a sea of data resources of varying quality, complexity, relevance, and trustworthiness. In the example of the real-time business dashboard and revenue forecasting model, the starting point for data users is to understand metadata for commonly used datasets, namely customer profile, login logs, billing datasets, pricing and promotions, and so on.

Discovering a dataset’s metadata details

The first milestone is understanding the metadata properties, such as where the data originated, how the data attributes were generated, and so on. Metadata also plays a key role in determining the quality and reliability of the data. For instance, if the model is built using a table that is not populated correctly or has bugs in its data pipelines, the resulting model will be incorrect and unreliable. Data users start with team knowledge available from other users, which can be outdated and unreliable. Gathering and correlating metadata requires access to datastores, ingestion frameworks, schedulers, metadata catalogs, compliance frameworks, and so on. There is no standardized format to track the metadata of a dataset as it traverses being collected and transformed. The time taken to complete this milestone is tracked by the metric time to interpret.

Searching available datasets and artifacts

With the ability to understand a dataset’s metadata details, the next milestone is to find all the relevant datasets and artifacts, namely views, files, streams, events, metrics, dashboards, ETLs, and ad hoc queries. In a typical enterprise, there are thousands or millions of datasets. As an extreme example, Google has 26 billion datasets. Depending on the scale, data users can take days and weeks identifying relevant details. Today, the search relies heavily on team knowledge within data users and reaching out to application developers. The available datasets and artifacts are continuously evolving and need to be continuously refreshed. The time taken to complete this milestone is tracked by the metric time to find.

Reusing or creating features for ML models

Continuing the example, developing the revenue forecasting models requires training using historic values of revenue numbers by market, product line, and so on. Attributes like revenue that are an input to the ML model are referred to as features. An attribute can be used as a feature if historic values are available. In the process of building ML models, data scientists iterate on feature combinations to generate the most accurate model. Data scientists spend 60% of their time creating training datasets to generate features for ML models. Reusing existing features can radically reduce the time to develop ML models. The time taken to complete this milestone is tracked by the metric time to featurize.

Aggregating missing data

For creating the business dashboard, the identified datasets (such as customer activity and billing records) need to be joined to generate the insight of retention risk. Datasets sitting across different application silos often need to be moved into a centralized repository like a data lake. Moving data involves orchestrating the data movement across heterogeneous systems, verifying data correctness, and adapting to any schema or configuration changes that occur on the data source. Once the insights are deployed in production, the data movement is an ongoing task and needs to be managed as part of the pipeline. The time taken to complete this milestone is tracked by the metric time to data availability.

Managing clickstream events

In the business dashboard, assume we want to analyze the most time-consuming workflows within the application. This requires analyzing the customer’s activity in terms of clicks, views, and related context, such as previous application pages, the visitor’s device type, and so on. To track the activity, data users may leverage existing instrumentation within the product that records the activity or add additional instrumentation to record clicks on specific widgets, like buttons. Clickstream data needs to be aggregated, filtered, and enriched before it can be consumed for generating insights. For instance, bot-generated traffic needs to be filtered out of raw events. Handling a high volume of stream events is extremely challenging, especially in near real-time use cases such as targeted personalization. The time taken to complete this milestone of collecting, analyzing, and aggregating behavioral data is tracked by the metric time to click metrics.

Prep

The preparation phase focuses on getting the data ready for building the actual business logic to extract insights. Preparation is an iterative, time-intensive task that includes aggregating, cleaning, standardizing, transforming, and denormalizing data. It involves multiple tools and frameworks. The preparation phase also needs to ensure data governance in order to meet regulatory compliance requirements.

Managing aggregated data within a central repository

Continuing with the example, the data required for the business dashboard and forecasting model is now aggregated within a central repository (commonly referred to as a data lake). The business dashboard needs to combine historic batch data as well as streaming behavioral data events. The data needs to be efficiently persisted with respect to data models and on-disk format. Similar to traditional data management, data users need to ensure access control, backup, versioning, ACID properties for concurrent data updates, and so on. The time taken to complete this milestone is tracked by the metric time to data lake management.

Structuring, cleaning, enriching, and validating data

With the data now aggregated in the lake, we need to make sure that the data is in the right form. For instance, assume the records in the billing dataset have a null billing value for trial customers. As a part of the structuring, the nulls will be explicitly converted to zeroes. Similarly, there can be outliers in usage of select customers that need to be excluded to prevent skewing the overall engagement analysis. These activities are referred to as data wrangling. Applying wrangling transformations requires writing idiosyncratic scripts in programming languages such as Python, Perl, and R, or engaging in tedious manual editing. Given the growing volume, velocity, and variety of the data, the data users use low-level coding skills to apply the transformations at scale in an efficient, reliable, and recurring fashion. These transformations are not one-time but instead need to be reliably applied in an ongoing fashion. The time taken to complete this milestone is tracked by the metric time to wrangle.

Ensuring data rights compliance

Assume that the customer has not given consent to use their behavioral data for generating insights. Data users need to understand which customers’ data can be used for which use cases. Compliance is a balancing act between better serving the customer experience with insights, and ensuring the data is being used in accordance with the customer’s directives. There are no simple heuristics that can be universally applied to solving this problem. Data users want an easy way to locate all the available data for a given use case, without having to worry about compliance violations. There is no single identifier for tracking applicable customer data across the silos. The time taken to complete this milestone is tracked by the metric time to compliance.

Build

During the build phase, the focus is on writing the actual logic required for extracting the insight. The following are the key milestones for this phase.

Deciding the best approach for accessing and analyzing data

A starting point to the build phase is deciding on a strategy for writing and executing the insights logic. Data in the lake can be persisted as objects, or stored in specialized serving layers, namely key-value stores, graph databases, document stores, and so on. Data users need to decide whether to leverage native APIs and keywords of the datastores, and decide on the query engine for the processing logic. For instance, short, interactive queries are run on Presto clusters, while long-running batch processes are on Hive or on Spark. Ideally, the transformation logic should be agnostic and should not change when data is moved to a different polyglot store, or if a different query engine is deployed. The time taken to complete this milestone is tracked by the metric time to virtualize.

Writing transformation logic

The actual logic for the dashboard or model insight is written either as an Extract-Transform-Load (ETL), Extract-Load-Transform (ELT), or a streaming analysis pattern. Business logic needs to be translated into actual code that needs to be performant and scalable as well as easy to manage for changes. The logic needs to be monitored for availability, quality, and change management. The time taken to complete this milestone is tracked by the metric time to transform.

Training the models

For the revenue forecasting example, an ML model needs to be trained. Historic revenue values are used to train the model. With growing dataset sizes and complicated deep learning models, training can take days and weeks. Training is run on a farm of servers consisting of a combination of CPUs and specialized hardware such as GPUs. Training is iterative, with hundreds of permutations of values for model parameters and hyperparameter values that are applied to find the best model. Model training is not one-time; models need to be retrained for changing data properties. The time taken to complete this milestone is tracked by the metric time to train.

Continuously integrating ML model changes

Assume in the business dashboard example that there is a change in the definition of how active subscribers are calculated. ML model pipelines are continuously evolving with source schema changes, feature logic, dependent datasets, data processing configurations, and model algorithms. Similar to traditional software engineering, ML models are constantly updated with multiple changes made daily across the teams. To integrate the changes, the data, code, and configuration associated with ML pipelines are tracked. Changes are verified by deploying in a test environment and using production data. The time taken to complete this milestone is tracked by the metric time to integrate.

A/B testing of insights

Consider a different example of an ML model that forecasts home prices for end customers. Assume there are two equally accurate models developed for this insight—which one is better? A growing practice within most enterprises is deploying multiple models and presenting them to different sets of customers. Based on behavioral data of customer usage, the goal is to select a better model. A/B testing—also known as bucket testing, split testing, or controlled experiment—is becoming a standard approach to making data-driven decisions. It is critical to integrate A/B testing as a part of the data platform to ensure consistent metrics definitions are applied across ML models, business reporting, and experimentation. Configuring A/B testing experiments correctly is nontrivial and must ensure there is no imbalance that would result in a statistically significant difference in a metric of interest across the variant populations. Also, customers must not be exposed to interactions between variants of different experiments. The time taken to complete this milestone is tracked by the metric time to A/B test.

Operationalize

In the operationalize phase of the journey map, the insight is deployed in production. This phase is ongoing until the insight is actively used in production.

Verifying and optimizing queries

Continuing the example of the business dashboard and revenue forecasting model, data users have written the data transformation logic either as SQL queries or big data programming models (such as Apache Spark or Beam) implemented in Python, Java, Scala, and so on. The difference between good and bad queries is quite significant; based on actual experiences, a query running for a few hours can be tuned to complete in minutes. Data users need to understand the multitude of knobs in query engines such as Hadoop, Spark, and Presto. Understanding which knobs to tune and their impact is nontrivial for most data users and requires a deep understanding of the inner workings of the query engines. There are no silver bullets—the optimal knob values for the query vary based on data models, query types, cluster sizes, concurrent query load, and so on. As such, query optimization is an ongoing activity. The time taken to complete this milestone is tracked by the metric time to optimize.

Orchestrating pipelines

The queries associated with the business dashboard and forecasting pipelines need to be scheduled. What is the optimal time to run the pipeline? How do we ensure the dependencies are correctly handled? Orchestration is a balancing act of ensuring pipeline service level agreements (SLAs) and efficient utilization of the underlying resources. Pipelines invoke services across ingestion, preparation, transformation, training, and deployment. Data users need to monitor and debug pipelines for correctness, robustness, and timeliness across these services, which is nontrivial. Orchestration of pipelines is multitenant, supporting multiple teams and business use cases. The time taken to complete this milestone is tracked by the metric time to orchestrate.

Deploying the ML models

The forecasting model is deployed in production such that it can be called by different programs to get the predicted value. Deploying the model is not a one-time task—the ML models are periodically updated based on retraining. Data users use non-standardized, homegrown scripts for deploying models that need to be customized to support a wide range of ML model types, ML libraries and tools, model formats, and deployment endpoints (such as IoT devices, mobile, browser, and web API). There are no standardized frameworks to monitor the performance of models and scale automatically based on load. The time taken to complete this milestone is tracked by the metric time to deploy.

Monitoring the quality of the insights

As the business dashboard is used daily, consider an example where it shows an incorrect value for a specific day. Several things can go wrong and lead to quality issues: uncoordinated source schema changes, changes in data element properties, ingestion issues, source and target systems with out-of-sync data, processing failures, incorrect business definitions for generating metrics, and many more. Data users need to analyze data attributes for anomalies and debug the root cause of detected quality issues. Data users rely on one-off checks that are not scalable with large volumes of data flowing across multiple systems. The goal is not just to detect data quality issues, but also to avoid mixing low-quality data records with the rest of the dataset partitions. The time taken to complete this milestone is tracked by the metric time to insight quality.

Continuous cost monitoring

We now have insights deployed in production with continuous monitoring to ensure quality. The last piece of the operationalized phase is cost management. Cost management is especially critical in the cloud where the pay-as-you-go model increases linearly with usage (in contrast to the traditional buy-up-front, fixed-cost model). With data democratization, where data users can self-serve the journey to extract insights, there is a possibility of significantly wasted resources and unbounded costs. A single bad query running on high-end GPUs can accumulate thousands of dollars in a matter of hours, typically to the surprise of the data users. Data users need to answer questions such as: a) what is the dollar spent per application? b) which team is projected to spend more than their allocated budgets? c) are there opportunities to reduce the spend without affecting performance and availability? and d) are the allocated resources appropriately utilized? The time taken to complete this milestone is tracked by the metric time to optimize cost.

Overall, in each phase of the journey today, data users spend a significant percentage of their time on data engineering tasks such as moving data, understanding data lineage, searching data artifacts, and so on. The ideal nirvana for data users is a self-service data platform that simplifies and automates these tasks encountered during the day-to-day journey.

Defining Your Time-to-Insight Scorecard

Time to insight is the overall metric that measures the time it takes to complete the entire journey from raw data into insights. In the example of developing the business dashboard and revenue forecasting model, time to insight represents the total number of days, weeks, or months to complete the journey map phases. Based on my experience managing data platforms, I have divided the journey map into 18 key milestones, as described in the previous section. Associated with each milestone is a metric such that the overall time to insight is a summation of the individual milestone metrics.

Each enterprise differs in their pain points related to the journey map. For instance, in the example of developing the business dashboard, an enterprise may spend a majority of time in time to interpret and time to find due to multiple silos and lack of documentation, while an enterprise in a regulated vertical may have time to comply as a key pain point in its journey map. In general, enterprises vary in their pain points due to differences in maturity of the existing process, technology, datasets, skills of the data team, industry vertical, and so on. To evaluate the current status of the data platform, we use a time-to-insight scorecard, as shown in Figure 1-3. The goal of the exercise is to determine the milestones that are the most time-consuming in the overall journey map.

Scorecard for the time-to-insight metric as a sum of individual milestone metrics within the journey map
Figure 1-3. Scorecard for the time-to-insight metric as a sum of individual milestone metrics within the journey map.

Each chapter in the rest of the book corresponds to a metric in the scorecard and describes the design patterns to make them self-service. Following is a brief summary of the metrics:

Time to interpret
Associated with the milestone of understanding a dataset’s metadata details before using it for developing insights. Incorrect assumptions about the dataset often leads to incorrect insights. The existing value of the metric depends on the process for defining, extracting, and aggregating technical metadata, operational metadata, and team knowledge. To minimize time to interpret and make it self-service, Chapter 2 covers implementation patterns for a metadata catalog service that extracts metadata by crawling sources, tracks lineage of derived datasets, and aggregates team knowledge in the form of tags, validation rules, and so on.
Time to find
Associated with the milestone of searching related datasets and artifacts. A high time to find leads to teams choosing to reinvent the wheel by developing clones of data pipelines, dashboards, and models within the enterprise, leading to multiple sources of truth. The existing value of the metric depends on the existing process to index, rank, and access control datasets and artifacts. In most enterprises, these processes are either ad hoc or have manual dependencies on the data platform team. To minimize time to find and make it self-service, Chapter 3 covers implementation patterns for a search service.
Time to featurize
Associated with the milestone of managing features for training ML models. Data scientists spend 60% of their time creating training datasets for ML models. The existing value of the metric depends on the process for feature computation and feature serving. To minimize time to featurize and make it self-service, Chapter 4 covers implementation patterns of a feature store service.
Time to data availability
Associated with the milestone of moving data across the silos. Data users spend 16% of their time moving data. The existing value of the metric depends on the process for connecting to heterogeneous data sources, data copying and verification, and adapting to any schema or configuration changes that occur on the data sources. To minimize time to data availability and make it self-service, Chapter 5 covers implementation patterns of a data movement service.
Time to click metrics
Associated with the milestone of collecting, managing, and analyzing clickstream data events. The existing value of the metric depends on the process of creating instrumentation beacons, aggregating events, enrichment by filtering, and ID stitching. To minimize time to click metrics and make it self-service, Chapter 6 covers implementation patterns of a clickstream service.
Time to data lake management
Associated with the milestone of managing data in a central repository. The existing value of the metric depends on the process of managing primitive data life cycle tasks, ensuring consistency of data updates, and managing batching and streaming data together. To minimize time to data lake management and make it self-service, Chapter 7 covers implementation patterns of a data lake management service.
Time to wrangle
Associated with the milestone of structuring, cleaning, enriching, and validating data. The existing value of the metric depends on the process of identifying data curation requirements for a dataset, building transformations to curate data at scale, and operational monitoring for correctness. To minimize time to wrangle and make it self-service, Chapter 8 covers implementation patterns of a data wrangling service.
Time to comply
Associated with the milestone of ensuring data rights compliance. The existing value of the metric depends on the process for tracking customer data across the application silos, executing customer data rights requests, and ensuring the use cases only use the data that has been consented to by the customers. To minimize time to comply and make it self-service, Chapter 9 covers implementation patterns of a data rights governance service.
Time to virtualize
Associated with the milestone of selecting the approach to build and analyze data. The existing value of the metric depends on the process to formulate queries for accessing data residing in polyglot datastores, queries to join data across the datastores, and processing queries at production scale. To minimize time to virtualize and make it self-service, Chapter 10 covers implementation patterns of a data virtualization service.
Time to transform
Associated with the milestone of implementing the transformation logic in data and ML pipelines. The transformation can be batch, near real-time, or real-time. The existing value of the metric depends on the process to define, execute, and operate transformation logic. To minimize time to transform and make it self-service, Chapter 11 covers implementation patterns of a data transformation service.
Time to train
Associated with the milestone of training ML models. The existing value of the metric depends on the process for orchestrating training, tuning of model parameters, and continuous retraining for new data samples. To minimize time to train and make it self-service, Chapter 12 covers implementation patterns of a model training service.
Time to integrate
Associated with the milestone of integrating code, data, and configuration changes in ML pipelines. The existing value of the metric depends on the process for tracking iterations of ML pipelines, creating reproducible packages, and validating the pipeline changes for correctness. To minimize time to integrate and make it self-service, Chapter 13 covers implementation patterns of a continuous integration service for ML pipelines.
Time to A/B test
Associated with the milestone of A/B testing. The existing value of the metric depends on the process for designing an online experiment, executing at scale (including metrics analysis), and continuously optimizing the experiment. To minimize time to A/B test and make it self-service, Chapter 14 covers implementation patterns of an A/B testing service as a part of the data platform.
Time to optimize
Associated with the milestone of optimizing queries and big data programs. The existing value of the metric depends on the process for aggregating monitoring statistics, analyzing the monitored data, and invoking corrective actions based on the analysis. To minimize time to A/B test and make it self-service, Chapter 15 covers implementation patterns of a query optimization service.
Time to orchestrate
Associated with the milestone of orchestrating pipelines in production. The existing value of the metric depends on the process for designing job dependencies, getting them efficiently executed on available hardware resources, and monitoring their quality and availability, especially for SLA-bound production pipelines. To minimize time to orchestrate and make it self-service, Chapter 16 covers implementation patterns of a pipeline orchestration service.
Time to deploy
Associated with the milestone of deploying insight in production. The existing value of the metric depends on the process to package and scale the insights available in the form of model endpoints, monitoring model drift. To minimize time to deploy and make it self-service, Chapter 17 covers implementation patterns of a model deploy service.
Time to insight quality
Associated with the milestone of ensuring correctness of the generated insights. The existing value of the metric depends on the process to verify accuracy of data, profile data properties for anomalies, and proactively prevent low-quality data records from polluting the data lake. To minimize time to insight quality and make it self-service, Chapter 18 covers implementation patterns of a quality observability service.
Time to optimize cost
Associated with the milestone of minimizing costs, especially while running in the cloud. The existing value of the metric depends on the process to select cost-effective cloud services, configuring and operating the services, and applying cost optimization on an ongoing basis. To minimize time to optimize cost and make it self-service, Chapter 19 covers implementation patterns of a cost management service.

The end result of this analysis is populating the scorecard corresponding to the current state of the data platform (similar to Figure 1-4). Each metric is color-coded based on whether the tasks associated with the metric can be completed, on the order of hours, days, or weeks. A metric that takes an order of weeks typically represents tasks that today are executed in an ad hoc fashion using manual, nonstandard scripts and programs and/or tasks requiring coordination between data users and data platform teams. Such metrics represent opportunities where the enterprise needs to invest in making the associated tasks self-service for data users.

The complexity associated with each of the scorecard metrics will differ between enterprises. For instance, in a startup with a handful of datasets and data team members, time to search and time to interpret can be accomplished in a matter of hours when relying solely on team knowledge, even though the process is ad hoc. Instead, the most time may be spent in data wrangling or tracking quality of the insights, given the poor quality of available data. Further, enterprises vary in the requirements associated with each service in the data platform. For instance, an enterprise deploying only offline trained ML models once a quarter (instead of online continuous training) may not prioritize improving the time to train metric even if it takes a number of weeks.

Scorecard representing the current state of an enterprise’s data platform
Figure 1-4. Example scorecard representing the current state of an enterprise’s data platform.

Build Your Self-Service Data Roadmap

The first step in developing the self-service data roadmap is defining the scorecard for the current state of the data platform, as described in the previous section. The scorecard helps shortlist the metrics that are currently slowing down the journey from raw data to insights. Each metric in the scorecard can be at a different level of self-service, and prioritized for automation in the roadmap based on the degree to which it slows down the overall time to insight.

As mentioned earlier, each chapter covers design patterns to make the corresponding metric self-service. We treat self-service as having multiple levels, analogous to different levels of self-driving cars that vary in terms of the levels of human intervention required to operate them (as illustrated in Figure 1-5). For instance, a level-2 self-driving car accelerates, steers, and brakes by itself under driver supervision, while level 5 is fully automated and requires no human supervision.

Different levels of automation in a self-driving car (from DZone)
Figure 1-5. Different levels of automation in a self-driving car (from DZone).

Enterprises need to systematically plan the roadmap for improving the level of automation for each of the shortlisted metrics. The design patterns in each chapter are organized like Maslow’s hierarchy of needs; the bottom level of the pyramid indicates the starting pattern to implement and is followed by two more levels, each building on the previous one. The entire pyramid within a given chapter represents the self-service, as shown in Figure 1-6.

Maslow’s hierarchy of task automation followed in each chapter
Figure 1-6. Maslow’s hierarchy of task automation followed in each chapter.

In summary, this book is based on experience in implementing self-service data platforms across multiple enterprises. To derive maximum value from the book, I encourage readers to apply the following approach to executing their self-service roadmap:

  1. Start by defining the current scorecard.

  2. Identify two or three metrics that are most significantly slowing down the journey map based on surveys of the data users, and perform technical analysis of how the tasks are currently being implemented. Note that the importance of these metrics varies for each enterprise based on their current processes, data user skills, technology building blocks, data properties, and use case requirements.

  3. For each of the metrics, start off with Maslow’s hierarchy of patterns to implement. Each chapter is dedicated to a metric, and covers patterns with increasing levels of automation. Instead of recommending specific technologies that will soon become outdated in the fast-paced big data evolution, the book instead focuses on implementation patterns, and provides examples of existing technologies available on-premise as well as in the cloud.

  4. Follow a phased crawl, walk, run strategy with a focus on doubling down on shortlisted metrics every quarter and making them self-service.

Finally, the book attempts to bring together perspectives of both the data users as well as data platform engineers. Creating a common understanding of requirements is critical in developing a pragmatic roadmap that intersects what is possible and what is feasible given the timeframe and resources available.

Ready! Set! Go!

Get The Self-Service Data Roadmap now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.