The industrialization of analytics
Analytic Ops—DevOps for data science—makes data analysis into a continually evolving process to meet business needs.
Industrialization, by definition, implies automation. It lets you do more with less. Just as a farmer can now plow a field with a tractor in a couple of hours instead of days with a horse, organizations can potentially plow through vast fields of data with advanced algorithms. Perhaps a better analogy is the factory—a manufacturing plant where the deliverables are insights. Imagine, for example, an assembly line that allows you to collect data, sort it, classify it, and prepare it for modeling, analysis, and insight generation. Is that where we are headed? Yes. And is it necessary? Also yes.
Here’s why. What organizations need in order to expand their access to big data’s volume, velocity, and variety—and make it work to their advantage—are three things that industrialization has baked into it: process, structure, and transparency. If you really want to get value from your data and run your organization like a well-oiled machine, you have to be able to scale. However, the ability to scale is one of the biggest conundrums of big data. The answer is industrialization. Industrialization is defined by its transformative ability to scale, and scaling almost always means automating what has traditionally been done by hand. Think assembly line.
An assembly line approach is based on defining a set of processes that support analytics. It’s a collaborative approach that requires cross-functional alignment and a commitment from the C-suite to drive participation. But how do you automate the process of gleaning insights from data?
Let’s look at how industrialization happens in manufacturing, where the processes were originally developed. Manufacturing managers have insisted on quality controls and process refinement for years. If our industry is going to industrialize analytics, we need to apply the same types of quality control measures for the analytics and the operations that they power. Any solution you build should take into account the following:
- Data management: This involves the creation of analytic data sets by data scientists in a manner that captures lineage, provides appropriate governance, and avoids the dreaded data swamp of unrecognizable assets. It also includes documentation, notes, code, data samples, and a change log as well as checks and balances to ensure that the assets are ready for consumption.
- Development: This refers to modeling tools that are integrated into a single workbench with visualization and interfaces designed for data exploration. It also includes knowledge management to store information about the models you are building.
- Deployment: This is where the production model is created that will later be used for operations. It requires model management, such as maintaining version history and training data sets for auditing, and model promotion processes. An emphasis should be placed on efficiency and controlled execution. Data platforms offer many options for analytical processing, but this approach must promise that business logic stays intact if the model is deployed on another platform.
- Maintenance: Operational systems are the bookends of the process. You source data at the beginning from your operational system and your analysis is the end deliverable that is consumed by the application or operational process. Strict rules of the road should be in place due to the operational dependencies inherent in these processes, including operational logging of all scoring activities, and a process to log irregularities when model drift occurs.
As the availability of data continues to explode and the tools to analyze it proliferate, companies will continue to seek the power that big data sets promise because where there is data, there are insights, and where there are insights, there is value. But in order to get there, we need to embed the principles of industrialization into the process.
When these processes are designed and implemented on the whole—not in piecemeal format—“industrialization” will start to occur where analytics are driven and sustained over time. This is Analytics Ops, or in other words, Dev Ops for data science. With industrialization of analytics, once a certain velocity is achieved, companies can ultimately lower costs, speed innovation, and bring new capabilities to market.
This post is a collaboration between O’Reilly and Think Big, a Teradata Company. View our statement of editorial independence.