Chapter 1. Introduction
Over the past three decades, as an enterprise CIO and a provider of third-party enterprise software, I’ve witnessed firsthand a long series of large-scale information technology transformations, including client/server, Web 1.0, Web 2.0, the cloud, and Big Data. One of the most important but underappreciated of these transformations is the astonishing emergence of DevOps.
DevOps—the ultimate pragmatic evolution of Agile methods—has enabled digital-native companies (Amazon, Google, etc.) to devour entire industries through rapid feature velocity and rapid pace of change, and is one of the key tools being used to realize Marc Andreessen’s portent that “Software Is Eating the World”. Traditional enterprises, intent on competing with digital-native internet companies, have already begun to adopt DevOps at scale. While running software and data engineering at the Novartis Institute of Biomedical Research, I introduced DevOps into the organization, and the impact was dramatic.
Fundamental changes such as the adoption of DevOps tend to be embraced by large enterprises after new technologies have matured to a point when the benefits are broadly understood, the cost and lock-in of legacy/incumbent enterprise vendors becomes insufferable, and core standards emerge through a critical mass of adoption. We are witnessing the beginning of another fundamental change in enterprise tech called “DataOps”—which will allow enterprises to rapidly and repeatedly engineer mission-ready data from all of the data sources across an enterprise.
DevOps and DataOps
Much like DevOps in the enterprise, the emergence of enterprise DataOps mimics the practices of modern data management at large internet companies over the past 10 years. Employees of large internet companies use their company’s data as a company asset, and leaders in traditional companies have recently developed this same appetite to take advantage of data to compete. But most large enterprises are unprepared, often because of behavioral norms (like territorial data hoarding) and because they lag in their technical capabilities (often stuck with cumbersome extract, transform, and load [ETL] and master data management [MDM] systems). The necessity of DataOps has emerged as individuals in large traditional enterprises realize that they should be using all the data generated in their company as a strategic asset to make better decisions every day. Ultimately, DataOps is as much about changing people’s relationship to data as it is about technology infrastructure and process.
The engineering framework that DevOps created is great preparation for DataOps. For most enterprises, many of whom have adopted some form of DevOps for their IT teams, the delivery of high-quality, comprehensive, and trusted analytics using data across many data silos will allow them to move quickly to compete over the next 20 years or more. Just like the internet companies needed DevOps to provide a high-quality, consistent framework for feature development, enterprises need a high-quality, consistent framework for rapid data engineering and analytic development.
The Catalyst for DataOps: “Data Debt”
DataOps is the logical consequence of three key trends in the enterprise:
-
Multibillion-dollar business process automation initiatives over the past 30-plus years that started with back-office system automation (accounting, finance, manufacturing, etc.) and swept through the front office (sales, marketing, etc.) in the 1990s and 2000s, creating hundreds, even thousands, of data silos within large enterprises.
-
The competitive pressure of digital-native companies in traditional industries.
-
The opportunity presented by the “democratization of analytics” driven by new products and companies that enabled broad use of analytic/visualization tools such as Spotfire, Tableau, and BusinessObjects.
For traditional Global 2000 enterprises intent on competing with digital natives, these trends have combined to create a major gap between the intensifying demand for analytics among empowered frontline people and the organization’s ability to manage the “data exhaust” from all the silos created by business process automation.
Bridging this gap has been promised before, starting with data warehousing in the 1990s, data lakes in the 2000s, and decades of other data integration promises from the large enterprise tech vendors. Despite the promises of single-vendor data hegemony by the likes of SAP, Oracle, Teradata, and IBM, most large enterprises still face the grim reality of intensely fractured data environments. The cost of the resulting data heterogeneity is what we call “data debt.”
Data debt stems naturally from the way that companies do business. Lines of businesses want control and rapid access to their mission-critical data, so they procure their own applications, creating data silos. Managers move talented personnel from project to project, so the data systems owners turn over often. The high historical rate of failure for business intelligence and analytics projects makes companies rightfully wary of game-changing and “boil the ocean” projects that were epitomized by MDM in the 1990s.
Paying Down the Data Debt
Data debt is often acquired by companies when they are running their business as a loosely connected portfolio, with the lines of business making “free rider” decisions about data management. When companies try to create leverage and synergy across their businesses, they recognize their data debt problem and work overtime to fix it. We’ve passed a tipping point at which large companies can no longer treat the management of their data as optional based on the whims of line-of-business managers and their willingness to fund central data initiatives. Instead, it’s finally time for enterprises to tackle their data debt as a strategic competitive imperative. As my friend Tom Davenport describes in his book Competing on Analytics, those organizations that are able to make better decisions faster are going to survive and thrive. Great decision making and analytics requires great unified data—the central solution to the classic garbage in/garbage out problem.
For organizations that recognize the severity of their data debt problem and determine to tackle it as a strategic imperative, DataOps enables them to pay down their data debt by rapidly and continuously delivering high-quality, unified data at scale from a wide variety of enterprise data sources.
From Data Debt to Data Asset
By building their data infrastructure from scratch with legions of talented engineers, digital-native, data-driven companies like Facebook, Amazon, Netflix, and Google have avoided data debt by managing their data as an asset from day one. Their examples of treating data as a competitive asset have provided a model for savvy leaders at traditional companies who are taking on digital transformation while dealing with massive legacy data debt. These leaders now understand that managing their data proactively as an asset is the first, foundational step for their digital transformation—it cannot be a “nice to have” driven by corporate IT. Even for managers who aren’t excited by the possibility of competing with data, the threat of a traditional competitor using their data more effectively or disruption from data-driven, digital-native upstarts requires that they take proactive steps and begin managing their data seriously.
DataOps to Drive Repeatability and Value
Most enterprises have the capability to find, shape, and deploy data for any given idiosyncratic use case, and there is an abundance of analyst-oriented tools for “wrangling” data from great companies such as Trifacta and Alteryx. Many of the industry-leading executives I work with have commissioned and benefitted from one-and-done analytics or data integration projects. These idiosyncratic approaches to managing data are necessary but not sufficient to solve their broader data debt problem and to enable these companies to compete on analytics.
Next-level leaders who recognize the threat of digital natives are looking to use data aggressively and iteratively to create new value every day as new data becomes available. The biggest challenge faced in enterprise data is repeatability and scale—being able to find, shape, and deploy data reliably with confidence. Also—much like unstructured content on the web—structured data changes over time. The right implementation of DataOps enables your analytics to adapt and change as more data becomes available and existing data is enhanced.
Organizing by Logical Entity
DataOps is the framework that will allow these enterprises to begin their journey toward treating their data as an asset and paying down their data debt. The human behavioral changes and process changes that are required are as important, if not more important, than any bright, shiny new technology. In the best projects I’ve been involved with, the participants realize that their first goal is to organize their data along their key, logical business entities, examples of which include the following:
-
Customers
-
Suppliers
-
Products
-
Research
-
Facilities
-
Employees
-
Parts
Of course, every enterprise and industry has its own collection of key entities. Banks might be interested in entities that allow fraud detection; agricultural firms might care more about climate and crop data. But for every enterprise, understanding these logical entities across many sources of data is key to ensuring reliable analytics. Many DataOps projects begin with a single entity for a single use case and then expand; this approach connects the data engineering activities to ROI from either selling more products or saving money through using unified, clean data for a given entity for analytics and decision making.
For each of these key entities, any chief data officer should be able to answer the following fundamental questions:
-
What data do we have?
-
Where does our data come from?
-
Where is our data consumed?
To ensure clean, unified data for these core entities, a key component of DataOps infrastructure is to create a system of reference that maps a company’s data to core logical entities. This unified system of reference should consist of unified attributes constructed from the raw physical attributes across source systems. Managing the pathways between raw, physical attributes, changes to the underlying data, and common operations on that data to shape it into production-readiness for the authoritative system of reference are the core capabilities of DataOps technologies and processes.
This book gets into much more detail on DataOps and the practical steps enterprises have and should take to pay down their own data debt—including behavioral, process, and technology changes. It traces the development of DataOps and its roots in DevOps, best practices in building a DataOps ecosystems, and real-world examples. I’m excited to be a part of this generational change, one that I truly believe will be a key to success for enterprises over the next decade as they strive to compete with their new digital-native competitors.
The challenge for the large enterprise with DataOps is that if it doesn’t adopt this new capability quickly, it runs the risk of being left in the proverbial competitive dust.
Get Getting DataOps Right now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.