Chapter 4. Operational Data Is the New Oil

Operational data and its potential to produce new business value apply to all types of organizations in all sectors of society and the economy. All domains generate and consume operational data. The ability of organizations to consume enough of the right kinds of operational data, generate valuable insights from that data, and then take correct and timely action determines their ability to achieve and sustain new heights of success in this current age of digital business.

The fact that a saying is a cliché makes it no less true. Many have declared data as the new oil so often that it has become cliché. But it is also a true statement, because operational data, like oil, is an unrefined resource that ultimately enables businesses to extract and create value in the form of derivatives.

Oil is the foundation for more than 6,000 products including “dishwashing liquid, solar panels, food preservatives, eyeglasses, DVDs, children’s toys, tires, and heart valves.”1 The value of operational data, like oil, depends largely on refinement and production processes. Indeed, raw data has little value. Rather, it is the information and insights gleaned through careful processing and analysis of the data that produces value.

We already see traditional industries embracing operational data to grow existing business and create new lines of revenue. Those firms that have successfully plotted a path to become a data-driven business are capitalizing on this new strategic capability. For example, the grocery retailer Kroger not only cites insights as the primary driver of phenomenal growth (14.1% in 2020, aided by a 116% jump in online sales),2 but it is also entering the insights business by monetizing its data:

Seeking to leverage its scale and significant insights on customers, the company is seeking to transform its business model with alternative revenue, where it plans to monetize its rich data and make the argument that it can provide CPG [companies] with a superior ROI on ad/marketing dollars (in addition to trade spend) versus traditional channels.3

Incumbent architectures, business practices, and skill sets will not deliver value from operational data because they are static and designed for an enterprise architecture that no longer reflects the actual anatomy of their enterprise systems. Static businesses simply cannot adapt and keep up with their customers and competitors. For the CIO, the challenge lies not only in managing and scaling existing business data architectures but also putting in place the technologies, tools, and teams needed to operate an operational data practice at scale.

In this chapter, we explore the pervasive challenge to organizations to become more innovative and think about operational data as a first-class business asset. Operational data, together with corresponding business practices and technology skill sets, enable an organization to govern an operational data platform with discipline and intentionality just like corporate finance, compliance, and risk. This chapter establishes the primary shifts in approach, the architectural ramifications, and the impact on technical skills investment as compared to traditional businesses (i.e., businesses designed to support a static line of business versus those designed to innovate—and how to make progress toward the latter).

Operational Data Platform(s)

An enterprise data practice needs a platform or group of platforms to scale its consumption of operational data (a.k.a. telemetry). This platform should have a flexible architecture for processing data in the right location in the right way, and to provide a consistent framework against which a data governance model can be executed.

New Sources of Data

The first step to successfully designing an insights platform is to understand which new types of data are defined as operational data and their sources. All applications, the environments in which they run, and the physical resources used to support them have potential operational data sources.

Applications or application stacks include the app code and everything the app needs to function. Any traditional or modern web application, for example, comprises the app code itself as well as an underlying web server, operating system, and, potentially, hypervisor. The application usually also includes management and orchestration systems. Collect logs and metrics from all these systems to widen the surface area of analytics and correlate otherwise unrelatable data.

Environments point to systems and services unique to a particular cloud or colocation substrate. Container services offered on many public clouds, for example, also offer operational visibility services to go along with them that are designed to report specialized data unique to their environment. Collect this data to determine whether the appropriate remediation should be focused on a given tenant or shifting work to an alternate cloud provider.

Physical resources describe a variety of potential operational data sources that may be overlooked because they are closely tied to physical infrastructure such as space, power, and cooling. We can, for example correlate a power surge on a server that caused a failure in the app and disrupted a set of customer purchases.

The most effective data platforms will strive to achieve full visibility. In fact, a lack of visibility across the entire IT stack results in missing data, which is the number one challenge reported by IT experts in obtaining the insights they need.4

At a minimum, operational data can be collected from existing logs, events, and traces used to monitor and troubleshoot the environment. Building on this base, organizations should consider their potential blind spots and work toward gradually illuminating those gaps in pursuit of full visibility through comprehensive data collection.

After achieving a clear understanding of all the potential types of operational data, the next step is creating an inventory of the components that make up a given application/service/digital experience, both internal and external. The output of this step is a map of the communications mesh among all components. Then through further grouping of components according to any service frameworks that may be used to aggregate telemetry, a reduced and refined view of the best data sources can be created. With this view, combined with the results of the blind-spot research, an organization has built the proper foundation for a data and observability strategy. That strategy should guide the development of a new insights platform and informs plans for enhancing the platform in line with their organization’s goals and capabilities.5

Note

The composition of the application itself changes as an organization adopts a new digital architecture. The number of microservices and cloud-native runtime components in use can easily number in the thousands for a single workload. This includes many more connections to third-party software-as-a-service (SaaS) services as compared to traditional applications, which are typically designed with many fewer components and very few, if any, third-party SaaS interfaces. Each of these runtime components is a viable operational data source. Identifying the full set of components, their telemetry sources, and which data sets are most valuable is the keen focus of the SRE teams that are the primary actors, as described in Chapter 6.

Other common blind spots in the IT stack are traditional and cloud native/microservices architectures, include the compute layer, nonproxy vantage points, and third-party services. Investigating new telemetry sources across these three areas will provide a starting point from which IT teams can determine which untapped data sources make the most sense to pursue and in what order.

An example of a compute layer blind spot is main server processor trace data. In this case, the data is available but is not being collected. These sources typically provide a voluminous processor execution branch history that needs to be filtered but is nonetheless ready for consumption.6

Numerous nonproxy vantage points exist, from the top of the digital stack to the bottom. The term nonproxy is meaningful because proxies are usually placed inline with traffic flows, giving them various levels of visibility depending on how they are designed. The proxy vantage point is a natural go-to source for operational data because it already exists to perform other critical functions such as traffic management and security, leaving the nonproxy vantage points as the more likely blind spots. Examples of nonproxy vantage points include the following:

  • Packet filters that may be used to implement or optimize a proxy but, in and of themselves, are not proxying network connections and therefore are unique lines of sight.

  • The new set of data, control, and management paths described in Chapter 2—DPUs—where infrastructure processing is offloaded from main processors to alternate processing centers. This includes field-programmable gate array (FPGA), GPU, or other auxiliary compute complexes.7

  • Code instrumented natively in an application or service. The purpose of instrumentation is to track a typical user’s path through a business flow as it traverses the various components and services that make up that flow.

Another common blind spot is third-party components. By subscribing to telemetry APIs of third parties, this otherwise invisible source of information increases the accuracy and overall value of insights that can be generated. An ecommerce example is payments processing. Digital payment services are commonly consumed as a third-party SaaS component. In addition to integrating the component itself for completing orders, the companion telemetry service, also exposed by an API, should also be consumed so that this data source can be streamed into the insights platform. Another common type of third-party telemetry source is provided by public cloud services through certain APIs they expose to their tenants.

The proliferation of APIs and their suitability for light yet effective operational data streaming opens the opportunity for standardization in collection architecture. Language-agnostic data formats such as JavaScript Object Notation (JSON) unify the formatting of data to be serialized, and technologies such as Protobuf unify the approach to serializing structured data streamed into time-series data store(s) designed to ingest and hold this information. Interestingly, a new technique addressing the challenge of data ingestion at scale produces multivariate time-series data, a more compact set of data that can be processed 25 times more efficiently than unidimensional streams. Rapid advancements like this are easily adopted by insights platforms with flexible architectures. They accommodate improvements while simultaneously maintaining certain standards to keep the efficiency and cost equation balanced. This produces increasing value from the platform for the business over time.8

Organizations are attracted to free and open source software to build new capabilities for themselves, and the collaboration among users speeds up progress for all participants. In the space of enterprise operational data collection, OpenTelemetry, introduced in Chapter 2, is a current example of such a leader. Formed through the merger of two earlier and related projects, this incubating project within the CNCF leads the way. This open community’s free and open libraries, APIs, tools, and software development kits (SDKs) simplify and accelerate IT implementation of a common framework for instrumenting and collecting operational data. Once implemented within an enterprise, the APIs used to connect data sources to their destination data stores are standardized, further driving the ability to automate data collection.

The most effective data platforms will employ a flexible architecture for instrumenting systems and collecting operational data. By prioritizing standard formats and APIs yet maintaining liberal acceptance of data collectors and format translators from various vendor-specific formats and serialization techniques, IT teams can work toward increasingly standard formats and APIs over time. This approach is not new. What is new with respect to operational data is that ingestion of such a wide variety of data in various formats requires a decision about how to expose this data to both human SRE teams that need to quickly troubleshoot and remediate suffering systems and to machines that are applying predefined analytics models to generate new insights or remediating issues through automation. So observability for humans and analytics for machines have distinct requirements on operational data that are best determined by SRE and data science professionals, respectively.

New sources of data, such as the application delivery and security services discussed in Chapter 3, require a new operational model for consumption, processing, analysis, and management. IT groups that build a platform to capture all types of data for human and machine processing ready their organizations to proceed to the next phase of transformation: data processing. The correct architecture for processing will take into consideration the location where data is generated, the types of insights that can be drawn from a given data set, the availability of processing and storage, and the relative cost of storing, moving, and processing data at various locations. The correct operating model will take into consideration the volume of data, speed of processing, and principles for decision making. We explore both of these and how they relate in the next section.

Having established the blueprint for operational data sources mapped to a given application (a.k.a. workload or set of workloads comprising a digital experience), attention can move on to the key characteristics of a data processing engine for the operational data platform.

Data Pipeline and Practices

As more types of operational data emerge as being critical to the digital enterprise architecture, most enterprises will not be able to build the storage, processing, security, and privacy for all that data at a global scale. Further, technology leaders will find it challenging to expose and use all of that data to the right systems, processes, and individuals within their organization in a compliant fashion. It is for these reasons that a data and insights platform is needed. This is akin to traditional data consolidation efforts and the use of business intelligence platforms for business and customer-focused data. Similar efforts are required today for operational data, to enable analysis to uncover missing insights and produce business value.

How is business value derived from collected data? This is dependent upon the human talent skilled in the various aspects of data management, just as software engineering talent was the key to extracting transactional business value out of the line-of-business systems architected and coded to meet the previous generation of business needs. In fact, code artifacts such as algorithms, mobile device apps, and data models become types of data that fall under the governance and management of the newly formed data team. By treating code as a type of institutional data, an IT team starts to show the signs of driving new business value from a data-first mindset. Those code artifacts can be revised, deployed, and expunged properly as raw materials used to fuel the digital business.

This is similar to the approach taken by DevOps when architecting a development pipeline. Data pipelines, such as the one described in Figure 4-1, require similar processes; thus, many of the practices common to DevOps and SRE operations regarding the use of tooling to deliver business outcomes faster can be applied to DataOps. DataOps is a relatively young practice but, like DevOps and SRE, promises to transform traditional processes into modern, more efficient ways of working.

Figure 4-1. A typical data pipeline with automation to enable real-time operations

At the platform layer, an effective architecture accommodates the requirements for data acquisition, protection, management, processing, and exposure. In contrast to being a relatively static transactional asset (e.g., customer profiles and history) with engineering resources assigned to maintain a system or systems, data becomes a dynamic raw material that merits its own engineering resources assigned to curate, search, analyze, and process to solve problems, discover insights, and enrich over time. This will naturally create a tension as competition between traditional software engineers and data-focused engineers with skills across data design, data curation, and data science rises. Investing in data talent shifts the organization to become more innovative and able to leverage its data assets.

In terms of the design for the new data and insights platform, a composite architecture takes into consideration the location of data, the types of insights that can be gained from a given data set, the availability of processing and storage, and the relative cost of storing, moving, and processing data at various locations. Unlike traditional customer and business data, which is typically consolidated in a central location, operational data is likely to be more distributed.

For example, some of the data will be processed at the edge by using ML designed for real-time or near real-time decisions based on appropriate AI/ML models. About 35% of organizations expect edge computing to support real-time data processing and analysis, where responses within 20 milliseconds are critical.10 This is often a requirement in the manufacturing and healthcare industries. A subset of this data will be aggregated for processing and analytics suited for queries that use other types of AI/ML models to discover a different set of insights in line with the needs of that higher aggregation point. Ultimately, the longest-held data will be stored in the most centralized locations specializing in analysis at the highest order of scope. A good example today is ServiceNow, which offers a platform for operational information.

Identifying what data needs to be processed where, which data needs to be stored, and what types of analysis to perform on which data sets are all questions that are answered in this area of the new digital enterprise architecture. ML models should be deployed where the insights they produce can be of best use, either locally or centrally. The factors that dictate this are as follows:

  • Where can the collected data be stored?

  • Where is the data model for processing this data stored locally?

  • How long does that data need to be stored before it is processed?

  • What type of processing is needed?

  • Where is the processing capacity located with respect to the data storage location?

For example, in a video call, the local device is the most likely location for operational data about the quality of the experience to be generated. Given that the right type of processing capability and storage also exists on the device, the ML needed to detect when adjustments to bitrate are necessary to preserve the experience are best run on the device itself. Given an adjustment interval of 10 seconds, even though the flow of operational data is constant, the local device needs to store only 10 seconds’ worth of that data at a time while running the local ML, after which it can be expunged. Further, only one reference data set needs to be sent upstream, and that occurs only if an adjustment was needed in any 10-second period; otherwise, nothing is sent.

Once a base design is established by answering the preceding questions at a local device level, the process can be repeated at higher levels of aggregation, producing the appropriate layering of data storage, processing, service adjustments, and operational data forwarding. For a video call originating or terminating on a smartphone, the next level of aggregation might be a single cell tower. At this level, issues affecting the experience of all users connected to that tower become useful, such as failure to initiate a call or unintended disconnects. By applying this reasoning all the way to the centralized computing location (usually a metro or regional data center), needed data is stored in appropriate silos and/or intermingled with appropriate confluences of data from various sources to serve each predefined purpose. Unneeded data is expunged at each point. This layered approach produces targeted insights efficiently because the architecture considers the intended uses of each data set across each layer and at each point in the user experience. The architecture delivers purposed analysis, and treatment of data which, in turn, ensures that the appropriate business value is derived.

Across the architecture and through each layer, the data, the ML models, and the resulting insights are treated as managed objects—like code—with versions, actions, and value being derived continually. They have a lifecycle similar to application code: created according to predetermined requirements, deployed to specific locations, and executed under specified conditions to achieve certain results. Adjustments to the data collected, the way it is analyzed, insights gained, and resulting actions taken perpetuate the data and data model lifecycle driven by business requirements, much like application code iterates in a virtuous cycle of improvement.

Beyond the automated detection and remediation of issues, treating data and data models as code increases an organization’s ability to make the most of the insights it discovers. Continuing the video example, the data model used to detect degrading user experience on a local device can be managed like code: kept in a central location, version controlled, updated, and pushed out to devices when appropriate. The next step in the evolution of technology in this case would be to aggregate adjustment data to a central or semi-central location so that a higher-order ML model can be used to detect opportunities to adjust the local ML model (and what to change) so that the update of the local ML model itself is automated. In this way, intelligent use of data, data models, and processing are adaptable—a key tenet of the digital enterprise architecture.

While the tension between static and transactional versus adaptable and data-driven digital business certainly manifests in terms of investing in new engineering talent, data engineering should be treated like an expansion of skills and opportunity for growth rather than a detractor. Organizations can and should encourage and invest in data-related skills development of their engineering staff. Engineering efficiencies gained from optimizing maintenance of existing line-of-business applications should be used as leverage to shift learning and assignment of engineering talent to data capture, management, and governance for the organization.

As expertise is gained operating and refining the operational data pipeline, the pattern of finding the right data, developing the algorithms, training the algorithms, and assessing and tuning the outcomes becomes more natural. Further, the catalogs of available models and associated capabilities are increasingly available as a service from third- party providers, giving organizations ever better options for the most common enterprise AI and ML needs. The transformation to data and algorithms over code will accelerate as applications become more dynamic, more microservices based, more dependent on services, and more global in nature. Therefore, the combined approach of building internal experience and leveraging advancements from industry providers and open communities is recommended.

Data Privacy and Sovereignty

As the data about everything is becoming more valuable, society is institutionalizing protecting that data, leading to governance structures that can adjust and tune for the changing needs of society. Governance is evolving to include security, privacy, sovereignty, algorithms, data models, usage, derivative uses, and cascading responsibilities. All of these facets of operational data governance point to an organization’s ability to become more adaptable.

The regulatory environment, sovereignty rules, and privacy protections, along with the compliance demands of specialized data, will be an overarching driver of how data is managed, where it is stored, how it’s processed, and who/which machines have access to it. New cases are emerging because some machine-generated data formerly in isolation is now being directly shared, aggregated, or otherwise accessed from outside its system. In the next few years, almost all data will be under some kind of compliance regime to help minimize exposure for customers and companies.

One of the biggest challenges of using data today, particularly structured data, is the all-or-nothing approach: someone either is trusted to see all the raw data or has no access to it.11 One solution that is emerging to deal with this tension is differential privacy. It provides a way for access to partial sets of data in such a way that the persons attached to that data are not identifiable. Some start-ups are already using this concept to provide a new level of privacy in critical areas like healthcare and financial services.12

Granularity of control over access to data is required to support the governance requirements and mitigate the risk associated with accessing data. This granularity is measured in two dimensions: scope and use. Scope is the precision of the data set; more granular means smaller (by field versus row in a table, for example). Use is the role of the user, the type of access, and associated conditions. For example, the same user may have multiple roles, triggering the need to access data for different purposes, and each purpose may have associated constraints such as time windows bounding the sanctioned access. The level of granularity required will increase over time, driven by ongoing cases of data exposure abuses and tightening of regulatory constraints in response to them.

Data Governance Evolves

An organization’s ability to manage and govern data will be the key to its ability to modernize business with a digital enterprise architecture. This requires a much larger governance approach than has been used previously as human processes incorporated governance into the business processes performed. Governance needs to be built into the digital enterprise architecture and business practices.

Most (80%) organizations say data governance is important to enabling business outcomes.13 Despite this, less than half (43%) either have a data governance program or have implemented a strategy that is considered immature.14 Factors that stand in the way of data governance practices are familiar: cost, lack of executive sponsorship, little-to-no business participation, and a lack of prioritization. But the reality is that a digital business depends on operational data. Foot traffic and patterns at physical locations once provided businesses with the insights they needed to make decisions and drive growth. The digital equivalent is operational data. The dependency of a digital business on that data requires viewing data governance as a mission-critical business function, analogous to fiduciary controls governing finance and testing governing code quality.

Data governance requires a framework capable of supporting a data operations practice and enforcing policies that govern access and usage of data while complying with data sovereignty and privacy requirements. Figure 4-2 shows a simple data governance framework that meets these requirements.

Figure 4-2. A simple data governance framework

Executing on such a framework, even a simple one, will be challenging without employing AI, ML, and automation because of the volume of data ingested, the complexity of the analytics applied, and the speed at which responsive actions need to be taken.

Traditionally, organizations use human interaction as the governance mechanism for data. Digital businesses rely on data to drive decisions for both the business and operations. This means digital governance must be incorporated into the infrastructure and the development cycle so that data management actions are automated. This requires every component across the entire architecture to be capable of executing a governance action. This capability must be designed into every component and applied everywhere, be transparent, self-regulating, and easy for owners to modify. Businesses from more regulated sectors of the economy have a head start because they are already urged via mandates to adopt organizational structures and processes entirely focused on data access and use.15

Conclusion

Establishing an enterprise data practice is essential for deriving new business value from insights found in all types of data, as highlighted recently by the emergence of operational data within the increasingly digital enterprise. The journey is typically gradual as real-world constraints of time, budget, and skilled resources slow down the use of available technology, whether from vendors or open source.

The successful building of a data practice depends upon designing an insights platform using an architecture that is based on standards, with flexibility that allows individual pieces to upgrade at a pace in line with any constraints. The three basic elements of the insights platform are data collection, data processing, and data governance.

In parallel, it is paramount for an organization’s business practices and technology skill sets to mature in lockstep so that as operational visibility, volume of insights, and quality of insights increase, the business processes and operational procedures also become less static and more dynamic. As the entire organization becomes more familiar with the new mode of operation, IT shifts from a supporter role to a strategic enabler of transformation.

1 “Uses for Oil,” Canadian Association of Petroleum Producers, accessed May 30, 2022, https://oreil.ly/D2R54.

2 Motley Fool Transcribing, “Kroger (KR) Q3 2020 Earnings Call Transcript,” December 3, 2020, https://oreil.ly/dprcw.

3 Russell Redman, “Kroger Banks on Burgeoning Sources of Revenue,” Supermarket News, October 31, 2018, https://oreil.ly/JDd3Q.

4 “The State of Application Strategy in 2022,” F5, April 12, 2022, https://oreil.ly/LH0Yj.

5 Bradley Barth, “Uncontrolled API ‘Sprawl’ Creates Unique Visibility and Asset Management Challenges,” SC Media, November 5, 2021, https://oreil.ly/Ks9i1.

6 Juhi Batra, “Collecting Processor Trace in Intel System Debugger,” Intel, accessed May 30, 2022, https://oreil.ly/vbPJH.

7 “GPU Trace,” NVIDIA Developer, accessed May 30, 2022, https://oreil.ly/SQAKu.

8 Laurent Quérel, “Multivariate Metrics—Benchmark,” GitHub, July 23, 2021, https://oreil.ly/lEBYP.

9 Thomas H. Davenport and DJ Patil, “Data Scientist: The Sexiest Job of the 21st Century,” Harvard Business Review, October 2012, https://oreil.ly/LdwKt.

10 F5, “The State of Application Strategy in 2022.”

11 Adrian Bridgwater, “The 13 Types of Data,” Forbes, July 15, 2008, https://oreil.ly/KvwID.

12 “5 Top Emerging Data Privacy Startups” StartUs Insights, accessed May 30, 2022, https://oreil.ly/YHqZj.

13 Heather Devane, “This Is Why Your Data Governance Strategy Is Failing,” Immuta, April 8, 2021, https://oreil.ly/UHfvY.

14 Ataccama, “Data: Nearly 8 in 10 Businesses Struggle with Data Quality, and Excel Is Still a Roadblock,” Cision PR Newswire, April 7, 2021, https://oreil.ly/o35qo.

15 Immuta and 451 Research, “DataOps Dilemma: Survey Reveals Gap in the Data Supply Chain,” Immuta, August 2021, https://oreil.ly/i4lIi.

Get Enterprise Architecture for Digital Business now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.