Chapter 1. Finding Signals in the Midst of Noise

The data generated by connected devices and other new sources of data has transformed logistics and commerce. It has transformed maintenance of all kinds, in virtually all verticals. It is fueling an ongoing revolution in sales and marketing. It is one of several intersecting factors that have completely transformed data management. To understand the import and ramifications of this transformation, it is helpful to have a sense for what analytics are and how they work. After all, even if we treat data management as an end unto itself, the creation, preservation, and maintenance of data is always adjunct to other purposes. Analysis is just one of these purposes—albeit one of outsized importance.

The Lives of Analytics

Data is exponentially more useful when it is joined together with other useful units of data to form new combinations. Analytics draws its power from this Lego-like network effect. We create analytics by fusing different units of data into larger combinations called models: the star or snowflake schemas that link facts to dimensions in data warehouse architecture are models. Fundamentally, an analytic model is a representation of some slice of the business and its world: for example, sales of Product N in Region X during Period Y to customers with Z1, Z2, and Z3 attributes. In fact, queries like this one are the raison d’ être of data warehouse architecture. The answer to this query already “lives” in the data that populates the warehouse’s fact and dimension tables; the data warehouse performs operations that join facts to dimensions in real time, creating an analytic model that “answers” the query.

This example also gets at something else: in data warehouse architecture, the role or function of analytics is to answer questions. But today’s cutting-edge analytic practices invert this arrangement: they seek to ask questions. What if Z1, Z2, and Z3 attributes are unknown? What if, in fact, their corresponding dimensions don’t even exist in the data warehouse? No conceivable star or snowflake schema can link facts to dimensions that do not yet exist. So the business analyst or data scientist must go to source data—assuming it is available—to answer these and similar questions.

Analytics as a Site of Rapid and Ongoing Transformation

Innovation in analytics is not just a function of fusing Lego-like blocks of data together to create larger ensembles of models. Recent analytic innovation is characterized by the intersection of three distinct trends: first, the capacity to cost-effectively collect, store, and process more and different types of data; second, the mainstream uptake of ML and especially of advanced ML techniques; and third, the application of this data (of different types and sizes) and of these advanced ML techniques to new problems that involve asking new kinds of questions.

Two decades ago, the data warehouse constituted the analytic center of gravity of the average enterprise: all business-critical data was vectored into it.

It was good to be the king. The warehouse and its constraints dictated the use of a dominant technology to store and manage data: the relational database, or relational database management system (RDBMS). And this dependence on the RDBMS dictated the use of a domain-specific language—SQL—for accessing and manipulating data. These same constraints helped to formalize the use of a set of techniques—starting with extract, transform, and load (ETL)—for engineering the data used to populate the warehouse.

But the data that businesses generate and expect to mine for insights is no longer of a single dominant type (i.e., relational data) and no longer conforms to a single set of general characteristics. Business data no longer “lives” in the same places (local databases and data warehouses, spreadsheets, network storage, etc.) but is distributed, dispersed across on-premises systems, off-premises cloud applications and storage services, mobile devices, the web, and so on.

What is more, business analysts, BI developers, and DBAs are no longer the only people who work with data, nor is BI itself the only—or even the primary—practice area for data work. BI and data warehouse people compete with data scientists, data engineers, ML engineers, and other specialists for access to more data, to fresher data, and to data of different types.

This gets at another major change—one that has to do with the role human intelligence now assumes in both the analysis of data and the production of analytics. The upshot is that human intelligence now allocates a large (and growing) proportion of analysis to machine intelligence, which makes it possible to automate not only the task of analysis itself but that of decision making with data preparation, data enrichment, and virtualization of results.

Human intelligence is likewise training machine intelligence to replicate itself—that is, to produce its own analytic models. (This is the focus of ML in general and especially of deep learning, reinforcement learning, and other advanced ML techniques.) The growth of ML is a reprise of a familiar story arc. Until relatively recently, and with a few notable exceptions, human “computers,” not machines, performed most mathematical calculations. As the complexity of the calculations involved and (especially) the scale at which they needed to be calculated increased, machine computers at first complemented and then ultimately replaced human computers. The same is happening with analytics, whereby organizations are replacing human analysts with automated analytic technologies and human-directed analysis with machine-directed analytics. In the present, the bulk of analysis is already performed by machines; in the future, almost all analytics will be produced by machines, too.

Analytic practices are also changing. The BI practice area is now complemented by new practice areas such as data science and ML/artificial intelligence (AI) development. The people and machines who work with data no longer expect to use a single means of access—an ODBC interface—and a single common language (SQL) to access, manipulate, and query data. And analytics as such is no longer the remit of a single practice area or a single domain: the data warehouse and BI; data science and its products; ML engineering and its products, etc. Rather, almost all applications and services will incorporate analytic capabilities, with the result that the consumption of analytics will, in a sense, become commoditized.

Lastly, the batch ingest model that was ideal for the data warehouse is unsuitable for emerging data warehouse use cases, to say nothing of data science, ML/AI engineering, and other analytic practices. Real-time data warehousing is not in any sense new, of course; what is new, however, is the expectation that data should be as fresh, as close to real time as possible. The upshot is that data must now be ingested as it arrives: as it pulses, as it streams; as it trickles, dribbles, or deluges.

Diagnosing the Present, Predicting the Future

Most BI work consists of combining customer, product, sales, and similar data into multidimensional views. The warehouse is still the killer app for asking questions of this kind. But access to data of diverse shapes and sizes permits businesses to ask new, different, more ambitious questions—questions that involve discovering as-yet-unknown relationships between bits and pieces of data.

Consider the twenty-first-century cargo ship. Like other modes of commercial transport—railcars, tractor trailers, and aircraft—the cargo ship now bristles with sensors of different types: temperature sensors; sensors that record the frequency and impact of bumps or jostles; sensors that measure motion; sensors that detect chemicals and gases, such as those correlated with cargo spoilage. These sensors generate enormous volumes of data, a small subset of which gets transmitted back to the shipping company, sometimes in real time. This data is a potential treasure trove for business.

Raw sensor data is of limited use in data warehouse-driven analytic development, where modelers and business analysts construct analytic views grounded in known relationships in available data. But the data generated by sensors lets an organization ask questions that have a definitive inductive quality: they’re attempts to reason backward from effects to causes, attempts to discover unknown relationships that permit businesses to diagnose problems in the present, attempts to make predictions about the future and to take action. For example, in a ship carrying, say, bananas and mangos from Puerto Quetzal, Guatemala, to Seattle, Washington, is there a possible relationship between prolonged jostling or bumping during loading, transport, and unloading and higher rates of spoilage? In other words, is it possible to correlate the frequency and severity of impact with the rate of fruit spoilage? What about air quality—or, more precisely, what about the mix of gases in the air in the shipping containers that house the cargo of produce and mangos? Can a combination of these and other factors be correlated with spoilage? Also, is it possible to detect early warnings of spoilage—for example, in the presence of certain noxious chemicals or gases? If so, what could a shipping company do to prevent this?

Taking Action, Maximizing Outcomes

Answering these questions depends on the availability and mainstream uptake of advanced statistical tools and techniques. These are questions that involve modeling a finite set of interactions among known variables in a slice of the world—namely, a shipping container stowed in a cargo ship traversing a weeks’ long voyage from a tropical to a subtropical climate—in which not all variables can be anticipated, let alone modeled. Data scientists and other skilled people use statistical techniques to determine, first, if a question has a statistically significant “answer” and, second, how to interpret—how to use—this “answer.” The strongest “answers” are usually products of combinations of different factors. So bumps and impacts alone aren’t strongly correlated with higher-than-average rates of spoilage. However, in combination with other factors (e.g., modest fluctuations in temperature, anomalously high proportions of certain gases in the atmosphere), the case for a correlation seems much stronger. Once the data science team decides that these relationships are statistically significant, it refocuses its efforts on another problem: what can the business do with this knowledge?

This is one potential use for AI engineering. Not “AI” in the mode of artificial general intelligence, which is analogous to human cognition in its self-reflective dimension. AI engineering is the application of ML to identify and diagnose problems; the use of automated rules engines to trigger interventions on the bases of a diagnosis; and, finally, constant monitoring by an AI feedback loop to measure the effectiveness of interventions and to self-correct if necessary.

Businesses want to use AI to increase productivity, to accelerate processes, to optimize outcomes, to forestall reversals, and, moreover, to deliver wholly new products and services. The cost-economics and the colocality of ML, AI, data science, and other services in the cloud are ideal for the AI engineering use case: cloud’s primary storage layer is inexpensive and (from a subscriber’s point of view) practically unlimited; cloud’s elastic character permits an organization to grow or reduce the storage, compute, and network resources it consumes; the availability of tightly integrated cloud development services—which in most cases also accommodate the tools, libraries, and techniques that data scientists, ML engineers, data engineers, and other technicians prefer to use—is still another selling point, as is the emergence of full-featured, cloud-centered data integration services.

And because so much data originates in the cloud, the cloud is a logical locus of data ingestion and integration. The cloud comprises a highly integrated—and, in the case of homogeneous (vendor-specific) deployments, vertically integrated—site for data science and AI engineering.

Putting It All Together

This is a world in which the data warehouse has an increasingly vital role to play: not as a kind of subaltern to otherwise dominant data science and AI engineering practices, but rather as the privileged destination for the analytic products—the insights—of data scientific and ML development.

Think about the twenty-first-century cargo ship, bristling with sensors amidst its Tetris-like assortment of shipping containers. Assume that its onboard sensors record a higher-than-usual number of severe impacts during loading. Assume, too, that sensors record variations in temperature and the presence of a specific gas (say, ethylene) in several onboard containers. Given the correlation between each of these factors and a definite rate of fruit spoilage, what does this mean for the projected value of the ship’s cargo? This is, of course, a question that the data scientist could answer; it is also, however, just the kind of question that a business decision maker might want to ask on a recurring basis. And so the data science team, working in tandem with a business analyst, adds a new dimension to the warehouse to incorporate this analytic “fact.” The upshot is that businesspeople can now pose questions and make projections about spoilage based on events that (just five years ago) could be modeled only imprecisely. For example, what if the ship were to dock at the Port of Los Angeles? Would getting the mangos and bananas to market earlier significantly arrest the rate of spoilage?

This is the vital role of the data warehouse in the context of modern data management and a cluster of complementary analytic practices. Owing to a combination of factors, the focus of these practices—and of data management itself—has shifted to the cloud. This is the biggest change by far. Now as ever, the data warehouse in the cloud is positioned as the go-to engine for day-to-day and strategic business decision making. It is no longer the center of its own planetary system, however.

Get Automating the Modern Data Warehouse now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.