Without data you’re just another person with an opinion.
William Edwards Deming
Data-drivenness is about building tools, abilities, and, most crucially, a culture that acts on data. This chapter will outline what sets data-driven organizations apart. I start with some initial prerequisites about data collection and access. I then contrast reporting and alerting versus analyses in some detail because it is such an important distinction. There are many different types of forward-looking analysis, varying in degrees of sophistication. Thus, I spend some time going over those types, describing them in terms of “levels of analytics” and “analytics maturity,” in particular, discussing the hallmarks of an analytically mature organization. What does that look like?
Let us start us on the way to answering our first question: what does it mean for an organization to be data-driven?
Prerequisite #1: An organization must be collecting data.
Data undoubtedly is a key ingredient. Of course, it can’t just be any data; it has to be the right data. The dataset has to be relevant to the question at hand. It also has to be timely, accurate, clean, unbiased; and perhaps most importantly, it has to be trustworthy.
This is a tall order. Data is always dirtier than you imagine. There can be subtle hidden biases that can sway your conclusions, and cleaning and massaging data can be a tough, time-consuming, and expensive operation. I often hear that data scientists spend 80% of their time obtaining, cleaning, and preparing data, and only 20% of their time building models, analyzing, visualizing, and drawing conclusions from that data (for example, http://bit.ly/nyt-janitor and http://bit.ly/im-data-sci). In my experience, this is entirely plausible. In the next chapter, I’ll cover aspects of data quality in much more detail.
Even if you do have quality data, and even if you have a lot of quality data, you will only get so far and; despite the hype that you might hear, it does not make you data-driven. Some people, especially certain big data vendors and service providers, pimp big data as a panacea: if you collect everything, somewhere in there are diamonds (or golden nuggets or needles or one of many other metaphors) that will make any company successful. The hard truth is that data alone is not enough. A small amount of clean, trustworthy data can be far more valuable than petabytes of junk.
Having accurate, timely, and relevant data, though, is not sufficient to count as data-driven. It must also be:
The data must be in a form that can be joined to other enterprise data when necessary. There are many options, such as relational databases, NoSQL stores, or Hadoop. Use the right tool for the job. For instance, for a long while, the financial analysts at Warby Parker were using Excel to compute the key metrics reported to senior management. They sucked down huge amounts of raw data from different sources and ran VLOOKUPS (an Excel function to find cross-references in the data) to join all that data to get a top-level look at the numbers. This worked well initially, but as the company’s sales and customer base were scaling rapidly, the data got larger and larger, the Excel file approached 300 MB, their computers maxed out their RAM, and the VLOOKUPS would take 10 hours or more, frequently crash, and had to be restarted. They had stretched the tool and approach as far as they could go. Excel had been an appropriate tool, but company hypergrowth changed that. The mechanics of getting those numbers became a huge time-sink for the analysts and a source of stress as to whether they would get their numbers or have to wait another 10 hours to rerun those VLOOKUPS. It turned them from analysts into Microsoft data engineers. My team helped to bring that whole dataset into a MySQL relational database. We wrote queries to crunch the numbers for them, allowing them to focus on analysis, trends, and presentation of that data—a far better use of their time. Now that they have better tools and more time, they are producing deeper, richer analyses.
There must be a data-sharing culture within the organization so that data can be joined, such as combining customers’ clickstream with their transactional history. Imagine a patient admitted to a hospital ER, receiving treatment and then being released with a requirement to attend an outpatient clinic for additional treatment and checkups. The patient is going to receive worse customer service and more importantly worse care if the hospital and clinic don’t share data—when, where, and why was he admitted, what issues did he present, what treatment did he receive, etc. From the healthcare providers’ perspective, their analysts are going to find it hard or impossible to analyze and improve the process and care if they don’t have a coherent, accurate picture of patient flow, diagnostic processes, and complete longitudinal data of those patients. So, siloed data is always going to inhibit the scope of what can be achieved. When more data is available to more parts of a system, the whole is greater than the sum of the parts.
There must be appropriate tools to query and slice and dice the data. All reporting and analysis requires filtering, grouping, and aggregating data to reduce the large amounts of raw data into a smaller set of higher-level numbers that help our brains comprehend what is happening in a business. I need to be able to see trends or understand differences among customer segments. Analysts have to have tools that allow them to compute those metrics relatively easily.
(All of these topics will be covered in greater detail in later chapters.)
OK, so now we have data and it is accessible. Is this sufficient? No, not yet. You need people with the right skills to use that data. That can mean the mechanics of filtering and aggregating data, such as through a query language or Excel macros, but it also means people who design and choose the appropriate metrics to extract and track (this topic is covered in Chapter 6). Those metrics might be resubscription rates (for subscription services such as Netflix or the Wall Street Journal), lifetime values, or growth metrics, but someone needs to decide upon them and someone (else) needs to create some process to provide those values.
So, for an organization to be data-driven, there have to be humans in the loop, humans who ask the right questions of the data, humans who have the skills to extract the right data and metrics, and humans who use that data to inform next steps. In short, data alone is not going to save your organization.
Let’s suppose you have an analytical group with access to accurate data. The group extracts sales data and generates a report proudly claiming that the company’s bookings grew 5.2% from April to May (Figure 1-1).
This is certainly now sounding more like a data-driven company. However, this is still deeply insufficient. Certainly it is good that they are tracking these metrics. The CFO and CEO will definitely be interested in those numbers. What, however, does this value of 5.2% really tell you? Very little, in fact. There are many possible reasons why the company sales grew by this amount:
“As orgs grow larger and complex, people at the top depend less on firsthand experience, and more on heavily processed data.” — John Gardner
John Maeda (@johnmaeda) August 16, 2014
Ding, ding, ding! Web-app server #14 has just averaged more than 98% CPU utilization for the last five minutes.
Alerts are essentially reports about what is happening right now. They typically provide very specific data with well-designed metrics. But like reports, they don’t tell you why you are seeing a spike in CPU utilization, and they don’t tell you what to do, right now, to rectify the problem. As such, like reports, they lack this crucial context. There is no causal explanation. This is the point at which performance engineers or system administrators dive into the production logs to ascertain what is going on, why it is happening, and what the options are to fixing it: rollback some code, spin up some more servers, reconfigure the load balancer, etc.
Figure 1-2 shows an example of server load over time. There is some variability, but most of the day is spent with a run queue of about 0.5 or fewer. At 1 a.m., load starts to spike, shooting up to more than 5, a tenfold increase over “normal," in the space of 30 minutes. It seems highly unusual. What’s going on? Maybe someone should fix that, but how?
In this case, it is just the weekly backups running. This happens every Thursday at 1 a.m.—perfectly normal, nothing to see here. This makes the point that there is great data here and a good metric that’s presented clearly. But the context—that it is caused by backups, that it happens on a particular schedule and this 1 a.m. time slot is expected, and that the server can handle this load without problems—is all lacking.
Reporting and alerting are necessary but not sufficient characteristics of being data-driven. We should, however, not understate the importance of both these activities. Reporting especially is a highly valuable component of a data-driven organization. You can’t have an effective one without it. However, the reverse is not true: there are many organizations that focus on reporting and may have little to no real (objective) analysis. For one thing, reporting may be driven by legal requirement and responsibilities, such as Sarbanes-Oxley compliance and generating earnings reports to shareholders, and not from an internal cultural drive to improve the business.
Reporting tells you what happened in the past. It also provides a baseline from which to observe changes and trends. It can be interesting and can keep some investors and shareholders happy, but it is a fundamentally backward view of the world. To be data-driven, you have to go beyond that. To be forward-looking and engage in analysis, dig in, and find out why the numbers are changing and, where appropriate, make testable predictions or run experiments to gather more data that will shed light on why.
Let’s be more explicit and compare and contrast the two. Here is one set of definitions:
“The process of organizing data into informational summaries in order to monitor how different areas of a business are performing”1
“Transforming data assets into competitive insights that will drive business decisions and actions using people, processes and technologies”2
Reporting says what happened—we hit a peak of 63,000 simultaneous visitors on the website on Thursday at 10:03 a.m. It tends to provide a very specific scope.
Analysis says why it happened—the company was mentioned in a piece on the TV newsmagazine show 60 Minutes at 10:01 a.m.—and should recommend what the organization can or should do to generate more or less of the same.
Reporting is matter of fact, descriptive. Analysis, on the other hand, is prescriptive.
In Table 1-1, we summarize the differences between the two. Hopefully it is now clear why analysis and being data-driven is such a powerful facet or cultural component of a business. This is what can drive a business in new directions or to greater levels of efficiency.
|Raise questions||Answer questions|
|Data → information||Data + information → insights|
|Reports, dashboards, alerts||Findings, recommendations, predictions|
|No context||Context + storytelling|
A) What happened?
B) What is happening now?
C) What will happen?
D) How and why did it happen?
Modeling, experimental design
E) What’s the next best action?
F) What’s the best/worst that can happen?
Prediction, optimization, simulation
Here we can see insight-driving activities in the bottom row. As I noted earlier, reporting (A) and alerting (B) are simply not data-driven: they state what happened in the past or that something unusual or undesirable is happening now; there is no explanation of why it is happening or why it did happen, and no recommendations as how to resolve or reproduce the situation. Digging down to understand causal factors through models or experiments (D) is a precursor to data-drivenness. Only by understanding why something happened can you formulate a plan or set of recommendations (E). E) and F) are truly data-driven but if and only if the information is acted upon—explained in detail below.
(C is a danger zone: it can be easy enough to extend a perceived trend out to the future—in Excel, click “Chart" and then “Add trendline”—that is, extrapolate outside the current’s data range and make a naïve prediction. Even making a sensible choice about a functional form for the model, there are many reasons why that prediction may be misleading or plain wrong. To have confidence in those predictions, you should strive to have a causal model. Types of analysis is covered in Chapter 5.)
There are several types of activities that truly data-driven organizations engage in:
A true data-driven organization will be doing at least one of these things, something forward-looking where data is a first-class citizen.
OK, now we have an organization that has high-quality data and skilled analysts who are engaged in these forward-looking activities. Surely, that makes it data-driven!
Unfortunately, not necessarily. Like a tree falling in a forest with no one to hear it, if analysts are putting out analyses but no one takes notice, if they don’t influence decision makers’ decisions, which are still based on gut and opinion, it is not data-driven. Analytics has to inform and influence the influencers.
Dykes talks about this in terms of an “analytics value chain” (see Figure 1-3). Data has to drive reports, which should lead to deeper dives and analysis. Those analyses have to get to the decision maker who incorporates those into her decision-making process. This step is key to being data-driven. An organization needs that data and that analysis to drive a decision that changes strategy or tactics and makes an ultimate impact to the organization in some manner. Technology and training can do the first part: enable analysts to run analyses and write up their findings. However, it is the culture that sets up the mindset and process to take notice of those findings, trust them, and act upon them.
Finally, we get to the crux of what being data-driven means. A data-driven organization will use the data as critical evidence to help inform and influence strategy. There will be an evidence-based culture in which data can be trusted and the analysis is highly relevant, informative, and used to determine next steps.
Therein lies the challenge. If your organization is making gut decisions, how do you make a case for more data-driven decision making? It is not easy and it is not quick, so don’t expect radical changes overnight, but everyone in an organization can contribute significantly to such an improvement. In this book, we will examine a number of ways an organization’s culture should be driving to be more data-driven.
In 2009, Jim Davis, the senior vice president and chief marketing officer of SAS Institute, declared that there are eight levels of analytics:5
(You can see that this is also where Davenport’s framework and Table 1-2 is based. You can easily map the first four levels to the upper information row and the second set of four to the lower insight row of Table 1-2.)
I like the general concept and the labels. However, in the way that both Davis (2009) and Davenport and Harris (2007) present their ideas, especially with the big sweeping upward arrow, it is too easy to interpret this as a progression or some sort of hierarchy, kind of like a video game where you progress to each next level only after having conquered the previous level.
This pseudo-progression is often labeled as analytics maturity. If you do a Google image search for “analytics maturity,” you will see what I mean; that many BI vendors and practitioners present this as set of stepping stones with unidirectional arrows pointing from one level to the next. Analytics is not like that: it cuts across levels within an analysis, and different parts of the organization can be engaged in analyses of differing degrees of sophistication at any one time. Ron Shevlin7 makes some great points:
From a capabilities perspective, there’s no reason why a firm couldn’t forecast something likes sales (“level” 6) without knowing where exactly the “problem” with sales is (“level” 3)...How could I, as a manager, address the question of “what actions are needed now?” without some understanding of “what if these trends continue?” and “what will happen next?” (“levels” 6 and 7)?
The correct way, I think, to interpret this is to think of the maximum level that the organization is engaged in is positively correlated with the level of commitment, investment, and utility of analytics and, as Davenport and Harris argue, the analytical competitiveness. For instance, if you have an operations research team of Ph.D.s dedicated to optimizing your global supply chain, you are clearly heavily invested in data and analytics. If your organization only gets to alerts and query drilldowns, you have a lower investment and are less data-driven.
The underlying implication is that more sophisticated analytics are better; they make and organization more competitive. Is that true? In a fascinating study,8 MIT Sloan Management Review collaborated with IBM Institute for Business Value to survey 3,000 managers and analysts across 30 industries about their use of and beliefs about the value of analytics.
One survey question asked about the organization’s competitive position where the possible responses were:
Those organizations that chose answers 1 or 4 were deemed top and lower performers, respectively. Interestingly, compared to lower performers, top performers were:
There are certainly complicating factors in the methodology. There may be significant survivor bias, and there will likely be correlation of top performers with organization size (we know that revenue of these organizations ranged from less than $500 million to more than $10 billion). For instance, perhaps only larger, more successful organizations have the bandwidth and resources to develop sophisticated operations research departments that can develop and run supply chain simulation models. However, there was broad agreement that better and more sophisticated analytics drove business value.
The authors identified three levels of analytics capability: aspirational, experienced, and transformed. These are summarized in Table 1-3.
|Use analytics to...||Justify actions||Guide actions||Prescribe actions|
|Use rigorous approaches to make decisions||Rarely||Sometimes||Mostly|
|Ability to capture, aggregate, and analyze or share information and insights||Limited||Moderate||High|
Compared to aspirational organizations, transformed organizations were:
Again, there is a complicated tangle of cause and effect and biases here, but there is an association between competitive advantage, relative to industry peers, and analytics sophistication.
So, what is stopping organizations from adopting analytics widely? Two of the top three answers were lack of understanding of how to use analytics and lack of skills internally in the line of business (see Figure 1-5).
Those are something that everyone, including every analyst, can help drive. For instance, analysts can help “level up” their skill set and can be more proactive about doing good work and communicating its value to their managers and to the business. They can do more research to dig up case studies of how other organizations tackled similar business problems through analytics. Managers of the data engineers can assign resource to data integration and quality so that data is trusted. Senior managers can promote or demand greater sharing of data and designate clearer ownership and stewardship of data, such as appointing a chief analytics officer or chief data officer (covered in Chapter 11). Everyone has a role.
We’ll dive into these issues in more detail in the following chapters. More concretely, I’ll examine the needs in terms the foundation: the raw and aggregated data itself and its quality (Chapters 2 and 3). Next, I’ll examine the analyst organization itself: the types of analysts, the skills they should possess, and how they should be structured (Chapter 4). I’ll cover aspects of data analysis (Chapter 5), metric design (Chapter 6), and storytelling with data (Chapter 7). Chapter 8 covers A/B testing. Crucially, we’ll cover the culture and decision-making process that are the hallmarks of a great data-driven organization (Chapters 9 and 10). I’ll discuss what’s needed to manage the culture and change management, especially in larger organizations: a data-driven C-suite. In particular, I’ll cover three new executive positions: chief data officer, chief digital officer, and chief analytics officer (Chapter 11). Chapter 12 covers ethics and privacy issues of data and how an organization that respects its users’ data may in fact limit what its analysts may do with the data. I’ll end with some general conclusions (Chapter 13).
2 Faria, M., “Acting on Analytics: How to Build a Data-Driven Enterprise.” BrightTALK, September 11, 2013.
3 Davenport, T.H., J. G. Harris, and R. Morison. Competing on Analytics. Boston: Harvard Business Press, 2010.
4 Bosker, B., “Google Exec Marissa Mayer Explains Why There Aren’t More Girl Geeks.” The Huffington Post, July 6, 2011.
6 Despite appearing two years earlier, Davenport and Harris’ source is cited as “adapted from a graphic produced by SAS.”