“Getting Analytics Right” is, admittedly, a big promise in the big data era. But given all of the opportunity and value at stake, how can we aspire to anything less? Getting analytics right is especially important considering the kinds of simple-to-ask yet difficult-to-answer questions that linger within today’s enterprises. On the one hand, there are customer data questions like: “Which customer segments have the highest loyalty rates?” or “Which of my sales prospects is most likely to convert to a customer?” On the other hand are sourcing questions like: “Are we getting the best possible price and terms for everything we buy?” and “What’s our total spend for each supplier across all business units?”

With the kind of internal and external data now available to enterprises, these questions seem eminently answerable through a process as simple and logical as:

  1. Ask the question

  2. Define the analytic

  3. Locate, organize, and analyze the data

  4. Answer the question

  5. Repeat

Except that the process rarely goes that way.

In fact, a recent Forbes Insight/Teradata survey of 316 large global company executives found that 47% “do not think that their companies’ big data and analytics capabilities are above par or best of breed.” Given that “90% of organizations report medium to high levels of investment in big data analytics,” the executives’ self-criticism begs the question: why, with so many urgent questions to answer with analytics every day, are so many companies still falling short of becoming truly data-driven?

In this chapter, we’ll explore the gap between the potential for big data analytics in enterprise, and where it falls short, and uncover some of the related problems and solutions.

Analytics Projects Often Start in the Wrong Place

Many analytics projects often start with a look at some primary data sources and an inference about what kinds of insights they can provide. In other words, they take the available sources as a constraint, and then go from there. As an example, let’s take the sourcing price and terms question mentioned earlier: “Are we getting the best possible price and terms for everything we buy?” A procurement analyst may only have easy access to audited data at the “head” of the tail—e.g., from the enterprise’s largest suppliers. The problem is, price/variance may in fact be driven by smaller suppliers in the long tail.

Running a spend analytics project like this skips a crucial step. Analysis must start with the business questions you’re trying to answer and then move into the data. Leading with your data necessarily limits the number and type of problems you can solve to the data you perceive to be available. Stepping back and leading with your questions, however, in this question first approach liberates you from such constraints, allowing your imagination to run wild about what you could learn about customers, vendors, employees, and so on.

Analytics Projects End Too Soon

Through software, services, or a combination of both—most analytics projects can arrive at answers to the questions your team is asking. The procurement analyst may indeed be able to gather and cobble together enough long-tail data to optimize spend in one category, but a successful analytics project shouldn’t stop with the delivery of its specific answers. A successful analytics project should build a framework for answering repeated questions—in this case, spend optimization across all categories. For all the software and services money they’re spending, businesses should expect every analytics project to arm them with the knowledge and infrastructure to ask, analyze, and answer future questions with more efficiency and independence.

Analytics Projects Take Too Long…and Still Fall Short

Despite improved methods and technologies, many analytics projects still get gummed up in complex data preparation, cleaning, and integration efforts. Conventional industry wisdom holds that 80% of analytics time is spent on preparing the data, and only 20% is actually spent analyzing data. In the big data era, wisdom’s hold feels tighter than ever. Massive reserves of enterprise data are scattered across variable formats and hundreds of disparate silos. Consider, in our spend analysis example, the many hundreds or thousands of supplier sources that could be scattered throughout a multinational manufacturing conglomerate. Then imagine integrating this information for analysis through manual methods—and the kind of preparation delays standing between you and the answer to your optimization questions.

Worse than delays, preparation problems can significantly diminish the quality and accuracy of the answers, with incomplete data risking incorrect insights and decisions. Faced with a long, arduous integration process, analysts may be compelled to take what they can (e.g., audited spend data from the largest suppliers)—leaving the rest for another day, and leaving the questions without the benefit of the full variety of relevant data.

Human-Machine Analytics Solutions

So what can businesses do when they are awash in data and have the tools to analyze it, but are continuously frustrated by incomplete, late, or useless answers to critical business questions?

We can create human-machine analytics solutions designed specifically to get businesses more and better answers, faster, and continuously. Fortunately, a range of analytics solutions are emerging to give businesses some real options. These solutions should feature:

  1. Speed/Quantity—Get more answers faster, by spending less time preparing data and more time analyzing it.

  2. Quality—Get better answers to questions, by finding and using more relevant data in analysis—not just what’s most obvious or familiar.

  3. Repeatability—Answer questions continuously, by leaving customers with a reusable analytic infrastructure.

Data preparation platforms from the likes of Informatica, OpenRefine, and Tamr have evolved over the last few years, becoming faster, nimbler, and more lightweight than traditional ETL and MDM solutions. These automated platforms help businesses embrace—not avoid—data variety, by quickly pulling data from many more sources than was historically possible. As a result, businesses get faster and better answers to their questions, since so much valuable information resides in “long-tail” data. To ensure both speed and quality of preparation and analysis, we need solutions that pair machine-driven platforms for discovering, organizing, and unifying long-tail data with the advice of business domain and data science experts.

Cataloging software like Enigma, Socrata, and Tamr can identify much more of the data relevant for analysis. The success of my recommended question first approach of course depends on whether you can actually find the data you need for determining answers to your questions. That’s a formidable challenge for enterprises in the big data era, as IDC estimates that 90% of big data is “dark data”—data that has been processed and stored but is hard to find and rarely used for analytics. This is an enormous opportunity for tech companies to build software that quickly and easily locates and inventories all data that exists in the enterprise, and is relevant for analysis—regardless of type, platform, or source.

Finally, we need to build persistent and reusable data engineering infrastructures that allow businesses to answer questions continuously, even as new data sources are added, and as data changes. A business can do everything right—from starting with the question, to identifying and unifying all available data, to reaching a strong, analytically-fueled answer—and it can still fall short of optimizing its data and analytic investment if it hasn’t built an infrastructure that enables repeatable analytics, preventing the user from having to start from scratch.

Question-First, Data-Second Approach

With the help of a question-first, data-second approach, fueled by cataloging and preparation software, businesses can create a “virtuous analytics cycle” that produces more and better answers faster and continuously (Figure P-1).

fig 0001
Figure P-1. The question-first, data-second approach (image credit: Jason Bailey)

In the question-first, data-second approach, users:

  • Ask the question to be answered and identify the analytics needed to answer it, e.g.,

    • Question: Am I getting the best price for every widget I buy?

    • Analytic: Total spend for each widget supplier across all business units (BUs)

  • Find all relevant data available to answer the question

    • Catalog data for thousands of widget suppliers across dozens of internal divisions/BUs.

    • Enrich with external sources like Dun & Bradstreet.

  • Organize the data for analysis, with speed and accuracy

    • Use data preparation software to automate deduplication across all suppliers and unify schema.

  • Analyze the organized data through a combination of automation and expert guidance

    • Run the unified data through a tool like Tableau—in this case a visual analysis that identifies opportunities to bundle widget spend across BUs.

    • Identify suppliers for negotiation and negotiate potential savings.

  • Answer questions continuously, through infrastructures that are reusable—even as the data changes

    • Run the same analytics for other widget categories–or even the same category as the data and sources change.

As the Forbes/Teradata survey on “The State Of Big Data Analytics” implies, collectively—businesses and analytics providers have a substantial gap to close between being “analytics-invested” and “data-driven.” Following a question-first, data-second approach can help us close this gap.

Get Getting Analytics Right now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.