Chapter 1. The Need for a Unifying Data Strategy

Imagine yourself as a data strategy consultant, supporting executives with a spectrum of problems across diverse industries. In some cases, deadlines are not being met, and you are brought in to understand why. In other cases, the executives have a vision of how they want to change the world and want your thought partnership on rapidly designing, testing, and building a prototype to present at a global conference. As you work with executives to solve various problems, you begin to see patterns of what works and what doesn’t in the world of data, innovation, and AI, and you begin to wonder why.

Ultimately, your role involves identifying the root causes of innovation bottlenecks and offering actionable recommendations to help organizations overcome these obstacles and achieve their objectives. If there were a set of principles and guidelines to make innovation outcomes more effective and reliable, that would enable you and your clients to be more successful.

A unifying data strategy is a way to approach innovation through the lens of what is the minimal amount of collaborative effort with data that creates maximum business value? It doesn’t require or recommend any specific technology, but it does require you to think about data from a holistic perspective so that you can unify teams around a common language, understanding, and way of working together.

Your Quest for Data-Driven Breakthroughs Begins

You’ve been hired by John, the CEO of a cutting-edge biopharma company, which has just secured significant venture capital to develop a groundbreaking new therapy that potentially cures a disease that kills millions of people a year.

“The clock is ticking,” says John. “With each passing day, we risk falling behind in the race to develop a life-saving therapy, with billions of dollars in contracts at stake. I’ve promised our investors we are going to be data driven in everything that we do. The livelihoods of hundreds of employees are at risk if we don’t deliver, and, most importantly, millions of people are desperate for a cure.”

John wants an assessment of the company’s most significant data problems and recommendations for a quick and effective solution. Despite having an exceptional data team of PhDs in data science and biology from prestigious universities, they are struggling with drug discovery, perpetually battling data issues and leaving themselves and the R&D teams that depend on them to spend time putting out fires, despite continued investments to increase the team size.

The pressure is palpable. John is scheduled for a presentation in a few months to the organization’s funders and board of directors, and they are expecting to see a plan on how to address the situation and achieve the outcomes they are anticipating. Colossal pharmaceutical companies are scrambling to capitalize on novel technologies for previously untreatable diseases, and billions of dollars in contracts are hinging on the new therapy moving forward to the next stage of clinical trials with a fixed deadline. Every day lost to data problems jeopardizes the financial future of the organization. Hundreds of people’s livelihoods and the families they support are on the line. Everyone in the organization is working 12+ hours a day, knowing that if they succeed, they will be part of the team that changed the world and helped save millions of lives.

You ask John to define the one thing that is most important in defining what success looks like. You call this one thing a North Star, and it will help you assess whether people are focusing their efforts on alignment to the CEO’s vision. John confidently speaks about data-driven decision making to accelerate research and says that machine learning has the potential to save years and millions of dollars in R&D costs. The North Star is stated as: we will have the most advanced data-driven capabilities for drug discovery. Your impression is that data will drive R&D, and you believe in the CEO’s vision.

However, the North Star definition starts shifting as John makes comments about how the organization’s culture is R&D led, and it becomes clear that he has a nebulous understanding of how data science works. John fumbles with his words, clearly uncomfortable that his statements aren’t holding up to much scrutiny. Not a big deal, you think. You reassure yourself and the CEO that together you can make the North Star definition clear and succinct.

There Are Usually Multiple, Conflicting North Stars

While interviewing VPs and their subordinate directors about the North Star, you uncover striking discrepancies in their perspective on the importance of data science in guiding their work. The organization’s culture is indeed R&D led, but the CEO is saying the North Star is to be completely data-driven. The R&D team is focused on running biology experiments and doesn’t have any data management expertise.

The data science and data engineering teams are entirely different parts of the organization, primarily used to support R&D by fixing data problems and handling data requests across the organization. R&D are the experts in their field, not data scientists. What does data driven even mean if R&D are the ones making decisions based on their intuition?

The way the data teams view what problems data science will address and the strategy of how data science will be used to make data-driven decisions in R&D deviate significantly from the CEO’s way of thinking about the North Star. The more people you ask what the North Star is, the more it is becoming increasingly unrecognizable across departments and levels of the organization. When you question other executives about these disparate views, they dismiss the North Star as some aspirational and unrealistic phrase rather than the operational foundation for their goals and work.

The Good, the Bad, and the Ugly of Data Problems

Digging deeper, you find business leaders making expensive decisions, investing in software and hardware that creates, curates, and disseminates data, only to find data teams saying that the data is mostly worthless because deeper and more significant problems are being ignored. The CEO is completely oblivious to the continuous data corruption plaguing the supposedly data-driven organization, lulled into a false sense of well-being by costly cloud data storage and compute bills. As the saying goes, garbage in, garbage out (GIGO).

A VP privately tells you the biggest problem to solve is that the scientists are all working with data stored in their emails, PowerPoints, Excel spreadsheets, and comma-separated values (CSV) files in SharePoint. No one can see each other’s work or learn from each other. The VP is considering a cloud company’s consulting pitch for an enterprise data lake solution, complete with a knowledge graph, data catalog, and a host of other expensive enterprise tools that will cost millions of dollars over several years as part of a digital transformation project. The VP is told by these trusted experts that the company’s data will be totally under control, and they will be able to get the insights leaders want.

Except for one problem: it almost always never works out the way the data solution was sold. This usually has less to do with data and more to do with your organization’s strategy, or rather the lack of an effective unifying data strategy around how people in very different domains and with very different perspectives need to understand each other and work together.

The problem boils down to the process of converting abstract business information to concrete results with the minimal amount of risk of errors. The business team says what they want, expecting a top-down progression as shown in Figure 1-1. If a dev and data team translates this without error, it is a successful data/code implementation that accurately represents a product to serve operational needs.

In the world of data management, the true challenge isn’t technology but the human factor; people operate within their own unique silos, skewing perspectives. Bridging these gaps is crucial. Business leaders often think in top-down, solution-centric terms, prioritizing immediate problems like, “We need technology X for problem Y,” rather than delving into root causes, such as, “Why are our costs in Area Z so high? And how do we prevent the problem from occurring in the first place? And what else is being impacted by the problem?” This focus can solve immediate issues for a single unit but neglects the organization’s overall health.

Conversely, data teams offer a bottom-up view anchored in logistical and technical realities. When projects simply get handed off to a data team for execution once the problem and solution have already been decided, perspective clashes occur, derailing timelines and budgets. The remedy is straightforward yet demanding: align these perspectives before taking action. Clarify ambiguities, bridge knowledge gaps, and root out blind spots. By doing so, you’ll develop a unified roadmap, aligning what the business wants with what it actually needs, and ultimately finding the best solution.

This way of thinking necessitates thinking about the problems of translating between the worlds of business and data as being in two distinct categories:

Top-down problems

Strategies and tools are covered by the methodology in Chapters 4, 6, and 7.

Bottom-up problems

The methodology’s tools and strategies address bottom-up approaches in Chapters 9 and 10.

Additionally you will learn that what makes JSON Schema exceptionally useful is that it has two core functions: validation, which is exceptionally well suited for top-down business/data translation problems, and annotation extraction, which is also extremely useful for bottom-up translation problem solving. JSON Schema is also human and machine readable, making it the ideal open source technology for your organization to implement a unifying strategy.

Figure 1-1. Unifying is about creating alignment and understanding of concepts as they flow between business and data teams to meet different requirements by minimizing ambiguity, knowledge gaps, and blind spots. While the example in this section describes a top-down direction, the next section, “The Problem with Problems,” explores what a bottom-up approach looks like. The goal of unifying is alignment in both directions.

Typically, these large-scale enterprise projects take years to successfully implement. People leave, and processes are implemented that people either work around or never learn. There is resistance to doing things in new ways and learning new complex software. Taxonomist processes can become bottlenecks, database architectures are debated, business priorities and competitive threats change, and meanwhile, mountains of new, messy data begin to collect at a faster pace than the data management project tasked with taming it can handle. In five years, the new management team that replaced the old one in a reorg goes through it all again with a fresh budget, believing that they will have a better solution because of some new technology paradigm and trend, but they never achieve the Nirvana-like data state everyone craves.

Note

In the context of JSON Schema, validation and annotation extraction serve distinct but complementary roles. Validation is the process of ensuring that a given JSON document adheres to the rules and constraints defined in the schema, such as data types or required properties. This helps in maintaining data integrity and consistency. On the other hand, annotation extraction involves pulling out additional metadata or descriptive information from the JSON document, such as field descriptions or default values. While these annotations do not impact the validation process, they provide extra context that can be used for generating documentation, tool tips in a user interface, or other supplementary functionalities. Together, validation and annotation extraction contribute to both the robustness and the usability of JSON-based data structures. You will learn more about JSON and JSON Schema from a technical perspective in Chapters 2 and 5, and cover validation and annotations in Chapter 8.

If teams are not collaborating well, and if leaders and employees are not aligned in their data strategy, why would implementing an extremely complicated enterprise solution go smoothly, quickly, or successfully? That’s what unifying is about—hence, the title of this book. It is a data strategy that focuses on the most effective and simplest way to begin data-centric projects: getting people aligned before hitting the gas pedal.

Tip

Going faster in the wrong direction isn’t progress. Your teams need to know where they are going, why their efforts are important to the goal, and how they can work well together.

The Problem with Problems

You are excited that your work can help save lives, you are inspired by people’s passion, and you believe in the company’s understanding of the value of data. You hear the stakeholders’ accounts of what the biggest problems are and begin pouring your creativity and ingenuity into searching for a solution. You conduct interviews, create road maps, and begin building a prototype, thinking you’ve created a truly amazing thing that the company will celebrate.

Except you find out after months of building your alpha version that stakeholders didn’t tell you about some other problem that only surfaced when lower-level employees (who weren’t part of your interviews) started using the application, and it totally invalidates the approach of your solution. You suddenly have to rethink everything from scratch, and all of your previous work was wasted. Welcome to innovation.

Agile is a popular methodology emphasizing flexible and iterative approaches to product development and project management. Being Agile means being able to get feedback as quickly as possible about what fails and why, which is the most important feedback you can get. Agility entails swiftly learning about failures and their causes, which is the most crucial information one can obtain. By identifying less successful ideas more rapidly, you can conserve time and effort, thereby accelerating the discovery of effective solutions.

In order to get feedback as quickly as possible, you decide that instead of building a new coded prototype, you are going to design a paper prototype, drawing out the solution on a piece of large paper with a marker and asking people to click on the paper buttons, moving to different pieces of paper to represent different screens. Everyone loves it, and you feel like a hero. Hooray! You build another prototype now that you have something validated—you’ve succeeded!

Then, as you are going through the final stages of validation, a stakeholder suggests that you present your solution to the medical advisory board. You are informed of a legal requirement around a commonly used word which totally invalidates a major set of features in your previous work that you spent a ton of time testing.

No amount of effort at being Agile will give you what you need if you are focused on the wrong problem—or don’t even know what problem you are trying to solve.

This is what the problem with problems is: How do you know which problems are the right problems to solve? This is especially true when dealing with organizations that have siloed perspectives and nonholistic interests. Leaders have budgets, head counts, and reputations to protect. Their problems are the most important to them.

A critical error that organizational leaders often make is not knowing what type of problem they are trying to solve. If they attempt to solve a problem in a top-down manner, diving into solutions without truly grasping the problem, the efforts can be misguided. Rushing into solutions without proper comprehension—and merely being Agile and iterating quickly—does little if anything to guarantee success.

Figure 1-2 shows where unifying can create alignment in problem solving. The depth of understanding and choosing the “right” problems and the right “way” to solve them makes all the difference. Determining which problems are the right ones involves understanding these problems in the context of bridging the worlds of conceptual (business) and technical (data) language and their operational outcomes (symptoms).

Figure 1-2. Unifying is about giving data champions the capability to understand problems and translating them between business and data team perspectives, whether they require top-down problem-solving approaches or bottom-up problem-solving ones. The phrase “the problem with problems” serves as a reminder of this principle, highlighting the pitfalls of a hasty approach and the benefits of a thorough understanding.

Effective problem solving starts with deep understanding. This involves recognizing how problems are connected across organizational networks and across conceptual, technical, or operational realms. Leaders need to understand the problem with problems, because if they try to address issues from a top-down approach when what is needed is a bottom-up approach, they are just reacting to symptoms and not addressing the root causes.

Tip

Know what type of problem you are solving, concrete or abstract, before making a decision about how to solve it.

If an organization operates in an Agile way to accelerate development before understanding the problem they are working on solving, what to prioritize, and why, then teams are building the wrong things faster. The problem with problems is the foundational problem to solve. Let’s examine what you can do to tackle it.

Unifying Concepts: The Key to Innovation

Concept-first design asks people to explain in plain language the business logic they use to achieve their goals, the problems they have, and with whom and how they collaborate. That business logic is translated into a simple pseudocode structure—simple enough for anyone to read, but structured enough that it can be used as a rough guideline for building systems. In short, before getting into designing or building, you ask people to describe what key concepts are important to the tasks they accomplish at work and why.

Vital knowledge often exists in people’s heads without a shared map to help align understanding and decision making. The only way to see differences in understanding concepts and language may risk people’s ability to collaborate effectively together is to take the fuzzy, implicit map in people’s heads, which teams believe they are aligned with, and turn it into a focused, external conceptual map.

This process involves assessing and comparing three key aspects of how information is managed and utilized in your organization:

  1. The purpose and design of operational concepts used in business processes

  2. Data structures and how they represent concepts utilized in business processes

  3. Methods of communication to gain a comprehensive understanding of how concepts are conveyed and decisions are made

By integrating these three aspects into a single map, you can visualize the connections between people, problems, objectives, and outcomes. This map helps to identify and fill knowledge gaps at an early stage. Without this map, individuals may be making costly and significant decisions without a shared understanding of their team’s current situation, their intended destination, or the strategies to get there. Making decisions to build things without a cohesive and comprehensive map is an easy way to fall into the trap of building the wrong thing faster.

Creating a unified structural map of logic, goals, problems, and success metrics before designing, building, and testing software can save significant time and money, making your software development more efficient and cost-effective.

Tip

Remember, the goal is to find faults as soon as possible. Staying at the conceptual level will enable you to move faster.

Your focus is on capturing and defining the fundamental ideas, business logic, and objectives that underpin the system or application being developed. The aim is to ensure that all stakeholders are unified in their shared understanding of the core concepts and that the design is aligned with the intended purpose and desired outcomes.

The benefits of concept-first design are:

Conceptual clarity

Key concepts, ideas, and principles are defined, including business logic, goals, and the problems that the system is intended to address.

Early alignment

Defining and clarifying concepts early in the development process prevents miscommunications that could have required costly reworking later in the development process.

Holistic perspective

Emphasis is placed on how information connects, flows, and is used outside of human operational silos.

User-centric focus

A strong emphasis is placed on understanding the needs and goals of end users. The design process is centered around user experiences to create solutions that are intuitive, effective, and satisfying for users.

Tip

Create well-thought-out and purposeful solutions by starting with a clear understanding of the fundamental concepts and goals that drive the system or application. Start testing the concepts you and other stakeholders believe in, getting feedback in an Agile and iterative way. The more complex projects are, the more a deep understanding of the underlying concepts is critical to the success of the project.

When one of the authors of this book, Ron Itelman, stumbled upon concept-first design and successfully used it, the results were shocking. All of the points of conceptual conflict and alignment were identified first, leaders were able to get teams on board with a single operational model. Nothing was designed or built until teams across silos could agree to what concepts meant, how they flowed, and what business logic they supported. The focus was simply can we agree to a set of concepts, how they are used, who uses them, and why? This is different from waterfall approaches, where everything is meticulously planned out to be built on a schedule.

Building and implementing the new system went smoothly; there weren’t any friction points, and it was designed, built, and tested in three months. The system was so efficient at creating high volumes of high-quality, rich, and meaningful data that the private equity firm that bought the company paid a premium for the data alone. That experience led Ron to further research and develop an innovation strategy based on the foundation of unifying concepts across organizational networks.

What a Unifying Data Strategy Will Do for Agile

The root cause of any problem that a company faces is not technology. It is the problem with problems—that leaders and teams are not unified in their language, understanding, or efforts to prioritize and solve the problems that prevent them from achieving goals that drive revenue, reduce costs, and create value.

Note

Historically, Agile was a set of principles resulting from frustrations with the highly structured way that software development contracts were written; there was no room to deviate from the agreed-upon work. This was nearly impossible, because as software development work began, unexpected problems, needs, and perspectives emerged. If the developers only focused on the requirements, they would be delivering something that required going through lengthy contract negotiations. The developers and the stakeholders were often separated. The original Agile principles were about creating communication, iterating, and adapting quickly in order to learn what works and what doesn’t, and delivering something of value in a modular versus monolithic way.

Many other books, strategies, and frameworks have been created to formalize Agile in their own unique ways, which this book does not cover. This book is the result of years of research into the top pain points of collaborating with data, and it proposes a data management strategy that aligns with the original Agile principles.

Traditionally, teams work in their functional areas, but data is holistic, belonging to the entire organization. This is a primary reason why data teams often struggle with traditional ways Agile strategies are implemented in organizations. A unifying data strategy enables zooming out to identify and solve company-wide challenges of becoming data-centric in addition to zooming in to operate like a traditional Agile team, using a data-centric lens to drive business value.

Defining Being Agile

No mechanism in nature or technology is more pervasive than the mechanism of feedback.

Bernard Friedland (Control System Design, Prentice Hall, 2005)

If you ask 1,000 people what Agile is or how it works, you will get 1,000 perspectives, ranging from loose interpretations such as “Agile means you don’t need requirements and have a quick meeting every day for 15 minutes to talk about where you are blocked” and “Agile just means figuring out what works and throwing out what doesn’t,” to highly structured ones like “you need to measure everything in story points and measure velocity in spreadsheets, analyzing productivity every two weeks” and “you need to get the team certified and have dedicated Agile management experts.”

For the purposes of this book, we will simplify what Agile means, and what an Agile data strategy means. This book defines the three primary ways of being Agile as follows:

Remove ambiguity

Deeply understand problems, goals, and people, and identify what you know and don’t know. This knowledge usually comes from conversations with customers, stakeholders, your competitors’ customers, and UX research.

Rapidly iterate

Test what you can to validate or invalidate assumptions. This can be A/B testing, prototyping with Styrofoam or with code. The goal of iterating is to get feedback—which might be qualitative (surveys and conversations that help explain where data cannot) or quantitative (numerical data from observations)—as quickly and directly as possible to remove ambiguity.

Adapt attention to value

Aim to make progress rather than spending too much time setting priorities and getting stuck in debates. Adaptive attention means removing distractions and being willing to shift focus. Attention should always be aligned with what actions will yield the most business value and best results.

Agile Theater

If you do not change direction, you may end up where you are heading.

Lao Tzu, Chinese Taoist philosopher, 5th century BC

When interviewing product managers, engineers, designers, and managers on challenges around being Agile, the conversations almost universally revolve around measuring velocity, missing deadlines, and shifting requirements as new information is gained. Thinking of velocity as productivity can be a trap; building the wrong things faster doesn’t equal success. Productivity for the sake of productivity doesn’t create value. Having stand-ups for the sake of stand-ups isn’t progress. If your Agile teams aren’t removing ambiguity, rapidly iterating, and adapting attention to what drives business impact and value, then your Agile processes are at risk of being mostly ritualistic theater.

In organizations, responsibility is distributed among various stakeholders. Agile stakeholders collaborate to ensure products, processes, and services are not only reliable and functional, but that they also generate sustainable value. However, measuring success solely through velocity points—an Agile metric that quantifies the amount of work a team can tackle during a single sprint—can cause unintended consequences. This approach can incentivize engineers to manipulate how they assign and complete points, leading to a defensive, self-protective culture known as cover your ass (CYA) that does not create business value.

Overemphasis on velocity metrics can encourage individuals to exploit the system, focusing more on inflating their stats—such as how many story points they’ve completed—rather than on genuine innovation. But the true goal of innovation isn’t about completing the most story points. It’s about creating business value, positively impacting colleagues, and serving customers effectively.

Agile, Waterfall, and Unifying

Unifying sets itself apart. Rather than thinking of it as just another methodology, envision unifying as the “tuning fork” of project management. By introducing it before design or construction begins, it ensures that whether you opt for Agile or the waterfall methodology, your approach is fine-tuned for utmost efficiency and alignment. Think of it as continually adjusting a musical instrument to hit the right notes; unifying continuously calibrates direction and purpose to ensure harmony in execution, as shown in Table 1-1.

Unifying aligns at the conceptual level to minimize the costly risks of misunderstandings and mistakes. It forefronts collaboration and innovation, acting as a tool to determine the optimal methodology—be it Agile or waterfall—ensuring the highest chances of project success.

Table 1-1. The benefits of Agile and waterfall methodologies, and how unifying compliments them
Waterfall Agile With unifying
Defined milestones and measures of progress from a macro perspective. Flexible to changes and improving goal setting at the micro perspective. Continuously calibrates perspectives, with a focus of quantifiably knowing the right direction before taking action.
Segments of delivery, such as factory or supply-chain processes, can be optimized independently. Faster delivery, incremental development, and regular releases. Focuses on removing and reducing unnecessary effort by minimizing ambiguity, knowledge gaps, and blind spots.
Allows segments to move autonomously, creating independence and integrity for individual unit responsibilities. Getting regular feedback by engaging stakeholders and end users throughout the development process, Agile ensures that the product evolves based on actual user needs. Serves as a translator between technical teams and business stakeholders using JSON Schema, ensuring a unified vision across both parties, thereby preventing costly misalignments.

Defining a Unifying Data Strategy Approach

For analytics, data science, and machine learning to enable enhanced decision making, they require high-quality data that represents business situations and outcomes accurately. Too often, data scientists are expected to comb through mountains of poor-quality data and miraculously extract insights when they should have been involved in decision making processes.

Business leaders don’t know what they don’t know and are reluctant to take time that they don’t have to understand data science concepts. Meanwhile, data scientists frequently lack an understanding of the collaboration and business dynamics of finance, design, product management, and software development, isolated as they are in enigmatic realms of mathematical jargon and data tools.

In this book we will explore the innovation challenges faced by organizations and the importance of embracing a data-centric innovation strategy. The lack of a shared understanding of goals, and the communication gaps between business leaders and data scientists, contribute to difficulties and may lead to investments in inadequate technology.

Overcoming obstacles around data requires fostering a data-centric culture that emphasizes collaboration, gathering high-quality data, and the strategic involvement of data and data science teams. Understanding the problems around data helps organizations transform into truly data-driven powerhouses, ensuring that they can unlock the full potential of their data and achieve their desired outcomes.

A unifying data strategy approach uses the three characteristics of being Agile—removing ambiguity, rapidly iterating, and adapting attention to value—and adds a few key factors:

Holistic connected perspective

For data to be valuable, it needs to be connected across the organization. Data doesn’t belong to any one team; people come and go, teams are reorganized and shifted, but the data belongs to the organization. A unifying data strategy maps together the various functional business units, the data they collect, and why the data is important.

Information flows

For data to be usable, it must be able to flow and be combined and transformed. Understanding how data flows between teams with differing language, mental models, and problems, is critical.

Minimal viable data (MVD)

Data-driven decision making requires the right quality and quantity of data. The minimal viable data (MVD) is the smallest possible amount of high quality data necessary to make reliable predictions and drive value. Simply having a large volume of data isn’t sufficient if the quality is poor—as in GIGO.

Understanding the Phrase Being Data Driven

Almost all organizations say they want to be data driven in order to maximize efficiency and optimize decision making. But what does this actually entail?

Efficiency is the ability to achieve a desired outcome with minimal waste of time, resources, or effort. When an organization is aligned, it naturally streamlines processes and eliminates redundant efforts. Teams can work seamlessly, reducing the risk of miscommunications, mistakes, and misunderstandings. This results in faster project completion times and more efficient use of resources.

Effectiveness is the ability to achieve desired outcomes with superior degrees of accuracy or quality. Alignment can improve effectiveness by ensuring that everyone is harmoniously working toward goals and objectives in a focused way.

An organization is data driven when it masters producing and analyzing data, enabling decision makers to learn faster, make higher-quality predictions, and identify problems and opportunities sooner. In other words, they still are making decisions and balancing information with experience, but decisions are made with the highest quality data that they can use.

This is the opposite of HIPPO, which stands for highest paid person’s opinion. HIPPO decisions are made when someone’s opinion is prioritized because of their seniority rather than because data has been collected to evaluate whether opinions and assumptions are invalid. Because of reluctance to challenge HIPPO, HIPPO decisions tend to be tested only after time and money have been spent and things have gone wrong. Worse, poor HIPPO decisions often end up being blamed on the teams who are supposed to defer to senior employees.

To Be Data Driven, Be Data Centric

Paradoxically, data is the most undervalued and deglamorized aspect of AI…We define, identify, and present empirical evidence on data cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable.

Google Research, “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI,” 2021.

Having higher-quality data can have a significantly greater impact on the value of data when compared to higher volumes of lower-quality data. Good data is usually better than big data because its quality, relevance, and accuracy minimizes irrelevant, incorrect, or misleading information, and requires fewer resources to process.

Good data makes it easier to communicate insights to stakeholders and facilitate data-driven decision making, reducing risks and creating true value that is reusable across the organization. When an organization values data as a critical resource and invests in the continuous production and maintenance of high-quality, feature-rich data in order to maximize operational efficiencies and effectiveness, then the organization can say that it is data driven.

Certainly, using poor-quality data and asking data teams to find value without including them in business decisions and innovation efforts is not being data driven. Unfortunately, the norm for many teams is using poor-quality data and not including data teams in decision making. Many organizations operate under the false assumption that they are data driven when they are not. Here is a real-world example. A multibillion dollar global company invested a significant amount of money into their Salesforce system. Sales teams were required to log potential customer interest in a service. The analytics were showing that sales teams had a 30% closing rate, meaning that one out of three conversations resulted in a sale. However, upon further examination, that number was revealed to be completely meaningless!

Salespeople don’t want to document every time they lose a sale, so they were only putting in sales data when they were confident they could make a sale. Some sales people were only entering that data when they were confident they could make a sale, and some simply avoided the task entirely. When sales were won or lost, few if any sales members logged information as to why, other than selecting a generic checkbox which was the default selection. From a business leadership perspective, the organization had invested a ton of money in Salesforce and had a lot of data, but after several years of collecting data, when they tried to predict who would buy what service so they could know who to focus on or what products to develop, they ended up with absolutely nothing that was usable from their hefty investments.

In order to be data driven, companies must start by being data centric: examining everything in your operational business processes to make sure that data really does live at the heart of your activities. In the previous example, this would entail having a designer optimize the Salesforce UI for intuitiveness and ensure that information could be leveraged effectively. In order to make sure that data capture is accurate, managers need to be trained to incentivize teams to capture data and how to get immediate value from accurate data.

Tip

Data must always be thought of as the center of the universe.

Here’s a checklist you can use to evaluate how data centric your organization is:

  • How is data collection from employees managed?

  • Is there someone responsible for owning data quality and a data-centric strategy?

  • What investments have been made into training managers and teams about data?

  • Are experiments and innovation efforts funded in order to learn how to create value in exchange for data collection?

  • How granular is the data collected, and how is data quality measured?

Bottlenecks Preventing Teams from Being Data Driven

100% of customers are people. 100% of employees are people. If you don’t understand people, you don’t understand business.

Simon Sinek, Start with Why (Portfolio, 2009)

Over the years, the authors of this book have researched the top pain points and bottlenecks that data teams face. The hard truth is that business leaders usually have little understanding of the complexities of managing data and little desire to feel like they aren’t experts, and they don’t really appreciate technology. What leaders care about is increasing profits, reducing costs, and growing as fast as possible. After that, they care about reducing risks. If you are lucky, an organization may have a philanthropic or employee development culture.

Business leaders want their reports, dashboards, and insights, but they have no idea that the person they are asking to generate a report has to often dig through data swamps to find a thousand CSV files, not knowing who created them or why, or even what the table headers mean.

Let’s imagine a business intelligence analyst gets instructions to generate a quarterly sales projection report. This report allows the finance team to make sure they are operationally healthy, can pay salary and other expenses, and can prepare reports to the SEC. Inaccurate projections can have legal consequences and huge impacts on shareholder price.

First, the business intelligence analyst must meet with stakeholders, who almost never know what data exists or how it is generated. The analyst gets access to sensitive financial data stored in CSV, Excel, JSON, and other formats, digging through datasets with commonly used terms such as Revenue unaware that different teams may intend slightly different meanings for how revenue is calculated. For example, a sales team might define revenue as the total sale to a client, while an accountant might define it as the sale minus the salesperson’s commission.

After digging through the data swamp, the analyst needs to set up more meetings to verify these meanings and ask which dataset is the source of truth. If the analyst is unlucky, they will not uncover the different meanings—only for the finance team to discover months later that something has gone wrong and that important and expensive financial decisions were made on the basis of the analyst’s faulty report. The finance team is now blaming the analyst. Welcome to the experience of people in data.

This Book’s Project: Intelligence.AI Coffee Beans

To demonstrate the application of a unifying data strategy, we are going to work with a sample problem and datasets and explore strategies for working effectively with data from a technical and business perspective. Intelligence.AI is the company the authors of this book founded. We will present it as a fictitious company selling premium coffee beans from around the world with a humorous flair and inspiring artwork on the coffee bags, as shown in Figure 1-3. The company is small, but it is building an online presence and wants to be data driven.

You are the CEO of Intelligence.AI. You need to decide which marketing channels are most effective in driving sales and acquiring new customers. Your marketing lead has allocated budgets across social media, email campaigns, and in-store promotions, but lacks insights into the return on investment (ROI) for each channel. You want to understand the factors influencing customer acquisition in order to optimize your budget spend and to make informed decisions on inventory management and pricing. To address these challenges, Intelligence.AI decides to adopt a unifying data strategy approach to analyze the available data and provide actionable insights.

Figure 1-3. Products from the fictitious Intelligence.AI virtual store. This store will provide examples of applying a unifying data strategy to a small ecommerce startup, including inventory management, pricing, customer acquisition, design, and copywriting. Images prompted using Midjourney (5/11/2023). Left image: “a teddy bear passionately singing into a microphone hyperrealistic precision 8K, and 8K hyperrealistic.” Right image: “a dog and cat cuddling together cute, gorgeous, loveable.”

The datasets at your disposal encompass sales, marketing, and customer service. You plan to A/B test coffee bag designs using annotated labels to describe the concepts featured in the designs (e.g., teddy bear, dog, cat). One effective way to work with data is by using JSON, the most popular data format in use today. JSON is a universal language that is both easy to read and incredibly powerful for data-driven applications. You will learn more about JSON in Chapter 2, and throughout the book you’ll be unifying your coffee business’ datasets.

Summary

Across most organizations, there are conflicting perspectives on the North Star, or guiding principle, especially regarding being truly data centric. While leaders may emphasize a data-driven approach, data problems abound, often resulting in costly investments that don’t address the fundamental problems: how to think of data from a holistic perspective, and to identify the right problems to solve. Being data driven requires a real commitment to creating, curating, and disseminating high-quality data and supporting a data-centric culture.

Concept-first design involves translating business logic into simple pseudocode structures, which clarifies key concepts and aligns stakeholders, preventing miscommunications and costly rework.

A unifying data strategy enables you to quickly identify, address, and learn from failures. To implement a unifying data strategy, focus on removing ambiguity, rapidly iterating, and adapting attention to value creation. The costs of not having a unifying data strategy are bottlenecks, data swamps, and inconsistent use of language, which hinders data-driven decision making.

In Chapter 2, we will delve into the world of JSON, a popular data format that is easy to read and powerful for data-driven applications. Understanding JSON is crucial for implementing a unifying data strategy, as it provides a universal language for structuring and exchanging data. In Chapters 3 and 4, we’ll explore how to connect your unifying data strategy to your code.

Get Unifying Business, Data, and Code now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.