Chapter 1. Big Data

The number of companies building data architectures has exploded in the 2020s. That growth is unlikely to slow down anytime soon, in large part because more data is available than ever before: from social media, Internet of Things (IoT) devices, homegrown applications, and third-party software, to name just a few sources. According to a 2023 BCG study, “the volume of data generated approximately doubled from 2018 to 2021 to about 84 ZB, a rate of growth that is expected to continue.” The researchers “estimate that the volume of data generated will rise at a compound annual growth rate (CAGR) of 21% from 2021 to 2024, reaching 149 ZB.” Companies know that they can save millions of dollars and increase revenue by gathering this data and using it to analyze the past and present and make predictions about the future—but to do that, they need a way to store all that data.

Throughout the business world, the rush is on to build data architectures as quickly as possible. Those architectures need to be ready to handle any future data—no matter its size, speed, or type—and to maintain its accuracy. And those of us who work with data architectures need a clear understanding of how they work and what the options are. That’s where this book comes in. I have seen firsthand the result of not properly understanding data architecture concepts. One company I know of built a data architecture at the cost of $100 million over two years, only to discover that the architecture used the wrong technology, was too difficult to use, and was not flexible enough to handle certain types of data. It had to be scraped and restarted from scratch. Don’t let this happen to you!

It’s all about getting the right information to the right people at the right time in the right format. To do that, you need a data architecture to ingest, store, transform, and model the data (big data processing) so it can be accurately and easily used. You need an architecture that allows any end user, even one with very little technical knowledge, to analyze the data and generate reports and dashboards, instead of relying on people in IT with deep technical knowledge to do it for them.

Chapter 1 begins by introducing big data and some of its fundamental ideas. I then discuss how companies are using their data, with an emphasis on business intelligence and how this usage grows as a company’s data architecture matures.

What Is Big Data, and How Can It Help You?

Even though the term big is used in big data, it’s not just about the size of the data. It’s also about all the data, big or small, within your company and all the data outside your company that would be helpful to you. The data can be in any format and can be collected with any degree of regularity. So the best way to define big data is to think of it as all data, no matter its size (volume), speed (velocity), or type (variety). In addition to those criteria, there are three more factors you can use to describe data: veracity, variability, and value. Together, they’re commonly known as the “six Vs” of big data, as shown in Figure 1-1.

Figure 1-1. The six Vs of big data (source: The Cloud Data Lake by Rukmani Gopalan [O’Reilly, 2023]).

Let’s take a closer look at each one:

Volume

Volume is the sheer amount of data generated and stored. This can be anywhere from terabytes to petabytes of data, and it can come from a wide range of sources including social media, ecommerce transactions, scientific experiments, sensor data from IoT devices, and much more. For example, data from an order entry system might amount to a couple of terabytes a day, while IoT devices can stream millions of events per minute and generate hundreds of terabytes of data a day.

Variety

Variety refers to the wide range of data sources and formats. These can be further broken down into structured data (from relational databases), semi-structured data (such as logs and CSV, XML, and JSON formats), unstructured data (like emails, documents, and PDFs), and binary data (images, audio, video). For example, data from an order entry system would be structured data because it comes from a relational database, while data from an IoT device would likely be in JSON format.

Velocity

Velocity refers to the speed at which data is generated and processed. Collecting data infrequently is often called batch processing; for example, each night the orders for the day are collected and processed. Data can also be collected very frequently or even in real time, especially if it’s generated at a high velocity such as data from social media, IoT devices, and mobile applications.

Veracity

Veracity is about the accuracy and reliability of data. Big data comes from a huge variety of sources. Unreliable or incomplete sources can damage the quality of the data. For example, if data is coming from an IoT device, such as an outdoor security camera located at the front of your house that is pointing to the driveway, and it sends you a text message when it detects a person, it’s possible that environmental factors, such as weather, have made the device falsely detect a person, corrupting the data. Thus, the data needs to be validated when received.

Variability

Variability refers to the consistency (or inconsistency) of data in terms of its format, quality, and meaning. Processing and analyzing structured, semi-structured, and unstructured data formats require different tools and techniques. For example, the type, frequency, and quality of sensor data from IoT devices can vary greatly. Temperature and humidity sensors might generate data points at regular intervals, while motion sensors might generate data only when they detect motion.

Value

Value, the most important V, relates to the usefulness and relevance of data. Companies use big data to gain insights and make decisions that can lead to business value, such as increased efficiency, cost savings, or new revenue streams. For example, by analyzing customer data, organizations can better understand their customers’ behaviors, preferences, and needs. They can use this information to develop better targeted marketing campaigns, improve customer experiences, and drive sales.

Collecting big data allows companies to gain insights that help them make better business decisions. Predictive analysis is a type of data analysis that involves using statistical algorithms and machine learning to analyze historical data and make predictions about future events and trends. This allows businesses to be proactive, not just reactive.

You’ll hear many companies calling data “the new oil,” because it has become an incredibly valuable resource in today’s digital economy, much like oil was in the industrial economy. Data is like oil in a number of ways:

  • It’s a raw material that needs to be extracted, refined, and processed in order to be useful. In the case of data, that involves collecting, storing, and analyzing it in order to gain insights that can drive business decisions.

  • It’s incredibly valuable. Companies that collect and analyze large amounts of data can use it to improve their products and services, make better business decisions, and gain a competitive advantage.

  • It can be used in a variety of ways. For example, if you use data to train machine learning algorithms, you can then use those algorithms to automate tasks, identify patterns, and make predictions.

  • It’s a powerful resource with a transformative effect on society. The widespread use of oil powered the growth of industries and enabled new technologies, while data has led to advances in fields like artificial intelligence, machine learning, and predictive analytics.

  • It can be a source of power and influence, thanks to all of the preceding factors.

For example, you can use big data to generate reports and dashboards that tell you where sales are lagging and take steps “after the fact” to improve those sales. You can also use machine learning to predict where sales will drop in the future and take proactive steps to prevent that drop. This is called business intelligence (BI): the process of collecting, analyzing, and using data to help businesses make more informed decisions.

As Figure 1-2 shows, I can collect data from new sources, such as IoT devices, web logs, and social media, as well as older sources, such as line-of-business, enterprise resource planning (ERP), and customer relationship management (CRM) applications. This data can be in multiple formats, such as CSV files, JSON files, and Parquet files. It can come over in batches, say once an hour, or it can be streamed in multiple times a second (this is called real-time streaming).

Figure 1-2. Big data processing (source: The Cloud Data Lake by Rukmani Gopalan [O’Reilly, 2023])

It’s important for companies to understand where they are in their journey to use data compared to other companies. This is called data maturity, and the next section shows the stages of the data maturity journey so you can understand where your company is.

Data Maturity

You may have heard many in the IT industry use the term digital transformation, which refers to how companies embed technologies across their business to drive fundamental change in the way they get value out of data and in how they operate and deliver value to customers. The process involves shifting away from traditional, manual, or paper-based processes to digital ones, leveraging the power of technology to improve efficiency, productivity, and innovation. A big part of this transformation is usually using data to improve a company’s business, which could mean creating a customer 360 profile to improve customer experience or using machine learning to improve the speed and accuracy of manufacturing lines.

This digital transformation can be broken into four stages, called the enterprise data maturity stages, illustrated in Figure 1-3. While this term is used widely in the IT industry, I have my own take on what those stages look like. They describe the level of development and sophistication an organization has reached in managing, utilizing, and deriving value from its data. This model is a way to assess an organization’s data management capabilities and readiness for advanced analytics, artificial intelligence, and other data-driven initiatives. Each stage represents a step forward in leveraging data for business value and decision making. The remainder of this section describes each stage.

Figure 1-3. Enterprise data maturity stages

Stage 1: Reactive

In, the first stage, a company has data scattered all over, likely in a bunch of Excel spreadsheets and/or desktop databases on many different filesystems, being emailed all over the place. Data architects call this a spreadmart (short for “spreadsheet data mart”): an informal, decentralized collection of data often found within an organization that uses spreadsheets to store, manage, and analyze data. Individuals or teams typically create and maintain spreadmarts independently of the organization’s centralized data management system or official data warehouse. Spreadmarts suffer from data inconsistency, lack of governance, limited scalability, and inefficiency (since they often result in a lot of duplicated effort).

Stage 2: Informative

Companies reach the second maturity stage when they start to centralize their data, making analysis and reporting much easier. Stages 1 and 2 are for historical reporting, or seeing trends and patterns from the past, so Figure 1-3 calls them the “rearview mirror.” In these stages, you are reacting to what’s already happened.

At stage 2, the solution built to gather the data is usually not very scalable. Generally, the size and types of data it can handle are limited, and it can ingest data only infrequently (every night, for example). Most companies are at stage 2, especially if their infrastructure is still on-prem.1

Stage 3: Predictive

By stage 3, companies have moved to the cloud and have built a system that can handle larger quantities of data, different types of data, and data that is ingested more frequently (hourly or streaming). They have also improved their decision making by incorporating machine learning (advanced analytics) to make decisions in real time. For example, while a user is in an online bookstore, the system might recommend additional books on the checkout page based on the user’s prior purchases.

Stage 4: Transformative

Finally, at stage 4, the company has built a solution that can handle any data, no matter its size, speed, or type. It is easy to onboard new data with a shortened lead time because the architecture can handle it and has the infrastructure capacity to support it. This is a solution that lets nontechnical end users easily create reports and dashboards with the tools of their choice.

Stages 3 and 4 are the focus of this book. In particular, when end users are doing their own reporting, this activity is called self-service business intelligence, which is the subject of the next section.

Self-Service Business Intelligence

For many years, if an end user within an organization needed a report or dashboard, they had to gather all their requirements (the source data needed, plus a description of what the report or dashboard should look like), fill out an IT request form, and wait. IT then built the report, which involved extracting the data, loading it into the data warehouse, building a data model, and then finally creating the report or dashboard. The end user would review it and either approve it or request changes. This often resulted in a long queue of IT requests so that IT ended up becoming a huge bottleneck. It took days, weeks, or even months for end users to get value out of the data. This process is now called “traditional BI,” because in recent years something better has developed: self-service BI.

The goal of any data architecture solution you build should be to make it quick and easy for any end user, no matter what their technical skills are, to query the data and to create reports and dashboards. They should not have to get IT involved to perform any of those tasks—they should be able to do it all on their own.

This goal requires more up-front work; IT will have to contact all the end users to find out what data they need, then build the data architecture with their needs in mind. But it will be well worth it for the time savings in creating the reports. This approach eliminates the queue and the back-and-forth with IT, whose team members generally have little understanding of the data. Instead, the end user, who knows the data best, accesses the data directly, prepares it, builds the data model, creates the reports, and validates that the reports are correct. This workflow is much more productive.

Creating that easy-to-consume data solution results in self-service BI. Creating a report should be as easy as dragging fields around in a workspace. End users shouldn’t have to understand how to join data from different tables or worry about a report running too slowly. When you are creating a data solution, always be asking: How easy will it be for people to build their own reports?

Summary

In this chapter, you learned what big data is and how it can help you and your organization make better business decisions, especially when combined with machine learning. You saw how to describe big data using the six Vs, and you learned what data maturity means and how to identify its stages. Finally, you learned the difference between traditional and self-service BI, where the goal is for everyone to be able to use the data to create reports and identify insights quickly and easily.

Let me now give you an idea of what to expect in the following chapters. In Chapter 2, I will go into what a data architecture is and provide a high-level overview of how the types of data architectures have changed over the years. Chapter 3 is where I show you how to conduct an architecture design session to help determine the best data architecture to use.

Part II, “Common Data Architecture Concepts,” gets into more detail about various architectures. In Chapter 4, I cover what a data warehouse is and what it is not, as well as why you would want to use one. I’ll discuss the “top-down approach,” ask if the relational data warehouse is dead, and cover ways to populate a data warehouse. Chapter 5 describes what a data lake is and why you would want to use one. It also discusses the bottom-up approach and then dives into data lake design and when to use multiple data lakes.

Chapter 6 is about common data architecture concepts related to data stores, including data marts, operational data stores, master data management, and data virtualization. Chapter 7 covers common data architecture concepts related to design, including OLTP versus OLAP, operational versus analytical data, SMP versus MPP, Lambda architecture, Kappa architecture, and polyglot persistence. Chapter 8 is all about data modeling, including relational and dimensional modeling, the Kimball versus Inmon debate, the common data model, and data vaults. And in Chapter 9, you will read about data ingestion, with sections on ETL versus ELT, reverse ELT, batch versus real-time processing, and data governance.

Part III focuses on specific data architectures. Chapter 10 describes the modern data warehouse and the five stages of building one. Chapter 11 covers the data fabric architecture and its use cases. Chapter 12 goes over the data lakehouse architecture and the trade-offs of not using a relational data warehouse.

Chapters 13 and 14 are both about data mesh architectures—there’s a lot to talk about! Chapter 13 focuses on the data mesh’s decentralized approach and the four principles of a data mesh, and it describes what data domains and data products are. Chapter 14 gets into the concerns and challenges of building a data mesh and tackles some common myths of data mesh. It’ll help you check if you are ready to adopt a data mesh. It finishes with what the future of the data mesh might look like.

Chapter 15 looks at why projects succeed and why they fail, and it describes the team organization you’ll need for building a data architecture. Finally, Chapter 16 is a discussion of open source, the benefits of the cloud, the major cloud providers, being multi-cloud, and software frameworks.

Now I’m about to revolutionize your data world. Are you ready?

1 Being on-prem, short for on-premises, refers to an organization’s hosting and managing its IT infrastructure—such as servers, storage, and networking equipment—within its own physical facilities, usually called data centers. This contrasts with cloud-based services, where these resources are hosted and managed by third-party providers such as Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP) in remote data centers. I’ll discuss the benefits of moving from on-prem to the cloud in Chapter 16, but for now, know that transitioning from on-prem servers to cloud is a huge part of most enterprises’ digital transformations.

Get Deciphering Data Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.