Chapter 4. The Relational Data Warehouse

By the mid-2000s, I had used relational databases for years, but I had never been exposed to relational data warehouses. I was working as a database administrator (DBA) for a company that used an accounting software package to manage its financial transactions. The reporting from the package was limited and slow, so the company was looking to improve performance, create dashboards to slice and dice the data, and combine its financial data with data from a homegrown application to get a better understanding of the business.

My employer hired a consulting company to build this thing called a “relational data warehouse”—and, in a decision that changed the course of my career, asked me to help. We generated dashboards that saved the end users a ton of time and added business insights they’d never had before. When I saw the excitement on their faces, I knew I’d found my new passion. I changed my career to focus on data warehousing and never looked back.

What Is a Relational Data Warehouse?

A relational data warehouse (RDW) is where you centrally store and manage large volumes of structured data copied from multiple data sources to be used for historical and trend analysis reporting so that your company can make better business decisions. It is called relational because it is based on the relational model, a widely used approach to data representation and organization for databases. In the relational model, data is organized into tables (also known as relations, hence the name). These tables consist of rows and columns, where each row represents an entity (such as a customer or product) and each column represents an attribute of that entity (like name, price, or quantity). It is called a data warehouse because it collects, stores, and manages massive volumes of structured data from various sources, such as transactional databases, application systems, and external data feeds.

Not all data warehouses are based on the relational model. Non-relational data warehouses include types like columnar, NoSQL, and graph data warehouses. However, relational data warehouses are much more popular and widely adopted, primarily because relational databases have been the dominant data management paradigm for decades. The relational model is well suited for structured data, which is commonly found in business applications. It is also popular due to the widespread use of SQL, which has been the standard language for relational data warehouses for many years.

An RDW acts as a central repository for many subject areas and contains the single version of truth (SVOT). A critical concept in data warehousing, the SVOT refers to the practice of creating a unified, consistent view of an organization’s data. It means that all the data within the data warehouse is stored in a standardized, structured format and represents a single, accurate version of the information. This ensures that all users have access to the same information, eliminating any discrepancies or inconsistencies and eliminating data silos. This improves decision making, collaboration, and efficiency across the organization. It also reduces the risk of errors and misunderstandings that can arise from working with disparate, inconsistent data sources.

Imagine you don’t have a data warehouse and are generating reports directly from multiple source systems, and maybe even some Excel files. If a report viewer questions the accuracy of the data, what can you tell them? The “truth” can be spread out over so many source systems that it’s difficult to trace where the data came from. In addition, some reports will give different results for the same data—for example, if two reports use complex logic to pull the data from multiple sources and the logic is updated incorrectly (or not at all). Having all the data in one central location means that the data warehouse is the single source of truth; any questions about the report data can be answered by the data warehouse. Maintaining a SVOT is essential for organizations looking to harness the full potential of their data.

If a data warehouse (DW) is used by the entire company, it’s often called an enterprise data warehouse (EDW). This is a more comprehensive and robust version of a data warehouse, designed to support the needs of the entire organization. While a standard DW might support a few business units, with many DWs throughout the organization, the EDW uses a wider range of data sources and data types to support all business units. The EDW provides a single, unified view of all of the organization’s data.

Figure 4-1 illustrates a major reason to have a data warehouse. The diagram on the left shows how challenging it is to run a report using data from multiple applications when you do not have a data warehouse. Each department runs a report that collects data from all the databases associated with each application. So many queries are being run that you’re bound to have performance problems and incorrect data. It’s a mess. The diagram on the right shows that, with all application data copied into the EDW, it becomes very easy for each department to run a report without compromising performance.

Figure 4-1. Before and after an enterprise data warehouse

Typically, to build a data warehouse, you will create data pipelines that perform three steps, called extract, transform, and load (ETL):

  1. The pipeline extracts data from the source systems, such as databases and flat files.

  2. The extracted data is then transformed or manipulated to fit the target systems’ requirements (in this case, to fit a data warehouse). This can involve cleaning, filtering, aggregating, or combining data from multiple sources.

  3. The transformed data is loaded into the data warehouse. A DBA can make the database and field names more meaningful, making it easier and faster for end users to create reports.

What a Data Warehouse Is Not

Now that you know what a data warehouse is, let’s clarify its purpose by looking at solutions that should not be considered a data warehouse (though I have seen people do so many times):

DW prefix

A data warehouse is not a just copy of a source database from an operational system with DW added to the filename. For example, say you were to copy a database called Finance containing 50 operational tables and call the copy DW_Finance, then use those 50 tables to build your reports. This would result in a data warehouse designed for operational data when instead you need it designed for analytical data. With analytical data, you have better read performance and can create data models to make it easier for end users to build reports. (I’ll explain more in the next section.)

Views with unions

A data warehouse is not a copy of multiple tables from various source systems unioned together in a SQL view. (Unioning is done via the SQL UNION statement, which combines the results of two or more SELECT statements into a single result set.) For example, if you copied data from three source systems that each contain customers, you’d end up with three tables in the data warehouse called CustomerSource1, CustomerSource2, and CustomerSource3. So you’d need to create a view called CustomerView that is a SELECT statement unioning the tables CustomerSource1, CustomerSource2, and CustomerSource3. You’d repeat this process for other tables, such as products and orders.

Instead, the data from the three tables should be copied into one table in the data warehouse, which requires the extra work of creating a data model that fits all three tables. You would likely want to use master data management (MDM), explained in Chapter 6, at this point to prevent duplicates and improve accessibility and performance.

Dumping ground

A data warehouse is not a dumping ground for tables. Many times, this practice arises when a company does not have a DW and an end user wants to create a report from a subset of data from a couple of source systems. To help them out quickly, a person from IT creates a DW without much thought, copying the data from those two source systems into the DW. Then other end users see the benefit the first end user got, and they want additional data from those same source systems and a few others to create their own reports. So once again the IT person quickly copies the requested data into the DW. This process repeats over and over until the DW becomes a jumbled mess of databases and tables.

So many DWs start out as one-off solutions for a couple of users, then morph into full-blown but poorly designed DWs for the entire company. There is a better way.

Instead, when that first end-user request comes in, assess your company’s reporting needs. Find out if the request is really a one-off or if it should be the start of building an EDW. If it should, this is your chance to show senior leaders why your company needs a DW. If so, be adamant that you need enough up-front time to design a DW that can support many data sources and end users. (Use “Why Use a Relational Data Warehouse?” to support your case.)

The Top-Down Approach

In an RDW, you will do a lot of work up front to get the data to where you can use it to create reports. Doing all this work beforehand is a design and implementation methodology referred to as a top-down approach. This approach works well for historical-type reporting, in which you’re trying to determine what happened (descriptive analytics) and why it happened (diagnostic analytics). In the top-down approach, you first establish the overall planning, design, and architecture of the data warehouse first and then develop specific components. This method emphasizes the importance of defining an enterprise-wide vision and understanding the organization’s strategic goals and information requirements before diving into the development of the data warehouse.

Descriptive analytics and diagnostic analytics are two important types of data analysis that are commonly used in business. Descriptive analytics involves analyzing data to describe past or current events, often through the use of summary statistics or data visualizations. This type of analysis is used to understand what has happened in the past and to identify patterns or trends in the data that can help with decision making.

Diagnostic analytics is used to investigate the causes of past events, typically by examining relationships between different variables or factors. This type of analysis can identify the root causes of problems or diagnose issues that may be affecting business performance.

Suppose a company wants to analyze sales data from the past year. Descriptive analytics would involve calculating summary statistics such as total sales revenue, average sales per day, and sales by product category to understand what happened. Diagnostic analytics, in contrast, would examine relationships between factors (such as sales and marketing spend, or seasonality and customer demographics) to understand why sales fluctuated throughout the year. By combining both approaches, companies can gain a deeper understanding of their data and make more informed decisions.

Figure 4-2 shows the architecture of a typical RDW. ETL is used to ingest data from multiple sources into the RDW, where reporting and other analytics can be performed.

Figure 4-2. The architecture of a data warehouse

The top-down approach typically involves the following steps:

1. Formulate some hypotheses up front.
Start with a clear understanding of corporate strategy. Then make sure you know what questions you want to ask of the data.
2. Define the business requirements.
Identify the organization’s goals, objectives, and key performance indicators (KPIs). Gather and analyze the information needs of various departments and users. You can also think of this step as defining your reporting requirements.
3. Design the data warehouse architecture.
Based on the business requirements, create a high-level architecture for the data warehouse, including its structure, data models, and data integration processes. These will be your technical requirements.
4. Develop the data model.
Design a detailed data model for the data warehouse, taking into account the relationships between various data entities and the granularity of the data.
5. Build the architecture.
Develop the appropriate databases, schemas, tables, and fields for the data warehouse. This is the previously described approach called schema-on-write.
6. Develop ETL.
Develop the ETL processes to extract data from various source systems, transform it into the desired format, and load it into the data warehouse.
7. Develop and deploy BI tools and applications.
Implement BI tools and applications that allow users to access, analyze, and report on the data stored in the data warehouse.
8. Test and refine the data warehouse.
Perform testing to ensure data quality, performance, and reliability. Make any necessary adjustments to optimize the system.
9. Maintain and expand the data warehouse.
As the organization’s needs evolve, update and expand the data warehouse accordingly.

The top-down approach has some advantages, such as a comprehensive view of the organization’s data needs, better data consistency, and improved governance. However, it can also be time-consuming and resource intensive, taking longer to deliver value than to the bottom-up approach used by the data lake, described in Chapter 5. The modern data warehouse architecture, described in Chapter 10, combines both the top-down and bottom-up approaches.

Why Use a Relational Data Warehouse?

Having an RDW makes it so much easier to build any kind of BI solution, since BI solutions can pull data just from the RDW without having to create complex logic to pull data from multiple source systems. Also, they won’t have to clean or join the data because the RDW will have already done that. The BI solution that is built from the RDW could be a data mart (which contains a subset of the RDW data for a specific group of people, as explained in Chapter 6), aggregate data to make queries and reports faster, and even be usable within Microsoft Excel. The bottom line is that with a RDW, you already have a solid foundation to build upon.

Let’s look in detail at some of the major benefits you can get from using an RDW:

Reduce stress on the production system

You may have seen this problem before: an angry call from an end user complaining that inserting orders via the order entry application is taking forever. You look into it, and it turns out that another end user is running a report via the order entry application that is hogging all the resources on the server where the application resides. This situation is especially common when end users are allowed to create ad hoc queries and they come up with poorly written SQL.

By copying the order entry application database to a DW and optimizing it, you can have all reports and ad hoc queries go against the DW and avoid this problem altogether, especially if the end user needs to run a report that goes against multiple application databases.

Optimize for read access

Application databases are going to be optimized to support all the CRUD operations equally, so the reading of data will not be as fast as it could be. The data warehouse, on the other hand, is a write-once, read-many type of system, meaning it will be used mainly for the reading of data. Therefore, it can be optimized for read access, especially for the time-consuming sequential disk scans that frequently occur when reports or queries are run. There are many database techniques that can be used to speed up read access in a DW, some to the detriment of write access, which we are not concerned about.

Integrate multiple sources of data

The ability to integrate many sources of data in order to create more useful reports is one of the more important reasons to build a DW. Locating all the data in one spot instead of having it spread out over various databases not only makes report building easier but greatly improves reporting performance.

Run accurate historical reports

Without a DW, end users of applications usually run all their reports on a particular day each month (usually the last). They then save them to disk so they have copies that they can refer to in the future. For example, the user wants to look at a report from a few months ago that lists customer sales by state. However, one customer has recently moved to a different state. If the user runs a current report, it would incorrectly show that customer’s sales in their new state instead of their old state (since their record in the database has been updated to their new state). Hence, the user must look back at a saved older report instead of running a current report.

A DW can take care of this by keeping track of when a customer moves (via tracking customer location history with start and end dates) as well as any other fields that need to be tracked (for example, employer or income). Now the user can run a report today but ask it to pull data as of some date in the past, and the report will be accurate. Moreover, saving report files each month is no longer required.

Restructure and rename tables

Many application databases have table and field names that are very difficult to understand, especially older ERP and CRM products (think table names such as T116 and field names like RAP16). In the data warehouse, you can copy the data from those source tables into something much easier to understand (for example, Customer instead of T116). You can also likely come up with a better data model for all the tables. End users will be able to create reports much more easily when they don’t have to translate cryptic table and field names.

Protect against application upgrades

Imagine you don’t have a DW and users instead create reports against an application database. Everything is running fine, and then all of a sudden, many reports start giving errors. It turns out the application went through an upgrade, installing a new version that renamed a bunch of tables and fields. So now you must go through each and every report, out of hundreds, and rename the changed tables and fields. That could take months, resulting in a lot of upset end users. Even after that, any reports that got missed might still give errors.

A DW can protect you against this. After an application upgrade, only the ETL that copies data from the application databases to the DW needs to be updated—a quick task. The reports do not have to be changed. End users won’t see any new data until the ETL is fixed, but their reports don’t have errors.

Reduce security concerns

Without a DW, your team would need to give each end user security access to each application database they needed to use for reporting purposes. There could be dozens; the process of providing access could take weeks, and sometimes they still might not have access to everything they need. With a DW, each end user needs access only to the appropriate tables, and providing this is much faster and easier.

Keep historical data

Many production systems limit the amount of historical data they keep (for example, data from the last three years). They do this to save space and improve performance and, in some cases, to comply with regulations. Older data is usually purged yearly or monthly. On the other hand, a DW can hold all of the history, so you never have to worry about running a report for older years and not finding any data.

Master data management (MDM)

As you collect data from multiple source systems, many times you will need to use MDM to remove duplicate records for such things as customers, products, and assets. (See Chapter 6 for a more detailed explanation of MDM.) The DW is the perfect place to perform MDM. Also, many of the MDM tools allow you to create hierarchies (for example, Company → Department → Employee), adding more value to mastering the data.

Improve data quality by plugging holes in source systems

You will find that a lot of the data you get from the various source systems needs to be cleaned, despite what the owners of the applications say (I have heard them say “Our data is clean” many times, only to be proved wrong). For example, an order entry application may require a customer’s birth date, and if the person entering the data does not know the customer’s birth date, they might enter a date in the future or a date more than 100 years old just to be able to complete the order. Or maybe the application does not check the accuracy of the two digits entered for a state code. There are always dozens of “holes” in the source system. You can not only clean the data in the DW, but also notify the people who maintain the applications of the holes in their systems so they can fix them. In this way, you help prevent the entry of bad data in the future.

Eliminate IT involvement in report creation

This goes back to the self-service BI mentioned in Chapter 3: building a proper DW will remove the need to get IT involved with building reports and leave this task in the hands of the end user. Without the bottleneck of limited IT resources, reports and dashboards can be built sooner. And IT will be thankful they can work on more interesting projects than creating reports!

Drawbacks to Using a Relational Data Warehouse

There are always trade-offs, and here are the drawbacks to consider when building an RDW:

Complexity

DWs can be complex and time-consuming to design, build, and maintain. The specialized skills and resources required can increase costs.

High costs

Implementing a DW can be expensive, requiring significant investments in hardware, software, and personnel. Ongoing maintenance and upgrades can also add to the cost.

Data integration challenges

Integrating data from various sources can be challenging, as it may involve dealing with different data formats, structures, and quality issues. This can result in spending time and effort on data cleaning and preprocessing. In addition, certain data, such as streaming data from IoT devices, is too challenging to ingest into an RDW and so the potential insights from this information are lost.

Time-consuming data transformation

For data to be loaded into a DW, it may need to be transformed to conform to the warehouse’s data model. This process can be time-consuming, and errors in data transformation can lead to inaccurate analysis.

Data latency

Because DWs are designed to handle large volumes of data, they can be slower to process than other types of databases. This can result in data latency, where the data in the warehouse is not up-to-date with the most recent changes to the source databases.

Maintenance window

With an RDW, you usually need a maintenance window. Loading and cleaning data is very resource intensive, and if users are trying to run reports at the same time, they will experience very slow performance. So users must be locked out of the warehouse while maintenance is going on, preventing 24/7 access. If any problems occur during the maintenance window, such as a failed ETL job, you may have to extend the maintenance window. If users try to run reports and are still locked out, you’ll have upset users who can’t perform their jobs.

Limited flexibility

DWs are designed to support specific types of analysis, which can limit their flexibility for other types of data processing or analysis. Additional tools or systems may need to be integrated with the warehouse to meet specific needs.

Security and privacy concerns

Storing large amounts of sensitive data in a centralized location can increase the risk of data breaches and privacy violations, necessitating strong security measures.

Populating a Data Warehouse

Because the source tables that are fed into a data warehouse change over time, the DW needs to reflect those changes. This sounds simple enough, but there are many decisions to make: how often to extract (or pull) the data, what extract method to use, how to physically extract the data, and how to determine which data has changed since the last extraction. I’ll briefly discuss each of these.

How Often to Extract the Data

How often you need to update the DW depends largely on how often the source systems are updated and how timely the end user needs the reporting to be. Often end users don’t want to see data for the current day, preferring to get all data through the end of the prior day. In this case, you can run your jobs to extract the data from the source systems via ETL tools each night after the source system databases have finished updating, creating a nightly maintenance window to do all the data transfer. If the end users require updates during the day, then a more frequent extract, say hourly, will be required.

One thing to consider is the size of the data for each extract. If it is very large, updating the DW may take too long, so you might want to split the update into smaller chunks and do more frequent extracts and updates (for example, hourly instead of daily). Also, it may take too long to transfer large amounts of data from the source systems to the data warehouse, especially if the source data is on-prem and you don’t have a large pipeline from the source system to the internet. This is another reason why you may want to go from a large nightly transfer to smaller hourly transfers during the day.

Extraction Methods

There are two methods for extracting data from source systems. Let’s look at each:

Full extraction

In a full extraction, all the data is extracted completely from one or more tables in the source system. This works best for smaller tables. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes, making this method very easy to build. The source data is provided as is, and you don’t need any additional information (for example, timestamps).

Incremental extraction

In incremental extraction, you’re pulling only the data that has changed since a specified time (such as the last extraction or the end of a fiscal period) instead of the whole table. This works best for large tables, and it works only when it’s possible to identify all the changed information (discussed below).

With most source systems, you’ll use a combination of these two methods.

Whether you’re doing a full or an incremental extraction, there are two ways to extract the data: online and offline.

In online extraction, the extraction process can connect directly to the source system to access the source tables, or it may connect to an intermediate system that stores changes to the data in a preconfigured manner (for example, in transaction logs or change tables).

However, direct access to the source system is not always available. In such cases, the data is staged outside the original source system and is created by an extraction routine originating from the source system (for example, a mainframe performs an extract routine on a table and deposits the data in a folder in a filesystem). The extracted data is usually placed in a flat file that is in a defined, generic format (for example, CSV or JSON).

How to Determine What Data Has Changed Since the Last Extraction

Unfortunately, it can be difficult for many source systems to identify the recently modified data and do an incremental extract. Following are several techniques for identifying recently modified data and implementing incremental extraction from source systems. These techniques can work in conjunction with the data extraction methods discussed. Some techniques are based on the characteristics of the source systems; others may require modifications to the source systems. The source system’s owners should carefully evaluate any technique prior to implementation:

Timestamps

Timestamps are the most preferable option and the easiest to implement. The tables in some operational systems have timestamp columns with the time and date that a given row was last modified, making it easy to identify the latest data. In relational databases, the timestamp column is often given the data type timestamp or datetime, along with a column name like Timestamp or Last Modified. The source application will then populate this column. If not, you can set up the relational database to default to the current date when the record is saved, or you can add database triggers to populate the column.

Change data capture

Most relational databases support change data capture (CDC), which records the INSERTs, UPDATEs, and DELETEs applied to database tables and makes a table record available of what changed, where, and when based on the relational database’s transaction log. If you need near-real-time data warehousing, where you can see changes to the source system reflected in the data warehouse within a few seconds, CDC can be the key enabling technology.

Partitioning

Some source systems use range partitioning, in which the source tables are partitioned along a date key, which makes it easy to identify new data. For example, if you are extracting from an orders table partitioned by day, it is easy to identify the current or previous day’s data.

Database triggers

You can add a trigger for INSERT, UPDATE, and DELETE on a single table and have those triggers write the information about the record change to a “change table.” This is similar to change data capture, so use CDC if your database product supports it; otherwise, use triggers.

MERGE statement

The least preferable option is to do a full extraction from the source system to a staging area in the DW, then compare this table with a previous full extract from the source system using a MERGE statement to identify the changed data. You will need to compare all source fields with all destination fields (or use a hash function). This approach likely won’t have a significant impact on the source system, but it can place a considerable burden on the DW, particularly if the data volumes are large. This option is usually the last resort if no other options are possible.

The Death of the Relational Data Warehouse Has Been Greatly Exaggerated

Around the early 2010s, people in IT started questioning whether the relational data warehouse was needed anymore, asking, “Is the relational data warehouse dead?” Many people understood this as asking if businesses still need DWs. They do, as this chapter points out. But the question is really about the data warehouse architecture—can you just use a data lake, or should you use both a data lake and an RDW?

When data lakes first appeared, they were built on Apache Hadoop technology, and it was largely Hadoop vendors pronouncing the RDW dead. “Just put all your data in the data lake and get rid of your RDW,” they advised. As mentioned in Chapter 2, the projects that attempted to do that all failed.

For many years, I had felt that RDWs would always be needed because data lakes were all Hadoop based and there were just too many limitations. But once solutions like Delta lake (see Chapter 12) had become available and data lakes began using better, easier-to-use products than Hadoop (see Chapter 16), I started to see some use cases where a solution could work without an RDW. That type of solution is a data lakehouse architecture, which will be covered in Chapter 12.

However, there are still plenty of use cases where an RDW is needed. And while data lake technologies will continue to improve, thereby reducing or eliminating the concerns about bypassing an RDW (see Chapter 12), we will never completely do away with RDWs. I think there are three reasons for this. First, it’s still harder to report off a data lake than from a DW. Second, RDWs continue to meet the information needs of users and provide value. Third, many people use, depend on, and trust DWs and don’t want to replace them with data lakes.

Data lakes offer a rich source of data for data scientists and self-service data consumers (“power users”), and they serve the needs of analytics and big data well. But not all data and information workers want to become power users. The majority continue to need well-integrated, systematically cleansed, easy-to-access relational data that includes a historical log that captures how things have evolved or progressed over a period of time. These people are best served with a data warehouse.

Summary

This chapter covered the first widely used technology solution to centralize data from multiple sources and report on it: the relational data warehouse. The RDW revolutionized the way businesses and organizations manage their data by providing a centralized repository for data storage and retrieval, enabling more efficient data management and analysis. With the ability to store and organize data in a structured manner, RDWs allow users to generate complex queries and reports quickly and easily, providing valuable insights and supporting critical decision making.

Today, the relational data warehouse remains a fundamental component of many data architectures and can be seen in a wide range of industries, from finance and healthcare to retail and manufacturing. The following chapter discusses the next technology to become a major factor in centralizing and reporting on data: the data lake.

Get Deciphering Data Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.