Chapter 10. Data Warehouse

Organizations today are making their decisions based on data. A data warehouse, also known as an enterprise data warehouse (EDW), is a large collective store of data that is used to make such data-driven decisions, thereby becoming one of the centerpieces of an organization’s data infrastructure. One of the most common applications of Hadoop is as a complement to an EDW architecture, often referred to as data warehouse offload or data warehouse optimization.

This data warehouse offload with Hadoop encompasses several areas, including:

ETL/ELT

Extract-transform-load (ETL) refers to the process in which data is extracted from a source system, transformed, and then loaded into a target system for further processing and analysis. The transformations may include transforming based on business requirements, combining data with other data sources, or validating/rejecting data based on some criteria. Another common processing pattern is extract-load-transform (ELT). In the ELT model, data is loaded into a target system, generally a set of temporary tables, and then transformed. We’ll get into more detail shortly on ETL and ELT processing in the traditional data warehouse world, and the benefits that Hadoop provides in offloading this processing from existing systems.

Data archiving

Since traditional databases are often difficult and expensive to scale as data volumes grow, it’s common to move historical data to external archival systems such as tape. The ...

Get Hadoop Application Architectures now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.