book

Building the Data Warehouse

Name: Building the Data Warehouse
Author: W. H. Inmon
ISBN: 9780764599446

by W. H. Inmon

October 2005

Beginner to intermediate

576 pages

13h 35m

English

Wiley

Read now

Unlock full access

Copyright
Credits
About the Author
Preface for the Second Edition
Acknowledgments
1. Evolution of Decision Support Systems
1.1. The Evolution1.1.1. The Advent of DASD1.1.2. PC/4GL Technology1.1.3. Enter the Extract Program1.1.4. The Spider Web1.2. Problems with the Naturally Evolving Architecture1.2.1. Lack of Data Credibility1.2.2. Problems with Productivity1.2.3. From Data to Information1.2.4. A Change in Approach1.2.5. The Architected Environment1.2.6. Data Integration in the Architected Environment1.2.7. Who Is the User?1.3. The Development Life Cycle1.4. Patterns of Hardware Utilization1.5. Setting the Stage for Re-engineering1.6. Monitoring the Data Warehouse Environment1.7. Summary
2. The Data Warehouse Environment
2.1. The Structure of the Data Warehouse2.2. Subject Orientation2.3. Day 1 to Day n Phenomenon2.4. Granularity2.4.1. The Benefits of Granularity2.4.2. An Example of Granularity2.4.3. Dual Levels of Granularity2.5. Exploration and Data Mining2.6. Living Sample Database2.7. Partitioning as a Design Approach2.7.1. Partitioning of Data2.8. Structuring Data in the Data Warehouse2.9. Auditing and the Data Warehouse2.10. Data Homogeneity and Heterogeneity2.11. Purging Warehouse Data2.12. Reporting and the Architected Environment2.13. The Operational Window of Opportunity2.14. Incorrect Data in the Data Warehouse2.15. Summary
3. The Data Warehouse and Design
3.1. Beginning with Operational Data3.2. Process and Data Models and the Architected Environment3.3. The Data Warehouse and Data Models3.3.1. The Data Warehouse Data Model3.3.2. The Midlevel Data Model3.3.3. The Physical Data Model3.4. The Data Model and Iterative Development3.5. Normalization and Denormalization3.5.1. Snapshots in the Data Warehouse3.6. Metadata3.6.1. Managing Reference Tables in a Data Warehouse3.7. Cyclicity of Data — The Wrinkle of Time3.8. Complexity of Transformation and Integration3.9. Triggering the Data Warehouse Record3.9.1. Events3.9.2. Components of the Snapshot3.9.3. Some Examples3.10. Profile Records3.11. Managing Volume3.12. Creating Multiple Profile Records3.13. Going from the Data Warehouse to the Operational Environment3.14. Direct Operational Access of Data Warehouse Data3.15. Indirect Access of Data Warehouse Data3.15.1. An Airline Commission Calculation System3.15.2. A Retail Personalization System3.15.3. Credit Scoring3.16. Indirect Use of Data Warehouse Data3.17. Star Joins3.18. Supporting the ODS3.19. Requirements and the Zachman Framework3.20. Summary
4. Granularity in the Data Warehouse
4.1. Raw Estimates4.2. Input to the Planning Process4.3. Data in Overflow4.3.1. Overflow Storage4.4. What the Levels of Granularity Will Be4.5. Some Feedback Loop Techniques4.6. Levels of Granularity — Banking Environment4.7. Feeding the Data Marts4.8. Summary
5. The Data Warehouse and Technology
5.1. Managing Large Amounts of Data5.2. Managing Multiple Media5.3. Indexing and Monitoring Data5.4. Interfaces to Many Technologies5.5. Programmer or Designer Control of Data Placement5.6. Parallel Storage and Management of Data5.6.1. Metadata Management5.7. Language Interface5.8. Efficient Loading of Data5.9. Efficient Index Utilization5.10. Compaction of Data5.11. Compound Keys5.12. Variable-Length Data5.13. Lock Management5.14. Index-Only Processing5.15. Fast Restore5.16. Other Technological Features5.17. DBMS Types and the Data Warehouse5.18. Changing DBMS Technology5.19. Multidimensional DBMS and the Data Warehouse5.20. Data Warehousing across Multiple Storage Media5.21. The Role of Metadata in the Data Warehouse Environment5.22. Context and Content5.22.1. Three Types of Contextual Information5.22.2. Capturing and Managing Contextual Information5.22.3. Looking at the Past5.23. Refreshing the Data Warehouse5.24. Testing5.25. Summary

6. The Distributed Data Warehouse
6.1. Types of Distributed Data Warehouses6.1.1. Local and Global Data Warehouses6.1.1.1. The Local Data Warehouse6.1.1.2. The Global Data Warehouse6.1.1.3. Intersection of Global and Local Data6.1.1.4. Redundancy6.1.1.5. Access of Local and Global Data6.1.2. The Technologically Distributed Data Warehouse6.1.3. The Independently Evolving Distributed Data Warehouse6.2. The Nature of the Development Efforts6.2.1. Completely Unrelated Warehouses6.3. Distributed Data Warehouse Development6.3.1. Coordinating Development across Distributed Locations6.3.2. The Corporate Data Model — Distributed6.3.3. Metadata in the Distributed Warehouse6.4. Building the Warehouse on Multiple Levels6.5. Multiple Groups Building the Current Level of Detail6.5.1. Different Requirements at Different Levels6.5.2. Other Types of Detailed Data6.5.3. Metadata6.6. Multiple Platforms for Common Detail Data6.7. Summary
7. Executive Information Systems and the Data Warehouse
7.1. EIS — The Promise7.2. A Simple Example7.3. Drill-Down Analysis7.4. Supporting the Drill-Down Process7.5. The Data Warehouse as a Basis for EIS7.6. Where to Turn7.7. Event Mapping7.8. Detailed Data and EIS7.9. Keeping Only Summary Data in the EIS7.10. Summary
8. External Data and the Data Warehouse
8.1. External Data in the Data Warehouse8.2. Metadata and External Data8.3. Storing External Data8.4. Different Components of External Data8.5. Modeling and External Data8.6. Secondary Reports8.7. Archiving External Data8.8. Comparing Internal Data to External Data8.9. Summary
9. Migration to the Architected Environment
9.1. A Migration Plan9.2. The Feedback Loop9.3. Strategic Considerations9.4. Methodology and Migration9.5. A Data-Driven Development Methodology9.5.1. Data-Driven Methodology9.5.2. System Development Life Cycles9.5.3. A Philosophical Observation9.6. Summary
10. The Data Warehouse and the Web
10.1. Supporting the eBusiness Environment10.2. Moving Data from the Web to the Data Warehouse10.3. Moving Data from the Data Warehouse to the Web10.4. Web Support10.5. Summary
11. Unstructured Data and the Data Warehouse
11.1. Integrating the Two Worlds11.1.1. Text — The Common Link11.1.2. A Fundamental Mismatch11.1.3. Matching Text across the Environments11.1.4. A Probabilistic Match11.1.5. Matching All the Information11.2. A Themed Match11.2.1. Industrially Recognized Themes11.2.2. Naturally Occurring Themes11.2.3. Linkage through Themes and Themed Words11.2.4. Linkage through Abstraction and Metadata11.3. A Two-Tiered Data Warehouse11.3.1. Dividing the Unstructured Data Warehouse11.3.2. Documents in the Unstructured Data Warehouse11.3.3. Visualizing Unstructured Data11.4. A Self-Organizing Map (SOM)11.4.1. The Unstructured Data Warehouse11.4.2. Volumes of Data and the Unstructured Data Warehouse11.5. Fitting the Two Environments Together11.6. Summary
12. The Really Large Data Warehouse
12.1. Why the Rapid Growth?12.2. The Impact of Large Volumes of Data12.2.1. Basic Data-Management Activities12.2.2. The Cost of Storage12.2.3. The Real Costs of Storage12.2.4. The Usage Pattern of Data in the Face of Large Volumes12.2.5. A Simple Calculation12.2.6. Two Classes of Data12.2.7. Implications of Separating Data into Two Classes12.3. Disk Storage in the Face of Data Separation12.3.1. Near-Line Storage12.3.2. Access Speed and Disk Storage12.3.3. Archival Storage12.3.4. Implications of Transparency12.4. Moving Data from One Environment to Another12.4.1. The CMSM Approach12.4.2. A Data Warehouse Usage Monitor12.4.3. The Extension of the Data Warehouse across Different Storage Media12.5. Inverting the Data Warehouse12.6. Total Cost12.7. Maximum Capacity12.8. Summary
13. The Relational and the Multidimensional Model as a Basis for Database Design
13.1. The Relational Model13.2. The Multidimensional Model13.3. Snowflake Structures13.4. Differences between the Models13.4.1. The Roots of the Differences13.4.2. Reshaping Relational Data13.4.3. Indirect Access and Direct Access of Data13.4.4. Servicing Future Unknown Needs13.4.5. Servicing the Need to Change Gracefully13.5. Independent Data Marts13.6. Building Independent Data Marts13.7. Summary
14. Data Warehouse Advanced Topics
14.1. End-User Requirements and the Data Warehouse14.1.1. The Data Warehouse and the Data Model14.1.2. The Relational Foundation14.1.3. The Data Warehouse and Statistical Processing14.2. Resource Contention in the Data Warehouse14.2.1. The Exploration Warehouse14.2.2. The Data Mining Warehouse14.2.3. Freezing the Exploration Warehouse14.2.4. External Data and the Exploration Warehouse14.3. Data Marts and Data Warehouses in the Same Processor14.4. The Life Cycle of Data14.4.1. Mapping the Life Cycle to the Data Warehouse Environment14.5. Testing and the Data Warehouse14.6. Tracing the Flow of Data through the Data Warehouse14.6.1. Data Velocity in the Data Warehouse14.6.2. "Pushing" and "Pulling" Data14.7. Data Warehouse and the Web-Based eBusiness Environment14.7.1. The Interface between the Two Environments14.7.2. The Granularity Manager14.7.3. Profile Records14.7.4. The ODS, Profile Records, and Performance14.8. The Financial Data Warehouse14.9. The System of Record14.10. A Brief History of Architecture — Evolving to the Corporate Information Factory14.10.1. Evolving from the CIF14.10.2. Obstacles14.11. CIF — Into the Future14.11.1. Analytics14.11.2. ERP/SAP14.11.3. Unstructured Data14.11.4. Volumes of Data14.12. Summary
15. Cost-Justification and Return on Investment for a Data Warehouse
15.1. Copying the Competition15.2. The Macro Level of Cost-Justification15.3. A Micro Level Cost-Justification15.4. Information from the Legacy Environment15.4.1. The Cost of New Information15.4.2. Gathering Information with a Data Warehouse15.4.3. Comparing the Costs15.4.4. Building the Data Warehouse15.4.5. A Complete Picture15.4.6. Information Frustration15.5. The Time Value of Data15.5.1. The Speed of Information15.6. Integrated Information15.6.1. The Value of Historical Data15.6.2. Historical Data and CRM15.7. Summary
16. The Data Warehouse and the ODS
16.1. Complementary Structures16.1.1. Updates in the ODS16.1.2. Historical Data and the ODS16.1.3. Profile Records16.2. Different Classes of ODS16.3. Database Design — A Hybrid Approach16.4. Drawn to Proportion16.5. Transaction Integrity in the ODS16.6. Time Slicing the ODS Day16.7. Multiple ODS16.8. ODS and the Web Environment16.9. An Example of an ODS16.10. Summary
17. Corporate Information Compliance and Data Warehousing
17.1. Two Basic Activities17.2. Financial Compliance17.2.1. The "What"17.2.2. The "Why"17.3. Auditing Corporate Communications17.4. Summary
18. The End-User Community
18.1. The Farmer18.2. The Explorer18.3. The Miner18.4. The Tourist18.5. The Community18.6. Different Types of Data18.7. Cost-Justification and ROI Analysis18.8. Summary
19. Data Warehouse Design Review Checklist
19.1. When to Do a Design Review19.2. Who Should Be in the Design Review?19.3. What Should the Agenda Be?19.4. The Results19.5. Administering the Review19.6. A Typical Data Warehouse Design Review19.7. Summary
Glossary
References
ArticlesBooksWhite Papers

Content preview from Building the Data Warehouse

Chapter 4. Granularity in the Data Warehouse

The single most important design issue facing the data warehouse developer is determining the proper level of granularity of the data that will reside in the data warehouse. When the level of granularity is properly set, the remaining aspects of design and implementation flow smoothly; when it is not properly set, every other aspect is awkward.

Granularity is also important to the warehouse architect because it affects all the environments that depend on the warehouse for data. Granularity affects how efficiently data can be shipped to the different environments and determines the types of analysis that can be done.

The primary issue of granularity is that of getting it at the right level. The level of granularity needs to be neither too high nor too low.

The trade-off in choosing the right levels of granularity (as discussed in Chapter 2) centers around managing the volume of data and storing data at too high a level of granularity, to the point that detailed data is so voluminous that it is unusable. In addition, if there is to be a truly large amount of data, consideration must be given to putting the inactive portion of the data into overflow storage.

Raw Estimates

The starting point for determining the appropriate level of granularity is to do a raw estimate of the number of rows of data and the DASD (direct access storage device) that will be in the data warehouse. Admittedly, in the best of circumstances, only an estimate can be made. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780764599446Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Building the Data Warehouse

by W. H. Inmon

Chapter 4. Granularity in the Data Warehouse

Raw Estimates

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.