book

Data Architecture: A Primer for the Data Scientist

by W.H. Inmon, Daniel Linstedt

November 2014

Beginner to intermediate

378 pages

10h 33m

English

Morgan Kaufmann

Read now

Unlock full access

Cover
Title page
Table of Contents
Copyright
Dedication
Preface
About the Authors
1.1: Corporate Data
AbstractThe Totality of Data Across the CorporationDividing Unstructured DataBusiness RelevancyBig DataThe Great DivideThe Continental DivideThe Complete Picture
1.2: The Data Infrastructure
AbstractTwo Types of Repetitive DataRepetitive Structured DataRepetitive Big DataThe Two InfrastructuresWhat’s being Optimized?Comparing the Two Infrastructures
1.3: The “Great Divide”
AbstractClassifying Corporate DataThe “Great Divide”Repetitive Unstructured DataNonrepetitive Unstructured DataDifferent Worlds

1.4: Demographics of Corporate Data
Abstract
1.5: Corporate Data Analysis
Abstract
1.6: The Life Cycle of Data – Understanding Data Over Time
Abstract
1.7: A Brief History of Data
AbstractPaper Tape and Punch CardsMagnetic TapesDisk StorageDatabase Management SystemCoupled ProcessorsOnline Transaction ProcessingData WarehouseParallel Data ManagementData VaultBig DataThe Great Divide
2.1: A Brief History of Big Data
AbstractAn Analogy – Taking the High GroundTaking the High GroundStandardization with the 360Online Transaction ProcessingEnter Teradata and Massively Parallel ProcessingThen Came Hadoop and Big DataIBM and HadoopHolding the High Ground
2.2: What is Big Data?
AbstractAnother DefinitionLarge VolumesInexpensive StorageThe Roman Census ApproachUnstructured DataData in Big DataContext in Repetitive DataNonrepetitive DataContext in Nonrepetitive Data
2.3: Parallel Processing
Abstract
2.4: Unstructured Data
AbstractTextual Information EverywhereDecisions Based on Structured DataThe Business Value PropositionRepetitive and Nonrepetitive Unstructured InformationEase of AnalysisContextualizationSome Approaches to ContextualizationMapReduceManual Analysis
2.5: Contextualizing Repetitive Unstructured Data
AbstractParsing Repetitive Unstructured DataRecasting the Output Data
2.6: Textual Disambiguation
AbstractFrom Narrative into an Analytical DatabaseInput into Textual DisambiguationMappingInput/OutputDocument Fracturing/Named Value ProcessingPreprocessing a DocumentEmails – A Special CaseSpreadsheetsReport Decompilation
2.7: Taxonomies
AbstractData Models and TaxonomiesApplicability of TaxonomiesWhat is a Taxonomy?Taxonomies in Multiple LanguagesDynamics of Taxonomies and Textual DisambiguationTaxonomies and Textual Disambiguation – Separate TechnologiesDifferent Types of TaxonomiesTaxonomies – Maintenance Over Time
3.1: A Brief History of Data Warehouse
AbstractEarly ApplicationsOnline ApplicationsExtract Programs4GL TechnologyPersonal ComputersSpreadsheetsIntegrity of DataSpider-Web SystemsThe Maintenance BacklogThe Data WarehouseTo an Architected EnvironmentTo the CIFDW 2.0
3.2: Integrated Corporate Data
AbstractMany ApplicationsLooking Across the CorporationMore Than One AnalystETL TechnologyThe Challenges of IntegrationThe Benefits of a Data WarehouseThe Granular Perspective
3.3: Historical Data
Abstract
3.4: Data Marts
AbstractGranular DataRelational Database DesignThe Data MartKey Performance IndicatorsThe Dimensional ModelCombining the Data Warehouse and Data Marts
3.5: The Operational Data Store
AbstractOnline Transaction Processing on Integrated DataThe Operational Data StoreODS and the Data WarehouseODS ClassesExternal Updates into the ODSThe ODS/Data Warehouse Interface
3.6: What a Data Warehouse is Not
AbstractA Simple Data Warehouse ArchitectureOnline High-Performance Transaction Processing in the Data WarehouseIntegrity of DataThe Data Warehouse WorkloadStatistical Processing from the Data WarehouseThe Frequency of Statistical ProcessingThe Exploration Warehouse
4.1: Introduction to Data Vault
AbstractData Vault 2.0 ModelingData Vault 2.0 Methodology DefinedData Vault 2.0 ArchitectureData Vault 2.0 ImplementationBusiness Benefits of Data Vault 2.0Data Vault 1.0
4.2: Introduction to Data Vault Modeling
AbstractA Data Vault Model ConceptData Vault Model DefinedComponents of a Data Vault ModelData Vault and Data WarehousingTranslating to Data Vault ModelingData RestructureBasic Rules of Data Vault ModelingWhy We Need Many-to-Many Link StructuresHash keys Instead of Sequence Numbers
4.3: Introduction to Data Vault Architecture
AbstractData Vault 2.0 ArchitectureHow NoSQL Fits into the ArchitectureData Vault 2.0 Architecture ObjectivesData Vault 2.0 Modeling ObjectiveHard and Soft Business RulesManaged SSBI and the Architecture
4.4: Introduction to Data Vault Methodology
AbstractData Vault 2.0 Methodology OverviewCMMI and Data Vault 2.0 MethodologyCMMI Versus AgilityProject Management Practices and SDLC Versus CMMI and AgileSix Sigma and Data Vault 2.0 MethodologyTotal Quality Management
4.5: Introduction to Data Vault Implementation
AbstractImplementation OverviewThe Importance of PatternsReengineering and Big DataVirtualize Our Data MartsManaged Self-Service BI
5.1: The Operational Environment – A Short History
AbstractCommercial Uses of the ComputerThe First ApplicationsEd Yourdon and the Structured RevolutionSystem Development Life CycleDisk TechnologyEnter the Database Management SystemResponse Time and AvailabilityCorporate Computing Today
5.2: The Standard Work Unit
AbstractElements of Response TimeAn Hourglass AnalogyThe Racetrack AnalogyYour Vehicle Runs as Fast as the Vehicle in Front of ItThe Standard Work UnitThe Service Level Agreement
5.3: Data Modeling for the Structured Environment
AbstractThe Purpose of the Road MapGranular Data OnlyThe Entity Relationship DiagramThe DISPhysical Database DesignRelating the Different Levels of the Data ModelAn Example of the LinkageGeneric Data ModelsOperational Data Models and Data Warehouse Data Models
5.4: Metadata
AbstractTypical MetadataThe RepositoryUsing MetadataAnalytical Uses of MetadataLooking at Multiple SystemsThe Lineage of DataComparing Existing Systems to Proposed Systems
5.5: Data Governance of Structured Data
AbstractA Corporate ActivityMotivations for Data GovernanceRepairing DataGranular, Detailed DataDocumentationData Stewardship
6.1: A Brief History of Data Architecture
Abstract
6.2: Big Data/Existing Systems Interface
AbstractThe Big Data/Existing Systems InterfaceThe Repetitive Raw Big Data/Existing Systems InterfaceException-Based DataThe Nonrepetitive Raw Big Data/Existing Systems InterfaceInto the Existing Systems EnvironmentThe “Context-Enriched” Big Data EnvironmentAnalyzing Structured Data/Unstructured Data Together
6.3: The Data Warehouse/Operational Environment Interface
AbstractThe Operational/Data Warehouse InterfaceThe Classical ETL InterfaceThe Operational Data Store/ETL InterfaceThe Staging AreaChanged Data CaptureInline TransformationELT Processing
6.4: Data Architecture – A High-Level Perspective
AbstractA High-Level PerspectiveRedundancyThe System of RecordDifferent Communities
7.1: Repetitive Analytics – Some Basics
AbstractDifferent Kinds of AnalysisLooking for PatternsHeuristic ProcessingThe SandboxThe “Normal” ProfileDistillation, FilteringSubsetting DataFiltering DataRepetitive Data and ContextLinking Repetitive RecordsLog Tape RecordsAnalyzing Points of DataData Over Time
7.2: Analyzing Repetitive Data
AbstractLog DataActive/Passive Indexing of DataSummary/Detailed DataMetadata in Big DataLinking Data
7.3: Repetitive Analysis
AbstractInternal, External DataUniversal IdentifiersSecurityFiltering, DistillationArchiving ResultsMetrics
8.1: Nonrepetitive Data
AbstractInline ContextualizationTaxonomy/Ontology ProcessingCustom VariablesHomographic ResolutionAcronym ResolutionNegation AnalysisNumeric TaggingDate TaggingDate StandardizationList ProcessingAssociative Word ProcessingStop Word ProcessingWord StemmingDocument MetadataDocument ClassificationProximity AnalysisFunctional Sequencing within Textual ETLInternal Referential IntegrityPreprocessing, Postprocessing
8.2: Mapping
Abstract
8.3: Analytics from Nonrepetitive Data
AbstractCall Center InformationMedical Records
9.1: Operational Analytics
AbstractTransaction Response Time
10.1: Operational Analytics
Abstract
11.1: Personal Analytics
Abstract
12.1: A Composite Data Architecture
Abstract
Glossary
Index

Overview

Today, the world is trying to create and educate data scientists because of the phenomenon of Big Data. And everyone is looking deeply into this technology. But no one is looking at the larger architectural picture of how Big Data needs to fit within the existing systems (data warehousing systems). Taking a look at the larger picture into which Big Data fits gives the data scientist the necessary context for how pieces of the puzzle should fit together. Most references on Big Data look at only one tiny part of a much larger whole. Until data gathered can be put into an existing framework or architecture it can’t be used to its full potential. Data Architecture a Primer for the Data Scientist addresses the larger architectural picture of how Big Data fits with the existing information infrastructure, an essential topic for the data scientist.

Drawing upon years of practical experience and using numerous examples and an easy to understand framework. W.H. Inmon, and Daniel Linstedt define the importance of data architecture and how it can be used effectively to harness big data within existing systems. You’ll be able to:

Turn textual information into a form that can be analyzed by standard tools.
Make the connection between analytics and Big Data
Understand how Big Data fits within an existing systems environment
Conduct analytics on repetitive and non-repetitive data

Discusses the value in Big Data that is often overlooked, non-repetitive data, and why there is significant business value in using it
Shows how to turn textual information into a form that can be analyzed by standard tools
Explains how Big Data fits within an existing systems environment
Presents new opportunities that are afforded by the advent of Big Data
Demystifies the murky waters of repetitive and non-repetitive data in Big Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Architecture: A Primer for the Data Scientist, 2nd Edition

Publisher Resources

ISBN: 9780128020449

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills