book

Data Architecture: A Primer for the Data Scientist

by W.H. Inmon, Daniel Linstedt

November 2014

Beginner to intermediate

378 pages

10h 33m

English

Morgan Kaufmann

Read now

Unlock full access

Cover
Title page
Table of Contents
Copyright
Dedication
Preface
About the Authors
1.1: Corporate Data
AbstractThe Totality of Data Across the CorporationDividing Unstructured DataBusiness RelevancyBig DataThe Great DivideThe Continental DivideThe Complete Picture
1.2: The Data Infrastructure
AbstractTwo Types of Repetitive DataRepetitive Structured DataRepetitive Big DataThe Two InfrastructuresWhat’s being Optimized?Comparing the Two Infrastructures
1.3: The “Great Divide”
AbstractClassifying Corporate DataThe “Great Divide”Repetitive Unstructured DataNonrepetitive Unstructured DataDifferent Worlds

1.4: Demographics of Corporate Data
Abstract
1.5: Corporate Data Analysis
Abstract
1.6: The Life Cycle of Data – Understanding Data Over Time
Abstract
1.7: A Brief History of Data
AbstractPaper Tape and Punch CardsMagnetic TapesDisk StorageDatabase Management SystemCoupled ProcessorsOnline Transaction ProcessingData WarehouseParallel Data ManagementData VaultBig DataThe Great Divide
2.1: A Brief History of Big Data
AbstractAn Analogy – Taking the High GroundTaking the High GroundStandardization with the 360Online Transaction ProcessingEnter Teradata and Massively Parallel ProcessingThen Came Hadoop and Big DataIBM and HadoopHolding the High Ground
2.2: What is Big Data?
AbstractAnother DefinitionLarge VolumesInexpensive StorageThe Roman Census ApproachUnstructured DataData in Big DataContext in Repetitive DataNonrepetitive DataContext in Nonrepetitive Data
2.3: Parallel Processing
Abstract
2.4: Unstructured Data
AbstractTextual Information EverywhereDecisions Based on Structured DataThe Business Value PropositionRepetitive and Nonrepetitive Unstructured InformationEase of AnalysisContextualizationSome Approaches to ContextualizationMapReduceManual Analysis
2.5: Contextualizing Repetitive Unstructured Data
AbstractParsing Repetitive Unstructured DataRecasting the Output Data
2.6: Textual Disambiguation
AbstractFrom Narrative into an Analytical DatabaseInput into Textual DisambiguationMappingInput/OutputDocument Fracturing/Named Value ProcessingPreprocessing a DocumentEmails – A Special CaseSpreadsheetsReport Decompilation
2.7: Taxonomies
AbstractData Models and TaxonomiesApplicability of TaxonomiesWhat is a Taxonomy?Taxonomies in Multiple LanguagesDynamics of Taxonomies and Textual DisambiguationTaxonomies and Textual Disambiguation – Separate TechnologiesDifferent Types of TaxonomiesTaxonomies – Maintenance Over Time
3.1: A Brief History of Data Warehouse
AbstractEarly ApplicationsOnline ApplicationsExtract Programs4GL TechnologyPersonal ComputersSpreadsheetsIntegrity of DataSpider-Web SystemsThe Maintenance BacklogThe Data WarehouseTo an Architected EnvironmentTo the CIFDW 2.0
3.2: Integrated Corporate Data
AbstractMany ApplicationsLooking Across the CorporationMore Than One AnalystETL TechnologyThe Challenges of IntegrationThe Benefits of a Data WarehouseThe Granular Perspective
3.3: Historical Data
Abstract
3.4: Data Marts
AbstractGranular DataRelational Database DesignThe Data MartKey Performance IndicatorsThe Dimensional ModelCombining the Data Warehouse and Data Marts
3.5: The Operational Data Store
AbstractOnline Transaction Processing on Integrated DataThe Operational Data StoreODS and the Data WarehouseODS ClassesExternal Updates into the ODSThe ODS/Data Warehouse Interface
3.6: What a Data Warehouse is Not
AbstractA Simple Data Warehouse ArchitectureOnline High-Performance Transaction Processing in the Data WarehouseIntegrity of DataThe Data Warehouse WorkloadStatistical Processing from the Data WarehouseThe Frequency of Statistical ProcessingThe Exploration Warehouse
4.1: Introduction to Data Vault
AbstractData Vault 2.0 ModelingData Vault 2.0 Methodology DefinedData Vault 2.0 ArchitectureData Vault 2.0 ImplementationBusiness Benefits of Data Vault 2.0Data Vault 1.0
4.2: Introduction to Data Vault Modeling
AbstractA Data Vault Model ConceptData Vault Model DefinedComponents of a Data Vault ModelData Vault and Data WarehousingTranslating to Data Vault ModelingData RestructureBasic Rules of Data Vault ModelingWhy We Need Many-to-Many Link StructuresHash keys Instead of Sequence Numbers
4.3: Introduction to Data Vault Architecture
AbstractData Vault 2.0 ArchitectureHow NoSQL Fits into the ArchitectureData Vault 2.0 Architecture ObjectivesData Vault 2.0 Modeling ObjectiveHard and Soft Business RulesManaged SSBI and the Architecture
4.4: Introduction to Data Vault Methodology
AbstractData Vault 2.0 Methodology OverviewCMMI and Data Vault 2.0 MethodologyCMMI Versus AgilityProject Management Practices and SDLC Versus CMMI and AgileSix Sigma and Data Vault 2.0 MethodologyTotal Quality Management
4.5: Introduction to Data Vault Implementation
AbstractImplementation OverviewThe Importance of PatternsReengineering and Big DataVirtualize Our Data MartsManaged Self-Service BI
5.1: The Operational Environment – A Short History
AbstractCommercial Uses of the ComputerThe First ApplicationsEd Yourdon and the Structured RevolutionSystem Development Life CycleDisk TechnologyEnter the Database Management SystemResponse Time and AvailabilityCorporate Computing Today
5.2: The Standard Work Unit
AbstractElements of Response TimeAn Hourglass AnalogyThe Racetrack AnalogyYour Vehicle Runs as Fast as the Vehicle in Front of ItThe Standard Work UnitThe Service Level Agreement
5.3: Data Modeling for the Structured Environment
AbstractThe Purpose of the Road MapGranular Data OnlyThe Entity Relationship DiagramThe DISPhysical Database DesignRelating the Different Levels of the Data ModelAn Example of the LinkageGeneric Data ModelsOperational Data Models and Data Warehouse Data Models
5.4: Metadata
AbstractTypical MetadataThe RepositoryUsing MetadataAnalytical Uses of MetadataLooking at Multiple SystemsThe Lineage of DataComparing Existing Systems to Proposed Systems
5.5: Data Governance of Structured Data
AbstractA Corporate ActivityMotivations for Data GovernanceRepairing DataGranular, Detailed DataDocumentationData Stewardship
6.1: A Brief History of Data Architecture
Abstract
6.2: Big Data/Existing Systems Interface
AbstractThe Big Data/Existing Systems InterfaceThe Repetitive Raw Big Data/Existing Systems InterfaceException-Based DataThe Nonrepetitive Raw Big Data/Existing Systems InterfaceInto the Existing Systems EnvironmentThe “Context-Enriched” Big Data EnvironmentAnalyzing Structured Data/Unstructured Data Together
6.3: The Data Warehouse/Operational Environment Interface
AbstractThe Operational/Data Warehouse InterfaceThe Classical ETL InterfaceThe Operational Data Store/ETL InterfaceThe Staging AreaChanged Data CaptureInline TransformationELT Processing
6.4: Data Architecture – A High-Level Perspective
AbstractA High-Level PerspectiveRedundancyThe System of RecordDifferent Communities
7.1: Repetitive Analytics – Some Basics
AbstractDifferent Kinds of AnalysisLooking for PatternsHeuristic ProcessingThe SandboxThe “Normal” ProfileDistillation, FilteringSubsetting DataFiltering DataRepetitive Data and ContextLinking Repetitive RecordsLog Tape RecordsAnalyzing Points of DataData Over Time
7.2: Analyzing Repetitive Data
AbstractLog DataActive/Passive Indexing of DataSummary/Detailed DataMetadata in Big DataLinking Data
7.3: Repetitive Analysis
AbstractInternal, External DataUniversal IdentifiersSecurityFiltering, DistillationArchiving ResultsMetrics
8.1: Nonrepetitive Data
AbstractInline ContextualizationTaxonomy/Ontology ProcessingCustom VariablesHomographic ResolutionAcronym ResolutionNegation AnalysisNumeric TaggingDate TaggingDate StandardizationList ProcessingAssociative Word ProcessingStop Word ProcessingWord StemmingDocument MetadataDocument ClassificationProximity AnalysisFunctional Sequencing within Textual ETLInternal Referential IntegrityPreprocessing, Postprocessing
8.2: Mapping
Abstract
8.3: Analytics from Nonrepetitive Data
AbstractCall Center InformationMedical Records
9.1: Operational Analytics
AbstractTransaction Response Time
10.1: Operational Analytics
Abstract
11.1: Personal Analytics
Abstract
12.1: A Composite Data Architecture
Abstract
Glossary
Index

Content preview from Data Architecture: A Primer for the Data Scientist

1.3

The “Great Divide”

Abstract

Corporate data consists of structured data and unstructured data. Unstructured data consists of repetitive and nonrepetitive data. The separation between repetitive data and nonrepetitive data can be called the: great divide”. Repetitive Big Data is centric to Hadoop, where most of the activities include data management functions for very large amounts of data. Nonrepetitive data is data that is organized around textual disambiguation, including such functions as sub doc processing, inline contextualization, taxonomical resolution, acronym resolution, standardization, stop word processing, homographic resolution, proximity resolution, and other functions.

Keywords

corporate data

Hadoop

Big Data

textual disambiguation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Architecture: A Primer for the Data Scientist, 2nd Edition

Publisher Resources

ISBN: 9780128020449

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design