book

Data Architecture: A Primer for the Data Scientist, 2nd Edition

Name: Data Architecture: A Primer for the Data Scientist, 2nd Edition
ISBN: 9780128169179

by W.H. Inmon, Daniel Linstedt, Mary Levins

April 2019

Beginner to intermediate

431 pages

11h 32m

English

Academic Press

Read now

Unlock full access

Cover image
Title page
Table of Contents
Copyright
Dedication
Chapter 1.1: An Introduction to Data Architecture
AbstractSubdividing DataRepetitive/Nonrepetitive Unstructured DataThe Great Divide of DataTextual/Nontextual DataThe Different Forms of DataBusiness Value
Chapter 1.2: The Data Infrastructure
AbstractTwo Types of Repetitive DataRepetitive Structured DataRepetitive Big DataThe Two InfrastructuresWhat's Being Optimized?Comparing the Two Infrastructures
Chapter 1.3: The “Great Divide”
AbstractClassifying Corporate DataThe “Great Divide”Repetitive Unstructured DataNonrepetitive Unstructured DataDifferent Worlds
Chapter 1.4: Demographics of Corporate Data
Abstract
Chapter 1.5: Corporate Data Analysis
Abstract

Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time
Abstract
Chapter 1.7: A Brief History of Data
AbstractPaper Tape and Punch CardsMagnetic TapesDisk StorageData Base Management System (DBMS)Coupled ProcessorsOnline Transaction ProcessingData WarehouseParallel Data ManagementData VaultBig DataThe Great Divide
Chapter 2.1: The End-State Architecture—The “World Map”
AbstractArchitectural ComponentsDifferent Kinds of Data in the End State ArchitectureShaping the Data Through ModelsWhere Is the Data Warehouse?Where Different Types of Questions Are Answered Across the End State ArchitectureData in the Data LakeMetadata in the End State ArchitectureNetworked MetadataAn Evolutionary ExperienceThe Data Lake Architecture
Chapter 3.1: Transformations in the End-State Architecture
AbstractRedundant DataTransformationsCustomizing DataTransforming TextTransforming Application DataTransforming Data Into a Customized StateTransforming Data Into Bulk StorageTransforming Data Generated AutomaticallyTransforming Bulk DataTransformation and Redundancy
Chapter 4.1: A Brief History of Big Data
AbstractAn Analogy—Taking the High GroundTaking the High GroundStandardization With the 360Online Transaction ProcessingEnter Teradata and MPP ProcessingThen Came Hadoop and Big DataIBM and HadoopHolding the High Ground
Chapter 4.2: What Is Big Data?
AbstractAnother DefinitionLarge VolumesInexpensive StorageThe Roman Census ApproachUnstructured DataData in Big DataContext in Repetitive DataNonrepetitive DataContext in Nonrepetitive Data
Chapter 4.3: Parallel Processing
Abstract
Chapter 4.4: Unstructured Data
AbstractTextual Information—EverywhereDecisions Based on Structured DataThe Business Value PropositionRepetitive and Nonrepetitive Unstructured InformationEase of AnalysisContextualizationSome Approaches to ContextualizationMap ReduceManual Analysis
Chapter 4.5: Contextualizing Repetitive Unstructured Data
AbstractParsing Repetitive Unstructured DataRecasting the Output Data
Chapter 4.6: Textual Disambiguation
AbstractFrom Narrative Into an Analytical Data BaseInput Into Textual DisambiguationMappingInput/OutputDocument Fracturing/Named Value ProcessingPreprocessing a DocumentE-mails—A Special CaseSpreadsheetsReport Decompilation
Chapter 4.7: Taxonomies
AbstractData Models/TaxonomiesApplicability of TaxonomiesWhat Is a Taxonomy?Taxonomies in Multiple LanguagesCommercial or Private Taxonomies?Dynamics of Taxonomies and Textual DisambiguationTaxonomies and Textual Disambiguation—Separate TechnologiesDifferent Types of TaxonomiesTaxonomies—Maintenance Over Time
Chapter 5.1: The Siloed Application Environment
AbstractThe Challenge of Siloed ApplicationsBuilding Siloed ApplicationsWhat Does a Siloed Application Look Like?Current Valued DataMinimal Historical DataHigh AvailabilityOverlap Between Siloed ApplicationsFrozen Business RequirementsDismantling Siloed Applications
Chapter 6.1: Introduction to Data Vault 2.0
AbstractData Vault Origins and BackgroundWhat Is Data Vault 2.0 Modeling?How Is Data Vault 2.0 Methodology Defined?Why Do We Need a Data Vault 2.0 Architecture?Where Does Data Vault 2.0 Implementation Fit?What Are the Business Benefits of Data Vault 2.0?What Is Data Vault 1.0?
Chapter 6.2: Introduction to Data Vault Modeling
AbstractWhat Is a Data Vault Model Concept?Data Vault Model DefinedComponents of a Data Vault ModelWhat Makes Business Keys So Interesting?What Does This Have to Do With Data Vault and Data Warehousing?How Does This Translate to Data Vault Modeling?Why Restructure the Data From the Staging Area?What Are the Basic Rules of the Data Vault Model?Why Do We Need Many to Many Link Structures?Primary Key Options for Data Vault 2.0
Chapter 6.3: Introduction to Data Vault Architecture
AbstractWhat Is a Data Vault 2.0 Architecture?How Does NoSQL Fit in to the Architecture?What Are the Objectives of the Data Vault 2.0 Architecture?What Is the Objective of the Data Vault 2.0 Model?What Are Hard and Soft Business Rules?How Does Managed Self Service BI Fit in the Architecture?
Chapter 6.4: Introduction to Data Vault Methodology
AbstractData Vault 2.0 Methodology OverviewHow Does CMMI Contribute to the Methodology?If CMMI Is So Great, Why Should We Care About Agility Then?Why Include PMP, SDLC If CMMI and Agile Should Be All That's Needed?So Then, What Does Six Sigma Contribute to the Data Vault 2 Methodology?Where Does TQM (Total Quality Management) Fit in to All of This?
Chapter 6.5: Introduction to Data Vault Implementation
AbstractImplementation OverviewWhat's So Important About Patterns?Why Does Reengineering Happen Because of Big Data?Why Do We Need to Virtualize Our Data Marts?What Is Managed Self-Service BI?
Chapter 7.1: The Operational Environment: A Short History
AbstractCommercial Uses of the ComputerThe First ApplicationsEd Yourdon and the Structured RevolutionThe SDLCDisk TechnologyEnter the DBMSResponse Time and AvailabilityCorporate Computing Today
Chapter 7.2: The Standard Work Unit
AbstractElements of Response TimeAn Hourglass AnalogyThe Racetrack AnalogyYour Vehicle Runs as Fast as the Vehicle in Front of ItThe Standard Work UnitThe SLA
Chapter 7.3: Data Modeling for the Structured Environment
AbstractThe Purpose of the RoadmapGranular Data OnlyThe ERDThe DisPhysical Data Base DesignRelating the Different Levels of the Data ModelAn Example of the LinkageGeneric Data ModelsOperational Data Models/Data Warehouse Data Models
Chapter 8.1: A Brief History of Data Architecture
Abstract
Chapter 8.2: Big Data/Existing System Interface
AbstractThe Big Data/Existing Systems InterfaceThe Repetitive Raw Big Data/Existing Systems InterfaceException Based DataThe Nonrepetitive Raw Big Data/Existing Systems InterfaceInto the Existing Systems EnvironmentThe “Context Enriched” Big Data EnvironmentAnalyzing Structured Data/Unstructured Data Together
Chapter 8.3: The Data Warehouse/Operational Environment Interface
AbstractThe Operational/Data Warehouse InterfaceThe Classical ETL InterfaceThe ODS and the ETL InterfaceThe Staging AreaChanged Data CaptureInline TransformationELT Processing
Chapter 8.4: Data Architecture: A High-Level Perspective
AbstractA High Level PerspectiveRedundancyThe System of RecordDifferent Types of QuestionsDifferent Communities
Chapter 9.1: Repetitive Analytics: Some Basics
AbstractDifferent Kinds of AnalysisLooking for PatternsHeuristic ProcessingFreezing DataThe SandboxThe “Normal” ProfileDistillation, FilteringSubsetting DataBias of the SampleFiltering DataRepetitive Data and ContextLinking Repetitive RecordsLog Tape RecordsAnalyzing Points of DataOutliersData Over Time
Chapter 9.2: Analyzing Repetitive Data
AbstractLog DataActive/Passive Indexing of DataSummary/Detailed DataMetadata in Big DataLinking Data
Chapter 9.3: Repetitive Analysis
AbstractInternal, External DataUniversal IdentifiersSecurityFiltering, DistillationArchiving ResultsMetrics
Chapter 10.1: Nonrepetitive Data
AbstractInline ContextualizationTaxonomy/Ontology ProcessingCustom VariablesHomographic ResolutionAcronym ResolutionNegation AnalysisNumeric TaggingDate TaggingDate StandardizationList ProcessingAssociative Word ProcessingStop Word ProcessingWord StemmingDocument MetadataDocument ClassificationProximity AnalysisFunctional Sequencing Within Textual ETLInternal Referential IntegrityPreprocessing, Postprocessing
Chapter 10.2: Mapping
Abstract
Chapter 10.3: Analytics From Nonrepetitive Data
AbstractCall Center InformationMedical Records
Chapter 11.1: Operational Analytics: Response Time
AbstractTransaction Response Time
Chapter 12.1: Operational Analytics
AbstractDifferent Perspectives of DataData MartsThe Operational Data Store—ODS
Chapter 13.1: Personal Analytics
Abstract
Chapter 14.1: Data Models Across the End-State Architecture
AbstractThe Different Data ModelsFunctional Decomposition and Data Flow DiagramsThe Corporate Data ModelThe Star Join/Dimensional Data ModelTaxonomies/OntologiesThe Selective Subdivision of DataProactive/Reactive Data Models
Chapter 15.1: The System of Record
AbstractThe End User Cycle of AwarenessThe System of RecordThe System of Record in the End State ArchitectureThe Role of Age in the System of RecordA Simple ExampleThe Flow of Data in the System of RecordOther Data Than the System of RecordIs Data Updated in the System of Record?Detailed and Summary Data in the System of RecordAuditing Data and the System of RecordText and the System of Record
Chapter 16.1: Business Value and the End-State Architecture
AbstractThe Evolution of the End State ArchitectureWhat is Meant by “Business Value”Tactical Business Value/Strategic Business ValueVolume of Data Versus Business ValueThe “Million in One” SyndromeWhere Business Value OccursData Relevancy Over TimeWhere Tactical Decisions Are Made
Chapter 17.1: Managing Text
AbstractThe Challenge of TextThe Challenge of ContextThe Processing Components of Textual ETLSecondary AnalysisVisualizationMerging Text Based Data and Structured Data
Chapter 18.1: An Introduction to Data Visualizations
AbstractIntroduction to Data Visualizations—OverviewPurpose and ContextVisualization—A Science and an ArtVisualization FrameworkStep 1: DefineStep 2: DataStep 3: DesignStep 4: DistributeData Visualization Tools and SoftwareSummary
Glossary
Index

Content preview from Data Architecture: A Primer for the Data Scientist, 2nd Edition

Chapter 10.1

Nonrepetitive Data

Abstract

Nonrepetitive analytics begins with the contextualization of the nonrepetitive data. Unlike repetitive data, the context of nonrepetitive data is difficult to determine. The context of nonrepetitive big data is determined by textual disambiguation. In textual disambiguation, there are algorithms that relate to stop word resolution, stemming, homographic resolution, inline contextualization, taxonomy/ontology resolution, custom variable resolution, acronym resolution, and so forth. Nonrepetitive analytics is very relevant to business value. Some typical forms of nonrepetitive analytics include the analysis of medical records, warranty analysis, insurance claim analysis, and call center analysis.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Architecture: A Primer for the Data Scientist

Publisher Resources

ISBN: 9780128169179

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Architecture: A Primer for the Data Scientist, 2nd Edition

by W.H. Inmon, Daniel Linstedt, Mary Levins

Nonrepetitive Data

Abstract

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.