book

Getting Data Right

Name: Getting Data Right
Author: Shannon Cutt
ISBN: 9781491935316

by Shannon Cutt

September 2015

Beginner to intermediate

52 pages

1h 51m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Introduction
1. The Solution: Data Curation at Scale
Three Generations of Data Integration SystemsFive Tenets for SuccessTenet 1: Data Curation Is Never DoneTenet 2: A PhD in AI Can’t be a Requirement for SuccessTenet 3: Fully Automatic Data Curation Is Not Likely to Be SuccessfulTenet 4: Data Curation Must Fit into the Enterprise EcosystemTenet 5: A Scheme for “Finding” Data Sources Must Be Present
2. An Alternative Approach to Data Management
Centralized Planning ApproachesCommon InformationInformation ChaosWhat Is to Be Done?Take a Federal Approach to Data ManagementUse All the New Tools at Your DisposalDon’t Model, CatalogCataloging ToolsKeep Everything Simple and StraightforwardUse an Ecological Approach
3. Pragmatic Challenges in Building Data Cleaning Systems
Data Cleaning Challenges1. Scale2. Human in the Loop3. Expressing and Discovering Quality Constraints4. Heterogeneity and Interaction of Quality Rules5. Data and Constraints Decoupling and Interplay6. Data Variety7. Iterative by Nature, Not DesignBuilding Adoptable Data Cleaning Solutions
4. Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery
Data Science: A New Discovery Paradigm That Will Transform Our WorldSignificance of DIA and Data ScienceIllustrious Histories: The Origins of Data ScienceWhat Could Possibly Go Wrong?Do We Understand Data Science?Cornerstone of a New Discovery ParadigmData Science: A PerspectiveUnderstanding Data Science from PracticeMethodology to Better Understand DIADIA ProcessesCharacteristics of Large-Scale DIA Use CasesLooking Into a Use CaseResearch for an Emerging DisciplineAcknowledgment
5. From DevOps to DataOps
Why It’s Time to Embrace “DataOps” as a New DisciplineFrom DevOps to DataOpsDefining DataOpsChanging the Fundamental InfrastructureDataOps MethodologyIntegrating DataOps into Your OrganizationThe Four Processes of DataOpsData EngineeringData IntegrationData QualityData SecurityBetter Information, Analytics, and Decisions
6. Data Unification Brings Out the Best in Installed Data Management Strategies
Positioning ETL and MDMExtract, Transform, and LoadMaster Data ManagementClustering to Meet the Rising Data TideEmbracing Data Variety with Data UnificationData Unification Is AdditiveData Unification and Master Data ManagementData Unification and ETLChanging InfrastructureProbabilistic Approach to Data Unification

Overview

Over the last 20 years, companies have invested roughly $3-4 trillion in enterprise software. These investments have been primarily focused on the development and deployment of single systems, applications, functions, and geographies targeted at the automation and optimization of key business processes. Companies are now investing heavily in big data analytics ($44 billion alone in 2014) in an effort to begin analyzing all of the data being generated from their process automation systems. But companies are quickly realizing that one of their key bottlenecks is Data Variety—the silo’d nature of the data that is a natural result of internal and external source proliferation.

The problem of big data variety has crept up from the bottom—and the cost of variety is only appreciated when companies attempt to ask simple questions across many business silos (divisions, geographies, functions, etc.). Current top-down, deterministic data unification approaches (such as ETL, ELT, and MDM) were simply not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos.

Download this free eBook to learn about the fundamental challenges that Data Variety poses to enterprises looking to maximize the value of their existing investments—and how new approaches promise to help organizations embrace and leverage the fundamental diversity of data. Readers will also find best practices for designing bottom-up and probabilistic methods for finding and managing data; principles for doing data science at scale in the big data era; preparing and unifying data in ways that complement existing systems; optimizing data warehousing; and how to use “data ops” to automate large-scale integration.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Securing SQL Server: DBAs Defending the Database

Publisher Resources

ISBN: 9781491935361Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills