O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Getting Data Right

Book Description

Over the last 20 years, companies have invested roughly $3-4 trillion in enterprise software. These investments have been primarily focused on the development and deployment of single systems, applications, functions, and geographies targeted at the automation and optimization of key business processes. Companies are now investing heavily in big data analytics ($44 billion alone in 2014) in an effort to begin analyzing all of the data being generated from their process automation systems. But companies are quickly realizing that one of their key bottlenecks is Data Variety—the silo’d nature of the data that is a natural result of internal and external source proliferation.

The problem of big data variety has crept up from the bottom—and the cost of variety is only appreciated when companies attempt to ask simple questions across many business silos (divisions, geographies, functions, etc.). Current top-down, deterministic data unification approaches (such as ETL, ELT, and MDM) were simply not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos.

Download this free eBook to learn about the fundamental challenges that Data Variety poses to enterprises looking to maximize the value of their existing investments—and how new approaches promise to help organizations embrace and leverage the fundamental diversity of data. Readers will also find best practices for designing bottom-up and probabilistic methods for finding and managing data; principles for doing data science at scale in the big data era; preparing and unifying data in ways that complement existing systems; optimizing data warehousing; and how to use “data ops” to automate large-scale integration.

Table of Contents

  1. Introduction
  2. 1. The Solution: Data Curation at Scale
    1. Three Generations of Data Integration Systems
    2. Five Tenets for Success
      1. Tenet 1: Data Curation Is Never Done
      2. Tenet 2: A PhD in AI Can’t be a Requirement for Success
      3. Tenet 3: Fully Automatic Data Curation Is Not Likely to Be Successful
      4. Tenet 4: Data Curation Must Fit into the Enterprise Ecosystem
      5. Tenet 5: A Scheme for “Finding” Data Sources Must Be Present
  3. 2. An Alternative Approach to Data Management
    1. Centralized Planning Approaches
    2. Common Information
    3. Information Chaos
    4. What Is to Be Done?
    5. Take a Federal Approach to Data Management
    6. Use All the New Tools at Your Disposal
    7. Don’t Model, Catalog
      1. Cataloging Tools
    8. Keep Everything Simple and Straightforward
    9. Use an Ecological Approach
  4. 3. Pragmatic Challenges in Building Data Cleaning Systems
    1. Data Cleaning Challenges
      1. 1. Scale
      2. 2. Human in the Loop
      3. 3. Expressing and Discovering Quality Constraints
      4. 4. Heterogeneity and Interaction of Quality Rules
      5. 5. Data and Constraints Decoupling and Interplay
      6. 6. Data Variety
      7. 7. Iterative by Nature, Not Design
    2. Building Adoptable Data Cleaning Solutions
  5. 4. Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery
    1. Data Science: A New Discovery Paradigm That Will Transform Our World
      1. Significance of DIA and Data Science
      2. Illustrious Histories: The Origins of Data Science
      3. What Could Possibly Go Wrong?
      4. Do We Understand Data Science?
      5. Cornerstone of a New Discovery Paradigm
    2. Data Science: A Perspective
    3. Understanding Data Science from Practice
      1. Methodology to Better Understand DIA
      2. DIA Processes
      3. Characteristics of Large-Scale DIA Use Cases
      4. Looking Into a Use Case
    4. Research for an Emerging Discipline
      1. Acknowledgment
  6. 5. From DevOps to DataOps
    1. Why It’s Time to Embrace “DataOps” as a New Discipline
    2. From DevOps to DataOps
    3. Defining DataOps
    4. Changing the Fundamental Infrastructure
    5. DataOps Methodology
    6. Integrating DataOps into Your Organization
    7. The Four Processes of DataOps
      1. Data Engineering
      2. Data Integration
      3. Data Quality
      4. Data Security
    8. Better Information, Analytics, and Decisions
  7. 6. Data Unification Brings Out the Best in Installed Data Management Strategies
    1. Positioning ETL and MDM
      1. Extract, Transform, and Load
      2. Master Data Management
    2. Clustering to Meet the Rising Data Tide
    3. Embracing Data Variety with Data Unification
    4. Data Unification Is Additive
      1. Data Unification and Master Data Management
      2. Data Unification and ETL
      3. Changing Infrastructure
    5. Probabilistic Approach to Data Unification