O'Reilly logo

Getting Data Right by Shannon Cutt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Introduction

Companies have invested an estimated $3–4 trillion in IT over the last 20-plus years, most of it directed at developing and deploying single-vendor applications to automate and optimize key business processes. And what has been the result of all of this disparate activity? Data silos, schema proliferation, and radical data heterogeneity.

With companies now investing heavily in big data analytics, this entropy is making the job considerably more complex. This complexity is best seen when companies attempt to ask “simple” questions of data that is spread across many business silos (divisions, geographies, or functions). Questions as simple as “Are we getting the best price for everything we buy?” often go unanswered because on their own, top-down, deterministic data unification approaches aren’t prepared to scale to the variety of hundreds, thousands, or tens of thousands of data silos.

The diversity and mutability of enterprise data and semantics should lead CDOs to explore—as a complement to deterministic systems—a new bottom-up, probabilistic approach that connects data across the organization and exploits big data variety. In managing data, we should look for solutions that find siloed data and connect it into a unified view. “Getting Data Right” means embracing variety and transforming it from a roadblock into ROI. Throughout this report, you’ll learn how to question conventional assumptions, and explore alternative approaches to managing big data in the enterprise. Here’s a summary of the topics we’ll cover:

Chapter 1, The Solution: Data Curation at Scale

Michael Stonebraker, 2015 A.M. Turing Award winner, argues that it’s impractical to try to meet today’s data integration demands with yesterday’s data integration approaches. Dr. Stonebraker reviews three generations of data integration products, and how they have evolved. He explores new third-generation products that deliver a vital missing layer in the data integration “stack”—data curation at scale. Dr. Stonebraker also highlights five key tenets of a system that can effectively handle data curation at scale.

Chapter 2, An Alternative Approach to Data Management

In this chapter, Tom Davenport, author of Competing on Analytics and Big Data at Work (Harvard Business Review Press), proposes an alternative approach to data management. Many of the centralized planning and architectural initiatives created throughout the 60 years or so that organizations have been managing data in electronic form were never completed or fully implemented because of their complexity. Davenport describes five approaches to realistic, effective data management in today’s enterprise.

Chapter 3, Pragmatic Challenges in Building Data Cleaning Systems

Ihab Ilyas of the University of Waterloo points to “dirty, inconsistent data” (now the norm in today’s enterprise) as the reason we need new solutions for quality data analytics and retrieval on large-scale databases. Dr. Ilyas approaches this issue as a theoretical and engineering problem, and breaks it down into several pragmatic challenges. He explores a series of principles that will help enterprises develop and deploy data cleaning solutions at scale.

Chapter 4, Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery

Michael Brodie, research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory, is devoted to understanding data science as an emerging discipline for data-intensive analytics. He explores data science as a basis for the Fourth Paradigm of engineering and scientific discovery. Given the potential risks and rewards of data-intensive analysis and its breadth of application, Dr. Brodie argues that it’s imperative we get it right. In this chapter, he summarizes his analysis of more than 30 large-scale use cases of data science, and reveals a body of principles and techniques with which to measure and improve the correctness, completeness, and efficiency of data-intensive analysis.

Chapter 5, From DevOps to DataOps

Tamr Cofounder and CEO Andy Palmer argues in support of “DataOps” as a new discipline, echoing the emergence of “DevOps,” which has improved the velocity, quality, predictability, and scale of software engineering and deployment. Palmer defines and explains DataOps, and offers specific recommendations for integrating it into today’s enterprises.

Chapter 6, Data Unification Brings Out the Best in Installed Data Management Strategies

Former Informatica CTO James Markarian looks at current data management techniques such as extract, transform, and load (ETL); master data management (MDM); and data lakes. While these technologies can provide a unique and significant handle on data, Markarian argues that they are still challenged in terms of speed and scalability. Markarian explores adding data unification as a frontend strategy to quicken the feed of highly organized data. He also reviews how data unification works with installed data management solutions, allowing businesses to embrace data volume and variety for more productive data analysis.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required