Chapter 1. The Solution: Data Curation at Scale

Michael Stonebraker, PhD

Integrating data sources isn’t a new challenge. But the challenge has intensified in both importance and difficulty, as the volume and variety of usable data—and enterprises’ ambitious plans for analyzing and applying it—have increased. As a result, trying to meet today’s data integration demands with yesterday’s data integration approaches is impractical.

In this chapter, we look at the three generations of data integration products and how they have evolved, focusing on the new third-generation products that deliver a vital missing layer in the data integration “stack”: data curation at scale. Finally, we look at five key tenets of an effective data curation at scale system.

Three Generations of Data Integration Systems

Data integration systems emerged to enable business analysts to access converged datasets directly for analyses and applications.

First-generation data integration systems—data warehouses—arrived on the scene in the 1990s. Major retailers took the lead, assembling, customer-facing data (e.g., item sales, products, customers) in data stores and mining it to make better purchasing decisions. For example, pet rocks might be out of favor while Barbie dolls might be “in.” With this intelligence, retailers could discount the pet rocks and tie up the Barbie doll factory with a big order. Data warehouses typically paid for themselves within a year through better buying decisions.

First-generation data integration systems were termed ETL (extract, transform, and load) products. They were used to assemble the data from various sources (usually fewer than 20) into the warehouse. But enterprises underestimated the “T” part of the process—specifically, the cost of the data curation (mostly, data cleaning) required to get heterogeneous data into the proper format for querying and analysis. Hence, the typical data warehouse project was usually substantially over-budget and late because of the difficulty of data integration inherent in these early systems.

This led to a second generation of ETL systems, wherein the major ETL products were extended with data cleaning modules, additional adapters to ingest other kinds of data, and data cleaning tools. In effect, the ETL tools were extended to become data curation tools.

Data curation involves five key tasks:

Ingesting data sources
Cleaning errors from the data (–99 often means null)
Transforming attributes into other ones (for example, euros to dollars)
Performing schema integration to connect disparate data sources
Performing entity consolidation to remove duplicates

In general, data curation systems followed the architecture of earlier first-generation systems: they were toolkits oriented toward professional programmers (in other words, programmer productivity tools).

While many of these are still in use today, second-generation data curation tools have two substantial weaknesses:

Scalability: Enterprises want to curate “the long tail” of enterprise data. They have several thousand data sources, everything from company budgets in the CFO’s spreadsheets to peripheral operational systems. There is “business intelligence gold” in the long tail, and enterprises wish to capture it—for example, for cross-selling of enterprise products. Furthermore, the rise of public data on the Web is leading business analysts to want to curate additional data sources. Data on everything from the weather to customs records to real estate transactions to political campaign contributions is readily available. However, in order to capture long-tail enterprise data as well as public data, curation tools must be able to deal with hundreds to thousands of data sources rather than the tens of data sources most second-generation tools are equipped to handle.
Architecture: Second-generation tools typically are designed for central IT departments. A professional programmer will not know the answers to many of the data curation questions that arise. For example, are “rubber gloves” the same thing as “latex hand protectors”? Is an “ICU50” the same kind of object as an “ICU”? Only businesspeople in line-of-business organizations can answer these kinds of questions. However, businesspeople are usually not in the same organizations as the programmers running data curation projects. As such, second-generation systems are not architected to take advantage of the humans best able to provide curation help.

These weaknesses led to a third generation of data curation products, which we term scalable data curation systems. Any data curation system should be capable of performing the five tasks noted earlier. However, first- and second-generation ETL products will only scale to a small number of data sources, because of the amount of human intervention required.

To scale to hundreds or even thousands of data sources, a new approach is needed—one that:

Uses statistics and machine learning to make automatic decisions wherever possible
Asks a human expert for help only when necessary

Instead of an architecture with a human controlling the process with computer assistance, we must move to an architecture with the computer running an automatic process, asking a human for help only when required. It’s also important that this process ask the right human: the data creator or owner (a business expert), not the data wrangler (a programmer).

Obviously, enterprises differ in the required accuracy of curation, so third-generation systems must allow an enterprise to make trade-offs between accuracy and the amount of human involvement. In addition, third-generation systems must contain a crowdsourcing component that makes it efficient for business experts to assist with curation decisions. Unlike Amazon’s Mechanical Turk, however, a data curation crowdsourcing model must be able to accommodate a hierarchy of experts inside an enterprise as well as various kinds of expertise. Therefore, we call this component an expert sourcing system to distinguish it from the more primitive crowdsourcing systems.

In short, a third-generation data curation product is an automated system with an expert sourcing component. Tamr is an early example of this third generation of systems.

Third-generation systems can coexist with second-generation systems that are currently in place, which can curate the first tens of data sources to generate a composite result that in turn can be curated with the “long tail” by the third-generation systems. Table 1-1 illustrates the key characteristics of the three types of curation systems.

Table 1-1. Evolution of three generations of data integration systems
	First generation 1990s	Second generation 2000s	Third generation 2010s
Approach	ETL	ETL+ data curation	Scalable data curation
Target data environment(s)	Data warehouses	Data warehouses or Data marts	Data lakes and self-service data analytics
Users	IT/programmers	IT/programmers	Data scientists, data stewards, data owners, business analysts
Integration philosophy	Top-down/rules-based/IT-driven	Top-down/rules-based/IT-driven	Bottom-up/demand-based/business-driven
Architecture	Programmer productivity tools (task automation)	Programming productivity tools (task automation with machine assistance)	Machine-driven, human-guided process
Scalability (# of data sources)	10s	10s to 100s	100s to 1000s+

To summarize: ETL systems arose to deal with the transformation challenges in early data warehouses. They evolved into second-generation data curation systems with an expanded scope of offerings. Third-generation data curation systems, which have a very different architecture, were created to address the enterprise’s need for data source scalability.

Five Tenets for Success

Third-generation scalable data curation systems provide the architecture, automated workflow, interfaces, and APIs for data curation at scale. Beyond this basic foundation, however, are five tenets that are desirable in any third-generation system.

Tenet 1: Data Curation Is Never Done

Business analysts and data scientists have an insatiable appetite for more data. This was brought home to me about a decade ago during a visit to a beer company in Milwaukee. They had a fairly standard data warehouse of sales of beer by distributor, time period, brand, and so on. I visited during a year when El Niño was forecast to disrupt winter weather in the US. Specifically, it was forecast to be wetter than normal on the West Coast and warmer than normal in New England. I asked the business analysts: “Are beer sales correlated with either temperature or precipitation?” They replied, “We don’t know, but that is a question we would like to ask.” However, temperature and precipitation data were not in the data warehouse, so asking was not an option.

The demand from warehouse users to correlate more and more data elements for business value leads to additional data curation tasks. Moreover, whenever a company makes an acquisition, it creates a data curation problem (digesting the acquired company’s data). Lastly, the treasure trove of public data on the Web (such as temperature and precipitation data) is largely untapped, leading to more curation challenges.

Even without new data sources, the collection of existing data sources is rarely static. Insertions and deletions in these sources generate a pipeline of incremental updates to a data curation system. Between the requirements of new data sources and updates to existing ones, it is obvious that data curation is never done, ensuring that any project in this area will effectively continue indefinitely. Realize this and plan accordingly.

One obvious consequence of this tenet concerns consultants. If you hire an outside service to perform data curation for you, then you will have to rehire them for each additional task. This will give the consultants a guided tour through your wallet over time. In my opinion, you are much better off developing in-house curation competence over time.

Tenet 2: A PhD in AI Can’t be a Requirement for Success

Any third-generation system will use statistics and machine learning to make automatic or semiautomatic curation decisions. Inevitably, it will use sophisticated techniques such as T-tests, regression, predictive modeling, data clustering, and classification. Many of these techniques will entail training data to set internal parameters. Several will also generate recall and/or precision estimates.

These are all techniques understood by data scientists. However, there will be a shortage of such people for the foreseeable future, until colleges and universities begin producing substantially more than at present. Also, it is not obvious that one can “retread” a business analyst into a data scientist. A business analyst only needs to understand the output of SQL aggregates; in contrast, a data scientist is typically familiar with statistics and various modeling techniques.

As a result, most enterprises will be lacking in data science expertise. Therefore, any third-generation data curation product must use these techniques internally, but not expose them in the user interface. Mere mortals must be able to use scalable data curation products.

Tenet 3: Fully Automatic Data Curation Is Not Likely to Be Successful

Some data curation products expect to run fully automatically. In other words, they translate input data sets into output without human intervention. Fully automatic operation is very unlikely to be successful in an enterprise, for a variety of reasons. First, there are curation decisions that simply cannot be made automatically. For example, consider two records, one stating that restaurant X is at location Y while the second states that restaurant Z is at location Y. This could be a case where one restaurant went out of business and got replaced by a second one, or the location could be a food court. There is no good way to know which record is correct without human guidance.

Second, there are cases where data curation must have high reliability. Certainly, consolidating medical records should not create errors. In such cases, one wants a human to check all (or maybe just some) of the automatic decisions. Third, there are situations where specialized knowledge is required for data curation. For example, in a genomics application one might have two terms: ICU50 and ICE50. An automatic system might suggest that these are the same thing, since the lexical distance between the terms is low; however, only a human genomics specialist can make this determination.

For all of these reasons, any third-generation data curation system must be able to ask the right human expert for input when it is unsure of the answer. The system must also avoid overloading the experts that are involved.

Tenet 4: Data Curation Must Fit into the Enterprise Ecosystem

Every enterprise has a computing infrastructure in place. This includes a collection of database management systems storing enterprise data, a collection of application servers and networking systems, and a set of installed tools and applications. Any new data curation system must fit into this existing infrastructure. For example, it must be able to extract data from corporate databases, use legacy data cleaning tools, and export data to legacy data systems. Hence, an open environment is required wherein callouts are available to existing systems. In addition, adapters to common input and export formats are a requirement. Do not use a curation system that is a closed “black box.”

Tenet 5: A Scheme for “Finding” Data Sources Must Be Present

A typical question to ask CIOs is, “How many operational data systems do you have?” In all likelihood, they do not know. The enterprise is a sea of such data systems, linked by a hodgepodge set of connectors. Moreover, there are all sorts of personal datasets, spreadsheets, and databases, as well as datasets imported from public web-oriented sources. Clearly, CIOs should have a mechanism for identifying data resources that they wish to have curated. Such a system must contain a data source catalog with information on a CIO’s data resources, as well as a query system for accessing this catalog. Lastly, an “enterprise crawler” is required to search a corporate intranet to locate relevant data sources. Collectively, this represents a schema for “finding” enterprise data sources.

Taken together, these five tenets indicate the characteristics of a good third-generation data curation system. If you are in the market for such a product, then look for systems with these features.

Get Getting Data Right now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Getting Data Right by Shannon Cutt