How human-machine collaboration has automated the data catalog

To document enterprise data, machines must learn from the explicit feedback and implicit signals people leave behind.

By Aaron Kalb

March 9, 2016

"Distillation de l'eau à la cornue" (source: Fondo Antiguo de la Universidad de Sevilla on Flickr)

The debate between advocates of artificial intelligence (AI) and defenders of human-centric approaches presents a false dichotomy. Machines can certainly help solve the problems facing humans, but they can rarely do so alone. To be most effective, machines must learn from people and about people. Creating and implementing accurate AI systems requires the input of human knowledge.

This doesn’t mean we can’t glean the efficiency benefits promised by automation and AI. Human input can be gathered without the investment of significant human time and effort. In other words, it’s possible for a computer to answer questions about people without directly asking questions of people. For example, Google learns about which Web pages people like by observing which pages they link to. When posting the links that feed the PageRank algorithm, these online content producers aren’t intentionally talking to Google’s computers; they’re talking to their own human audiences. Google simply eavesdrops, like a baby learns her native language by overhearing the conversations of many adults talking to each other for their own reasons. Learning from people’s natural patterns and passive signals is one of the most efficient ways for computers to gain useful, applicable knowledge.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The role of the eavesdropping intelligent computer

Inside an organization, there are numerous sources from which an intelligent computer eavesdropper might learn how data analysts talk to their databases, and how they ought to do so:

Query logs

Many of those “conversations” are stored in query logs. A single record in a query log might show that the user jdoe wrote a query selecting a few columns from a customer table joined with a transaction table, filtered by date. That event suggests that jdoe might know about and be interested in those two tables, and that those tables can be joined in that way. If many distinct users write queries against one of those tables, that would indicate that said table is important in the organization: such information could be useful to a new hire trying to get ramped up on the data environment, as well as to a steward trying to prioritize his data documentation efforts. If jdoe writes a disproportionate number of queries against the transaction table, that suggests she might be an expert on it. And if many queries executed against the transaction table contained a date filter (and if all those that didn’t took hundreds of hours to run and often wound up getting canceled), that’s a good sign that future queries ought to include such a filter. The logs contain a wealth of knowledge about what’s important (and to whom), who has expertise, and how data should be optimally filtered, joined, and used—if you know how to read the signs.

BI tools

Another record of “conversations” that humans have with their data (and each other) can be found in BI tools. If I make a chart in a Tableau workbook where the y-axis is labeled “revenue” and I calculate that measure using the SUM of the amt column in the transaction table, I’m effectively offering my definition of revenue. A computer can compare that definition to other axes labeled “revenue” and employ a variety of techniques to assess whether those definitions are logically equivalent.

Lineage

Without much human input, a computer can draw a graph of the lineage or provenance of all data assets in an organization, from base tables to derived reports, with all the ETL scripts and SQL CREATE statements in between. That graph can be used to amplify human effort. For instance, since corruption flows downstream, a single data quality flag raised on an important source table could propagate a quality warning down onto thousands of tables and report metrics.

Natural language corpora

Inscrutable field names can present a major challenge for analytics organizations. What does bin stand for in cmplt_bin? How about in is_bin or bin_nbr? By scanning written documentation in internal wikis or BI tools, a computer can construct a language model with likely candidates: “bin” could be a synonym for “bucket” in an A/B test, or an abbreviation for “binary,” or an acronym for “Buy It Now.” After learning its vocabulary from the “adults” (to extend our baby metaphor), the computer can generate a disambiguation engine based on collocation and context clues. Orthographic rules and Natural Language Processing techniques can all be brought to bear on the corpora that already exist in organizational text.

Data values

Data values themselves can also provide strong signals. Nine-digit strings with a certain profile appear likely to be Social Security Numbers, particularly in fields with names like soc_sec_num, ssn, or scl_scrty_nbr. By following lines of lineage or joins (as discussed above), those values can be traced to other fields with less obvious labels. So, a sensitivity flag placed on one of them can be propagated to the rest, enhancing the security of the entire data set.

The role of the knowledgeable human trainer

For all documentation—from a possible sensitivity classification to a translation of a field name into plain English to a mapping from a calculation (e.g. SUM(amt)) to a metric such as “total revenue”—the computer can offer a best guess with a confidence interval, which a knowledgeable human can then confirm. Such confirmations not only boost the trust another person can place in that particular annotation, but also teach the computer, making the machine more confident in its future guesses.

Many modern data-driven organizations are scrambling today to generate a data catalog—a comprehensive repository of all data assets in an organization with information on their quality and provenance, and how they should be used. Attempts to build such an artifact using only human labor are rarely completed given resource constraints, while fully automated projects—with machines crawling only the data itself with no human input—often yield inaccurate and untrusted results.

Conclusion

By distilling insight out of the implicit signals left behind by humans in query logs, BI tools, wikis, lineage, data values, and their various connections, machines can automatically populate a data catalog with important information gleaned from learning all about a data environment, how it works, and how it can and should be used. And by eliciting explicit feedback from knowledgeable humans, the computer can improve the breadth and accuracy of that data catalog over time.

Machines guess, experts confirm, machines learn, guesses get better, and all humans profit. When humans and computers collaborate, the effort required of us humans is minimized, while the benefits offered to us are maximized. It’s the best of both worlds.

Aaron Kalb will be diving more into this discussion on modern data cataloging in a talk at the upcoming Strata + Hadoop World conference in San Jose, CA, March 29-31, 2016.

Post topics: Data science