Series of lava rock pools situated just off the southern end of Lake Turkana, northern Kenya's Jade Sea
Series of lava rock pools situated just off the southern end of Lake Turkana, northern Kenya's Jade Sea (source: Thomas Kujawa on Wikimedia Commons)

In the olden days of data science, one of the rallying cries was the democratization of data. No longer were data owners at the mercy of enterprise data warehouses (EDWs) and extract, transform, load (ETL) jobs, where data had to be transformed into a specific schema (“schema on write”) before it could be stored in the enterprise data warehouse and made available for use in reporting and analytics. This data was often most naturally expressed as nested structures (e.g., a base record with two array-typed attributes), but warehouses were usually based on the relational model. Thus, the data needed to be pulled apart and “normalized" into flat relational tables in first normal form. Once stored in the warehouse, recovering the data’s natural structure required several expensive relational joins. Or, for the most common or business-critical applications, the data was “de-normalized,” in which formerly nested structures were reunited, but in a flat relational form with a lot of redundancy.

This is the context in which big data and the data lake arose. No single schema was imposed. Anyone could store their data in the data lake, in any structure (or no consistent structure). Naturally nested data was no longer stripped apart into artificially flat structures. Data owners no longer had to wait for the IT department to write ETL jobs before they could access and query their data. In place of the tyranny of schema on write, schema on read was born. Users could store their data in any schema, which would be discovered at the time of reading the data. Data storage was no longer the exclusive provenance of the DBAs and the IT departments. Data from multiple previously siloed teams could be stored in the same repository.

Where are we today? Data lakes have ballooned. The same data, and aggregations of the same data, are often present redundantly—often many times redundant, as the same interesting data set is saved to the data lake by multiple teams, unknown to each other. Further, data scientists seeking to integrate data from multiple silos are unable to identify where the data resides in the lake. Once found, diverse data sets are very hard to integrate, since the data typically contains no documentation on the semantics of its attributes. Attributes on which data sets would be joined (e.g., customer billing ID) have been given different names by different teams. The rule of thumb is that data scientists spend 70% of their time finding, interpreting, and cleaning data, and only 30% actually analyzing it. Schema on read offers no help in these tasks, because data gives up none of its secrets until actually read, and even when read has no documentation beyond attribute names, which may be inscrutable, vacuous, or even misleading.

Enter data governance. Traditionally, data governance is much more akin to EDWs than data lakes—formal management and definition, controlled vocabularies, access control, standardization, regulatory compliance, expiration policies. In the terms of a recent Harvard Business Review article, “What’s Your Data Strategy?”, by Leandro DalleMule and Thomas H. Davenport, (traditional) data governance does a good job at the important reactive (“defensive”) elements of data management—“identifying, standardizing, and governing authoritative data sources, such as fundamental customer and supplier information or sales data, in a 'single source of truth'”—but is less well-suited to proactive (“offensive”) efforts. In contrast, proactive strategies “focus on activities that generate customer insights (data analysis and modeling, for example) or integrate disparate customer and market data to support managerial decision-making.”

Today’s data governance retains some of its traditional reactive roots. But increasingly in the big data arena, proactive data governance is saving the democratized data lake from itself.

At Comcast, for instance, Kafka topics are associated with Apache Avro schemas that include non-trivial documentation on every attribute and use common subschemas to capture commonly used data (such as error logs). These schemas follow the data through its streaming journey, often being enriched and transformed, until the data finds its resting place in the data lake. "Schema on read” using Avro files thus includes rich documentation and common structures and naming conventions. More accurately, a data lake of Avro data can be characterized as “schema on write,” with the following distinctions from traditional schema on write: 1) nested structures instead of flat relations; 2) schemas defined by data owners, not DBAs; and 3) multiple schemas not only supported but encouraged. Further, the data includes its own documentation.

At Comcast, we store schemas and metadata on Kafka topics, data lake objects, and connecting/enriching processes in Apache Atlas. Atlas provides data and lineage discovery via sql-like, free-text, and graph queries. Our system thus enables data scientists to find data of interest, understand it (via extensive attribute-level documentation), and join it (via commonly named attributes). In addition, by storing the connecting/enriching processes we provide data lineage. A data producer can answer the question: “Where are the derivatives of my original data, and who transformed them along the way?” A data scientist can answer the question: “How has the data changed in its journey from ingest to where I’m viewing it in the data lake?”

Proactive data governance transforms schema on read to schemas on write, enabling both flexibility and common semantics.

This post is a collaboration between O'Reilly and Qubole. See our statement of editorial independence.

Article image: Series of lava rock pools situated just off the southern end of Lake Turkana, northern Kenya's Jade Sea (source: Thomas Kujawa on Wikimedia Commons).