Spanning the stages of data analytics

Sharing metadata is key to managing the data pipeline

By Andy Oram

November 3, 2015

Pont du Gard, Roman Empire, October 2007 (source: Emanuele on Wikimedia Commons)

Sites that run analytics tend to divide tasks into stages: ingestion, cleansing or preparation, analysis, and storage. But these stages are not isolated — each stage depends on results generated by other stages, outside the simple pipeline that carries data from one stage to the next.

A metadata store can tie the analytics stages together so that each stage informs the others. (In this article, I will not consider visualization or reporting, which take place at a different level of processing.)

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Analysis, cleansing, ingestion — each informs the other

The analysis stage can produce statistics about corrupted or erroneous data, for example, which can feed back into improving the cleansing stage. For instance, if a country code is supposed to be two letters, but the United States is routinely coded as “USA,” the analysis stage can produce a rule that the cleansing stage uses to convert “USA” to “US.”

The cleansing stage can identify problems with data sources that can improve the ingestion stage. For instance, if cleansing identifies that a certain source has switched its fourth and fifth fields (city and state, for instance), the ingestion phase can be informed of the change so that it starts tagging the fields in the new manner.

The ingestion stage can record sources of data and other provenance information that can be used for access control during the analytics and storage stages. For instance, if your license restricts the use of data from Memorial Hospital to researchers who have signed acceptable use policies, the ingestion stage can tag all data from Memorial Hospital and later stages can look for that tag.

Automation is critical to each stage

It’s important to make the most of the connections, or lineage, between the processing stages in modern data environments; this is especially important given that modern environments are characterized by data sets that grow too fast, and are too complex, to handle manually. In past eras, for example, a user could notice that an input source has switched the city and state fields, and pass that tip on to a programmer who manually alters the ingestion tools. But if you’re adding new data sets each month, and handling thousands of fields, it’s simply inefficient to bother programmers with routine changes — automation is crucial.

Types of metadata

Each organization can find unique ways to use one stage of data analytics to improve another. Interventions at each of these stages have at least one thing in common: they require metadata about the data. Metadata may be:

provided by the source, such as the names of columns in a table.
derived from the data or its context, such as a filename, a timestamp that notes when the data was ingested, a field size, or a source’s network address.
generated as a by-product, by a stage of processing (i.e. when the cleansing stage tallies the number of incorrect fields found).
added deliberately by a stage of processing, as when the ingestion or cleansing stage adds an access control field.

Ben Sharma, CEO of Zaloni, an enterprise data management company, recommends that organizations focus on three types of metadata:

Business metadata: Names and descriptions of data fields, and business rules that make sense to non-technical business users.

Operational metadata: Source and target locations of data, how many records were rejected during data preparation or a job run, and the success or failure of that run itself.

Technical metadata: The data’s type and format (text, images, etc.), and the structure or schema.

Metadata can also be used to meet regulatory and contractual requirements (access control is one common example).

Tools for handling metadata

You can use the same tools and data stores to handle metadata that you use for the data itself. For instance, for every field or column of data ingested, you can create a row in a metadata table. This row would contain whatever information the stages of processing need to know (i.e. the length of the column, access control information, the number of corrupted or incorrect values found, etc).

Each stage of processing can also send a user an alert when something goes wrong. Metadata is critically important when a user or programmer needs to investigate the cause of a problem. The metadata can store statistics about each stage, and help identify where the processing went off track, as well as identify the field responsible for the problem.

In a new O’Reilly report, Managing the Data Lake: Moving to Big Data Analysis, we examine solutions for managing your data lake, covering topics such as data acquisition and ingestion, metadata (cataloging), access control, and more. Download the free report for more information.

This post is a collaboration between O’Reilly and Zaloni. See our statement of editorial independence.

Post topics: Data science