Managing file metadata using HCatalog

Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in data/stage/salesdb01/sales, the attributes will be as follows:

  • Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)
  • Fields/columns in the data file: For example, fiscal quarter, $amount
  • Data types of the fields: For example, integer, string, double, and string

Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the ...

Get HDInsight Essentials - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.