Managing file metadata using HCatalog

Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in data/stage/salesdb01/sales, the attributes will be as follows:

  • Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)
  • Fields/columns in the data file: For example, fiscal quarter, $amount
  • Data types of the fields: For example, integer, string, double, and string

Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the ...

Get HDInsight Essentials - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.