Managing file metadata using HCatalog
Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in
data/stage/salesdb01/sales, the attributes will be as follows:
- Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)
- Fields/columns in the data file: For example, fiscal quarter, $amount
- Data types of the fields: For example, integer, string, double, and string
Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the ...