Managing file metadata using HCatalog
Organizing data in specific directories based on the content and source does provide the foundation for a well-managed Data Lake. In addition to file location, a managed Data Lake should capture key attributes and structure information of the file; for example, for the sales table being ingested to Data Lake in data/stage/salesdb01/sales
, the attributes will be as follows:
- Structure of the file: For example, fixed length, delimited, XML, JSON, sequence, and columnar (RC)
- Fields/columns in the data file: For example, fiscal quarter, $amount
- Data types of the fields: For example, integer, string, double, and string
Apache HCatalog provides a table management system for the HDFS based filesystem. It provides the ...
Get HDInsight Essentials - Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.