2Architecture of Data Lakes

In this chapter, we define the most important features of data lake systems, and from there we outline an architecture for these systems. Our vision for a data lake system is based on a generic and extensible architecture with a unified data model, facilitating the ingestion, storage and metadata management over heterogeneous data sources.

We also introduce a real-life data lake system called Constance that can deal with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines that support structured data and semi-structured data, Constance provides users with a unified interface for query processing and data exploration.

2.1. Introduction

Big Data has undoubtedly become one of the most important challenges in database research. An unprecedented volume, a large variety and high velocity of data need to be captured, stored and processed to provide us knowledge. In the Big Data era, the trend of Data Democratization brings in a wider range of users, and at the same time a higher diversity of data and more complex requirements for integrating, accessing and analyzing these data.

However, compared to other Big Data features such as “Volume” and “Velocity” (sometimes also including “Veracity” and “Value”), “Variety” ...

Get Data Lakes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.