5A Use Case of Data Lake Metadata Management

To govern a data lake with a great volume of heterogeneous types of data, metadata management is mandatory to prevent the data lake from being turned into a data swamp which is invisible, incomprehensible and inaccessible to users. In this chapter, we present a use case of data lake metadata management, applied to the health-care field, which is particularly known by its heterogeneous sources of data.

We first present a more detailed data lake definition in comparison to the chapter dedicated to the data lake definition and its underlying data lake architecture, based on which we designed the metadata model. Second, we present a metadata classification pointing to the essential attributes adapted to the use case. Third, we introduce a conceptual model of metadata which considers different types: (i) structured, (ii) semi-structured and (iii) unstructured raw or processed data. Fourth, we validate our proposition with an implementation of the conceptual model which concerns two DBMSs (one relational database and one NoSQL database).

5.1. Context

The University Hospital Center (UHC) of Toulouse is the largest hospital center in the south of France. Approximately 4,000 doctors and 12,000 hospital staff ensure more than 280,000 stays and 850,000 consultations per year. The information system of the hospital stores all the patient data including medical images, biological results, textual hospital reports, PMSI (Programme de médicalisation ...

Get Data Lakes now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.