Chapter EightData Lake Maintenance

Initially, data lakes will not be well organized or maintained with a broader audience in mind. When first loading data sources to a data lake, their structure is pretty much the same as before they ended up in the lake (Figure 8.1). These structures can be hard to understand and query; however, they do not need to be cleaned up too much yet. The majority of the cleaning will take place in creating the data warehouse.

The areas of focus for data lake maintenance are around:

  • SQL
  • Extracting and loading of data sources
  • Performance
image

Figure 8.1 Source data being loaded with SQL into a data lake.

These maintenance activities can be expensive once data is extracted and loaded with custom scripts. It's necessary to have in‐depth knowledge of the data, such as where data originates, the details of the APIs, and the data structures inside it. Also, the potential need to write new code and maintain existing code when data sources update is high. We recommend avoiding manual extract and load; use tools like Fivetran or Stitch, which automatically handle data source updates so that any data engineers can focus on more critical tasks.

Why SQL?

While the data could be loaded into the lake in a variety of formats, we recommend SQL (Figure 8.1). It's the standard language for relational database management systems (which is what a Data Warehouse is ...

Get The Informed Company now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.