Chapter 8. Considerations for the Data Engineer

The data engineer is the person on the data team responsible for gathering and collecting the data, storing it, performing batch or real-time processing on it, and serving it via an API to data scientists so that the data scientists can easily delve into it to create their experiments and models. Data engineers are often the first team called on when things break, and the last to get credit when things go well, yet they are arguably the most critical component to operationalizing the data lake.

Data engineers are primarily responsible for managing the volume, variety, velocity, and veracity of the data in the data lake. By properly managing these “four Vs,” data scientists and data analysts can more easily find value in the data—the fifth V.

With organizations wanting to embrace the data-driven model, data engineers are constantly under pressure to learn more technologies and apply them to their development processes. Today they must understand distributed computing, good data warehousing technologies, and the difference between transactional modeling and analytic modeling, such as in OLAP versus OLTP systems.

To add to these many hats, data engineers also must be quality engineers. They need to monitor the quality of their data, the shape of their data, and their metadata statistics so that they can quickly spot a problem and correct it.

The upshot: today, data engineers need to master almost an entire development methodology to ...

Get Operationalizing the Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.