Chapter FiveChoosing an Engine for the Data Lake

Lakes typically run on a different database engine than transactional source databases. As mentioned, transactional databases such as PostgreSQL or MySQL are optimized for doing the massive number of relatively simple queries that are required to support a working product.

Data lakes, however, are usually hosted on what is called a data warehouse engine, which is built for more complex analytical queries on larger volumes of data.

Think of transactional databases as a scooter, and data warehouse engines as a semi‐truck (Figure 5.1). They're both valuable vehicles, but clearly have different jobs. It is impractical to deliver the mail door to door with a semi‐truck, just as it is impractical to serve thousands of small requests to different places per second from a data warehouse engine. The semi‐truck is suited to deliver many tons of packages across the country to smaller distribution hubs, much like a data warehouse engine is suited to performing big analytical queries to a BI tool.

image

Figure 5.1

Get The Informed Company now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.