Chapter 10. Data Virtualization Service

With the data ready, we can now start writing the processing logic for generating the insights. There are three trends in big data deployments that need to be taken into account to effectively design the processing logic. First is the polyglot data models associated with the datasets. For instance, graph data is best persisted and queried in a graph database. Similarly, there are other models, namely key-value, wide-column, document, and so on. Polyglot persistence is applicable both for lake data as well as application transactional data. Second, the decoupling of query engines from data storage persistence allows different query engines to run queries on data persisted in the lake. For instance, short, interactive queries are run on Presto clusters, whereas long-running batch processes are on Hive or Spark. Typically, multiple processing clusters are configured for different combinations of query workloads. Selecting the right cluster types is key. Third, for a growing number of use cases like real-time BI, the data in the lake is joined with the application sources in real time. As insights generation becomes increasingly real-time, there is a need to combine historic data in the lake with real-time data in application datastores.

Given these trends, data users need to keep up with the changing technology landscape and gain expertise in evolving data models and query engines and efficiently joining data across silos. This leads to a few ...

Get The Self-Service Data Roadmap now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

The Self-Service Data Roadmap by Sandeep Uttamchandani

Chapter 10. Data Virtualization Service

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly