Chapter 10. Data Virtualization Service
With the data ready, we can now start writing the processing logic for generating the insights. There are three trends in big data deployments that need to be taken into account to effectively design the processing logic. First is the polyglot data models associated with the datasets. For instance, graph data is best persisted and queried in a graph database. Similarly, there are other models, namely key-value, wide-column, document, and so on. Polyglot persistence is applicable both for lake data as well as application transactional data. Second, the decoupling of query engines from data storage persistence allows different query engines to run queries on data persisted in the lake. For instance, short, interactive queries are run on Presto clusters, whereas long-running batch processes are on Hive or Spark. Typically, multiple processing clusters are configured for different combinations of query workloads. Selecting the right cluster types is key. Third, for a growing number of use cases like real-time BI, the data in the lake is joined with the application sources in real time. As insights generation becomes increasingly real-time, there is a need to combine historic data in the lake with real-time data in application datastores.
Given these trends, data users need to keep up with the changing technology landscape and gain expertise in evolving data models and query engines and efficiently joining data across silos. This leads to a few ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access