Chapter 4. Advanced Architectural Components
In the previous chapter, we discussed the basic architectural components that are found in most DV Systems. In this chapter, we will discuss some advanced components that may not be available in all systems. These advanced components either improve performance or add functionality to the DV System. It is important for DV System users to understand these components and how they work since, in many cases, these components can be parameterized to adjust their functionality based on user preferences.
Caching
A core tenet of data virtualization is that it does not require moving data to new locations. Instead, data is left at its original source, and the DV System extracts it on the fly as needed. One disadvantage of this approach is that a dataset may need to be sent over the network multiple times from the underlying data sources to the DV Engine if it is repeatedly accessed by a series of queries. Although we explained in Chapter 2 that faster networks are often able to avoid allowing this data transfer to become a bottleneck, too many unnecessary data transfers can nonetheless overload a network. Furthermore, as we will see in Chapter 5, when underlying data sources make their data available via slow interfaces such as Java Database Connectivity (JDBC), data transfer bottlenecks still very much exist today. Therefore, repeatedly performing the transfer of the same datasets is not only inefficient in terms of cost, CPU, and power consumption, but it can also degrade the performance of the DV System.
This disadvantage is alleviated via the principle of caching, in which recently accessed data remains temporarily stored in memory or on disk at the DV Engine and is available for subsequent requests.
There are many different ways to cache data, and some DV Systems use multiple different caching approaches. In this section we will discuss the following five approaches:
-
Query cache
-
Block/partition cache
-
Database table cache
-
Automated pre-computation based cache
-
Materialized view caching
Most DV Systems will not implement all five approaches, but rather will devote local storage resources to allocating space for the most appropriate subset of these approaches given the architecture of that system.
In each case, the cache functions as a (temporary) local copy of some subset of data that would otherwise require communication with remote data sources. Over time, data may be updated at the remote data sources, thereby rendering the data in the cache stale. Most analysis tasks can tolerate a bounded degree of staleness in the cache; therefore, many DV Systems include parameters that allow users to specify what degree of staleness is acceptable. Nonetheless, an effort must be made to reduce the degree of staleness in the cache—especially within the specified bounds. One of four approaches is typically performed to limit staleness:
-
The DV System automatically removes data from the cache that has been present there beyond the degree of acceptable staleness.
-
The DV System sends queries to the underlying data sources on a regular basis to check the status of data and, if anything has changed, to refresh the cache. The worst-case staleness of the cache is thus bounded by the frequency of these checks and refreshes.
-
The DV System processes the log records of the underlying data source to detect source data changes that may affect the current data in the DV System cache.
-
The underlying data source may notify the DV System directly if the underlying data has changed.
The first option is the easiest to implement, but it may remove data from the cache unnecessarily (i.e., even if it had not been updated at the underlying system). The second option is also fairly straightforward to implement, but it consumes the most processing resources due to the repeated communication between the DV System and underlying data source, along with the cost of performing the repeated checks at the underlying data source. The third and fourth options are more efficient, but they require a deeper integration between the DV System and the underlying data source. The third requires the DV System to know how to read the log records of the underlying data source (each data source may use a different format for its log records), and the fourth requires change notification functionality in the underlying data source—a feature not supported by all data source systems.
When caches run out of space, they need to remove data in the cache to make room for new data. Most caches use some form of the least recently used (LRU) algorithm to decide which data to remove. Data that has been in the cache for a long time without being used is therefore removed before actively used data.
Query Cache
Many DV Systems implement some form of a query cache. Once a query has been executed, the query results are stored either in memory or on some form of a fast disk. If, in the future, the same query is submitted to the DV System, then the system can reuse the previously stored results rather than wasting resources to process the exact same query again.1
Some queries cannot be cached. For example, if the query contains nondeterministic SQL functions such as now()
or rand()
, the same query may return different results every time it is run. Therefore, there is no benefit to caching the query results of previous runs.2
Query caches must enforce security rules to ensure that a user without access to run the original query cannot see the results of the cached query either. In practice, this is done by sending the query to the Query Parser described in Chapter 3, since this is the component that typically enforces the security rules. The parser is also helpful in checking to see if two queries are identical even if they are expressed slightly differently.
Of the cache refresh options we discussed previously, the second option is typically prohibitively expensive for query caches. This is because in order to check if the cache needs to be refreshed, the query needs to be run again. If the frequency of the checks is higher than the frequency of repeated queries submitted to the DV System, the cost of maintaining the query cache may be higher than the processing reduction benefits it provides. Therefore, query caches are typically implemented via one of the other three refresh options discussed.
Block/Partition Cache
Many data source systems partition the data they store. For example, if the data source system runs over a cluster of disks, the data will be partitioned across those disks. Even if the system runs on a single machine that has a single disk, the system may choose to partition data across the storage of that machine to accelerate queries over the stored data—such as dividing data by zip code so that queries that access a particular zip code can quickly find the relevant data. File formats such as Parquet and ORC, table formats such as Iceberg and Delta, and underlying database systems such as PostgreSQL, Teradata, or Oracle all support data partitioning.
A partition cache stores entire source partitions at the DV System. For example, the partition associated with the zip code 01776 may be cached at the DV System, while other zip code partitions are not.
For DV Systems that incorporate a local DV processing engine, a partition cache can significantly accelerate query processing. However, this is only the case when the DV platform receives mostly raw data from the underlying data storage (i.e., entire partitions of raw data) instead of pushing down the query processing to the storage system. Thus, push-based DV Systems (see Chapter 3) generally do not include a partition cache.
A partition cache does not necessarily need to use the same notion of a partition as the underlying data source system. As discussed in Chapter 3, many DV Systems subdivide source partitions into smaller chunks (often called splits), and the partition cache may choose to operate at the granularity of a split instead of a complete source partition.
With regard to the cache refresh options we discussed previously, the second option is far more feasible for partition caches than for query caches. This is because in many cases, it is easy to immediately determine if a partition has been updated or removed at the data source without issuing a separate query. For example, if the source data is a raw data file in a data lake, the DV System can perform this check directly using file metadata.
Access to the data in the partition cache needs to use the same authorization mechanisms as the underlying data.
In general, the trade-off between a partition cache and a query cache is as follows: data in a partition cache has a higher probability of being utilized in future query processing than data in a query cache. This is because this data is raw and is usable by many different types of future queries. On the other hand, the partition cache saves less work for the DV platform relative to a query cache, since the DV platform still has to process the rest of the query once the raw data is received. In contrast, data in a query cache already contains the final results of query processing.
Database Table Cache
In theory, a database table cache is equivalent to a partition cache in which all partitions from a table are cached or none of them are. However, in practice, partition caches and table caches are typically used in different scenarios. As mentioned in the previous section, partition caches cannot be used for underlying data source systems that do not partition data. Even if they do partition data, if these partitions are not exposed to external systems in a manner where keeping track of and extracting individual partitions is easy, the utility of partition caches is reduced significantly. In such scenarios, table caches are more feasible since it is always possible to keep track of what tables exist (and extract data from any given table) for any underlying data source system that incorporates the concept of tables in its local model of data storage. Thus, table caching becomes the primary mechanism to cache raw data when partition caching is not possible. The main downside of table caching relative to partition caching is that it is less memory-efficient. Even if only part of a table is repeatedly accessed, the entire table must be kept in cache.
In practice, partition caching is commonly used in DV Systems when working with data stored using open data formats in a data lake. Table caching is more common when working over complex data management systems such as SQL Server, Db2, or MySQL.
Another difference (in practice) between table caching and partition caching is that while the partition cache is typically stored in the DV System directly, the table cache may be stored on an external storage system or a data lake. This is because the table cache typically requires more space since entire tables must be cached. One might ask: “If the table cache is stored on an external storage system, how is the cache helpful at all? The DV System should simply read the data from the original data source directly since the original data source is also an external system.” The answer to this question is that in many cases, the external storage system used for the table cache is physically closer to the DV System than the data source system, or is otherwise faster to access and read data from than the data source system. This is especially true if the data source system has an ongoing operational workload that may slow down read requests by the DV System. Furthermore, offloading these read requests to a different external system allows the operational system to devote more resources to its normal operational workload.
Similar to the other caching approaches we have discussed, the security rules used for the original tables must also be applied to the replica of the data that is cached.
Automated Pre-Computation Based Cache
The automated pre-computation based cache is a generalization of the query cache. As previously described, the query cache only caches the results of a previous query submitted to the DV System. In contrast, the automated pre-computation based cache attempts to cache a larger result set than that which was returned to any particular query, so even queries that have never been submitted to the system can still be answered from the data in the cache. The DV platform analyzes the log of previous queries that have been executed by the system, and then attempts to extrapolate based on the pattern of these previous queries to prepare data for future queries by widening the scope of these previous queries. For example, a query such as:
SELECT sum(sales) from sock_table WHERE date = "01/01/1970"
would result in a single value, the sales of socks on January 1, 1970. This query can be widened to include more aggregations (aside from only sum(sales)
) and more dates (aside from only 01/01/1970) as follows:
SELECT sum(sales), min(sales), max(sales), average(sales),sum(inventory), min(inventory), max(inventory), average(inventory) FROM schema.sock_table GROUP BY date
The results of this widened query are stored in the automated pre-computation based cache, and future queries that have a similar structure to the original query can potentially be redirected to this cache. The staleness problems of the query cache that we discussed also apply to the automated pre-computation based cache, so the results need to be updated on an ongoing basis in the same way as for any query cache.
Materialized View Caching
Both the query cache and the automated pre-computation based cache are fully automated. The user of the DV System does not control what is stored in the cache, when data is removed, or which queries are redirected to these caches. In contrast, a materialized view cache is fully controlled by the DV System user. If the system user is aware of a type of query that will be repeatedly sent to the system, a materialized view can be created to accelerate the performance of this query. The user controls which materialized views are created and when they can be removed. Meanwhile, the DV System redirects queries to these materialized views if they contain the relevant data for any given input query.
These materialized views may be stored in the DV System, the data source system, or even an external data lake. Similar to the other caching approaches we have discussed, the equivalent security rules used for the raw data must also be used for the materialized view cache.
DV Engine–Initiated Writes to Underlying Data Sources
In general, the primary use case for DV Systems is for querying, analyzing, and processing data from the original data sources that are being virtualized. In other words, data generally flows one way through the system: from the data source through the DV Engine to the client. Nonetheless, some DV Systems support bidirectional data flow where clients also can write or update data in the underlying data sources.
There are several challenges that come into play when supporting bidirectional data flow.
First, in some cases, the underlying data source does not give the DV System access to all data stored locally, but instead provides the system with access to a view of the data. For example, an underlying data source in Europe may contain private or personally identifiable information (PII) that is regulated by the General Data Protection Regulation (GDPR), which prevents it from being transferred to locations with lower levels of privacy protection. If the DV Engine is located in one of those locations, the underlying data source will only give it a view that does not include the private information. If the DV Engine wants to insert new entries into this view, it will not be able to include personal information since the view itself does not include personal information. Such writes will get rejected when they get forwarded to the original data source in which some of the attributes of this personal data are required fields in the database.
In general, writing back to views is a well-known problem in the field of database systems and is in fact impossible in most nontrivial use cases. So when the DV System is working over a view, many times bidirectional data flow is impossible.
Second, as we discussed previously and in Chapter 3, one of the primary advantages of pull-based DV Systems is that they require a much shallower knowledge of the API of each system they need to work with. As long as the DV System knows how to extract raw data from that system, it can perform its needed functionality. Supporting bidirectional flow gets rid of much of this advantage, since the DV System will need to acquire a deeper knowledge of the API in order to express writes back to the underlying data source in a way that upholds the isolation, atomicity, and consistency guarantees of the underlying system.
Third, in many cases, writes may cause consistency problems between the cache in the DV System and the data in the underlying data source.
In general, most modern DV Systems do not support bidirectional data flow. However, push-based systems are more likely to support bidirectional data flow than pull-based systems since the second challenge is not relevant to push-based systems.
Multiregion (and/or Multicloud) DV Systems
We have discussed at length that DV Systems are designed to run over many different types of data sources. Until now, however, we have not discussed the physical location of those data sources. In practice, it is rare for all of the data sources for a nontrivial DV System deployment to be physically located in the same data center. Instead, they are often located in different data centers and often in different geographical regions (and countries).
There are many reasons why an organization may have data sources in different regions. Examples include:
-
The organization has recently migrated some of its applications to the cloud. However, the rest of the applications are still running on-premises.
-
An organization acquires or merges with another organization that does business in a different part of the world and whose datasets are kept close to where the application does business.
-
An organization that does business globally chooses to keep data close to where it is most often accessed from for performance reasons.
-
Regulatory requirements, such as data sovereignty rules, force the organization to keep certain datasets in certain regions.
-
One cloud is more cost-efficient for the access patterns for a particular application, but a different cloud is cheaper for the access patterns for a different application within the same organization.
When data sources exist in different regions, additional complexities are added to the architecture of DV Systems, especially pull-based systems. This is because pulling data across different geographical regions can be more expensive in terms of latency and money (because of egress charges) than pulling data from sources within the same data center. Furthermore, there may be regulatory issues around the movement of data from data sources with data sovereignty requirements to the DV System for data processing.
Multiregion DV Architecture
The fundamental systems architecture principle that is used to circumvent the issues discussed in the previous section is that the DV System cannot be running only in a single region. Rather, there need to be components of the DV System that run within the same region as each potential data source.
When the DV Engine needs to pull data out of a data source, it pulls it out into the component of the DV System running near that source. If data needs to be extracted out of a data source in Germany, it is sent to DV Engine code running in Germany; if out of a data source in Japan, it is sent to DV Engine code running in Japan. The DV Engine code running at each region therefore processes data sent to it from that region. Figure 4-1 outlines this architecture.
This figure shows a DV platform with data sources in two regions and a data consumer issuing a data processing request at one of those regions. There are three potential scenarios for how the query is processed depending on what data is accessed by that request:
-
If the request only accesses data within the same region as where the request is initiated, the request is simply processed locally using the standard DV Engine architecture: the query is parsed, optimized, and planned locally and then processed using the pull-based or push-based DV Engine architecture (depending on the design of the DV System being used). All data transfers are local.
-
If the request only accesses data from a remote region (e.g., it is submitted to region 1 but only accesses data from region 2), the request is forwarded in its entirety to region 2. Some parsing was already performed in region 1 (in order to determine which data sources were being accessed), but the rest of the parsing, optimizing, and planning will be done in region 2, along with the execution of the request. If the results of the request are allowed to be sent to the consumer in region 1 (e.g., no data sovereignty rules will be broken, and the result set is not too large), then they can be sent there. Otherwise, the consumer will have to access the results directly from region 2 (e.g., via a VPN).
-
If the request accesses data from multiple regions, then instances of the DV Engine in both regions must work together to process the request. In this case the region to which the request was submitted performs the initial query parsing, optimization, and planning. During this planning process, data within a remote region is treated as a giant unified data source. This process is very natural for push-based systems. Just as a normal input request is divided into subrequests—one for each underlying data source—so too a subrequest will be generated that processes data at the combined data source. All data processing that needs to happen on any data within that region is combined into that subrequest and sent to that region. The DV Engine at that region receives that subrequest and further divides it into smaller subrequests—one for each data source that needs to be accessed at that region.
However, the process is a little less natural for pull-based systems since they initially must function like a push-based system, dividing a request into subrequests for each region and sending the subrequest to each region. It is only after the subrequest arrives at that region that the system can revert back to its normal pull-based processing.
The region to which the processing request was submitted is typically also the one that combines the results of the subrequests and performs any further necessary processing after combining them. However, in some cases a different region is chosen (e.g., if the intermediate result set from that region is much larger than the intermediate results from any other region). Care must be taken to ensure that data sovereignty rules are not violated when sending data across regions at this later stage of query processing (see more later in this section).
We described in Chapters 2 and 3 that query optimization in DV Systems is complex, and much more so for push-based systems since the optimizer needs to reason about cost equations and intermediate result set sizes of remote processing. Since the multiregion DV architecture always must use push-based processing when sending work to remote regions, some of the complexity of query optimization in the context of push-based systems exists in this context. However, since the code running at remote regions are part of the same DV System, the optimizer within the DV Engine has detailed knowledge about how the different query processing options will be run there and can estimate their anticipated costs. This makes query optimization easier relative to optimizers for standard push-based DV Engines running over foreign data sources that act as a black box relative to the DV Engine.
One might argue that if every pull-based system must deal with the complexities of push-based processing for multiregion deployments, doesn’t one of the major arguments in favor of the pull-based architecture get eliminated? In truth, there is some validity to this argument. However, it is important to note that the complexities of push-based processing in the context of multiregion deployments are much less significant than the complexities of push-based processing in general. First, the query optimization problem is less significant, as described in the previous paragraph. Second, as we described previously, the complexities of push-based processing in general center around the need to have an intimate knowledge of the API of every possible type of data source system, along with detailed reasoning about the query processing algorithms that are run there. In contrast, the push-based processing in the context of multiregion deployments is much simpler. The only system to which the subrequest is being submitted is another version of that same DV Engine. There is no additional API knowledge that needs to be acquired in order to generate the subrequest, and the query processing algorithms are already known since they are running the same fundamental DV Engine code. While the general process of dividing a request into several subrequests needs to be added to the pull-based system in order to support multiregion deployments, this new code is a small fraction of the code that needs to exist for generic push-based systems.
Upholding data sovereignty rules can be tricky when an input request needs to access data from more than one region. These rules are typically stated in terms of the raw data inside the data source. After processing a subrequest on the raw data, the raw data has been altered. In many cases, it is unclear if this newly generated intermediate data set modified the raw data to the point where it is now allowed to be transferred to a different region. One approach to avoiding the complexity of trying to determine this is to define logical views (or materialized views) inside a region with data sovereignty rules. These views restrict access to only include data that is allowed to be transferred outside of that region. For example, if a dataset includes PII data that regulatory requirements prevent from leaving a particular region, a view can be created that either hides the sensitive data, aggregates it, or anonymizes it. If the remote processing request can express its subrequest in terms of these safe views (instead of over the raw data), then it is guaranteed that intermediate result sets are safe to be transferred across regions.
Get Data Virtualization in the Cloud Era now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.