Chapter 1. Introduction to Data Virtualization and Data Lakes

As long as humankind has existed, knowledge has been spread out across the world. In ancient times, one had to go to Delphi to see the oracle, to Egypt for the construction knowledge necessary to build the pyramids, and to Babylon for the irrigation science to build the Hanging Gardens. If someone had a simple question, such as, “Were the engineering practices used in building the pyramids and Hanging Gardens similar?” no answer could be obtained in less than a decade. The person would have to spend years learning the languages used in Babylon and Egypt in order to converse with the locals who knew the history of these structures; learn enough about the domains to express their questions in such a way that the answers would reveal whether the engineering practices were similar; and then travel to these locations and figure out who were the correct locals to ask.

In modern times, data is no less spread out than it used to be. Important knowledge—capable of answering many of the world’s most important questions—remains dispersed across the entire world. Even within a single organization, data is typically spread across many different locations. Data generally starts off being located where it was generated, and a large amount of energy is needed to overcome the inertia to move it to a new location. The more locations data is generated in, the more locations data is found in.

Although it no longer takes a decade to answer basic questions, it still takes a surprisingly long time. This is true even though we are now able to transfer data across the world at the speed of light: an enormous amount of information can be accessed from halfway across the earth in just a few seconds. Furthermore, computers can now consume and process billions of bytes of data in less than a second. Answering almost any question should be nearly instantaneous, yet in practice it is not. Why does it take so long to answer basic questions—even today?

The answer to this question is that many of the obstacles to answering a question about the pyramids and Hanging Gardens in ancient times still exist today. Language was a barrier then. It is still a barrier today. Learning enough about the semantics of the domain was a barrier then. It is still a barrier today. Figuring out who to ask (i.e., where to find the needed data) was a barrier then. It is still a barrier today. So while travel time is billions of times faster now than it was then, and processing time is billions of times faster now than it was then, this only benefits us to the extent that these parts of analyzing data are no longer the bottleneck. Rather, it is these other barriers that prevent us from efficiently getting answers to questions we have about datasets.

Language is a barrier beyond the fact that one dataset may be in English, another in Chinese, and another in Greek. Even if they are all in English, the computing system that stores the data may require questions to be posed in different languages in order to extract or answer questions about these datasets. One system may have an SQL interface, another GraphQL, and a third system may support only text search. The client who wishes to pose a question to these differing systems needs to learn the language that the system supports as its interface. Further, the client needs to understand enough about the semantics of how data is stored in a source system to pose a question coherently. Is data stored in tables, a graph, or flat files? If tables, what do the rows and columns correspond to? What are the types of each column (integers, strings, text)? Does a column for a particular table refer to the same real-world entity as a column from a different table? Furthermore, where can I even find a dataset related to my question? I know there is a useful dataset in Spain. Is there also one in Belgium? Egypt? Japan? How do I discover what is out there and what I have access to? And how do I request permission to access something I do not currently have access to?

The goal of data virtualization (DV) is to eliminate or alleviate these other barriers. A DV System creates a central interface in which data can be accessed no matter where it is located, no matter how it is stored, and no matter how it is organized. The system does not physically move the data to a central location. Rather, the data exists there virtually. A user of the system is given the impression that all data is in one place, even though in reality it may be spread across the world. Furthermore, the user is presented with information about what datasets exist, how they are organized, and enough of the semantic details of each dataset to be able to formulate queries over them. The user can then issue commands that access any dataset virtualized by the system without needing to know any of the physical details regarding where data is located, which systems are being used to store it, and how the data is compressed or organized in storage.

The convenience of a proper functioning DV System is obvious. We will see in Chapter 2 that there are many challenges in building such systems, and indeed many DV Systems often fall significantly short of the promise we have described. Nonetheless, when they work as intended (as we illustrate in Chapter 6), they are extremely powerful and can dramatically broaden the scope of a data analysis task while also accelerating the analysis process in a variety of ways, including by reducing the human effort involved in bringing together datasets for analysis.

A Quick Overview of Data Virtualization System Architecture

Figure 1-1 shows a high-level architecture of a DV System, the core software that implements data virtualization, which we’ll discuss in detail in Chapter 3. The system itself is fairly lightweight—it does not store any data locally except for a cache of recently accessed data and query results (see Chapter 4). Instead, it is configured to access a set of underlying data sources that are virtualized by the system such that a client is given the impression that these data sources are local to the DV System, even though in reality they are separate systems that are potentially geographically dispersed and distant from the DV System. In Figure 1-1, three underlying data sources are shown, two of which are separate data management systems that may contain their own interfaces and query access capabilities for datasets stored within. The third is a data lake, which may store datasets as files in a distributed filesystem.

The existence of these underlying data source systems is registered in a catalog that is viewable by clients of the DV System. Furthermore, the catalog usually contains information regarding the semantics of the datasets stored in the underlying data systems—what exactly is the data stored in these datasets, how this data was collected, what real-world entities are referred to in the dataset, and how the datasets are organized—in tables, graphs, or hierarchical structures. Furthermore, the schema of these structures is defined. All of this information allows clients to be aware of what datasets exist, what important metadata these datasets have (including statistical information about the data contained within the datasets), and how to express coherent requests to access them.

The most complex part of a DV System is the data virtualization engine (DV Engine), which receives requests from clients (generated using the client interface) and performs whatever processing is required for these requests. This typically involves communication with the specific underlying data sources that contain data relevant to those requests. The DV Engine thus needs to know how to communicate with a variety of different types of systems that may store data that is being virtualized by the system. Furthermore, it may need to forward parts of client requests to these underlying source systems. Therefore, the engine needs to know how to properly express these requests such that the underlying data source system can perform these requests in a high-performing way and return the results in a manner that is consumable in a scalable fashion by the DV System. The DV Engine may also need to combine results received from multiple underlying data source systems involved in a client request.

In general, the goal of data virtualization is to allow clients to express requests over datasets without having to worry about the details of how the underlying data source systems store the source data. Yet most underlying data sources have unique interfaces that require expertise in that particular system before data can be accessed. Therefore, the DV Engine typically requires some translation on the fly from a global interface that is used by the client to access any underlying system into the particular interface used by specific underlying data sources. We will discuss in Chapter 2 how this process is complex and error prone, and cover approaches that modern systems use to reduce this complexity.

Figure 1-1 included a data lake as one of the underlying data sources. A data lake is a cluster of servers that combine to form a distributed filesystem inside which raw datasets (often very large in size) can be stored scalably at low cost. In the cloud era, they have become popular alternatives to storing data in expensive data management systems, which implement many features that are unnecessary for use cases that primarily require simply storing and accessing raw data. Data lakes often contain datasets in the early stages of data pipelines, before getting finalized for specific applications. These raw or early-stage datasets may contain significant value for data analysis and data science tasks, and a major use case for DV Systems has emerged in which data in a data lake is virtualized by the DV System, and thus becomes accessible via the client interface that the DV System provides. Furthermore, we will see in future chapters that accessing raw data in a data lake is often much simpler and less error-prone for a DV System relative to accessing data in more advanced data management systems. Therefore, many of the most successful deployments of data virtualization systems at the time of writing contain at least one data lake as an underlying data source system. As a result, this book contains a special focus on data lake use cases for data virtualization in addition to discussing data virtualization more generally. In the next section, we give a more detailed introduction to data lakes.

Data Lakes

Before the emergence of data virtualization, the only way to access a variety of datasets for a particular analysis task was to physically bring them all together prior to this analysis. For decades, data warehouses were the primary solution to bringing data from across an organization into a central, unified location for subsequent analysis. Data was extracted from source data systems via an extract, transform, and load (ETL) process; integrated and stored inside dedicated data warehouse software such as Oracle Exadata, Teradata, Vertica, or Netezza products; and made available for data scientists and data analysts to perform detailed analyses. These data warehouses stored data in highly optimized storage formats in order to analyze it with high performance and throughput, so that analysts could experience fast responses even when analyzing very large datasets. However, due to the large amount of complexity in the software, these data warehouse solutions were expensive and charged by the size of data stored; therefore, they were often prohibitively expensive for storing truly large datasets, especially when it was unclear in advance if these datasets would provide significant value to analysis tasks. Furthermore, data warehouses typically require up-front data cleaning and schema declaration, which may involve nontrivial human effort—which gets wasted if the dataset ends up not being used for analysis.

Data lakes therefore emerged as much cheaper alternatives to storing large amounts of data in data warehouses. Typically built via relatively simple free and open source software, a home-built data lake’s only cost of storing data was the cost of the hardware for the cluster of servers that were running the data lake software, and the labor cost of the employees overseeing this deployment. Furthermore, data could be dumped into data lakes without up-front cleaning, integration, semantic modeling, and schema generation, thereby making data lakes an attractive place to store datasets whose value for analysis tasks has yet to be determined. Data lakes allowed for a “store first, organize later” approach.

Starting in approximately 2008, a system called Hadoop emerged as one of the first popular implementations of a data lake. Based on several important pieces of infrastructure used by Google to handle many of its big data applications (including its GFS distributed filesystem and its MapReduce data processing engine), Hadoop became almost synonymous with data lakes in the decade after its emergence.

Data lakes are known for several important characteristics that initially existed in the Hadoop ecosystem and have become core requirements of any modern data lake:

Horizontal scalability
Support for structured, semi-structured, and unstructured data
Open file formats
Support for schema on read

We discuss each of these characteristics in the following sections.

Horizontal Scalability

Hadoop incorporated an open source distributed filesystem called HDFS (Hadoop distributed filesystem) that was based on the architecture of the Google filesystem (GFS), which itself was based on decades of research into scalable filesystems. The most prominent feature of HDFS is how well it scales to petabytes of data. Although metadata is stored on a single leader server, the main filesystem data is partitioned (and replicated) across a cluster of inexpensive commodity servers. Over time, as filesystem data continues to expand, HDFS gracefully expands to accommodate the new data by simply adding additional inexpensive commodity servers to the existing HDFS cluster. Although some amount of rebalancing (i.e., repartitioning) of data is necessary when new nodes are added in order to keep filesystem data reasonably balanced across all available servers—including the newly added servers—the overall process of horizontally scaling the cluster by adding new machines whenever the filesystem capacity needs to expand is straightforward for system administrators. Horizontal scalability proved to be a much more cost-effective way to scale to large datasets, compared to upgrading existing servers with additional storage and processing capability, since there is a limited amount of additional resources that can be added to a single server before these additions become prohibitively expensive.

By storing data on a cluster of cheap commodity servers, along with the filesystem software being free and open source, HDFS reduced the cost per byte of storage relative to other existing options by more than a factor of 10. This fundamentally altered the psychology of organizations when it came to keeping around datasets of questionable value. Whereas in previous times, organizations were forced to justify the existence of any large dataset to ensure the high cost of keeping it around was worth it, the low-cost scalability of HDFS allowed organizations to keep datasets of questionable value with vague justifications that perhaps they will be useful in the future. This feature of scalable cheap storage is now a core requirement of all data lakes.

Support for Structured, Semi-Structured and Unstructured Data

Another major paradigm shift introduced by Hadoop was a general reduction of constraints about how data must be structured in storage. Most data management systems at the time were designed for a particular type of data model—such as tables, graphs, or hierarchical data—and functioned with high performance only if someone stored data using the data model expected by that system. For example, many of the most advanced data management systems were designed for data to be stored in the relational model, in which data is organized in rows and columns inside tables. Although such systems were able to handle unstructured data (e.g., raw text, video, audio files) or semi-structured data (e.g., log files, Internet of Things device output) using various techniques, they were not optimized for such data, and performance degraded dramatically if large amounts of unstructured data were stored inside the system.

In contrast, Hadoop was designed for data to be stored as raw bytes in a distributed filesystem. Although it is possible to structure data within HDFS (we will discuss this shortly), the default expectation of the system is that data is unstructured, and its original data processing capabilities (such as MapReduce) were designed accordingly to handle unstructured data as input.¹ This is because MapReduce was originally designed to serve Google’s large-scale data processing needs, especially for tasks such as generating its indexes for use in its core search business. For example, counting and classifying words in a web page served as a prominent example in the original publication describing the MapReduce system. Since web pages generally consist of unstructured or semi-structured data, HDFS and MapReduce were both optimized to work with these data types.

This embrace of unstructured data again altered the psychology of organizations when it came to keeping around large raw datasets. Before Hadoop become popular, organizations were faced with a decision to make for every new raw dataset: either (a) spend a large amount of effort to clean and reorganize the data such that it can fit into the preferred data model of the data management system that the organization was using or (b) discard the dataset. By providing a good solution for storing large unstructured and semi-structured datasets, these datasets became possible to keep around and store in the data lake.

Open File Formats

So far, we have explained that Hadoop was originally designed to work over raw files in a filesystem rather than the highly optimized compressed structured data formats typically found in relational database systems. Nonetheless, Hadoop’s strong ability to cost-effectively store and process data at scale naturally led to it being used for storing highly structured data as well. Over time, projects such as Hive and HadoopDB emerged that enabled the use of Hadoop for processing structured data.²^,³^,⁴ Once Hadoop started being used for structured data use cases, several open file formats (such as Parquet and ORC) emerged that were designed to store data in optimized formats for these use cases—for example, storing tables column by column rather than row by row. This enabled structured data to be stored in HDFS (a filesystem that is oblivious to whether it is storing structured or unstructured data) while still providing the performance benefits at the storage level that relational database systems use to optimize performance, such as lightweight compression formats that can be operated on directly without decompression, byte-offset attribute storage, index-oriented layout, array storage that is amenable to SIMD parallel operations, and many other well-known optimizations.

Some important features of these file formats include:

Columnar storage format in which each column is stored separately, which allows for better compression and more efficient data retrieval. This format is particularly well suited to analytical workloads where only a subset of the columns in a table are accessed by any particular query and time does not have to be wasted scanning data in those columns that are not accessed.
Partitioning of files based on specific attributes or columns. Instead of storing all data in a single large file, it is divided into smaller files based on the values of one or more attributes of the data. Partitioning data can significantly improve analysis performance by enabling data pruning, in which partitions that don’t contain relevant data for a particular task can be skipped. This significantly reduces the amount of data that needs to be scanned during analysis. For example, if the data is partitioned by date, three years of data can be divided into 1,095 partitions (one for each day over those three years). A particular analysis task that only involves analyzing data from a particular week within that three-year period can directly read the seven partitions corresponding to the days in that week instead of scanning through the entire dataset to find the relevant data. Directly scanning the relevant partitions results in only having to read 1% of the total data volume.

A separate advantage of data partitioning is that it makes it simpler to parallelize data processing tasks because different partitions can be processed concurrently.
Inclusion of metadata that describes the structure of the data in the file and how it is partitioned. This metadata helps data processing engines understand how data is organized within the files.
Storage of data using open standards defined by a particular file format. By clearly defining how data is stored using that format, data processing systems are able to be developed that directly support these open formats as inputs to those systems. This allows the same dataset to be directly used as input to various big data frameworks and tools, including Trino, Apache Spark, and Apache Hive, without having to be converted on the fly to different internal data structures used by those systems. This means that data only needs to be stored once and can be accessed by many engines.

Support for Schema on Read

Some datasets—especially those that were generated from a variety of heterogeneous sources—may not contain a consistent schema. For example, in a dataset containing information about customers of a business, some entries may contain the customer name while others do not; some entries may contain an address or phone number while others do not; some entries may contain an age or other demographic information while others do not. In traditional structured systems, a rigorous process is often required to define a single, unified schema for a particular dataset. The data is then cleaned and transformed to abide by this schema. Any information that is missing (name, address, age, etc.) must be explicitly noted for a particular entry (e.g., by including the special null value for missing data). This entire cleaning and transformation process happens prior to loading data into the system, and the process is referred to by the acronym ETL.

Since data lakes allow data to be stored in a raw format, new datasets can be stored immediately, prior to any kind of transformation. This enables unclean data with irregular schemas and missing attributes to be loaded into the system as is without being forced to conform to a global schema. Furthermore, this type of data can be made available for data processing via a concept called schema on read, in which the schema for each entry is defined separately for that entry and read alongside the entry itself. When an application or data processing tool reads the raw data, it reads the schema information as well and processes the data on the fly according to the defined schema for each individual entry. This is a significant departure from the more traditional schema on write concept, which requires data to be formatted, modeled, and integrated as it is written to storage before consumption and usage.

A data lake thus enables an ELT paradigm (as opposed to the ETL paradigm previously defined), where data is first extracted and loaded into the data lake storage prior to any cleaning or transformation. At first, whenever it is accessed by data processing tools, the schema on read technique is used to process the data. Over time, data can be cleaned and transformed into a unified schema for higher quality and performance. But this process only needs to happen when necessary, only when the human and computing resources required to perform this transformation process are available, and only when the value of the data after transformation increases by enough to justify the cost of the transformation. Until this point, the data remains available via the schema on read paradigm. This reduces the time-to-value for new datasets, since they become immediately available for analysis, and only increase in value over time as the data is improved iteratively on demand.

The Cloud Era

All of the leading cloud vendors offer some sort of data lake solution (often more than one). In some cases, these solutions are based directly on Hadoop, where Hadoop software runs under the covers, and the cloud solution takes over the significant burden of managing and running Hadoop environments. In other cases, these solutions are built using their own software developed in-house, while adhering to the principles we discussed previously.

One notable addition that cloud vendors have made to the data lake paradigm beyond what we previously discussed is the introduction of object storage. Object storage is designed to accommodate unstructured data by organizing it into discrete entities known as objects. These objects are then housed within a simplified, nonhierarchical data landscape. Each object encapsulates the data, associated metadata, and a unique identifier, facilitating straightforward access and retrieval for applications via an API. For example, the REST-based API for accessing data on Amazon’s object storage is called S3 (Simple Storage Service).

Both object storage and the more traditional file-based storage can be foundational components of data lakes; however, while file-based storage is designed and mostly employed for analytics use cases, object storage has a much broader applicability and is used as a location for storing data for both operational and analytical purposes.

Data Virtualization Over Data Lakes

In the next chapter we will discuss some major challenges in building functional, high-performance DV Systems and some solutions that have emerged to overcome these challenges. During the course of that discussion (along with additional discussion in the following chapters), it will become clear why data lakes are an excellent use case for virtualization systems. This does not mean that data virtualization should not be used for data outside data lakes—modern DV Systems are able to process data located anywhere with high performance. Rather, the existence of data in data lakes is pushing forward demand for DV Systems while at the same time eliminating some of the more difficult challenges in deploying such systems.

¹ Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Communications of the ACM 51, no. 1 (2008): 107-113.

² Ashish Thusoo et al., “Hive - A Warehousing Solution Over a Map-Reduce Framework,” Meta, August 1, 2009, https://research.facebook.com/publications/hive-a-warehousing-solution-over-a-map-reduce-framework.

³ Azza Abouzied et al., “Integration of Large-Scale Data Processing Systems and Traditional Parallel Database Technology,” PVLDB 12, no. 12 (2019): 2290-2299, https://www.vldb.org/pvldb/vol12/p2290-abouzied.pdf.

⁴ Azza Abouzied et al., “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads,” PVLDB 2, no. 1 (2009): 922-933, https://dl.acm.org/doi/10.14778/1687627.1687731.

Get Data Virtualization in the Cloud Era now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Virtualization in the Cloud Era by Daniel Abadi, Andrew Mott

Chapter 1. Introduction to Data Virtualization and Data Lakes

A Quick Overview of Data Virtualization System Architecture

Figure 1-1. A high-level DV System architecture