Chapter 4. Requirements Summary
A fast universal data access platform must meet the following organizational requirements:
-
Handle a wide variety and ever-changing set of data.
-
Deal with huge amounts of data that continue to grow with time.
-
Provide results to analytical queries as quickly as possible to allow organizations to do analysis and make data-driven decisions in as close to real time as possible.
-
Be flexible enough to be on premises, in the cloud, or any combination of the two.
To meet these requirements, the platform needs to utilize multiple components:
- Federated query system
-
Allows the platform to connect to a wide variety of different data types and structures, and allows for structured and unstructured data.
- Pluggable data connectors
-
Give the flexibility to add additional connection types as data sources change and update.
- Multiple installation options including varying levels of cloud support
-
Ensure that the organization’s needs are met regardless of whether the installation is on premises or in the cloud.
- Dynamic clustering
-
Brings multiple worker nodes under a delegating authority to provide additional power, speed, and redundancy.
- Dynamic query optimization
-
Continuously improves the performance of queries to ensure that data is returned in as close to real time as possible.
- Intelligent query pushdown
-
Shifts the data processing to the data sources, meaning the work is distributed to the system most adept at processing it. It also potentially reduces the amount of data returned to the federated system by applying query predicates at the source prior to returning the data.
- Ability to convert query code to machine code
-
Further improves query performance by translating code written by humans into a machine-optimized language.
Case Study
How do all these components work together to form a multiaccess, high-volume, high-performance interface? Let’s take a look at an example. China Mobile is a telemarketing business in Asia, providing service for over 1.5 billion mobile service users in 2019. Within the company is China Mobile Guangdong (GMCC), which services over 100 million customers in the Guangdong province of China. The large volume of customers combined with a wide variety of diverse data sources meant that GMCC was struggling to derive insights from terabyte levels of data.
GMCC needed this data to meet customer needs, assess network performance, and optimize company performance; however, pulling insights from the data proved extremely difficult since the data was so large and so siloed. Query performance was slow, and the lack of a universal query language meant business users had to rely on the IT office to generate new reports and data sets. The result? The IT office became overwhelmed by the high demand for data, and vital reports were delayed from days to months. In short, GMCC could not make real-time data-driven decisions due to the volume and complexity of its data environment.
The company approached the issue by addressing portions of the issue independently:
-
First, it relied on Hadoop (HDFS) to store all the data. HDFS was able to adapt to the huge volume, holding over 8 TB of data; however, it relies heavily on disc input and output, resulting in increased latency and reduced query speed. HDFS is also batch processing based, meaning ad hoc queries could not be run easily.
-
Redis was employed to improve speed by processing data in memory; however, as stated before, Redis limits capabilities to simple queries, reducing its effectiveness.
-
Oracle and DB2 were employed, but both rely heavily on the ETL process, which is extremely time consuming and resource intensive. The two database structures are also limited to structured data, meaning they couldn’t handle some of the unique data values being pulled in by GMCC.
GMCC turned to RapidsDB to address its data challenges. RapidsDB was able to deal with many of the issues by utilizing its in-memory federated query system. The system is built on a distributed model, meaning multiple machines work in unison to process data and queries. This model not only improved performance, but also provided redundancy while allowing for easy expansion. The system also pulled all the varied data sources under a single unified language, enabling users to access the data using a single query language and login rather than multiple languages and logins.
This implementation provided a significant improvement. Query processing times dropped from minutes to seconds. Additionally, the software relied on a standard SQL language and passed credential authorization to the sub-data sets. This meant the base application controlled access. In short, if the user successfully authenticated to the universal database access platform and had access to the underlying data set, they would be granted access without the need for a separate credential.
This improved performance means GMCC can make queries of its data and receive insights in much closer to real time. The federated query system ensures that existing platforms and future platforms can easily be added, combined, and queried. Also important is that users can do ad hoc queries rather than rely on time-consuming batch processing.
The benefits of a fast universal data access platform are multiple. With the increased number of data sources, data varieties, and data changes, the capability to provide connectivity, storage, and usability at fast speeds is more valuable than ever. A distributed framework combined with pluggable connectors ensures not only that current data obstacles are overcome, but also that your company is prepared and able to adapt to future data challenges.
Get Building a Fast Universal Data Access Platform now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.