Chapter 1. Challenges of Universal Data Access

Your company relies on data to succeed. Traditionally, this data came from the business’s transactional processes. It was pulled from the transaction systems through an extract, transform, load (ETL) process and into a warehouse for reporting purposes. With the growth of the Internet of Things (IoT), web commerce, and cybersecurity, this traditional data flow no longer suffices. The copious and diverse data available to your company brings challenges with connections, speed, volume, and access. How do you ensure that your company can keep up with today’s increasing magnitude of data and insights so that it will be a leader in the field in the future?

What Is Universal Data Access?

The main problems facing businesses today are the volume and variety of data accessible for analysis. It is no longer viable to simply examine the data generated by business processes. Instead, many organizations are starting to look outside their business workflow for information on customer behavior, retail patterns, and industry trends. This supplemental data provides actionable insights, but it also creates a challenge when integrating with business-specific data sources.

Your organization likely generates data and stores it within a single platform. Your daily processes create data as part of normal business flow, and this data is likely stored in a transactional database similar to that shown in Figure 1-1. From this, the data is translated and transformed into a separate structure, such as a warehouse, where it can be reported more easily. But what happens when data is needed from an external source, such as web analytics, census demographics, or elsewhere? How do you integrate data sources of differing structure, design, or format? How do you ensure peak query performance to provide insights in near-real time?

Traditional data flow from transactional processes to analytic insights
Figure 1-1. Traditional data flow from transactional processes to analytic insights

External data sources are vital to the success of your organization. The challenge lies in how to accurately and efficiently bring all this data together into a usable format. In Figure 1-2, you can see how these additional data sources impact data flow. As many of these sources reside outside your business’s transactional process, many are in different formats and structures. Some of them may be in relational databases while others may be in NoSQL databases, flat files or streams. You need special reporting tools to combine this data into a streamlined, usable source.

Data flow illustrating the gap caused by extracting data from external sources
Figure 1-2. Data flow illustrating the gap caused by extracting data from external sources

The process of combining all these different data sources, formats, and structures is what we refer to in this report as universal data access. In short, a framework is developed to pull data from multiple sources and combine the data sets into a single schema of queryable views similar to that shown in Figure 1-3. Ideally, this new framework would be dynamic, be easy to access, and rely on the infrastructure of its sources to provide the speed and power needed to make the data accessible in a timely fashion. In truth, however, many of today’s applications sacrifice speed for data diversity or vice versa.

Data flow from transactional processes and external sources to an integration or federation tool, consolidating both into a single format
Figure 1-3. Data flow from transactional processes and external sources to an integration or federation tool, consolidating both into a single format

Two of the most well-known software tools to move forward with universal data access are Denodo and Presto. While each is powerful in its own right, they both have drawbacks. Denodo is extremely diverse, providing a huge variety of connectors and allowing users to pull from many different data sources; however, it lacks the performance needed to combine these sources and provide ad hoc queries. Presto is more adept at handling ad hoc queries, but it lacks the diversity of connection types. Let’s take a deeper look at the areas of data diversity and data volume, as they dictate the success of combining external and internal data sources.

Data Diversity

Data within an organization is easier to collect and combine into a single format and data structure, but what happens when data is needed from sources the organization does not control? How do you deal with data that is dynamic, unstructured, or even residing in an entirely different database structure than your organization?

To answer these questions, first we’ll examine why external data is relevant to your business. Take, for example, a retailer. A business such as this would need data related to operations such as supply costs, inventory, product demand, and operating costs (employee pay, utilities, etc.). What happens if we look one step beyond the business itself? A retailer can improve sales by utilizing web data to understand customer habits and trends. It can improve marketing by identifying customer demographics and regional tendencies. It could improve transportation costs by comparing delivery services. In fact, huge amounts of data are available to any business beyond its operating core that can improve the bottom line.

This leads to the first issue with data diversity. Traditional data is organized and structured, usually in a centralized database with fixed tables, fields, and data formats. This makes the data easy to store, manipulate, and analyze. Data doesn’t just come in strings, integers, or floating decimal values, though. Photos, audio, video, arrays, and many other new types of data are now becoming mainstream, and traditional data structures are not designed to handle these new formats. Your business needs definitions and functionality to assist with storing and analyzing these data types.

This issue is not limited to the format of the sources, either. Many other obstacles come into play. First, the data needs to be extracted from its source and pulled into a repository where it can be combined with other data. This process requires transferring the data over networks, translating data fields, and establishing an architecture that will provide accurate and usable data. All of this takes time, which builds in a level of latency from when the data becomes available in its source system to when it is accessible for reporting within your organization. As we’ll see later, data speed is imperative in making timely business decisions.

The second issue with data diversity is data quality and integrity. Adding new data to an already existing system creates potential challenges. For example, pulling data can sometimes result in duplicate data or incorrect field formats. There are also challenges with changes in the source system impacting the data pull or transformation. This obstacle requires constant monitoring and testing to ensure data quality throughout.

The third issue is security. Data integration makes copies of source systems in another location. Pulling data from external sources may result in additional security needs, especially if that data contains sensitive or private information. Some examples of data that may have potential security issues include protected health data, credit card data, and Social Security numbers. You need to be prepared to handle increased security needs if you use data integration to pull in external data sources with sensitive information.

The final issue with data diversity is adaptability. Data sources will continue to adapt and change. Your company’s needs will change as well. Unfortunately, many systems designed to pull in disparate data are not easily adaptable. What happens when your data federation or data integration system does not adapt to a new data source that is imperative to the operation of your business?

All these obstacles combine to create challenges when bringing together varieties of data for reporting. A platform is needed to bring this wide variety of data into a single usable format. Two processes that combine disparate data sources are data federation and data integration. These processes are accomplished with software that allows multiple varieties and structures of data to be combined and function under a single source. For now, it is enough to understand what the software does. Later in this guide, we will discuss the differences between data integration and data federation as well as evaluate the advantages and disadvantages of each.

Data diversity is not your company’s only obstacle as it searches for new and valuable insights. Organizations also face the challenges of data volume.

Data Volume

Consider your existing data structure within your company. How many different tables exist? How many years’ worth of data is available? What happens when more and more data is added as your company continues to do business year after year? Are you able to dynamically add new data types or data sources? As more time passes, additional data is collected, meaning the data stores are dramatically increasing in size. How does your company handle this? What happens when more data is required to do effective predictive and prescriptive analysis for your company?

This brings us to the second-biggest hurdle when it comes to data volume: data processing. Even if you have a large-volume cloud storage option for your data, it still takes time to organize, structure, load, and analyze that stored data. Once the data is loaded, it may still need to be evaluated for quality and integrity. The large volumes of data many companies deal with create obstacles to finding errors or inaccuracies.

There are several vendors in the data storage space and many new methods are emerging to improve data organization, indexing, and performance. Hadoop and Apache Spark are two examples of databases designed for large-volume storage and access. Even these storage structures have issues, though, as they must be able to handle existing data while being extendable to handle future demand. Still, the flexibility and adaptability of cloud storage has led many companies to turn away from storing data in large server banks locally.

Regardless of what tools you use or where you store your data, the biggest hurdle when it comes to data volume is speed. Decisions are made much more effectively when they happen closer to when transactions occur. Parsing, loading, and analyzing large volumes of data take time, which creates the potential for latency. Things like network speed, server resources, and usage volume can all impact the performance of queries against data sources. Knowing this, let’s look at why speed is valuable to your business and what obstacles exist with generating near-real-time insights.

Speed of Analytic Operations

Why are rapid insights so valuable to your organization? With increased online commerce, speed is more important than ever. Faster analytics can improve your business in the following ways:

Improved customer experience

By examining browsing habits and historical purchases, companies can provide their customers with customized experiences in near-real time. These improved experiences improve customer interactions and increase the likelihood of successful transactions.

Utilizing customer demographics

Information about customers is key to improving sales. Demographic and regional data provides the tools necessary to generate marketing efforts specific to gender, race, or age groups as well as customers from specific regions within your service area.

Maximizing operational efficiency

Real-time information on how your company is performing is vital to success. This includes data from transactional sources that allow you to identify where business processes are lagging. Such data may also come from external sources such as suppliers or shipping providers. Access to this data not only allows you to identify what is inefficient, but also helps you preempt potential challenges in the future.

Identifying and preventing data threats

Network and security threats are becoming more and more frequent in the business world. Real-time threat assessment and mitigation is imperative to ensuring that your company and customer information is secure, safe, and private.

Improving competitive advantage

Tracking pricing trends can benefit your business by providing near-real-time information on competitor pricing, customer feedback, and sales analysis. This data can then be used to adjust prices dynamically, ensuring that your business remains competitive in the market.

In each of these examples, latency is an obstacle to valuable insights. Your company needs to adapt as quickly as possible to the business landscape to attract and keep customers, produce products or services more efficiently, and identify network and security threats as quickly as possible. So how do you overcome the latency that can be inherent in large amounts of data from diverse sources?

There are multiple approaches to improving data speed. The first is to throw resources at the problem. Many companies adopt this method to ensure that critical processes continue to run as more and more resources are needed. In short, if the memory, processor, or storage is insufficient to meet the needs of the data, simply add more. While this works in many cases, there are limitations to how many of these things can be added to manage the data. Additionally, this method significantly increases the cost of maintaining the data. Additional hardware requires additional processing power, storage space, and human resources to maintain it all.

Speed can also be improved by converting the data into a unified format. As mentioned before, data integration and data federation will combine multiple varieties of data sources to provide a single unified data source. This allows data analysts, data scientists, and report writers to analyze a single source rather than face the challenges associated with disparate data sources. Ultimately, a unified data format improves the speed to gaining insights, providing time for analysis rather than spending time figuring out a blend.

There are other ways to improve speed as well. Indexing is a common solution that improves speed by creating a directory of where certain information is housed. This improves query times by pre-identifying where to look for the requested information. To be effective, indexing needs to happen regularly, as data and sources change.

Certain database management systems (DBMSs) are also effective in improving query performance against data sources. Columnar database management systems such as Amazon Redshift and Google Cloud convert row-based data into a columnar format, allowing for faster indexing and data retrieval as data can be searched for within a specific column rather than searching through every row. Unfortunately, columnar databases are far more difficult to load than row-based databases due to their structure.

There are numerous reasons to improve speed, but the data sources that provide the necessary data to achieve these results are large in size, are varied in format, and require complex queries to generate valuable insights. Which methods are most effective for adapting to large and diverse data sets? How does your company overcome these obstacles to ensure that the data needed for insights is not only accurate and accessible but also timely?

Get Building a Fast Universal Data Access Platform now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.