Chapter 4. Data for Embedded Analytics

One of the first questions to ask when building an analytics solution that will be embedded inside an operational application is where the data will come from. The answer is not always simple. The data will not necessarily come from the operational application itself. In fact, very often you’ll want to complement or augment the application with insights from other systems.

To do so, you’ll have to consider three things for each data source that you wish to use:

  • Where does the data come from? Which system has the best data to support your decisions, or to support the application that you are enhancing?

  • How will you connect to that data? There are different ways of connecting (even to the same data source), which may be more or less appropriate for the scenario at hand.

  • How will you manage and govern the connectivity, security, and performance of the connection to that data source?

In this chapter, we’ll look at several different data sources that can be useful for embedded analytics and consider those questions.

CSV and Other Text Files

There are few data sources as simple—in theory—as a text file of data. Text files are found in almost every business in some scenario, because they have some great advantages. For one thing, they are widely supported by data analysis tools ranging from the simplest spreadsheet to the most advanced data science workbench. Comma-separated values (CSV) files are equally well supported as an export format. Connecting to a CSV file typically only involves having read permissions to access the folder in which it is stored. So, for many situations, they are a convenient solution.

This convenience comes at a cost, however:

Security

The only security mechanism for CSV files is to protect access to the folders in which they are stored. There is no support for row or column protection. It is important, therefore, that if you are using CSV files as a source they must be exported with the strictest security and privacy controls in place. Do not save sensitive data in these files.

Performance

Most data movement tools, such as extract, transform, load (ETL) applications for data warehouses, are optimized in some way for text files, especially for reading or writing data in bulk. These optimizations often take the form of read and write buffers that load data into memory for parsing. BI tools rarely have these optimizations in place. They will read text files easily, but performance is likely to be poor compared to a native application driver.

Parsing

Not all text files are CSV files. Some are delimited by tab characters, some by pipe or vertical bar characters. Highly customized delimiters are rare—and ironically, ASCII control codes, intended to be a standard, are rarer still—but you may encounter them. Numerous issues can arise with numbers, such as the use of different decimal separators (some countries use commas) or unusual placement of the thousands separator (such as in the Indian numbering system, which uses lakhs and crores). If your analytics environment does not support these characteristics, you will need to ensure that your data export process renders the text files in an easily readable format.

Metadata

With CSV files, if we only read the data in the file, we do not get metadata about when it was last updated. Sometimes, for example in finance, these files are manually updated every week or month when the department has all the data available. If we open the file manually in a spreadsheet, we may get some information embedded in the filename, in some header cells, or in an embedded info table in the file. But if we just read in the data with a semi-automated process, that metadata is lost. Also, unlike a database table, text files have a higher risk of human errors because numbers can just be typed in without validation.

Operational Data Sources

An operational data source is simply the location where an operational application’s data is stored. All modern applications, to some extent, create, store, and manage data. For example, an application for processing orders generates the data for each order and perhaps links data for each customer. That will need to be persisted somewhere. Operational data sources focus on integrating that data in such a way that the performance of the application is maximized and the integrity of the application is ensured. In many cases, the ease of use, especially the responsiveness of the application, is enhanced by having a robust data store that can persist large volumes of data for recordkeeping with good performance.

Many applications use very simple structures for operational data sources. For example, many use an underlying operational database. Developers are typically familiar with connectivity to that database and with the table structure, because they work with the operational application on a daily basis. And because you’re connecting directly to your operational data source, the consistency of the data is pretty much guaranteed.

Sometimes, you may use a copy of the operational source. For example, you may take advantage of a technology such as replication. The advantage of that for a developer is that the work they do with analytics doesn’t affect the operational application itself. So, for example, you could extract all the records for an individual customer, aggregate them to do some analysis, and then present that to the operator. You don’t have to look up all the details in the original database because you’re using a replicated copy, which improves performance and preserves the consistency of the original data source.

As we said, operational data structures are optimized for efficiency and integrity. But the implication of this is that they are not optimized for analytics. In practice, the joins between tables in a highly normalized database can be very complex and tricky to navigate, and often don’t provide the best performance for analysis. In such cases, trying to execute analytics over these tables becomes an extra load on the database. Replication can help with that, but if replication has too much latency, you may find discontinuity between operations and analytics: the order just placed on the phone doesn’t show up in the report.

However, connectivity to the operational data source is typically simple, because your operational application is already connected to it. If the data source is a replicated version of the operational data source, connectivity is only slightly different, requiring a distinct connection string but no other changes.

The situation does get a little more difficult if we are trying to combine two operational systems: one in which we are performing operations and one from which we are sourcing data to enhance the analytics.

You may have a purchase management system that handles orders and invoices and various aspects of the order-to-delivery process. The customer data may be in a different application, such as a customer relationship management (CRM) system or a financial system. Bringing those two together involves embedding one within the other, but they’re both operational systems and therefore not particularly optimized for analytics.

For any operational system, there are typically two ways to connect: through a bulk connector or through an API. A bulk connector is optimized for extracting a large amount of data. An example of this is connecting with OLE DB or JDBC to the operational database underlying the application. With such a connector, you can ask for a full table of information, or for many tables at a time, and you can probably issue a query against it and return a result set, which may be small or large. That kind of connectivity can be useful for analytics because typically we’re looking at larger volumes of data to analyze.

The other way to connect to operational systems is through an API (if that’s supported). Web developers will be familiar with this, because most of the connectivity between web applications involves API calls. However, APIs are typically not designed for returning large volumes of data: often only single records or the answer to a simple query and the result set may be returned as an XML document rather than as a flat table. Even worse, the limitations on data volume in an API call can mean the developer has to page through the data, making several API calls to slowly build the full result set.

As a result, we have to be careful in choosing the kind of connectivity that we want to use for analytics. It will not always be the case that an OLE DB connector is the best. Many of them are very simple and performance may not be good. Equally, it will not always be the case that an API is inappropriate. Some APIs do have calls that enable bulk data extraction. In other words, we have to know our system pretty well to understand the volume of data we are trying to extract, the characteristics of the source system, and the workload we ask it to perform.

One way of working with operational data sources is to add new objects—views and aggregation tables—to the database schema. Aggregation tables and views are both database objects used in database design to organize and present data in a useful way.

Aggregation tables contain summarized data. They can be precalculated and stored in the database to improve query performance and reduce the amount of time required to generate reports. Views, on the other hand, are virtual tables created by combining data from one or more existing tables in the database. Views can simplify complex queries and provide a simplified, organized view of the data for end users.

Analytic Data Sources

In this section, we discuss the three types of data sources we might consider as analytic: data warehouses, in-memory engines, and data lakes.

Data Warehouses

The most common analytic data source is the data warehouse, an architecture specifically intended to enable data analysis. Its data is denormalized to make it simple to query and to improve the performance of large analytic queries. The architecture is designed to integrate data from multiple data sources so that work does not need to be repeated. Most importantly, the data warehouse represents not just a data structure, but a business model. That is to say, it is a logical representation of how a business operates.

A data warehouse will have a customer dimension that represents how the business thinks about its customers, and a geography dimension that represents how the business thinks about its divisions, offices, and stores or factories. A data warehouse nearly always includes a time dimension that represents the calendars, working days, holidays, and seasonal divisions relevant to the business. It may include a personnel dimension, which reflects the structure of the company’s hierarchy. All of these predesigned and prepopulated dimensions make analytic queries very simple for the developer. Even better, the data warehouse should be optimized for analytic performance.

In fact, a data warehouse—or its simpler, more focused cousin, the data mart—is the primary source for many business intelligence systems and an excellent source of data for embedded analytics.

But not every enterprise, and certainly not every small or medium-sized business, has a data warehouse. Even those enterprises that do have a data warehouse may not want to make it available for operational applications, because in some cases the warehouse is focused on management, financial, or tax reporting: that is, mission-critical business analysis, which may be sensitive or governed in such a way that the business does not wish to make the data directly available to other systems.

Sometimes, we can work around that by creating downstream replications of specific subject areas and even the answers to specific queries that we use regularly.

In-Memory Engines

The data warehouse technology with which we are familiar has been with us since the 1990s. Around 2005, another technology started to appear that has many of the advantages of a data warehouse but offers greater performance and flexibility and has advantages for both large and small businesses. This is the in-memory engine, which takes advantage of the very high performance of memory as compared to disk for performing complex queries. With 64-bit systems (and cheap memory chips) emerging in the mid-2000s, companies could perform analysis on a desktop that previously had only been possible on the largest corporate systems.

Building on these desktop capabilities, in-memory technologies are very popular in self-service business intelligence and are the key to the performance and flexibility of many of the most popular self-service applications, such as Microsoft Power BI, Salesforce Tableau, and Qlik. In the past, self-service applications were very self-contained and focused on their own analytic capabilities. Their connectivity to other systems was entirely focused on extracting data into their own memory space, and developers didn’t think very much about how they would share that data with other systems. However, many of the in-memory BI applications available today enable embedding. So, more often than not, it’s possible to take a chart from a BI tool and embed it inside an operational application. This has benefits for developers in terms of familiarity, performance, and simplicity, but there are some limitations because the systems are often not designed from the start to be embedded. Rather, the embedding ability (and the APIs) may be afterthoughts, which have been added later. Nevertheless, for developers familiar with these systems, and where the systems are already deployed in the business or in the enterprise, it’s a very attractive proposition to reuse this functionality and embed it inside operational applications. This scenario is often well supported by the tools and their user communities, and it’s an option that is well worth considering.

Data Lakes

Like the data warehouse, the data lake architecture (and, in its more modern incarnation, the data lakehouse) is focused on the storage of large volumes of data, which are made available for purposes such as data science and, to some extent, analytics.

Unlike the data warehouse, the data lake does not have a built-in business model—that is, a logical semantic architecture that represents how the business thinks about itself, with rules, well-defined processes for governance, or easily navigated data structures. In fact, the data lake is intended to store data in its most raw format without any processing. This is a great advantage for data scientists, who often want to see data in its native state, because they can use that raw data to do more detailed and more complex analyses than they could if the data had already been cleaned up and built into a data model for reporting.

In the data lake, the data scientist can see all the raw detail, with all its errors and complexities. But the business user may find it very difficult to integrate and transform data in the lake as they need it.

Data Integration Pipelines

Another source of data for an embedded system may be the output of a data integration pipeline. That is to say, rather than reading data from a table in an in-memory system or a data warehouse or operational database, the data is read from a pipeline running in a data science environment. The pipeline may perform numerous operations of integration, cleansing, and preparation on the data, from whatever source it comes. This is a popular scenario for data science. However, it is of limited use for embedded analytics, because the pipeline is more dynamic and more volatile, only delivering data while it is running, which makes the process more fragile and more difficult to govern than a data warehouse or an in-memory system.

A scenario where this is popular and useful is when data is continuously changing, referred to as streaming data. This could include data from machines, such as automated lathes in manufacturing, or from an ecommerce ordering system where hundreds of orders may be placed every minute, or from environmental sensors.

In such cases, the analytic scenario is somewhat different and performance is still a consideration. An application offering streaming analytics must include methodologies for handling interruptions to the stream. In addition, it must be able to display changing values without users having to manually refresh the display. Finally, methodologies for governance and security may be different from the analytic governance that we’re used to.

Writing Back to Sources

A final point has to be made about the handling of data in embedded analytics. We have been talking so far about data extracted from source systems and presented in the embedded analytics environment purely for the purposes of information and augmentation. This is a read-only scenario. However, there may be times when we want to make changes to the analytic system from within the embedded application—what we call write-back.

Write-back is often used for planning and simulation applications, where we want to insert numbers that represent future estimates of sales or staffing or growth, or whatever the parameters are that we are planning around. It is relatively unusual for systems designed for analysis, such as a data warehouse, to enable direct write-back. Such systems are too concerned with data integrity, data history, and compliance to allow ad hoc changes. But often they will allow write-back to special tables, which can be set up as parameter tables specifically for the purpose.

Summary

We have seen that data can be sourced for embedded analytics from many different kinds of systems: operational, analytic, and streaming. There is no one best solution for a typical embedded scenario, but it is fair to say that most developers of embedded applications will start with a simple data warehouse or data mart scenario if that already exists within their organization.

Table 4-1 summarizes some of the benefits and weaknesses of the approaches we have discussed.

Table 4-1. Data sources for embedded analytics
Data source Benefits Weaknesses
CSV files
  • Widely supported, often human readable.
  • No security mechanism.
  • May be difficult to parse.
  • Performance may be poor.
  • Fragile to user errors.
Operational database
  • Developers are familiar with the connectivity and table structure.
  • Because you are connecting to operational data, consistency is guaranteed.
  • If replication is available, you may use a copy of the operational system for analytics.
  • Operational table structures are optimized for efficient transactions, not for analytics. Joins can be complex.
  • Analytics is an extra load on the database and can affect performance, although replication can help.
Views and aggregate tables
  • Analytic queries are simplified because the joins were developed in advance.
  • Filtering is simple.
  • You can apply role-based access rules simply or build them into the view. (Like the secured views of some cloud data warehouses.)
  • Aggregating data in advance improves performance.
  • Response time may be slow if the underlying joins are complex.
  • Views and aggregate tables must be maintained separately from operational tables: another burden for administrators.
  • Aggregate tables must be updated, often using triggers. Updates have to be carefully scheduled or they will affect the performance of the operational system.
Data mart or data warehouse
  • High performance, as the architecture and schema are optimized for analytic queries.
  • Simple to integrate with BI and analytics tools, as the schema is designed for analysis.
  • Load on the transactional source system is reduced.
  • Loading the data warehouse using ETL is a specialized practice.
  • The load and refresh cycle of the warehouse (often scheduled to run overnight) may be too slow for analytics embedded in an operational application.
  • Changing the schema can be complex, requiring a lengthy process of specifying, implementing, and testing.
In-memory engine
  • Very high performance.
  • Should be closely integrated with the embedded analytics environment.
  • Support for writing back to the database is often limited.
  • Data updates may be occasional and often require full, rather than incremental, updates.
Data lake
  • If using a data lakehouse architecture, you will find many of the same benefits as a data warehouse.
  • Data is generally stored in its native format, which can be good for predictive analytics.
  • Data may be continuously updated.
  • If not using a lakehouse architecture, you may find performance for business intelligence poor because so much transformation is required to conform and shape data.
  • Data lakes are generally shared with data scientists, data engineers, and analysts. Security can be a concern.
  • The use of raw data increases the risk of users seeing different variations of the same data because the data has not been validated to the same extent as in a data warehouse or mart.

In general, we recommend starting with a data mart or data warehouse connection if this already exists in your organization’s architecture. You’ll find it gives the best balance of accessibility, performance, and governance for your needs. A data lake or lakehouse, if well managed, offers some similar advantages. However, if you don’t already have a formal data warehouse or lakehouse in place, it is a significant effort to build one and likely beyond the scope of your embedded analytics project on its own.

In this case, you may find it practical to start with views and aggregate tables, defined for your purposes and hosted within the operational database. The administrator can provision access and ensure security and good governance, performance should be acceptable, and data access should be possible with a suitable driver.

Having reviewed the data sources that are available, in the next chapter we’ll consider the different types of embeddable objects and how we can integrate them visually with applications for the most effective support.

Get Embedded Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.