Chapter 1. The Current Landscape
Somewhere in the business, someone is requesting a unified view of data from IT for information that’s actually stored across multiple source systems within the organization. This person wants a single report, a single web page, or some single “pane of glass” that she can can look through to view some information crucial to her organization’s success and to bolster a level of confidence in her division’s ability to execute and deliver accurate information to help the business succeed.
In addition to this, businesses are also realizing that simply having a “single view” alone is not enough, as the need to transact business across organizational silos becomes increasingly necessary. Hearing the phrase, “That’s a different department; please hold while I transfer you” is tolerated less frequently by many of today’s digital first consumers.
What’s the reality? The data is in silos. Data is spread out across mainframes, relational systems, filesystems, Microsoft SharePoint, email attachments, desktops, local shares; it’s everywhere! (See Figure 1-1.) For any one person in an organization, there are multiple sources of data available.
Because the data isn’t integrated, and reports still need to be created, we often find the business performing “stare and compare” reporting and “swivel chair integrations.” This is when a person queries one system, cuts and pastes the results into Microsoft Excel or PowerPoint, swivels his chair, queries the next system, and repeats until he has all the information he thinks he needs. The final result is another silo of information manifested in an Excel or PowerPoint report that ultimately winds up in someone’s email somewhere.
This type of integration is manual, error-prone, and takes too long to get the required answers that business can act upon. So, what happens next? Fix it! Business submits a request to IT to integrate the information. This results in a data mart and the creation of a new silo. There will be DBMS provisioning, reporting schema design, index optimizations, and finally some form of ETL to produce a report. If the mart has already been created, modifications to the existing schemas and an update to ETL processes to populate the new values will be required. And the cycle continues.
The sources for these silos are varied and make sense in the contexts they are created:
To be able to ensure a business can quickly report its finances and other information, that business asks IT to integrate multiple, disparate sources of data, creating a new data silo. Some thought might have been given to improving or even consolidating existing systems, but IT assessed the landscape of existing data silos, and changes were daunting.
Many silos have been stood up in support of specific applications for critical business needs, each application often coupled with its own unique backend system, even though many likely contain data duplicated across existing silos. And the data is worse than just duplicated: it’s transformed. Data might be sourced from the same record as other data elsewhere, but it no longer looks the same, and in some cases has diverged, causing a master data problem.
Overlapping systems with similar datasets and purposes are acquired as a result of mergers and acquisitions with other companies, each of which had fragmented, siloed data!
Before we look at the reality of what IT is going to face in integrating disparate and various systems, let’s ask ourselves the importance of being able to integrate data by looking briefly at a few real-world scenarios. If we’re not able to integrate data successfully, the challenges and potential problems to our business go far beyond not being able to generate a unified report. If we’re not able to integrate enterprise resource planning (ERP) systems, are we reporting our finances and taxes correctly to the government? If we work in regulated industries such as financial services, what fines will we face if we’re not able to rapidly respond to audits from financial regulatory boards on an integrated view of data requiring the ability to answer questions from ad hoc queries? If we’re not able to integrate HR systems for employees, how can we be sure that an employee who has left the company for new opportunities or who has been terminated is no longer receiving paychecks and no longer has access to facilities and company computer systems? If you’re a healthcare organization and you’re not able to integrate systems, how can you be certain that you have all the information needed to ensure the right treatment? What if you prescribe a medicine or procedure that is contraindicated for a patient taking another medicine that you were unaware of? These types of challenges are real and are being faced by organizations—if not entire industries—daily.
In 2004, I was as a systems engineer for ERP-IT at Sun Microsystems. At that job, I helped integrate our ERP systems. In fact, we end-of-lifed 80 legacy servers hosting 2 custom, home-grown applications to integrate all of Sun’s ERP data into what was at the time the world’s largest Oracle instance. But the silos remained! I do know of people who left the company or were laid off who continued for a long time to receive paychecks and have access to campus buildings and company computer networks! This is because of the challenge of HR data being in silos, and updates to one system not propagating to other critical systems. The amazing thing is that even though I witnessed this in 2005, those same problems still exist!
Data silos are embedded in or support mature systems that have been implemented over long periods of time and have grown fragile. To begin to work our way out of this mess and survey the landscape, we frequently see mainframes, relational systems, and filesystems from which we’ll need to gather data. As one CIO who had lived through many acquisitions told us, “You go to work with the architecture you have, not the one you designed.”
Types of Databases in Common Use
Let’s take a look at the nature of each of these typical data store types: mainframes, relational, and filesystems.
Mainframe
IBM first introduced IMS DBMS, the hierarchical filesystem, in 1966 in advance of the Apollo moon mission. It relied on a tree-like data structure—which reflects the many hierarchical relationships we see in the real world.
The hierarchical approach puts every item of data in an inverted-tree structure, extending downward in a series of parent-child relationships. This provides a high-performance path to a given bit of data. (See Figure 4-1.)
The challenge with the mainframe is that it’s inflexible, expensive, and difficult to program. Queries that follow a database’s hierarchical structure can be fast, but others might be very slow. There are also legacy interfaces and proprietary APIs that only a handful of people in the organization might be comfortable with or even have enough knowledge of to be able to use them.
Mainframes continue to be a viable mainstay, with security, availability, and superior data server capability topping the list of considerations.1 But to integrate information with a mainframe, you first need to get the data out of the mainframe; and in many organizations, this is the first major hurdle.
Relational Databases
In 1970, E.F. Codd turned the world of databases on its head. In fact, the concept of a relational database was so groundbreaking, so monumental, that in 1981 Codd received the A.M. Turing Award for his contribution to the theory and practice of database management systems. Codd’s ideas changed the way people thought about databases and became the standard for database management systems.
A relational database has a succinct mathematical definition based on the relational calculus: a means of organizing and querying data via joins (roughly “tables,” in normal language) against a primary key.
Previously, accessing data required sophisticated knowledge and was incredibly expensive and slow. This was because it was inexorably tied to the application for which it was conceived. In Codd’s model, the database’s schema, or logical organization, is disconnected from physical information storage.
In 1970, it wasn’t instantly apparent that the relational database model was better than existing models. But eventually it became clear that the relational model was more simple and flexible because SQL (invented in 1972) allowed users to conduct ad hoc queries that can be optimized in the database rather than in the application. SQL is declarative in that it asks the database for what it wants and does not inform the database how it wants the task performed. Thus, SQL became the standard query language for relational databases.
Relational databases have continued to dominate in the subsequent 45 years. With their ability to store and query tabular data, they proved capable of handling most online transaction-oriented applications and most data warehouse applications. When their rigid adherence to tabular data structures created problems, clever programmers circumvented the issues with stored procedures, BLOBs, object-relational mappings, and so on.
The arrival of the personal computer offered low-cost computing power that allowed any employee to input data. This coincided with the development of object-oriented (OO) methods and, of course, the internet.
In the 1970s, ’80s, and ’90s, the ease and flexibility of relational databases made them the predominant choice for financial records, manufacturing and logistical information, and personnel data. The majority of routine data transactions—accessing bank accounts, using credit cards, trading stocks, making travel reservations, buying things online—all modeled and stored information in relational structures. As data growth and transaction loads increased, these databases could be scaled up by installing them on larger and ever-more powerful servers, and database vendors optimized their products for these large platforms.
Unfortunately, scaling out by adding parallel servers that work together in a cluster is more difficult with relational technology. Data is split into dozens or hundreds of tables, which are typically stored independently and must be joined back together to access complete records. This joining process is slow enough on one server; when distributed across many servers, joins become more expensive, and performance suffers. To work around this, relational databases tend to replicate to read-only copies to increase performance. This approach risks introducing data integrity issues by trying to manage separate copies of the data (likely in a different schema). It also uses expensive, proprietary hardware.
Data modeling in the relational era
One of the foundational elements of Codd was the third normal form (3NF), which required a rigid reduction of data to reduce ambiguity. In an era of expensive storage and compute power, 3NF eliminated redundancy by breaking records into their most atomic form and reusing that data across records via joins. This eliminated redundancy and insured atomicity, among other benefits. Today however, storage is cheap. And although this approach works well for well-structured data sources, it fails to incorporate communication in a shape that’s more natural for humans to parse. Trying to model all conversations, documents, emails, and so on within rows and columns becomes impossible.
The challenge with ETL
But from the “when you have a hammer, everything becomes a nail” department, relational databases emerged as the de facto standard for storing integrated data in most organizations. After a relational schema is populated, it is simple to query using SQL, and most developers and analysts can write SQL queries. The real challenge, though, is in creating a schema against which queries will be issued. Different uses and users need different queries; and all too often, this is provided by creating different schemas and copies of the data. Even in a new, green-field system, there will typically be one transactional system with a normalized schema and a separate analytic data warehouse, with a separate (star or snowflake) dimensional schema.
Data integration use cases make this problem even more difficult. To appropriately capture and integrate all the existing schemas (and possibly mainframe data and text content) that you want to integrate takes a tremendous amount of time and coordination between business units, subject matter experts, and implementers. When a model is finally settled upon by various stakeholders, a tremendous amount of work is required to do the following:
Extract the information needed from source systems,
Transform the data to fit the new schema, and
Load the data into the new schema.
And thus, ETL. Data movement and system-to-system connectivity proliferate to move data around the enterprise in hopes of integrating the data into a unified schema that can provide a unified view of data.
We’ll illustrate this challenge in more detail when we contrast this relational approach with the multi-model approach to data integration in Chapter 2. But most of us are already very familiar with the problems with this. A new target schema for integrating source content is designed with the questions the business wants to ask of the data today. ETL works great if the source system’s schemas don’t change and if the target schema to be used for unification doesn’t change. But what regularly happens is the business changes the questions it wants to ask of the data, requiring updates to the target schema. Source systems might also adapt and update their schemas to support different areas of the business. A business might acquire another company, and then an entire new set of source systems will need to be included in the integration design. In any of these scenarios, ETL jobs will require updates.
Often, the goal of integrating data into a target relational schema is not met because the goal posts for success keep moving whenever source systems change or the target system schema changes, requiring the supporting ETL to change. Years can be spent on integration projects that never successfully integrate all the data they were initially scoped to consolidate. This is how organizations find themselves two and a half years into a one-year integration project.
The arrows in Figure 1-2 are not by any means to scale. Initial modeling of data using relational tools can take months, even close to a year, before subsequent activities can begin. In addition, are big design trade-offs to be addressed. You can begin with a sample of data and not examine all potential sources. This can be integrated quickly but inflexibly and will require change later when new sources are introduced. Or you can aim for complex and “flexible” using full requirements; however, this can lead to poor performance and extended time to implement. The truth for the majority of data integration projects is we don’t know what sources of data we might integrate in the future. So even if you have “full requirements” at a point in time, they will change in the future if any new source or query type is introduced.
Based on the model, ETL will be designed in support of transforming this data, and the data will be consumed into the target database. Consumer applications being developed on separate project timelines will require exports of the data to their environments so they can develop in anticipation of the unified schema. As we step through the software development lifecycle for any IT organization, we find ourselves “ETLing” the data to consume it as well as copying and migrating data and supporting data (e.g., lookup tables, references, and data for consumer projects) to a multitude of environments for development, testing, and deployment. This work must be performed for all sources and potentially for all consumers. If anything breaks in the ETL strategy along this path, it’s game over: go back to design or your previous step, and start over.
When we finally make it to deploy, that’s when users can begin actually asking questions of the data. After the data is deployed, anyone can now query it with SQL. But first, it takes too long to get here. There’s too much churn in the previous steps. Second, if you succeed in devising an enterprise warehouse schema for all the data, that schema might be too complex for all but a handful of experts to understand. Then, you’re limited to having only the few people who can understand it write applications against it. An interesting phenomenon can occur in this situation in which subsets of the data will subsequently be ETL’d out of the unified source into a more query-friendly data mart for others to access—and another silo is born!
All of our components must be completely synchronized before we deploy. Until the data is delivered in deploy, it remains inaccessible to the business as developers and modelers churn on it to make it fit the target schema. This is why so many data integration projects fail to deliver.
Schema-first versus schema-later
The problem with using relational tools, ETL, and traditional technologies for integration projects is that they are schema-first. These tools are schema-driven, not data-driven. You first must define the schema for integration, which requires extensive design and modeling. This model will be out of date the day you begin because business requirements and sources of data are always changing. When you put the schema first, it becomes the roadblock to all other integration efforts. You might see the familiarity of schema-first as it pertains to a master data management (MDM) project as well. But there is a solution to this problem!
A multi-model database gives us the ability to implement and realize our own schema-later. A multi-model database is unique in that it separates data ingestion from data organization. Traditional tools require that we define how we are going to organize the data before we load it. Not so with multi-model. By providing a system that gives us the ability to load data as is and access it immediately for analysis and discovery, we can begin to work with the data from the start of any project and then lay schemas on top of it when needed. This data-driven approach gives us the ability to deploy meaningful information to consumers sooner, rather than later. Schema-later allows us to work with the data and add a schema to it simultaneously! We’ll be covering just how multi-model databases can do this in much greater detail in the coming chapters.
Filesystem
Now, we often don’t think of text as a data model, but it very much is. Text is a form, or shape, of data that we’ll want to address in a multi-model database. Structured query in a relational system is great when we have some idea of what we’re looking for. If we know the table names and column names, we have some idea of where to query for what we’re looking for. But when we’re not sure where the data might reside in a document or record, and we want to be able to ask any question we want to find of any piece of the data to find what we’re looking for, we want to be able to search text.
We might want to search text within a relational system, or we might want to search documents stored in SharePoint or on a filesystem. Most large systems have text, or evolve to include text. Organizations go to great lengths to improve query and search, either shredding documents into relational systems, augmenting their SharePoint team sites with lists, or bolting on an external search engine. It’s common to see systems implemented that are a combination of structured, transactional data stored in a relational database that’s been augmented with some sort of search engine for discovery, for which fuzzy relevance ranking is required.
Text has meaning in search. The more times a search term is repeated within a document, the more relevant that document might be to our search. Words can occur in phrases; they have proximity to one another; and they can happen in the context of other words or the structure of a document, such as a header, footer, a paragraph, or in metadata. Words have relevance and weight. Text is a different type of data from traditional database systems, traditionally requiring different indexes and a different API for querying. If we want to include any text documents or fields in a unified view, we’ll need to address where text fits in multi-model, and it does so in the context of search. Search is the query language for text. Search engines index text and structure, and these indexes are then often integrated with the results of other queries in an application layer to help us find the data we’re looking for. But as we’ll find out, in a multi-model database, search and query can be handled by a single, unified API.
Many times, text is locked within binaries. It’s not uncommon to find artifacts like Java Objects or Microsoft Word documents persisted as BLOBs or Character Large OBjects (CLOBs) within a relational system. BLOBs and CLOBs are typically large files. On their own, BLOBs provide no meaning to the database system as to what their contents are. They are opaque. They could contain a Word document, an image, a video—anything binary. They have to be handled in a special way through application logic because a DBMS has no way of understanding the contents of a BLOB file in order to know how to deal with it. CLOBs are only marginally better than BLOBs in that a DBMS will understand the object stored is textual. However, CLOBs typically don’t give you much more transparency into the text given the performance overhead of how RDMBS databases index data.
Often, CLOB data that a business would like to query today is extracted via some custom process into columns in a relational table, with the remainder of the document stored in a CLOB in a column that essentially becomes opaque to any search or query. During search and query of the columns available, the business will stumble upon some insight that makes it realize it would like to query additional information from the CLOB’d document. It will then submit a request to IT. IT will then need to schedule the work, and then update the existing schema to accept the new data and update its extraction process to pull the new information and populate the newly added columns to the schema. The process to accomplish this task, given traditional release cycles in IT, can typically take months! But that business wants and needs answers today. Waiting months to see new data in its queries adds to the friction between business and IT. This is solely the result of using relational systems to store this data. Conversely, because Java Objects can be saved as XML or JSON and because Word documents are essentially ZIP files of XML parts, if we use a multi-model database, we can store the information all readily available for search and query without requiring additional requests to IT to update any processes or schemas. The effort to use these data sources in their natural form is minimized.
The Desired Solution
We’ve addressed mainframes, relational systems, and filesystems, as these are the primary sources of data we’re grappling with in the organization today. There might be others, too, and with any of these data stores comes a considerable load of mental inertia and physical behaviors required for people to interact with and make use of their data across all these different systems.
A benefit of multi-model—and why we find more and more organizations embracing multi-model databases—is that even though for any one person there are multiple sources of data (Figure 1-1), with a multi-model database, we can implement solutions with one source of data that provides different lenses to many different consumers requiring different unified views across disparate data models and formats all contained within a single database, as demonstrated in Figure 1-3.
Figure 1-3 illustrates the desired end state for many solutions. As the glue that binds disparate sources of information into a unified view, multi-model databases have emerged and are being used to create solutions in which data is aggregated quickly and can provide delivery to multiple consumers via multiple formats in the form of a unified view. In the following chapters, we’ll dig deeper into how a multi-model database can get us to this desired end state quickly. We’ll also cover how we work with data using a multi-model approach, and how this contrasts with a relational approach. We’ll also examine the significant benefits realized in terms of the reduction in time and level of effort required to implement a solution in multi-model as well as the overall reduction of required ETL and data movement in these solutions.
1 Ray Shaw, “Is the mainframe dead?” ITWire, January 20, 2017.
Get Building on Multi-Model Databases now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.