Data in its natural format
Integrate and access any form of data using a multi-model database.
Each database management system, relational or non-relational, has its strengths and weaknesses. Often, application developers or database designers choose the database system they know or like the best, and fit the data into that system—whether it’s suitable or not.
In large organizations, this process of choosing a database system, building to it, and wrangling data to fit has often been repeated over and over. An enterprise may have dozens or even hundreds of data stores created by different departments, different business units, or by companies that have been acquired. When decisions-makers need to combine that data for analysis, a complex and often brittle extract-transform-load (ETL) process is done to move relevant data into an analytics data store, a spreadsheet, a report, or a presentation. Without its original context and format, the extracted data is not as meaningful as it should be.
Instead of fitting the data to the database, what if we fit the database to the data? Instead of connecting silos, what if we just got rid of them altogether?
“Model” refers to how a database stores and organizes the data it has been given. This has to do with the high-level conceptual organization, and how one interacts with the data via query language—not precisely how the data is physically stored; databases with nearly identical models might have very different physical implementations.
The relational model
The conventional database model is relational—the SQL databases many of us are familiar with. Data is stored in tables, where each row is an item in a collection, and each column represents a particular attribute of the items.
Well-designed relational databases conventionally follow third normal form (3NF), in which no data is duplicated in the system. To accomplish this, references to any entity are made using foreign keys. So, for example, in an ecommerce system with an orders table and a customer table, the row referring to a single order will reference the ID of the customer, instead of storing the customer’s information directly.
Here, we already see one of the problems with relational databases: that hypothetical order table doesn’t have the customer information, so a join is needed to pull that in. Each customer might have many addresses, stored on a customer address table. The order table can’t store a list of items in the order; that information is on the order items table. But the order items table only references product keys, so getting details like the name of the product requires looking at the products table. So, for something as simple as viewing order details, we have to join across at least five tables.
This deconstruction of information into the relational model has advantages:
- With a homogenous data set, it is highly space efficient.
- The data integrity of 3NF ensures consistency about what is considered “true” in the system.
- A well-designed schema, for a domain where such a thing can exist, often reveals the shape of the domain and can guide application logic.
There are disadvantages, too:
- Data integrity requires a high degree of dependency across tables. This can cause scaling problems because sharding across multiple machines may physically separate data that has to be joined to be useful.
- Relational data schemas are complex and hard to change, and have to be designed and implemented before data is added to the system. If you can’t predict what the incoming data will look like, it is impossible to create a meaningful schema for it.
- There is a mismatch between how data exists in the world, how it is stored in the database, and how it is used by any consuming application. This can cause a loss of context and detail, making the data less meaningful.
Over the last decade or so, a number of new database models have emerged as engineers have tried to compensate for the disadvantages of the relational model, or needed to solve new sets of problems created by the size and scope of the internet. Collectively, these are called NoSQL databases.
Document databases store complete, usually self-contained representations of entities. In a document-centered application—which includes anything from web content to medical record keeping—this can be a very natural approach to storing data. For this reason, and because they are conceptually easy to understand and use, the most popular NoSQL databases are document oriented.
A graph is a web of nodes and the connections (called “edges”) between them. The connectivity of the internet, the web, and social media has provided a number of important uses for graph databases, as they are the most natural way to model the network of relationships among a collection of entities.
A subset of graph databases is the semantic triple store, which is more concerned with storing relational facts among dissimilar entities. So, while a relational graph might show you who is friends with whom, a triple store will tell you which part belongs to which subassembly.
A key-value store is, in essence, an associative array or dictionary. Because of their relative simplicity, they allow very fast writes. Reading out is very efficient, too, as long as you already know the key you are looking for. Searching and querying by the values is typically difficult.
Problems of multiple models in the enterprise
In a very large organization, where different business units and departments make engineering decisions to meet their own needs, a diversity of models will prevail. This is even more true if companies have been acquired over time. Also, business longevity means that legacy databases will often be in place.
Different models for different applications makes perfect sense, as no single model is perfect for all situations. A system managing financial transactions needs absolute consistency, which usually means a relational database and third normal form. An enterprise knowledge management or customer records system might need a document-oriented system. Even if everyone in an organization used only relational databases, the differing schemas created to solve particular problems still create difficulty.
Silos and how they get connected
When the order data is the sales database, but the details that led up to the sale are in the CRM system, and the information about how that customer first connected with your company is in the marketing database, and their previous order information is in a database from a company you acquired, you have a problem. It becomes impossible to get a holistic view of any aspect of your business.
There are a few common ways organizations try to overcome this gap.
Some analyst or team of analysts manually queries the various data stores and loads that information into a spreadsheet. This aggregated data becomes the basis for a report which is then read by executives.
Problems with this approach:
- The new information created by building the report cannot stay up to date. The process has to be repeated any time you want to see new information.
- The copy-paste-manipulate process has a lot of opportunities for human error and bad judgement.
- It is extremely time consuming.
- Reports can only answer questions that report builders thought to ask ahead of building the report. Genuinely new insights that might come from being able to “play with” the data are not available.
One step up from manually aggregating data into a report, is aggregating it into a new database, typically a relational database with a schema designed to answer particular types of analysis questions. Scripts are written to ETL data out of the other systems and into the new one, and then they’re scheduled to run every day or hour.
Problems with this approach:
- Defining a schema that will work well for data coming from a variety of sources can be very difficult.
- The ETL process robs data of its original context and format, which may affect its meaningfulness.
- Like the reports, the new database will often only be able to answer the types of questions that were thought of when designing the schema.
- Changes in the way data is organized in the originating databases will require changes to the aggregating database.
Multiple models, one database
A multi-model database can solve these problems by providing a single storage solution for all the different types of data an organization handles.
What is a multi-model database?
A multi-model database supports multiple data models in their native form within a single, integrated back end, and uses data standards and query standards appropriate to each model. Queries are extended or combined to provide seamless queries across all the supported data models. Indexing, parsing, and processing standards appropriate to the data model are included in the core database product.
A multi-model database has relational storage, document storage, graph storage, and key-value storage together in one place. Data does not need to be transformed before it is loaded, and a relational schema does not need to be designed before data is brought in.
This allows querying across data of different types, without losing the context and formatting that often gives meaning to individual pieces of information.
Why use a multi-model database?
A multi-model approach allows you to keep data in the form the data wants to be in, it minimizes or eliminates the need to transform data to fit new formats, and it allows you to consume data before knowing how it will be organized. We already live in a multi-data model world. A multi-model database system brings that world view into a single application.
To learn more about how to reduce complexity and shorten time-to-value using a multi-model database, download our free report Building on Multi-Model Databases.
This post is a collaboration between O’Reilly and MarkLogic. See our statement of editorial independence.