Chapter 4. Documents and Text

Up until this point, we’ve looked at multi-model through the lens of data integration with a focus on integrating data from relational systems. This is frequently where data integration begins, but there is tremendous value in being able to search structured data with unstructured and semi-structured data such as documents and text. Unfortunately, most people who hear documents and text in relation to a database automatically think of Binary Large OBjects (BLOBs) or Character Large OBjects (CLOBs) or the amount of shredding required to get those documents and text to fit a relational schema. But a multi-model database allows us to load them and use them as is because their text and structure are self-describing.

When we asked more than 200 IT professionals what percentage of their data was not relational, 44% said 1 to 25% of their data was unstructured. That is a shockingly low volume of unstructured data—for a surprisingly high number of organizations. Although it is possible that these organizations really do have a paucity of data that isn’t relational, it is more likely that most companies aren’t dealing with their unstructured data because it is simply too inconvenient to do so.

The ubiquity of relational databases has meant that, for many, only that which fits in a relational database management systems (RDBMS)—billing, payroll, customer, and inventory—is considered data. But this view of data is changing. Analysts estimate that more than 80% of data being created and stored is unstructured. This includes documentation, instant messages, email, customer communications, contracts, product information, and website tracking data.

So why does the survey indicate that at most 25% of an organization’s data is unstructured? Possibly because data that is unstructured (or even semi-structured) is rarely dealt with by IT management.

That doesn’t make unstructured data low-value. In fact, the inverse is true. Instead, it means this highly valuable content is unused.

But there are options for modeling, storing, and querying unstructured data that give us insight into the 80% of the data that, for many, has remained unexplored. Further, any multi-model database that includes a structured model, such as JSON or XML, with text indexing provides the ability to query structured and unstructured data together.

Schemas Are Key to Querying Documents

XML and JSON documents are self-describing. The schema is already defined within the elements and properties already written within each document. Multi-model systems can identify this implicit schema when data is imported into the database and allow you to immediately query it.

Schemas create a consistent way to describe and query your data. As we noted earlier, in a relational database, the schema is defined in terms of tables, columns, and attributes; each table has one or more columns, and each column has one or more attributes. Different rows in a relational database table have the same schema—period. This makes the schema static, with only slow, painful, and costly changes possible. Relational systems are schema-first, in that we have to define the schema before pouring in the data. Changing the shape of the schema after data has been poured in can be painfully challenging and costly.

In a nonrelational database, the idea of a schema is more fluid. Some NoSQL databases have a concept of a schema that is similar to the relational picture, but most do not. Of those that do not, you can further divide them into databases for which the schema is latent, that is implicit, in the structure of each semi-structured or unstructured entity. When working with data from existing sources, your data has already been modeled. Data usually comes to us with some shape to it. When we load XML or JSON documents into a multi-model database, the latent or implicit schema is already defined within the XML elements or JSON properties of the documents, as demonstrated in Figure 4-1.

Same document saved in two different formats
Figure 4-1. Same document saved in two different formats

If the primary data entity is a document, frequently represented as an XML or JSON object, the latent schema implicit in a document might be different than that of another document in the same NoSQL database or database collection. That is to say, one document might have a title and author, whereas another has a headline and writer. Each of those two documents can be in the same NoSQL database because the schema is included in the documents themselves. In other words, in document stores, the schema is organized in each document, but it’s not defined externally as it would be in a relational database. The latent schema is the shape of the data when you loaded it. In NoSQL databases, the schema for any document could, and often does, change frequently.

A true multi-model database is schema-agnostic and schema-aware so that you are not forced to cast your queries in a predefined and unchanging schema.

This is perfect for managing and querying text, and for querying text with structured data. This gives you a semantic view of varied databases and one place where you can look up every fact that you have. In that case, a multi-model database handles (at least) a document data model as well as semantic RDF data.

In the preceding documents, a multi-model database will index the JSON properties and XML elements as well as their values. In this way, the system is schema-aware, in that element and property values can be used for structured queries such as this one:

Show me the documents where heading equals "Data Models"

We also can use them as full-text search queries such as the following:

Show me any documents with the word "relational" in them 
anywhere in the document.

Combining full-text search and structured query, we should be able to perform a search query of the following type:

Search for documents with the word "relational" anywhere in a 
paragraph element/property.

A multi-model database management system must let you load data with multiple schemas. You shouldn’t need to be concerned about lengths nor worry about cardinality of elements/properties. And, if something unexpected shows up tomorrow, you can still store it just fine.

Get Building on Multi-Model Databases now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.