Chapter 1. Beyond Relational Databases

If at first the idea is not absurd, then there is no hope for it.

Albert Einstein

Welcome to Cassandra: The Definitive Guide. The aim of this book is to help developers and database administrators understand this important database technology. During the course of this book, we will explore how Cassandra compares to traditional relational database management systems, and help you put it to work in your own environment.

What’s Wrong with Relational Databases?

If I had asked people what they wanted, they would have said faster horses.

Henry Ford

We ask you to consider a certain model for data, invented by a small team at a company with thousands of employees. It was accessible over a TCP/IP interface and was available from a variety of languages, including Java and web services. This model was difficult at first for all but the most advanced computer scientists to understand, until broader adoption helped make the concepts clearer. Using the database built around this model required learning new terms and thinking about data storage in a different way. But as products sprang up around it, more businesses and government agencies put it to use, in no small part because it was fast—capable of processing thousands of operations a second. The revenue it generated was tremendous.

And then a new model came along.

The new model was threatening, chiefly for two reasons. First, the new model was very different from the old model, which it pointedly controverted. It was threatening because it can be hard to understand something different and new. Ensuing debates can help entrench people stubbornly further in their views—views that might have been largely inherited from the climate in which they learned their craft and the circumstances in which they work. Second, and perhaps more importantly, as a barrier, the new model was threatening because businesses had made considerable investments in the old model and were making lots of money with it. Changing course seemed ridiculous, even impossible.

Of course, we are talking about the Information Management System (IMS) hierarchical database, invented in 1966 at IBM.

IMS was built for use in the Saturn V moon rocket. Its architect was Vern Watts, who dedicated his career to it. Many of us are familiar with IBM’s database DB2. IBM’s wildly popular DB2 database gets its name as the successor to DB1—the product built around the hierarchical data model IMS. IMS was released in 1968, and subsequently enjoyed success in Customer Information Control System (CICS) and other applications. It is still used today.

But in the years following the invention of IMS, the new model, the disruptive model, the threatening model, was the relational database.

In his 1970 paper, “A Relational Model of Data for Large Shared Data Banks”, Dr. Edgar F. Codd also advanced his theory of the relational model for data while working at IBM’s San Jose research laboratory. This paper became the foundational work for relational database management systems.

Codd’s work was antithetical to the hierarchical structure of IMS. Understanding and working with a relational database required learning new terms, including relations, tuples, and normal form, all of which must have sounded very strange indeed to users of IMS. It presented certain key advantages over its predecessor, such as the ability to express complex relationships between multiple entities, well beyond what could be represented by hierarchical databases.

While these ideas and their application have evolved in four decades, the relational database still is clearly one of the most successful software applications in history. It’s used in the form of Microsoft Access in sole proprietorships, and in giant multinational corporations with clusters of hundreds of finely tuned instances representing multiterabyte data warehouses. Relational databases store invoices, customer records, product catalogs, accounting ledgers, user authentication schemes—the very world, it might appear. There is no question that the relational database is a key facet of the modern technology and business landscape, and one that will be with us in its various forms for many years to come, as will IMS in its various forms. The relational model presented an alternative to IMS, and each has its uses.

So the short answer to the question, “What’s Wrong with Relational Databases?” is “Nothing.”

There is, however, a rather longer answer, which says that every once in a while an idea is born that ostensibly changes things, and engenders a revolution of sorts. And yet, in another way, such revolutions, viewed structurally, are simply history’s business as usual. IMS, RDBMS, NoSQL. The horse, the car, the plane. They each build on prior art, they each attempt to solve certain problems, and so they’re each good at certain things—and less good at others. They coexist, even now.

So let’s examine for a moment why you might consider an alternative to the relational database, just as Codd himself four decades ago looked at the Information Management System and thought that maybe it wasn’t the only legitimate way of organizing information and solving data problems, and that maybe, for certain problems, it might prove fruitful to consider an alternative.

You encounter scalability problems when your relational applications become successful and usage goes up. The need to gather related data from multiple tables via joins is inherent in any relatively normalized relational database of even modest size, and joins can be slow. The way that databases gain consistency is typically through the use of transactions, which require locking some portion of the database so it’s not available to other clients. This can become untenable under very heavy loads, as the locks mean that competing users start queuing up, waiting for their turn to read or write the data.

You typically address these problems in one or more of the following ways, sometimes in this order:

  • Throw hardware at the problem by adding more memory, adding faster processors, and upgrading disks. This is known as vertical scaling. This can relieve you for a time.

  • When the problems arise again, the answer appears to be similar: now that one box is maxed out, you add hardware in the form of additional boxes in a database cluster. Now you have the problem of data replication and consistency during regular usage and in failover scenarios. You didn’t have that problem before.

  • Now you need to update the configuration of the database management system. This might mean optimizing the channels the database uses to write to the underlying filesystem. You turn off logging or journaling, which frequently is not a desirable (or, depending on your situation, legal) option.

  • Having put what attention you could into the database system, you turn to your application. You try to improve your indexes. You optimize the queries. But presumably at this scale you weren’t wholly ignorant of index and query optimization, and already had them in pretty good shape. So this becomes a painful process of picking through the data access code to find any opportunities for fine-tuning. This might include reducing or reorganizing joins, throwing out resource-intensive features such as XML processing within a stored procedure, and so forth. Of course, presumably you were doing that XML processing for a reason, so if you have to do it somewhere, you move that problem to the application layer, hoping to solve it there and crossing your fingers that you don’t break something else in the meantime.

  • You employ a caching layer. For larger systems, this might include distributed caches such as Redis, memcached, Hazelcast, Aerospike, Ehcache, or Riak. Now you have a consistency problem between updates in the cache and updates in the database, which is exacerbated over a cluster.

  • You turn your attention to the database again and decide that, now that the application is built and you understand the primary query paths, you can duplicate some of the data to make it look more like the queries that access it. This process, called denormalization, is antithetical to the five normal forms that characterize the relational model, and violates Codd’s 12 Rules for relational data. You remind yourself that you live in this world, and not in some theoretical cloud, and then undertake to do what you must to make the application start responding at acceptable levels again, even if it’s no longer “pure.”

Codd’s 12 Rules

Codd provided a list of 12 rules (there are actually 13, numbered 0 to 12) formalizing his definition of the relational model as a response to the divergence of commercial databases from his original concepts. Codd introduced his rules in a pair of articles in CompuWorld magazine in October 1985, and formalized them in the second edition of his book The Relational Model for Database Management, which is now out of print. Although Codd’s rules represent an ideal system which commercial databases have typically implemented only partially, they have continued to exert a key influence over relational data modeling to the present day.

This likely sounds familiar to you. At web scale, engineers may legitimately ponder whether this situation isn’t similar to Henry Ford’s assertion that at a certain point, it’s not simply a faster horse that you want. And they’ve done some impressive, interesting work.

We must therefore begin here in recognition that the relational model is simply a model. That is, it’s intended to be a useful way of looking at the world, applicable to certain problems. It does not purport to be exhaustive, closing the case on all other ways of representing data, never again to be examined, leaving no room for alternatives. If you take the long view of history, Dr. Codd’s model was a rather disruptive one in its time. It was new, with strange new vocabulary and terms such as tuples—familiar words used in a new and different manner. The relational model was held up to suspicion, and doubtless suffered its vehement detractors. It encountered opposition even in the form of Dr. Codd’s own employer, IBM, which had a very lucrative product set around IMS and didn’t need a young upstart cutting into its pie.

But the relational model now arguably enjoys the best seat in the house within the data world. SQL is widely supported and well understood. It is taught in introductory university courses. Cloud-based Platform-as-a-Service (PaaS) providers such as Amazon Web Services, Google Cloud Platform, Microsoft Azure, Alibaba, and Rackspace provide relational database access as a service, including automated monitoring and maintenance features. Often the database you end up using is dictated by architectural standards within your organization. Even absent such standards, it’s prudent to learn whatever your organization already has for a database platform. Your colleagues in development and infrastructure have considerable hard-won knowledge.

If by nothing more than osmosis (or inertia), you have learned over the years that a relational database is a one-size-fits-all solution.

So perhaps a better question is not, “What’s Wrong with Relational Databases?” but rather, “What problem do you have?”

That is, you want to ensure that your solution matches the problem that you have. There are certain problems that relational databases solve very well. But the explosion of the web, and in particular social networks, means a corresponding explosion in the sheer volume of data you must deal with. When Tim Berners-Lee first worked on the web in the early 1990s, it was for the purpose of exchanging scientific documents between PhDs at a physics laboratory. Now, of course, the web has become so ubiquitous that it’s used by everyone, from those same scientists to legions of five-year-olds exchanging emoji about kittens. That means in part that it must support enormous volumes of data; the fact that it does stands as a monument to the ingenious architecture of the web.

But as the traditional relational databases started to bend under the weight, it became clear that new solutions were needed.

A Quick Review of Relational Databases

Though you are likely familiar with them, let’s briefly turn our attention to some of the foundational concepts in relational databases. This will give us a basis on which to consider more recent advances in thought around the trade-offs inherent in distributed data systems, especially very large distributed data systems, such as those that are required at web scale.

There are many reasons that the relational database has become so overwhelmingly popular over the last four decades. An important one is the Structured Query Language (SQL), which is feature-rich and uses a simple, declarative syntax. SQL was first officially adopted as an American National Standards Institute (ANSI) standard in 1986; since that time, it’s gone through several revisions and has also been extended with vendor-proprietary syntax such as Microsoft’s T-SQL and Oracle’s PL/SQL to provide additional implementation-specific features.

SQL is powerful for a variety of reasons. It allows the user to represent complex relationships with the data, using statements that form the Data Manipulation Language (DML) to insert, select, update, delete, truncate, and merge data. You can perform a rich variety of operations using functions based on relational algebra to find a maximum or minimum value in a set, for example, or to filter and order results. SQL statements support grouping aggregate values and executing summary functions. SQL provides a means of directly creating, altering, and dropping schema structures at runtime using Data Definition Language (DDL). SQL also allows you to grant and revoke rights for users and groups of users using the same syntax.

SQL is easy to use. The basic syntax can be learned quickly, and conceptually SQL and RDBMSs offer a low barrier to entry. Junior developers can become proficient readily, and as is often the case in an industry beset by rapid changes, tight deadlines, and exploding budgets, ease of use can be very important. And it’s not just the syntax that’s easy to use; there are many robust tools that include intuitive graphical interfaces for viewing and working with your database.

In part because it’s a standard, SQL allows you to easily integrate your RDBMS with a wide variety of systems. All you need is a driver for your application language, and you’re off to the races in a very portable way. If you decide to change your application implementation language (or your RDBMS vendor), you can often do that painlessly, assuming you haven’t backed yourself into a corner using lots of proprietary extensions.

Transactions, ACID-ity, and Two-Phase Commit

In addition to the features mentioned already, RDBMSs and SQL also support transactions. A key feature of transactions is that they execute virtually at first, allowing the programmer to undo (using rollback) any changes that may have gone awry during execution; if all has gone well, the transaction can be reliably committed. As Jim Gray puts it, a transaction is “a transformation of state” that has the ACID properties (see “The Transaction Concept: Virtues and Limitations”).

ACID is an acronym for Atomic, Consistent, Isolated, Durable, which are the gauges you can use to assess that a transaction has executed properly and that it was successful:

Atomic
Atomic means “all or nothing”; that is, when a statement is executed, every update within the transaction must succeed in order to be called successful. There is no partial failure where one update was successful and another related update failed. The common example here is with monetary transfers at an ATM: the transfer requires a debit from one account and a credit to another account. This operation cannot be subdivided; they must both succeed.
Consistent
Consistent means that data moves from one correct state to another correct state, with no possibility that readers could view different values that don’t make sense together. For example, if a transaction attempts to delete a customer and their order history, it cannot leave order rows that reference the deleted customer’s primary key; this is an inconsistent state that would cause errors if someone tried to read those order records.
Isolated
Isolated means that transactions executing concurrently will not become entangled with each other; they each execute in their own space. That is, if two different transactions attempt to modify the same data at the same time, then one of them will have to wait for the other to complete.
Durable
Once a transaction has succeeded, the changes will not be lost. This doesn’t imply another transaction won’t later modify the same data; it just means that writers can be confident that the changes are available for the next transaction to work with as necessary.

The debate about support for transactions comes up very quickly as a sore spot in conversations around nonrelational data stores, so let’s take a moment to revisit what this really means. On the surface, ACID properties seem so obviously desirable as to not even merit conversation. Presumably no one who runs a database would suggest that data updates don’t have to endure for some length of time; that’s the very point of making updates—that they’re there for others to read. However, a more subtle examination might lead you to want to find a way to tune these properties a bit and control them slightly. There is, as they say, no free lunch on the internet, and once you see how you’re paying for transactions, you may start to wonder whether there’s an alternative.

Transactions become difficult under heavy load. When you first attempt to horizontally scale a relational database, making it distributed, you must now account for distributed transactions, where the transaction isn’t simply operating inside a single table or a single database, but is spread across multiple systems. In order to continue to honor the ACID properties of transactions, you now need a transaction manager to orchestrate across the multiple nodes.

In order to account for successful completion across multiple hosts, the idea of a two-phase commit (sometimes referred to as “2PC”) is introduced. The two-phase commit is a commonly used algorithm for achieving consensus in distributed systems, involving two sets of interactions between hosts known as the prepare phase and commit phase. Because the two-phase commit locks all associated resources, it is useful only for operations that can complete very quickly. Although it may often be the case that your distributed operations can complete in subsecond time, it is certainly not always the case. Some use cases require coordination between multiple hosts that you may not control yourself. Operations coordinating several different but related activities can take hours to update.

Two-phase commit blocks; that is, clients (“competing consumers”) must wait for a prior transaction to finish before they can access the blocked resource. The protocol will wait for a node to respond, even if it has died. It’s possible to avoid waiting forever in this event, because a timeout can be set that allows the transaction coordinator node to decide that the node isn’t going to respond and that it should abort the transaction. However, an infinite loop is still possible with 2PC; that’s because a node can send a message to the transaction coordinator node agreeing that it’s OK for the coordinator to commit the entire transaction. The node will then wait for the coordinator to send a commit response (or a rollback response if, say, a different node can’t commit); if the coordinator is down in this scenario, that node conceivably will wait forever.

So in order to account for these shortcomings in two-phase commit of distributed transactions, the database world turned to the idea of compensation. Compensation, often used in web services, means in simple terms that the operation is immediately committed, and then in the event that some error is reported, a new operation is invoked to restore proper state.

There are a few basic, well-known patterns for compensatory action that architects frequently have to consider as an alternative to two-phase commit. These include writing off the transaction if it fails, deciding to discard erroneous transactions and reconciling later. Another alternative is to retry failed operations later on notification. In a reservation system or a stock sales ticker, these are not likely to meet your requirements. For other kinds of applications, such as billing or ticketing applications, this can be acceptable.

The Problem with Two-Phase Commit

Gregor Hohpe, a Google architect, wrote a wonderful and often-cited blog entry titled “Starbucks Does Not Use Two-Phase Commit”. It shows in real-world terms how difficult it is to scale two-phase commit and highlights some of the alternatives that are mentioned here. It’s an easy, fun, and enlightening read. If you’re interested in digging deeper, Martin Kleppman’s comprehensive book Designing Data-Intensive Applications (O’Reilly) contains an excellent in-depth discussion of two-phase commit and other consensus algorithms.

The problems that 2PC introduces for application developers include loss of availability and higher latency during partial failures. Neither of these is desirable. So once you’ve had the good fortune of being successful enough to necessitate scaling your database past a single machine, you now have to figure out how to handle transactions across multiple machines and still make the ACID properties apply. Whether you have 10 or 100 or 1,000 database machines, atomicity is still required in transactions as if you were working on a single node. But it’s now a much, much bigger pill to swallow.

Schema

One often-lauded feature of relational database systems is the rich schemas they afford. You can represent your domain objects in a relational model. A whole industry has sprung up around (expensive) tools such as the CA ERwin Data Modeler to support this effort. In order to create a properly normalized schema, however, you are forced to create tables that don’t exist as business objects in your domain. For example, a schema for a university database might require a “student” table and a “course” table. But because of the “many-to-many” relationship here (one student can take many courses at the same time, and one course has many students at the same time), you have to create a join table. This pollutes a pristine data model, where you’d prefer to just have students and courses. It also forces you to create more complex SQL statements to join these tables together. The join statements, in turn, can be slow.

Again, in a system of modest size, this isn’t much of a problem. But complex queries and multiple joins can become burdensomely slow once you have a large number of rows in many tables to handle.

Finally, not all schemas map well to the relational model. One type of system that has risen in popularity in the last decade is the complex event processing system or stream processing system, which represents state changes in a very fast stream. It’s often useful to contextualize events at runtime against other events that might be related in order to infer some conclusion to support business decision-making. Although event streams can be represented in terms of a relational database, as with Apache Kafka’s KSQL, it is often an uncomfortable stretch.

If you’re an application developer, you’ll no doubt be familiar with the many object-relational mapping (ORM) frameworks that have sprung up in recent years to help ease the difficulty in mapping application objects to a relational model. Again, for small systems, ORM can be a relief. But it also introduces new problems of its own, such as extended memory requirements, and it often pollutes the application code with increasingly unwieldy mapping code. Here’s an example of a Java method using Hibernate to “ease the burden” of having to write the SQL code:

@CollectionOfElements
@JoinTable(name="store_description",
  joinColumns = @JoinColumn(name="store_code"))
@MapKey(columns={@Column(name="for_store",length=3)})
@Column(name="description")
private Map<String, String> getMap() {
  return this.map;
}
//... etc.

Is it certain that we’ve done anything but move the problem here? Of course, with some systems, such as those that make extensive use of document exchange, as with services or XML-based applications, there are not always clear mappings to a relational database. This exacerbates the problem.

Sharding and Shared-Nothing Architecture

If you can’t split it, you can’t scale it.

Randy Shoup, Distinguished Architect, eBay

Another way to attempt to scale a relational database is to introduce sharding to your architecture. This has been used to good effect at large websites such as eBay, which supports billions of SQL queries a day, and in other modern web applications. The idea here is that you split the data so that instead of hosting all of it on a single server or replicating all of the data on all of the servers in a cluster, you divide up portions of the data horizontally and host them each separately.

For example, consider a large customer table in a relational database. The least disruptive thing (for the programming staff, anyway) is to vertically scale by adding CPU, adding memory, and getting faster hard drives, but if you continue to be successful and add more customers, at some point (perhaps into the tens of millions of rows), you’ll likely have to start thinking about how you can add more machines. When you do so, do you just copy the data so that all of the machines have it? Or do you instead divide up that single customer table so that each database has only some of the records, with their order preserved? Then, when clients execute queries, they put load only on the machine that has the record they’re looking for, with no load on the other machines.

It seems clear that in order to shard, you need to find a good key by which to order your records. For example, you could divide your customer records across 26 machines, one for each letter of the alphabet, with each hosting only the records for customers whose last names start with that particular letter. It’s likely this is not a good strategy, however—there probably aren’t many last names that begin with “Q” or “Z,” so those machines will sit idle while the “J,” “M,” and “S” machines spike. You could shard according to something numeric, like phone number, “member since” date, or the name of the customer’s state. It all depends on how your specific data is likely to be distributed.

There are three basic strategies for determining shard structure:

Feature-based shard or functional segmentation

This is the approach taken by Randy Shoup, Distinguished Architect at eBay, who in 2006 helped bring the site’s architecture into maturity to support many billions of queries per day. Using this strategy, the data is split not by dividing records in a single table (as in the customer example discussed earlier), but rather by splitting into separate databases the features that don’t overlap with each other very much. For example, at eBay, the users are in one shard, and the items for sale are in another. This approach depends on understanding your domain so that you can segment data cleanly.

Key-based sharding

In this approach, you find a key in your data that will evenly distribute it across shards. So instead of simply storing one letter of the alphabet for each server as in the (naive and improper) earlier example, you use a one-way hash on a key data element and distribute data across machines according to the hash. It is common in this strategy to find time-based or numeric keys to hash on.

Lookup table

In this approach, also known as directory-based sharding, one of the nodes in the cluster acts as a “Yellow Pages” directory and looks up which node has the data you’re trying to access. This has two obvious disadvantages. The first is that you’ll take a performance hit every time you have to go through the lookup table as an additional hop. The second is that the lookup table not only becomes a bottleneck, but a single point of failure.

Sharding can minimize contention depending on your strategy and allows you not just to scale horizontally, but then to scale more precisely, as you can add power to the particular shards that need it.

Sharding could be termed a kind of shared-nothing architecture that’s specific to databases. A shared-nothing architecture is one in which there is no centralized (shared) state, but each node in a distributed system is independent, so there is no client contention for shared resources.

Shared-nothing architecture was more recently popularized by Google, which has written systems such as its Bigtable database and its MapReduce implementation that do not share state, and are therefore capable of near-infinite scaling. The Cassandra database is a shared-nothing architecture, as it has no central controller and no notion of primary/secondary replicas; all of its nodes are the same.

More on Shared-Nothing Architecture

The term was first coined by Michael Stonebraker at the University of California at Berkeley in his 1986 paper “The Case for Shared Nothing.”. It’s only a few pages. If you take a look, you’ll see that many of the features of shared-nothing distributed data architecture, such as ease of high availability and the ability to scale to a very large number of machines, are the very things that Cassandra excels at.

Many nonrelational databases offer this automatically and out of the box is very handy; creating and maintaining custom data shards by hand is a wicked proposition. For example, MongoDB, which we’ll discuss later, provides auto-sharding capabilities to manage failover and node balancing. It’s good to understand sharding in terms of data architecture in general, but especially in terms of Cassandra more specifically. Cassandra uses an approach similar to key-based sharding to distribute data across nodes, but does so automatically.

Web Scale

In summary, relational databases are very good at solving certain data storage problems, but because of their focus, they also can create problems of their own when it’s time to scale. Then, you often need to find a way to get rid of your joins, which means denormalizing the data, which means maintaining multiple copies of data and seriously disrupting your design, both in the database and in your application. Further, you almost certainly need to find a way around distributed transactions, which will quickly become a bottleneck. These compensatory actions are not directly supported in any but the most expensive RDBMSs. And even if you can write such a huge check, you still need to carefully choose partitioning keys to the point where you can never entirely ignore the limitation.

Perhaps more importantly, as you see some of the limitations of RDBMSs and consequently some of the strategies that architects have used to mitigate their scaling issues, a picture slowly starts to emerge. It’s a picture that makes some NoSQL solutions seem perhaps less radical and less scary than you may have thought at first, and more like a natural expression and encapsulation of some of the work that was already being done to manage very large databases.

Because of some of the inherent design decisions in RDBMSs, it is not always as easy to scale as some other, more recent possibilities that take the structure of the web into consideration. However, it’s not only the structure of the web you need to consider, but also its phenomenal growth, because as more and more data becomes available, you need architectures that allow your organization to take advantage of this data in near real time to support decision making, and to offer new and more powerful features and capabilities to your customers.

Data Scale, Then and Now

It has been said, though it is hard to verify, that the 17th-century English poet John Milton had actually read every published book on the face of the earth. Milton knew many languages (he was even learning Navajo at the time of his death), and given that the total number of published books at that time was in the thousands, this would have been possible. The size of the world’s data stores have grown somewhat since then.

With the rapid growth in the web, there is great variety to the kinds of data that need to be stored, processed, and queried, and some variety to the businesses that use such data. Consider not only customer data at familiar retailers or suppliers, and not only digital video content, but also the required move to digital television and the explosive growth of email, messaging, mobile phones, RFID, Voice Over IP (VoIP) usage, and the Internet of Things (IoT). Companies that provide content—and the third-party value-add businesses built around them—require very scalable data solutions. Consider too that a typical business application developer or database administrator may be used to thinking of relational databases as the center of the universe. You might then be surprised to learn that within corporations, around 80% of data is unstructured.

The Rise of NoSQL

The recent interest in nonrelational databases reflects the growing sense of need in the software development community for web scale data solutions. The term “NoSQL” began gaining popularity around 2009 as a shorthand way of describing these databases. The term has historically been the subject of much debate, but a consensus has emerged that the term refers to nonrelational databases that support “not only SQL” semantics.

Various experts have attempted to organize these databases in a few broad categories; let’s examine a few of the most common:

Key-value stores

In a key-value store, the data items are keys that have a set of attributes. All data relevant to a key is stored with the key; data is frequently duplicated. Popular key-value stores include Amazon’s Dynamo DB, Riak, and Voldemort. Additionally, many popular caching technologies act as key-value stores, including Oracle Coherence, Redis, and Memcached.

Column stores

In a column store, also known as a wide-column store or column-oriented store, data is stored by column rather than by row. For example, in a column store, all customer addresses might be stored together, allowing them to be retrieved in a single query. Popular column stores include Apache Hadoop’s HBase, Apache Kudu, and Apache Druid.

Document stores

The basic unit of storage in a document database is the complete document, often stored in a format such as JSON, XML, or YAML. Popular document stores include MongoDB, CouchDB, and several public cloud offerings.

Graph databases

Graph databases represent data as a graph—a network of nodes and edges that connect the nodes. Both nodes and edges can have properties. Because they give heightened importance to relationships, graph databases such as Neo4j, JanusGraph, and DataStax Graph have proven popular for building social networking and semantic web applications.

Object databases

Object databases store data not in terms of relations and columns and rows, but in terms of objects as understood from the discipline of object-oriented programming. This makes it straightforward to use these databases from object-oriented applications. Object databases such as db4o and InterSystems Caché allow you to avoid techniques like stored procedures and object-relational mapping (ORM) tools. The most widely used object database is Amazon Web Services’ Simple Storage Service (S3).

XML databases

XML databases are a special form of document databases, optimized specifically for working with data described in the eXtensible Markup Language (XML). So-called “XML native” databases include BaseX and eXist.

Multimodel databases

Databases that support more than one of these styles have been growing in popularity. These “multimodel” databases are based on a primary underlying database (most often a relational, key-value, or column store) and expose additional models as APIs on top of that underlying database. Examples of these include Microsoft Azure Cosmos DB, which exposes document, wide column, and graph APIs on top of a key-value store, and DataStax Enterprise, which offers a graph API on top of Cassandra’s wide column model. Multimodel databases are often touted for their ability to support an approach known as polyglot persistence, in which different microservices or components of an application can interact with data using more than one of the models we’ve described here. We’ll discuss an example of polyglot persistence in Chapter 7.

Learning More About NoSQL Databases

For a comprehensive list of NoSQL databases, see the NoSQL site. The DB-Engines site also provides popularity rankings of popular databases by type, updated monthly.

There is wide variety in the goals and features of these databases, but they tend to share a set of common characteristics. The most obvious of these is implied by the name NoSQL—these databases support data models, Data Definition Languages (DDLs), and interfaces beyond the standard SQL available in popular relational databases. In addition, these databases are typically distributed systems without centralized control. They emphasize horizontal scalability and high availability, in some cases at the cost of strong consistency and ACID semantics. They tend to support rapid development and deployment. They take flexible approaches to schema definition, in some cases not requiring any schema to be defined up front. They provide support for Big Data and analytics use cases.

Over the past decade, there have been a large number of open source and commercial offerings in the NoSQL space. The adoption and quality of these have varied widely, but leaders have emerged in the categories just discussed, and many have become mature technologies with large installation bases and commercial support. We’re happy to report that Cassandra is one of those technologies, as we’ll dig into more in the next chapter.

Summary

The relational model has served the software industry well over the past four decades, but the level of availability and scalability required for modern applications has stretched traditional relational database technology to the breaking point.

The intention of this book is not to convince you by clever argument to adopt a non-relational database such as Apache Cassandra. It is only our intention to present what Cassandra can do and how it does it so that you can make an informed decision and get started working with it in practical ways if you find it applies.

Perhaps the ultimate question, then, is not “What’s Wrong with Relational Databases?” but rather, “What kinds of things would I do with data if it wasn’t a problem?” In a world now working at web scale and looking to the future, Apache Cassandra might be one part of the answer.

Get Cassandra: The Definitive Guide, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.