Chapter 1. Why CouchDB?
Traditional database systems have existed for many years, and they have a familiar structure and expected methods of communicating, inserting, and extracting information. Although complex to condense into a simple statement, most database systems rely on the creation of a specific structure (based on specific fields of information), organized collectively into a record of data. To get information in, you add a record of data, and to get the information out, you query the records by looking for values or ranges within those specific fields.
Apache CouchDB is different and one of a new breed of databases that relies on a different approach to the database structure, methods of storing information, and methods for retrieving it. There are many reasons why this new breed of database systems is required and for much of the motivation behind the development of CouchDB.
In this chapter, we’re going to look at the basics of CouchDB, why it is different, and why the new approach has everybody excited about using CouchDB. CouchDB was produced out of the needs and necessities of the environment. Developers are becoming more savvy every year, with better environments, better tools, and simpler and more straightforward methods for achieving a range of goals.
You only have to look at the Web and the different tools and environments available. It is easy to take the effects of the modern websites for granted, but the functionality of pop-up lists during searches, customization, and the in-page experience (traditionally referred to as AJAX) of a dynamic website. Five years ago, this functionality was rare. Today, toolkits like jQuery or Dojo make this process easier. Outside of the Web, environments like Apple’s Xcode or Microsoft’s .NET all provide toolkits that simplify the development and functionality of your applications.
So how does CouchDB make these processes easier? Here are the highlights, some of which we will expand on in this and later chapters:
An HTTP-based REST API makes communicating with the database easier, because so many modern environments are capable of talking HTTP. The simple structure of HTTP resources and methods (GET, PUT, DELETE) are easy to understand and develop with.
The flexible document-based structure eliminates the need to worry about the structure of your data, either before or during your application development.
Powerful mapping of your data to allow querying, combining, and filtering the information.
Easy-to-use replication so that you can copy, share, and synchronize your data between databases, machines, and even continents.
Let’s look at these features in more detail.
Learning to Relax
Perhaps most importantly, we will look at why the mantra when using CouchDB is relax, and why this message is printed out when you start the CouchDB database. CouchDB was built with web developers in mind, and anybody that was worked on the Web should be familiar with how it works. But CouchDB is also easy to understand even for non-web developers.
Relaxing with CouchDB falls into three main areas:
Allowing database developers to develop their solutions without getting in the way with complex processes and interfaces was a key part of the design goal for CouchDB. Requiring drivers, interfaces, and complex protocols is counter to that process. CouchDB is therefore accessible through a simple HTTP-based REST API, and that makes it very simple and easy to use. We’ll look at the basic mechanics of this interface later in this book.
Lots of databases work well during development, but the experience is not always shared during deployment. CouchDB tries to address some of the pain by allowing the deployment of a database or application to be simple and straightforward. CouchDB is fault-tolerant and generally self-sufficient. If something goes wrong, the problems are dealt with simply and gracefully; you can always obtain more detailed information if you need it. In general, if something goes wrong, it should be simple to find out what happened, but such issues are rare.
Scaling your database is another important element of the deployment process. Dealing with a range of different loads on the database can be difficult to handle. CouchDB will handle a temporary increase in concurrent requests without complaining. Each request may take longer, but it will still be handled.
Furthermore, the issue of extending or expanding your deployed environment to support more requests is made easier through the simple structure of CouchDB. Instead of enforcing the way you scale, CouchDB can easily be integrated with a variety of other solutions giving you the flexibility to use whichever system suits your needs best.
As a rule, the simplicity of CouchDB enables you to develop and deploy an application in a way that is both flexible and efficient. It is unlikely that CouchDB will let you get yourself into any difficulty without giving you some indication of where the problem lies.
A Different Data Model
I’ve touched on this already, but one of the key differences between CouchDB and other database solutions is the flexible nature of the format for storing information. Probably the best way to think about this is to look at an example.
If you look at a typical contact entry, it might look something like this:
Name: AN Other Phone: 01234 567890 Email: email@example.com
When modeling this in a typical database you might create a field for the name, phone, and email. But problems can occur when you get another record that is outside of your structure:
Name: MC Brown Phone: 01234 567890 Mobile: +44 1234 098765 Email: firstname.lastname@example.org Email: email@example.com
Here I have two email addresses and both phone and mobile numbers. Companies can introduce similar issues:
Name: Example Phone: 01234 567890 Fax: 01234 098765 Email: firstname.lastname@example.org Website: example.org
These are all fairly simple records for contacts. We haven’t even considered complexities like postal addresses (or the fact that there might be more than one), and my contact record doesn’t include additional details like my Skype IDs, instant messaging accounts, or that I have a few more addresses than those listed above.
If you think about how the contact information is used, for example on a business card, you can see that the data itself is important, even though the structure and method for storing information may not be. This is an example of where the semantics of the data (i.e., the type of information that is stored) is similar but the syntax and structure of the information varies significantly.
In a traditional database, there are many different ways of modeling this information, but a common one is to use relations to model the information. There is a core contact table, another table for each phone number, another for emails and IM, etc., and all this is then linked together using a unique ID so that you can obtain all the information you need.
There is nothing inherently wrong with this approach. In fact, in many cases there are some significant advantages to this approach when working with some types of data. However, the point here is that your data may not fit an arbitrary (and fixed) data model such as the one described here. It can even be difficult as the data matures and expands to know where the right place for information is. Twenty years ago, requirements like email, website, or Skype addresses won’t have occurred to most designers.
Within CouchDB, the opposite approach is used. Rather than trying to create a structure into which all the information that you want to store can be shoehorned, CouchDB stores the data as documents, and worries about how to report and aggregate the information that is stored. Using our contact example, the information could be recorded in the database exactly as written it above, with each person’s contact details stored as a CouchDB document. We can make the decision during the reporting phase on how to output information, what information to output, and indeed whether to output that data at all.
Databases are no longer isolated, single systems. Whether you want a database that can be shared among multiple devices (your desktop, laptop, and mobile phone), between multiple offices, or to be used as part of your database scaling operations, copying and sharing database information has become required functionality.
Different databases have traditionally approached this in a variety of different ways, including binary logs, data streams, row-based logging, and more complex hashing techniques. Within CouchDB, a simple but very effective method has been developed that uses the individual documents as the key to the method of sharing and distributing the document information between databases.
Note that the distinction is that replication occurs between databases, not necessarily instances of CouchDB. You can use replication to copy documents between databases on the same machine, the same database on different machines, and different databases across multiple machines and devices.
The simplicity and ease with which you can share and exchange information in this way is a key feature of CouchDB. The replication system uses the same REST API as the client interface to the database, and it supports the ability to filter and select records during the replication process.
Another useful aspect of CouchDB replication is that it operates one way. That is, if you have a desktop machine and a laptop and you want to replicate your data so that you can take it with you, you can perform a specific desktop to laptop replication. If you make changes to the database while you are away, replicate the changes back from the mobile to the desktop. Better still, you can replicate both ways and keep the two databases in sync. This approach simplifies the entire replication process and ensures that you can always replicate the data where you need it.
CouchDB allows you to create both the one-shot replication, and to configure replication that will continuously replicate changes to your configured database. In CouchDB 1.1 and later, the replication configuration is retained when restarting CouchDB.
The one-way nature of replication also means that you can replicate documents from multiple databases into a single database. For example, data collection or logging systems can use multiple CouchDB instances to collect information, then replicate the data all to one machine for processing and statistics.
CouchDB also handles problems with replication with ease. The Fallacies of Distributed Computing imply that all of the following are solved in a perfect system:
The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.
The reality, of course, is quite different. Rather than expecting everything to work fine, CouchDB expects there to be a problem and tries to cope with it. Rather than treating a fault with replication as a serious problem, CouchDB instead tries to recover gracefully from the problem and only tells the user when there is a problem that requires user interaction. The replication process is also incremental, so that if anything goes wrong, such as a network outage, replication will pick right back up where it stopped.
To summarize, replication offers a number of potential scenarios:
Replicate from database A to database B once
Replicate from database A to database B continuously
Replicate from database A to database B and B to A continuously
Replicate from database A to B to C to D to E to A
Replicate between databases A, B, C, D, and E
Replicate from database A, B, C, and D to database E
You may think that all of this replication introduces some interesting issues when the same record is edited or modified on multiple machines. CouchDB has a solution for this, too, called conflict resolution. But to keep things simple, even the default response in the event of a conflict is consistent so that it doesn’t stop your database from operating within a cluster.
As you have seen in the previous section discussion on replication, the issue of distributing your data around different CouchDB instances is one way to take advantage of the functionality and flexibility that CouchDB offers.
One of the issues in a distributed system is the expectation that your network and system operate effectively and reliably. In a typical relational database management system (RDBMS) solution, for example, reliability and consistency, in particular, in a distributed system can start to be a problem. You rely on global state, timing, forced delays, and synchronous operations to ensure that your writes are available across your entire system before your application needs to read it back.
Within the three distinct concerns of consistency, availability, and partition tolerance of Brewer’s CAP theorem on distributed applications, the RDBMS is relying on the C and A portions to support the distributed model. Different solutions approach the problem differently, but a common approach includes using a single database for writes and multiple for reads, which introduces the problem of synchronizing operations so that all clients get the right data.
That is, once you scale up your system beyond a single node and you start to distribute your load across multiple machines, you have to start worrying about how to make the data available, keep it consistent, and partition the information across the database to help support the distributed model.
CouchDB approaches the problem differently using what is called eventual consistency. If the availability of your database is a priority, then CouchDB can be used in a way that allows a single node to provide read and write support, and therefore consistency for the immediate user. The other nodes in the distributed system can catch up later, becoming eventually consistent with the other nodes as the data is updated. This can be achieved while providing high-availability of the data in question.
CouchDB employs other tricks to help improve this consistency model on a single node basis, and to improve the overall performance and throughput. There is no need to go into detail, but some of the features CouchDB uses include:
- Key/value nature of the data store
Key/value nature of the data store enables very quick access to the documents stored. Using a key to read or write a single document provides a huge advantage in terms of reading and writing over a row or lookup method.
- B-tree storage engine for all internal data, documents, and views
B-tree engines are quick for retrieving single keys and key ranges. Better still, the view model also allows for key/value data to be written directly into the B-tree storage engine automatically sorted by the key. This further improves single and range-based key lookups.
- Lock-free database updates
Traditional databases will lock an entire data store (table) or record while data is inserted or updated. CouchDB uses a Multi-Version Concurrency Control (MVCC) model. Instead of locking the database, CouchDB writes a new version of the existing record. This allows different processes to access old versions while the new version is being inserted, and also means that updating the information is really just a case of appending the new data, not reading, updating, and writing back a new version.
- Freeform document format
Most databases will enforce strict requirements on the format of the data and check and invalidate insert and update requests if they are not in the correct format. In many cases, your application can use the JSON object structure directly without having to serialize your objects or data into the fixed format required by the database engine.
CouchDB can write the JSON document directly, simplifying the writing/update process, while allowing you to optionally enfore a structure on your JSON documents within the database itself if you need it. The enforcement and validation, though, continues to work with the JSON structure of the data.
By using these features, and the eventual consistency model within a distributed deployment, you can work with CouchDB to help support and improve your performance and latency, and to scale in a more linear fashion.
Data: Local, Remote, Everywhere
The CouchDB document-based approach solves another of the major issues in the modern world, which is one of access and ability. Although it is obvious we are moving to a fully connected world and environment, the reality is that there will always be a location, device, or situation where network access is unavailable.
Being in a lift, the middle of a desert, an airplane, or even just a simple powercut can all completely remove you from access to your database if it is only accessible in a single server or a cluster of servers in the cloud.
By allowing you to easily copy information from one database to another, CouchDB simplifies the problem of having the data where you need it. Instead of relying on one massive database you can access over the Internet, you can have a copy of the data you need on your laptop, iOS, or Android mobile phone, and then synchronize the information back to your big database.
The locality of the information also helps solve another problem commonly seen with network-based applications: the latency of access to the information. By storing the data locally and synchronizing the information in the background, the UI and response times can be kept high without losing the holistic approach to data storage.
This doesn’t stop you from deploying CouchDB in the cloud or providing central services. Instead it provides you with flexibility for how and where you deploy and distribute your data.
CouchDB Deployment and Peformance
Looking over all the different features and functionality in this section, it should be clear that CouchDB can be used and employed in a variety of different ways.
One of the key issues for any modern database system is the problem of scaling and improving the performance of your database to cope with different loads. As a general rule, improving the performance in one area of your system typically has an effect on another area.
For example, increasing your throughput when you read or write information to and from your database will usually increase your latency of response. You can look at a variety of solutions at different points to improve that, but often the effects in one area alter the peformance and capabilities in another.
CouchDB doesn’t attempt to solve your scalability problems with any single solution, but instead provides you with a simple and flexible system that can be molded and adapted to your needs. It is not going to solve every problem, and it’s not designed to, but as a basic building block into a larger system, you can use the flexibility of replication to provide scale (both reading and writing), combine it with proxy services to improve latency during scaling, or combine different systems and combinations to provide a key points in different parts of your solution.