Chapter 1. Couchbase Overview
Couchbase Server is a NoSQL document database for interactive applications that has a flexible data model, is easily scalable, provides consistently high performance, and is “always on”, 24*7*365. From a developer perspective the key aspects are the flexible data structure and the always-on nature. The data model is flexible because at the core, Couchbase stores streams of bytes associated with a document ID. For additional features and practicality, that information can also be JSON-formatted (which brings additional functionality and benefits).
The scalability, performance and always-on nature are primarily focused on the administration and operations side, but they have one key impact on the developer experience. Couchbase supports applications and data loads with a very high level of concurrency. It is not uncommon to find applications with millions of active users, thousands (and more) of data operations per second, and working with some very large datasets.
The application Draw Something is a good example of a highly-concurrent application built on top of Couchbase Server. Draw Something was a smartphone application that allowed people to draw pictures, share it with friends, and let them guess what the picture depicted. As is often the case with such a simple premise, the game became very popular, very quickly. Within 6 weeks the game had gone from launch to an exceedingly popular application with:
15 million daily active users
3,000 new pictures generated every two seconds
Over two billion stored drawings
To just about anybody, those are some staggering numbers. To an administrator, it becomes more staggering when you appreciate that they managed this without taking the site down, and grew from a fairly small 3 node cluster to more than 90 nodes (across three clusters). All without taking the system offline, redeploying or architecting the core application and the database service, and still while serving those millions of picture-hungry users.
For a developer, Couchbase Server is actually a fairly straightforward application. Although the system is designed as a clustered database, as a developer this is not something you need to frequently consider. Or even, at a practical level, to be aware of. Using a Couchbase cluster with two nodes is the same as using a Couchbase cluster with 30 nodes. There are no complexities, no master/slave relationships or complex data handling that need to be supported at the client level.
To a developer, this is probably the most important aspect of the system to understand. When developing with Couchbase the most important concerns are about your application and the data that it needs to store, rather than having to design, and later cope, with sudden growth and complex and arduous scale out procedures.
Architecture and Structure
Couchbase Server is a cluster-based system where each node in the cluster is entirely equal. The software and configuration of each node is identical.
The core of Couchbase Server is a document based storage sysytem, where individual documents are identified by a unique document ID. Data is persisted to disk, and a caching layer provides fast access to the stored objects.
For a developer using a database there can be no more important connection than the one between your application and the underlying server. Couchbase Server supports a number of different client libraries that connect to the cluster. From an application perspective, the fact that the database is a cluster of individual nodes is entirely hidden from the developer. Applications read and write document objects tot he cluster, and the client library and Couchbase Server handle the distribution of information, and where a specific object is located.
Buckets and vBuckets
Two important structures internally within Couchbase Server are Buckets and vBuckets. The Bucket is a logical holder of information and is used to enable you to compartmentalize information that you store into your database. Couchbase Server supports multi-tenancy, that is, the ability for more than one application to store and retrieve information within Couchbase. Buckets are the logical elements used to support this. Buckets are:
All buckets have a name used to identify them on the system.
All bucket data is automatically shared (and sharded) across the Couchbase Server cluster. You cannot control how the information is distributed (it is automatically controlled), but your client library will handle the communication directly between your client and each node in the cluster to identify where the data should be stored (or retrieved) from.
- Optionally password controlled
All buckets can have an optional password. This can be useful both for securely providing access to the information in your database, and to help prevent you from accidentally connecting to the wrong bucket and updating/deleting information.
- Independently managed and monitored
Each bucket has a RAM quota, replication configuration and disk storage. Each bucket can also be independently monitored, and each bucket can be independently compacted and indexed.
Applications can use one or more buckets for storing information, but it’s important to remember that from a server perspective, buckets cannot talk to each other. This has particular impact on views and indexing. The current limit of the number of buckets that can be configured within a single cluster is 10.
Now we know that the information is automatically sharded and distributed around the cluster. Without wanting to get ito too much detail, the method of distribution is by dividing up the bucket content into multiple vBuckets. The vBuckets are distributed around the cluster, and because there are a fixed number of vBuckets, we can hash the document ID to a vBucket, and then through a vBucket map, map the vBucket to a node. When you store a document in the database, the document ID is processed by a hashing algorithm that produces a number, the number refers to the vBucket, and the client library uses the vBucket map to to determine which node within the cluster that is responsible for the vBuckets.
The vBucket structure is critical to the way the Couchbase Server works and is flexible and scalable. When the cluster is extended, individual vBuckets (and the data they contain) are ‘rebalanced’ around the cluster. Since the number of vBuckets does not change, finding the vBucket number for a given document ID has not changed. The hash is computed, the vBucket number determined, and the vBucket map is used to identify the node. Because a given document ID will always compute the same vBucket ID, and the vBucket map always describes this structure, the number of nodes in the cluster can change, shrinking or expanding to cope with different application loads.
For an analogy, let’s consider the typical, physical, filing cabinet. The filing cabinet is constructed of different drawers (our nodes), and there are a fixed number of folders within the drawers (with each folder analogous to the vBucket). On the wall by the filing cabinets is a directory that tells you which drawer contains which folder. If I buy a new filing cabinet, I can move the folders around within the drawers so that they are evenly distributed. Once I’ve completed that operation, I then need to update my directory (our vBucket map). To round the analogy out, there is only one copy of any piece of data within the drawers at one time. Because the data is automatically sharded and distributed, there is no problem in ensuring that the different ‘versions’ of the data across the database are in sync. There is only one version of the data. For additional security, however, you can use replication to create a copy of all the data in a bucket. The replicas are distributed around the stored data, just like keeping a carbon copy or photocopied version of each file in a different drawer.
As a developer, you don’t need to know about the vBucket system, since your client library will handle the entire hashing/lookup and communication process for you, all you need to do is connect your application to your cluster. That said, some of the errors returned by the system will refer to a vBucket specific problem, hence why I’ve included this background.
Data Storage and Retrieval
The most important part of any database system is how you store and retrieve the information from the database. Couchbase Server is a document store, you store data into the database using a document ID and the corresponding document data. The document ID must be unique within the bucket, but the document value that you store can be anything—a stream of bytes, a serialized object from your application language, or a flexible document format such as JSON. We’ll return to the significance of JSON in a moment.
Because of the document nature, the storage and retrieval of information is straightforward. There are no complicated queries or structures to write, and in fact, the core operations can be summarized into just four types of statements: Create, Read, Update, and Delete (CRUD). You can see these four statements in Table 1-1.
|Create||store(DocumentID, DocumentValue)||add(), set()|
|Update||Update(DocumentID, DocumentValue)||replace(), incr(), decr(), append(), prepend(), cas()|
As you can see, the basic operations are very simple, and this ultimately makes the programming and application development very simple. Documents are updated wholesale, when creating, you store the whole document, and when updating, you get the whole document, change the content or portion you want to change, and save it back.
Some additional information that apply to all the operations:
It should be noted that all operations within Couchbase Server are atomic. That is, an operation either works, or it doesn’t. The same operation sent by multiple clients will be processed sequentially in the order they were received.
Related to the previous entry, for all operations the response is either success or failure. If you store a value, it will either succeed, or fail. If you get a value, you will either get the value, or it will fail (i.e., it doesn’t exist). There are no mid-error points in the operations. That doesn’t mean you don’t need to handle errors; if there is an error during an update operation, for example, you will want to be able to recover from and resolve that problem. There are also some operational error conditions from the server that indicate a temporary issue.
These core operations will let you do 95% of the work and operations you would need to do with any database. But there are also some additional operations that provide some more advanced operations, or that provide additional information, such as the observe command.
The actual interface to Couchbase Server for these operations is supported by the memcached protocol. In fact, Couchbase Server is 100% memcached protocol compatible and you can use existing memcached compatible applications with Couchbase. You do, however, gain more functionality by using the full client library and taking advantage of the full cluster functionality.
Time to Live (TTL)
All values can be stored with an optional TTL, or Time to Live, value. The expiry time is particularly useful for data or information that is transient in nature, such as session stores for a website, or baskets during a shopping session. Identification of whether the item has ‘expired’ or not works through two mechanisms:
Lazy identification at the point of access. If you try to to perform a GET or UPDATE or other operation on a value, and the value has expired when the operation occurs, the operation will proceed as if the previous document never existed.
Background expiration operates on a schedule (default is hourly), and it deletes items from memory. This allows for the data to be be removed from the built-in caching layer and the version stored on disk. This deletes the item (and frees the RAM and disk space), regardless of whether or not the data has been accessed.
One common question with Couchbase Server is how to ensure information is consistent across the cluster. Because of the structure of the server, there are no multiple copies of the data. Because the information is automatically sharded across the cluster according to the hash of the document ID used to store the information, there is only one active location for a particular document.
All sets, gets, and updates on the document take place on the same server, in the single active record of the information. There are no consistency issues with the information spread across the database because of this single active copy architecture.
This architecture works on the individual document access model. There are additional layers to the system that may ultimately be affected in a different way by the consistency model. The views and indexing system (described later in this chapter), relies on a separate model when updating and building the index information, and the consistency of the information and the updates work in a very specific way.
From an operational level, you can also determine whether a particular document update has been persisted to replicas, and/or persisted to disk. If the document has been persisted to disk, then that means it’s eligible for inclusion in the view indexes. All this information can be determine through the use of the observe command and, where supported by the client library, update operations that support durability requirements.
In Couchbase Server, developers are given a lot of choice about how these different components work, and how they operate together. There are choices and decisions that can be used to drive the operations, and to control the methods and systems used. The consistency of views and document updates can be controlled to achieve the result you need, but it may require a small tradeoff in performance.
The concurrency of access to the information is related to the fact that there is only one active version of the data. Operations on a document are atomic—that is, they either work or they don’t. It is not possible to ‘accidentally’ corrupt the information by updating the record from two or more hosts simultaneously. Either the operation will succeed both times (sequentially), or one will succeed and the the other will raise an error condition that will then need to be handled. In reality, the chances of updating the same document simultaneously, given that most operations will complete in microseconds, is unlikely.
This still leaves the question of protecting the updates of information. Although you cannot corrupt the document by updating them at the same time, it is still possible to overwrite or update a document with a different version of the information. The Couchbase Server supports a number of operations that help with this, the primary one is Check-and-Set, or “Compare and Swap” (CAS), which adds a check value to the request so that you only update the when the check value supplied matches. The check value is updated every time the document changes, so if two clients obtain the record, and then try to update it, the first one will work, but the second one will fail because the check values do not match (a CAS miss).
Other atomic operations include increment and decrement, and append and prepend which update the data on the server without having to perform a server/client/server roundtrip.
There are also explicit locking operations that will stop an item being updated until the lock is released. I’ll describe these in more detail when we look at the specific operations and the client libraries and SDK interface.
Views, Indexes, and Querying
The power of a database doesn’t come from the ability to store information using specific document IDs and associated content, although you can do an awful lot with just this information. You gain a lot by being able to search, query, and unique link your data that makes it powerful.
In Couchbase Server 2.0, a single system, Views, enables you to translate your document data (ideally stored in JSON) into a fixed format, and using that structured format, you can then query and search database for information. Because the view works by examining the schema-less document data, and converts that into a structured format, you can process and work with the information in a number of different ways. It also means that you can process different documents with different formats into a structured form. This can be particularly useful as your application matures and changes, because the views system can cope with different JSON documents as the schema changes without requiring you to change all your documents to match a new structure.
Views are written using map/reduce functions; the map performs the translation from the document structure to the table output structure. The reduce function can be used to simplify and summarize data, such as building counts, sum totals, or more complex condensations.
Traditional map/reduce is an expensive process, because normally you have to run the map/reduce on the entire dataset each time. Couchbase Server uses a system called incremental map/reduce. The basic map/reduce process creates an index, and the index can be incrementally updated as the source data changes. For example, if you have 10,000 records, and create a view on the information, then when you update 200 records, only those 200 need to be reprocessed for the index to be updated. This means that your views (and indexes) can be kept up to date with your underlying data according to the schedule you set, and much quicker than having to completely reprocess the entire dataset each time.
Once the view, and the corresponding index, have been created, the information can then be queried. In fact, the query mechanism it built on top of the map/reduce structure that you create. Querying your data effectively is therefore a combination of understanding what you want to query, the underlying document structure, and writing a suitable view to produce the index you need to perform that query.
Views are a large topic, but once you get your head around the basics, they are very powerful. Views often confuse the traditional SQL users because they seem to work in reverse. In the SQL world you create the strict structure so that you can query it after the data has been stored and processed. With Couchbase, you store the data in any format and post-process it to extract the information you need in the format you need. It doesn’t normally take long to see the benefits.
Comparing Couchbase to SQL Databases
Comparing the operation and development model of the SQL databases to Couchbase, and indeed document databases in general, is an entire topic and book on its own.
The key difference is the flexibility of the document structure. Document databases such as Couchbase Server allow information to be stored into the database in simple documents, rather than the rigid, schema enforced structure of databases, tables and fields.
Getting information into, and out of, Couchbase Server is generally much more straightforward, and documents can be modeled using more application friendly structures with the information grouped together. For example, with contact information, you can have a single record that covers an entire contact, including multiple email addresses, phone numbers, and other information. In an SQL database, you would either use a fixed table structure, which might limit the number of fields for a particular type of information, or you might use relations between tables to allow for unlimited entries. The latter sounds sensible, until you have to view the information as one record when it requires collecting the information from ten or twelve different tables.
Better still, the document structure allows for the structure to change over time. For example, 20 years ago including an email address for every contact was unnecessary, adding it to an SQL database would mean either adding more tables (and more code to be able to load them), or changing every existing record to add a new field.
This simplified structure offered by document databases not only makes the data architecture easier, it can also makes development easier, as the process of storing and retrieving information is simpler, enabling you to concentrate on the application functionality, and not the complexities of reading and writing to SQL databases.
Beyond the development advantages, it is the ease of scalability, which makes it straightforward to extend and expand your database as your application and data needs increase. Within a typical SQL environment this scalability is more difficult, and must be architected and incorporated into your application structure before you start adding the data. With Couchbase Server, the scalability is built into the database and your application doesn’t need to know about the complexities of the database architecture, and doesn’t need to be changed whether you are running on one node, ten nodes or a hundred nodes.
Couchbase Server excels in a number of different environments. The high performance and concurrency aspects of the database make it ideal for those applications where you have a high number of users performing both reads and writes to the stored data.
For example, online and social gaming involves large numbers of users creating, updating and retrieving large volumes of information as they collect, use, and exchange objects. Session stores fall into this category too, where the consistent high performance and the ability to effectively red and write large volumes of information are key. Content provision, and ad-targeting are also quite popular, where the speed of access to the content, and the ability to track usage statistics of that information is very high.
With the views and querying mechanism, there are a huge range of different applications that can be written to take advantage of the high performance nature, and query ability. Couchbase Server can be used in situations where a MySQL database and memcached caching layer have previously been used together, but without the management complexities that this architecture normally involves. Because Couchbase Server is also memcached compliant, you can also use it as a direct memcached replacement within your existing database and application environment, and still take advantage of the scalability it provides.