This chapter explores the CouchDB in minute detail. It shows all the nitty-gritty and clever bits. We show you best practices and guide you around common pitfalls.
We start out by revisiting the basic operations we ran in the last chapter, looking behind the scenes. We also show what Futon needs to do behind its user interface to give us the nice features we saw earlier.
This chapter is both an introduction to the core CouchDB API as well as a reference. If you can’t remember how to run a particular request or why some parameters are needed, you can always come back here and look things up (we are probably the heaviest users of this chapter).
While explaining the API bits and pieces, we sometimes need to take a larger detour to explain the reasoning for a particular request. This is a good opportunity for us to tell you why CouchDB works the way it does.
The API can be subdivided into the following sections. We’ll explore them individually:
Server
Databases
Documents
Replication
This one is basic and simple. It can serve as a sanity check to see
if CouchDB is running at all. It can also act as a safety guard for
libraries that require a certain version of CouchDB. We’re using the
curl
utility again:
curl http://127.0.0.1:5984/
CouchDB replies, all excited to get going:
{
"couchdb"
:
"Welcome"
,
"version"
:
"0.10.1"
}
You get back a JSON string, that, if parsed into a native object or data structure of your programming language, gives you access to the welcome string and version information.
This is not terribly useful, but it illustrates nicely the way CouchDB behaves. You send an HTTP request and you receive a JSON string in the HTTP response as a result.
Now let’s do something a little more useful: create databases. For the strict, CouchDB is a database management system (DMS). That means it can hold multiple databases. A database is a bucket that holds “related data.” We’ll explore later what that means exactly. In practice, the terminology is overlapping—often people refer to a DMS as “a database” and also a database within the DMS as “a database.” We might follow that slight oddity, so don’t get confused by it. In general, it should be clear from the context if we are talking about the whole of CouchDB or a single database within CouchDB.
Now let’s make one! We want to store our favorite music albums, and
we creatively give our database the name albums
. Note
that we’re now using the -X
option again to tell
curl
to send a PUT
request instead
of the default GET
request:
curl -X PUT http://127.0.0.1:5984/albums
CouchDB replies:
{
"ok"
:
true
}
That’s it. You created a database and CouchDB told you that all went well. What happens if you try to create a database that already exists? Let’s try to create that database again:
curl -X PUT http://127.0.0.1:5984/albums
CouchDB replies:
{
"error"
:
"file_exists"
,
"reason"
:
"The database could not be created, the file
already exists."
}
We get back an error. This is pretty convenient. We also learn a little bit about how CouchDB works. CouchDB stores each database in a single file. Very simple. This has some consequences down the road, but we’ll skip the details for now and explore the underlying storage system in Appendix F.
Let’s create another database, this time with
curl
’s -v
(for “verbose”) option.
The verbose option tells curl
to show us not only the
essentials—the HTTP response body—but all the underlying request and
response details:
curl -vX PUT http://127.0.0.1:5984/albums-backup
curl
elaborates:
* About to connect() to 127.0.0.1 port 5984 (#0) * Trying 127.0.0.1... connected * Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0) > PUT /albums-backup HTTP/1.1 > User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3 > Host: 127.0.0.1:5984 > Accept: */* > < HTTP/1.1 201 Created < Server: CouchDB/0.9.0 (Erlang OTP/R12B) < Date: Sun, 05 Jul 2009 22:48:28 GMT < Content-Type: text/plain;charset=utf-8 < Content-Length: 12 < Cache-Control: must-revalidate < {"ok":true} * Connection #0 to host 127.0.0.1 left intact * Closing connection #0
What a mouthful. Let’s step through this line by line to understand what’s going on and find out what’s important. Once you’ve seen this output a few times, you’ll be able to spot the important bits more easily.
* About to connect() to 127.0.0.1 port 5984 (#0)
This is curl
telling us that it is going to
establish a TCP connection to the CouchDB server we
specified in our request URI. Not at all important, except when debugging
networking issues.
* Trying 127.0.0.1... connected * Connected to 127.0.0.1 (127.0.0.1) port 5984 (#0)
curl
tells us it successfully connected to
CouchDB. Again, not important if you aren’t trying to find problems with
your network.
The following lines are prefixed with >
and
<
characters. >
means the line
was sent to CouchDB verbatim (without the actual >
).
<
means the line was sent back to
curl
by CouchDB.
> PUT /albums-backup HTTP/1.1
This initiates an HTTP request. Its method is
PUT
, the URI is
/albums-backup
, and the HTTP version is
HTTP/1.1
. There is also HTTP/1.0
,
which is simpler in some cases, but for all practical reasons you should
be using HTTP/1.1
.
Next, we see a number of request headers. These are used to provide additional details about the request to CouchDB.
> User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3
The User-Agent
header tell CouchDB which piece of
client software is doing the HTTP request. We don’t learn anything new:
it’s curl
. This header is often useful in web
development when there are known errors in client implementations that a
server might want to prepare the response for. It also helps to determine
which platform a user is on. This information can be used for technical
and statistical reasons. For CouchDB, the User-Agent
header is irrelevant.
> Host: 127.0.0.1:5984
The Host
header is required by HTTP 1.1. It tells
the server the hostname that came with the request.
> Accept: */*
The Accept
header tells CouchDB that
curl
accepts any media type. We’ll look into why this
is useful a little later.
>
An empty line denotes that the request headers are now finished and the rest of the request contains data we’re sending to the server. In this case, we’re not sending any data, so the rest of the curl output is dedicated to the HTTP response.
< HTTP/1.1 201 Created
The first line of CouchDB’s HTTP response includes the HTTP version
information (again, to acknowledge that the requested version could be
processed), an HTTP status code, and a
status code message. Different requests trigger
different response codes. There’s a whole range of them telling the client
(curl
in our case) what effect the request had on the
server. Or, if an error occurred, what kind of error. RFC 2616 (the
HTTP 1.1 specification) defines
clear behavior for response codes. CouchDB fully follows the
RFC.
The 201 Created status code tells the client
that the resource the request was made against was successfully created.
No surprise here, but if you remember that we got an error message when we
tried to create this database twice, you now know that this response could
include a different response code. Acting upon responses based on response
codes is a common practice. For example, all response codes of 400 or
larger tell you that some error occurred. If you want to shortcut your
logic and immediately deal with the error, you could just check a
>= 400
response code.
< Server: CouchDB/0.10.1 (Erlang OTP/R13B)
The Server
header is good for diagnostics. It
tells us which CouchDB version and which underlying Erlang version we are
talking to. In general, you can ignore this header, but it is good to know
it’s there if you need it.
< Date: Sun, 05 Jul 2009 22:48:28 GMT
The Date
header tells you the time of the server.
Since client and server time are not necessary synchronized, this header
is purely informational. You shouldn’t build any critical application
logic on top of this!
< Content-Type: text/plain;charset=utf-8
The Content-Type
header tells you which MIME type
the HTTP response body is and its encoding. We already know CouchDB
returns JSON strings. The appropriate Content-Type
header is
application/json
. Why do we see
text/plain
? This is where pragmatism wins over purity.
Sending an application/json
Content-Type
header will make a browser offer you the
returned JSON for download instead of just displaying it. Since it is
extremely useful to be able to test
CouchDB from a browser, CouchDB sends a text/plain
content type, so all browsers will display the JSON as text.
There are some browser extensions that make your browser JSON-aware, but they are not installed by default.
Do you remember the Accept
request header and how
it is set to \*/\* -> */*
to express interest in any
MIME type? If you send Accept: application/json
in your
request, CouchDB knows that you can deal with a pure JSON response with
the proper Content-Type
header and will use it instead
of text/plain
.
< Content-Length: 12
The Content-Length
header simply tells us how
many bytes the response body has.
< Cache-Control: must-revalidate
This Cache-Control
header tells you, or any proxy
server between CouchDB and you, not to cache this response.
<
This empty line tells us we’re done with the response headers and what follows now is the response body.
{"ok":true}
We’ve seen this before.
* Connection #0 to host 127.0.0.1 left intact * Closing connection #0
The last two lines are curl
telling us that it
kept the TCP connection it opened in the beginning open for a moment, but
then closed it after it received the entire response.
Throughout the book, we’ll show more requests with the
-v
option, but we’ll omit some of the headers we’ve
seen here and include only those that are important for the particular
request.
Creating databases is all fine, but how do we get rid of one? Easy—just change the HTTP method:
> curl -vX DELETE http://127.0.0.1:5984/albums-backup
This deletes a CouchDB database. The request will remove the file that the database contents are stored in. There is no “Are you sure?” safety net or any “Empty the trash” magic you’ve got to do to delete a database. Use this command with care. Your data will be deleted without a chance to bring it back easily if you don’t have a backup copy.
This section went knee-deep into HTTP and set the stage for discussing the rest of the core CouchDB API. Next stop: documents.
Documents are CouchDB’s central data structure. The idea behind a document is, unsurprisingly, that of a real-world document—a sheet of paper such as an invoice, a recipe, or a business card. We already learned that CouchDB uses the JSON format to store documents. Let’s see how this storing works at the lowest level.
Each document in CouchDB has an ID. This ID is unique per database. You are free to choose any string to be the ID, but for best results we recommend a UUID (or GUID), i.e., a Universally (or Globally) Unique IDentifier. UUIDs are random numbers that have such a low collision probability that everybody can make thousands of UUIDs a minute for millions of years without ever creating a duplicate. This is a great way to ensure two independent people cannot create two different documents with the same ID. Why should you care what somebody else is doing? For one, that somebody else could be you at a later time or on a different computer; secondly, CouchDB replication lets you share documents with others and using UUIDs ensures that it all works. But more on that later; let’s make some documents:
curl -X PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af \ -d '{"title":"There is Nothing Left to Lose","artist":"Foo Fighters"}'
CouchDB replies:
{"ok":true,"id":"6e1295ed6c29495e54cc05947f18c8af","rev":"1-2902191555"}
The curl
command appears complex, but let’s break
it down. First, -X PUT
tells curl
to
make a PUT request. It is followed by the URL that specifies your CouchDB IP address and port. The resource
part of the URL
/albums/6e1295ed6c29495e54cc05947f18c8af
specifies the
location of a document inside our albums
database. The
wild collection of numbers and characters is a UUID. This UUID is your
document’s ID. Finally, the -d
flag tells curl
to use the following string as the body
for the PUT request. The string is a simple JSON structure including
title
and artist
attributes with
their respective values.
Note
If you don’t have a UUID handy, you can ask CouchDB to give you
one (in fact, that is what we did just now without showing you). Simply
send a GET request to /_uuids
:
curl -X GET http://127.0.0.1:5984/_uuids
CouchDB replies:
{
"uuids"
:
[
"6e1295ed6c29495e54cc05947f18c8af"
]}
Voilá, a UUID. If you need more than one, you can pass in the
?count=10
HTTP parameter to request 10 UUIDs, or really,
any number you need.
To double-check that CouchDB isn’t lying about having saved your document (it usually doesn’t), try to retrieve it by sending a GET request:
curl -X GET http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af
We hope you see a pattern here. Everything in CouchDB has an address, a URI, and you use the different HTTP methods to operate on these URIs.
CouchDB replies:
{
"_id"
:
"6e1295ed6c29495e54cc05947f18c8af"
,
"_rev"
:
"1-2902191555"
,
"title"
:
"There is Nothing Left to Lose"
,
"artist"
:
"Foo Fighters"
}
This looks a lot like the document you asked CouchDB to save, which
is good. But you should notice that CouchDB added two fields to your JSON
structure. The first is _id
, which holds the UUID we
asked CouchDB to save our document under. We always know the ID of a
document if it is included, which is very convenient.
The second field is _rev
. It stands for
revision.
If you want to change a document in CouchDB, you don’t tell it to
go and find a field in a specific document and insert a new value.
Instead, you load the full document out of CouchDB, make your changes in
the JSON structure (or object, when you are doing actual programming),
and save the entire new revision (or version) of that document back into
CouchDB. Each revision is identified by a new _rev
value.
If you want to update or delete a document, CouchDB expects you to
include the _rev
field of the revision you wish to
change. When CouchDB accepts the change, it will generate a new revision
number. This mechanism ensures that, in case somebody else made a change
unbeknownst to you before you got to request the document update,
CouchDB will not accept your update because you are likely to overwrite
data you didn’t know existed. Or simplified: whoever saves a change to a
document first, wins. Let’s see what happens if we don’t provide a
_rev
field (which is equivalent to providing a
outdated value):
curl -X PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af \ -d '{"title":"There is Nothing Left to Lose","artist":"Foo Fighters","year":"1997"}'
CouchDB replies:
{
"error"
:
"conflict"
,
"reason"
:
"Document update conflict."
}
If you see this, add the latest revision number of your document to the JSON structure:
curl -X PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af \ -d '{"_rev":"1-2902191555","title":"There is Nothing Left to Lose", "artist":"Foo Fighters","year":"1997"}'
Now you see why it was handy that CouchDB returned that
_rev
when we made the initial request. CouchDB
replies:
{
"ok"
:
true
,
"id"
:
"6e1295ed6c29495e54cc05947f18c8af"
,
"rev"
:
"2-2739352689"
}
CouchDB accepted your write and also generated a new revision
number. The revision number is the md5 hash of the transport
representation of a document with an N-
prefix
denoting the number of times a document got updated. This is useful for
replication. See Chapter 17 for more
information.
There are multiple reasons why CouchDB uses this revision system, which is also called Multi-Version Concurrency Control (MVCC). They all work hand-in-hand, and this is a good opportunity to explain some of them.
One of the aspects of the HTTP protocol that CouchDB uses is that it is stateless. What does that mean? When talking to CouchDB you need to make requests. Making a request includes opening a network connection to CouchDB, exchanging bytes, and closing the connection. This is done every time you make a request. Other protocols allow you to open a connection, exchange bytes, keep the connection open, exchange more bytes later—maybe depending on the bytes you exchanged at the beginning—and eventually close the connection. Holding a connection open for later use requires the server to do extra work. One common pattern is that for the lifetime of a connection, the client has a consistent and static view of the data on the server. Managing huge amounts of parallel connections is a significant amount of work. HTTP connections are usually short-lived, and making the same guarantees is a lot easier. As a result, CouchDB can handle many more concurrent connections.
Another reason CouchDB uses MVCC is that this model is simpler conceptually and, as a consequence, easier to program. CouchDB uses less code to make this work, and less code is always good because the ratio of defects per lines of code is static.
The revision system also has positive effects on replication and storage mechanisms, but we’ll explore these later in the book.
Warning
The terms version and revision might sound familiar (if you are programming without version control, drop this book right now and start learning one of the popular systems). Using new versions for document changes works a lot like version control, but there’s an important difference: CouchDB does not guarantee that older versions are kept around.
Now let’s have a closer look at our document creation requests
with the curl
-v
flag that was
helpful when we explored the database API earlier. This is also a good
opportunity to create more
documents that we can use in later examples.
We’ll add some more of our favorite music albums. Get a fresh UUID
from the /_uuids
resource. If you don’t
remember how that works, you can look it up a few pages back.
curl -vX PUT http://127.0.0.1:5984/albums/70b50bfa0a4b3aed1f8aff9e92dc16a0 \ -d '{"title":"Blackened Sky","artist":"Biffy Clyro","year":2002}'
Note
By the way, if you happen to know more information about your favorite albums, don’t hesitate to add more properties. And don’t worry about not knowing all the information for all the albums. CouchDB’s schema-less documents can contain whatever you know. After all, you should relax and not worry about data.
Now with the -v
option, CouchDB’s reply (with
only the important bits shown) looks like this:
> PUT /albums/70b50bfa0a4b3aed1f8aff9e92dc16a0 HTTP/1.1 > < HTTP/1.1 201 Created < Location: http://127.0.0.1:5984/albums/70b50bfa0a4b3aed1f8aff9e92dc16a0 < Etag: "1-2248288203" < {"ok":true,"id":"70b50bfa0a4b3aed1f8aff9e92dc16a0","rev":"1-2248288203"}
We’re getting back the 201 Created
HTTP status
code in the response headers, as we saw earlier when we created a
database. The Location
header gives us a full URL to
our newly created document. And there’s a new header. An Etag in
HTTP-speak identifies a specific version of a resource. In this case, it
identifies a specific version (the first one) of our new document. Sound
familiar? Yes, conceptually, an Etag is the same as a CouchDB document
revision number, and it shouldn’t come as a surprise that CouchDB uses
revision numbers for Etags. Etags are useful for caching
infrastructures. We’ll learn how to use them in Chapter 8.
CouchDB documents can have attachments just like an email message can have attachments. An attachment is identified by a name and includes its MIME type (or Content-Type) and the number of bytes the attachment contains. Attachments can be any data. It is easiest to think about attachments as files attached to a document. These files can be text, images, Word documents, music, or movie files. Let’s make one.
Attachments get their own URL where you can upload data. Say we
want to add the album artwork to the
6e1295ed6c29495e54cc05947f18c8af
document (“There
is Nothing Left to Lose”), and let’s also say the artwork is in a file
artwork.jpg in the current directory:
> curl -vX PUT http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af/ \ artwork.jpg?rev=2-2739352689 --data-binary @artwork.jpg -H "Content-Type: image/jpg"
The -d@
option tells curl
to read a file’s contents into the HTTP request body. We’re using the
-H
option to tell CouchDB that we’re uploading a
JPEG file. CouchDB will keep this information around and will send the
appropriate header when requesting this attachment; in case of an
image like this, a browser will render the image instead of offering
you the data for download. This will come in handy later. Note that
you need to provide the current revision number of the document you’re
attaching the artwork to, just as if you would update the document.
Because, after all, attaching some data is changing the
document.
You should now see your artwork image if you point your browser to http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af/artwork.jpg.
If you request the document again, you’ll see a new member:
curl http://127.0.0.1:5984/albums/6e1295ed6c29495e54cc05947f18c8af
CouchDB replies:
{
"_id"
:
"6e1295ed6c29495e54cc05947f18c8af"
,
"_rev"
:
"3-131533518"
,
"title"
:
"There is Nothing Left to Lose"
,
"artist"
:
"Foo Fighters"
,
"year"
:
"1997"
,
"_attachments"
:
{
"artwork.jpg"
:
{
"stub"
:
true
,
"content_type"
:
"image/jpg"
,
"length"
:
52450
}}}
_attachments
is a list of keys and values
where the values are JSON objects containing the attachment metadata.
stub=true
tells us that this entry is just the
metadata. If we use the ?attachments=true
HTTP
option when requesting this document, we’d get a Base64-encoded string
containing the attachment data.
We’ll have a look at more document request options later as we explore more features of CouchDB, such as replication, which is the next topic.
CouchDB replication is a mechanism to synchronize databases. Much
like rsync
synchronizes two directories locally or over
a network, replication synchronizes two databases locally or remotely.
In a simple POST request, you tell CouchDB the source and the target of a replication and CouchDB will figure out which documents and new document revisions are on source that are not yet on target, and will proceed to move the missing documents and revisions over.
We’ll take an in-depth look at replication later in the book; in this chapter, we’ll just show you how to use it.
First, we’ll create a target database. Note that CouchDB won’t automatically create a target database for you, and will return a replication failure if the target doesn’t exist (likewise for the source, but that mistake isn’t as easy to make):
curl -X PUT http://127.0.0.1:5984/albums-replica
Now we can use the database albums-replica
as a
replication target:
curl -vX POST http://127.0.0.1:5984/_replicate \ -d '{"source":"albums","target":"albums-replica"}'
Note
As of version 0.11, CouchDB supports the option
"create_target":true
placed in the JSON POSTed to the
_replicate
URL. It implicitly creates the target
database if it doesn’t exist.
CouchDB replies (this time we formatted the output so you can read it more easily):
{
"history"
:
[
{
"start_last_seq"
:
0
,
"missing_found"
:
2
,
"docs_read"
:
2
,
"end_last_seq"
:
5
,
"missing_checked"
:
2
,
"docs_written"
:
2
,
"doc_write_failures"
:
0
,
"end_time"
:
"Sat, 11 Jul 2009 17:36:21 GMT"
,
"start_time"
:
"Sat, 11 Jul 2009 17:36:20 GMT"
}
],
"source_last_seq"
:
5
,
"session_id"
:
"924e75e914392343de89c99d29d06671"
,
"ok"
:
true
}
CouchDB maintains a session history of replications. The response for a replication request contains the history entry for this replication session. It is also worth noting that the request for replication will stay open until replication closes. If you have a lot of documents, it’ll take a while until they are all replicated and you won’t get back the replication response until all documents are replicated. It is important to note that replication replicates the database only as it was at the point in time when replication was started. So, any additions, modifications, or deletions subsequent to the start of replication will not be replicated.
We’ll punt on the details again—the "ok": true
at
the end tells us all went well. If you now have a look at the
albums-replica
database, you should see all the
documents that you created in the albums
database.
Neat, eh?
What you just did is called local replication in CouchDB terms. You created a local copy of a database. This is useful for backups or to keep snapshots of a specific state of your data around for later. You might want to do this if you are developing your applications but want to be able to roll back to a stable version of your code and data.
There are more types of replication useful in other situations. The
source
and target
members of our
replication request are actually links (like in HTML) and so far we’ve
seen links relative to the server we’re working on (hence
local). You can also specify a remote database as the
target:
curl -vX POST http://127.0.0.1:5984/_replicate \ -d '{"source":"albums","target":"http://127.0.0.1:5984/albums-replica"}'
Using a local source
and a remote
target
database is called push
replication. We’re pushing changes to a remote
server.
Note
Since we don’t have a second CouchDB server around just yet, we’ll just use the absolute address of our single server, but you should be able to infer from this that you can put any remote server in there.
This is great for sharing local changes with remote servers or buddies next door.
You can also use a remote source
and a local
target
to do a pull replication.
This is great for getting the latest changes from a server that is used by
others:
curl -vX POST http://127.0.0.1:5984/_replicate \ -d '{"source":"http://127.0.0.1:5984/albums-replica","target":"albums"}'
Finally, you can run remote replication, which is mostly useful for management operations:
curl -vX POST http://127.0.0.1:5984/_replicate \ -d '{"source":"http://127.0.0.1:5984/albums", "target":"http://127.0.0.1:5984/albums-replica"}'
Get CouchDB: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.