O'Reilly logo

Scaling CouchDB by Bradley Holt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Load Balancing

Load balancing allows you to distribute the workload evenly across multiple CouchDB nodes. Since CouchDB uses an HTTP API, standard HTTP load balancing software or hardware can be used. With simple load balancing, each CouchDB node will maintain a full copy of your database through replication. Each document will eventually need to be written to every node, which is a limitation of this approach since the sustained write throughput of your entire system will be limited to that of the slowest node. You could replicate only certain documents using filter functions or by specifying document IDs, as discussed in Chapter 3. This approach to clustering could get complicated very quickly. See Chapter 5 for details on an alternative way to distribute your data across multiple CouchDB nodes.

In this scenario, we will set up a write-only master node and three read-only slave nodes. We will send all “unsafe” HTTP write requests (POST, PUT, DELETE, MOVE, and COPY) to the master node and load balance all “safe” HTTP read requests (GET, HEAD, and OPTIONS) across the three slave nodes. We will set up continuous replication from the write-only master to each of the read-only slave nodes. See Figure 4-1 for a diagram of the configuration we will be creating in this chapter.

Load balancing configuration
Figure 4-1. Load balancing configuration

Note

MOVE and COPY are non-standard HTTP methods. Only versions 0.8 and 0.9 of CouchDB supported the MOVE HTTP method. The MOVE HTTP method was removed from CouchDB since it was really just a COPY followed by a DELETE, but implied that there was a transaction across these two operations (which there was not). Assuming you are using a newer version of CouchDB, then there’s no need to concern yourself with the MOVE HTTP method.

There are many load balancing software and hardware options available. A full discussion of all the available tools and how to install and configure each on every available platform is beyond the scope of this book. Instead, we will focus on installing and configuring the Apache HTTP Server as a load balancer. We’ll be using Ubuntu, but these instructions should be easily adaptable to your operating system.

Warning

You may want to consider having multiple load balancers so that you can remove the load balancer as a single point of failure. This setup typically involves having two or more load balancers sharing the same IP address, with one configured as a failover. The details of this configuration are beyond the scope of this book.

Apache was chosen as the load balancer for this scenario because it is relatively easy to configure and has the basic capabilities needed. Other load balancers you may want to consider are HAProxy, Varnish, Pound, Perlbal, Squid, nginx, and Linux-HA (High-Availability Linux) on Linux Standard Base (LSB). This example illustrate a basic load balancing setup. Hardware load balancers are also available. Your hosting provider may also offer its own, proprietary load balancing tools. For example, Amazon has a tool called Elastic Load Balancing and Rackspace provides a service called Rackspace Cloud Load Balancers (in beta as of this writing).

CouchDB Nodes

In the following scenario, we will send write requests to one master node with a domain name of couch-master.example.com and distribute read requests to three nodes on machines with domain names of couch-a.example.com, couch-b.example.com, and couch-c.example.com. Install CouchDB on the master node and on all three slave nodes:

sudo aptitude install couchdb

We need to configure CouchDB to allow connections from the outside world. On all four nodes, configure CouchDB to bind to each server’s IP address by editing the [httpd] section of /etc/couchdb/local.ini as follows, replacing <server IP address> with your server’s IP address:

Warning

Unless your server is behind a firewall, this configuration change will allow anyone to access your CouchDB database. You will likely want to configure authentication and authorization. For information on this, see Chapter 22 in CouchDB: The Definitive Guide (O’Reilly).

[httpd]
; port = 5984
bind_address = <server IP address>
; Uncomment next line to trigger basic-auth popup on unauthorized requests.
;WWW-Authenticate = Basic realm="administrator"

If your server has multiple IP addresses, you can use 0.0.0.0 as the bind_address to bind on all IP addresses.

Restart CouchDB on all four nodes:

sudo /etc/init.d/couchdb restart

Test all four nodes by trying to connect to each from a different machine:

curl -X GET http://couch-master.example.com:5984/
curl -X GET http://couch-a.example.com:5984/
curl -X GET http://couch-b.example.com:5984/
curl -X GET http://couch-c.example.com:5984/

The response to each request should be:

{"couchdb":"Welcome","version":"1.0.1"}

If you can’t connect remotely to one or more of the CouchDB nodes, then double-check that the bind_address in /etc/couchdb/local.ini is set to each machine’s correct IP address, respectively, and that you have restarted CouchDB.

Create a database named api on all four nodes:

curl -X PUT http://couch-master.example.com:5984/api
curl -X PUT http://couch-a.example.com:5984/api
curl -X PUT http://couch-b.example.com:5984/api
curl -X PUT http://couch-c.example.com:5984/api

The response to each request should be:

{"ok":true}

At this point we have four CouchDB databases, on four different nodes, running independently. If we write data to the master node, it will not be replicated to any of the slave nodes yet.

Replication Setup

Note

CouchDB supports both pull replication and push replication. Pull replication is when replication is triggered from the same node as the target. Push replication is when replication is triggered from the same node as the source.

Set up a pull replication with couch-a.example.com pulling changes from couch-master.example.com:

curl -X POST http://couch-a.example.com:5984/_replicate \
-H "Content-Type: application/json" \
-d \
'{
   "source":"http://couch-master.example.com:5984/api",
   "target":"api",
   "continuous":true
}'

The response (your details will be different):

{"ok":true,"_local_id":"471f59393820994cd50fb432b17c9a96"}

Set up a pull replication with couch-b.example.com pulling changes from couch-master.example.com:

curl -X POST http://couch-b.example.com:5984/_replicate \
-H "Content-Type: application/json" \
-d \
'{
   "source":"http://couch-master.example.com:5984/api",
   "target":"api",
   "continuous":true
}'

The response (your details will be different):

{"ok":true,"_local_id":"0fdc3a4b85e224e51fdb6ce63f9bcbc6"}

Set up a pull replication with couch-c.example.com pulling changes from couch-master.example.com:

curl -X POST http://couch-c.example.com:5984/_replicate \
-H "Content-Type: application/json" \
-d \
'{
   "source":"http://couch-master.example.com:5984/api",
   "target":"api",
   "continuous":true
}'

The response (your details will be different):

{"ok":true,"_local_id":"ed8941ac5abf1cd7370b2d9a79000a11"}

Note

You may want to set up replication using a separate, private, network so that you can have dedicated bandwidth for private and public requests.

Proxy Server Configuration

The Apache HTTP server is extremely versatile. It has many features including a proxy server and a load balancer (as of version 2.1). See Recipe 10.9 in Apache Cookbook, Second Edition (O’Reilly). In this exercise, we will create our load balancer on a machine with a domain name of couch-proxy.example.com..

On couch-proxy.example.com, install Apache 2:

sudo aptitude install apache2

On couch-proxy.example.com, install mod_proxy:

sudo aptitude install libapache2-mod-proxy-html

On couch-proxy.example.com, enable mod_proxy:

sudo a2enmod proxy

On couch-proxy.example.com, enable mod_proxy_http:

sudo a2enmod proxy_http

On couch-proxy.example.com, enable mod_proxy_balancer:

sudo a2enmod proxy_balancer

We will also need mod_headers enabled:

sudo a2enmod headers

Finally, we will need mod_rewrite enabled:

sudo a2enmod rewrite

On couch-proxy.example.com, edit /etc/apache2/httpd.conf and add the following (it is likely that the file will be empty to start with):

Header append Vary Accept
Header add Set-Cookie "NODE=%{BALANCER_WORKER_ROUTE}e; path=/api" \
env=BALANCER_ROUTE_CHANGED

<Proxy balancer://couch-slave>
    BalancerMember http://couch-a.example.com:5984/api route=couch-a max=4
    BalancerMember http://couch-b.example.com:5984/api route=couch-b max=4
    BalancerMember http://couch-c.example.com:5984/api route=couch-c max=4
    ProxySet stickysession=NODE
    ProxySet timeout=5
</Proxy>

RewriteEngine On
RewriteCond %{REQUEST_METHOD} ^(POST|PUT|DELETE|MOVE|COPY)$
RewriteRule ^/api(.*)$ http://couch-master.example.com:5984/api$1 [P]
RewriteCond %{REQUEST_METHOD} ^(GET|HEAD|OPTIONS)$
RewriteRule ^/api(.*)$ balancer://couch-slave$1 [P]
ProxyPassReverse /api http://couch-master:5984/api
ProxyPassReverse /api balancer://couch-slave

Note

Apache allows for three possible load balancer scheduler algorithms. Traffic can be balanced based on number of requests (lbmethod=byrequests), the number of bytes transferred (lbmethod=bytraffic), or by the number of currently pending requests (lbmethod=bybusyness). The default is to balance by requests. To instead balance by busyness, add a ProxySet lbmethod=bybusyness directive to the end of the <Proxy> directive group (after ProxySet timeout=5 and before </Proxy>), although the order doesn’t matter.

You will also need to configure your virtual host to enable the rewrite engine and inherit the rewrite options from the server configuration above. Edit /etc/apache2/sites-enabled/000-default (or the configuration file for the appropriate virtual host) and add the following before the closing </VirtualHost> directive group:

    RewriteEngine On
    RewriteOptions inherit

Let’s take a look at each line of the /etc/apache2/httpd.conf configuration file. The Header append Vary Accept line appends the value Accept to the Vary HTTP header. If you have mod_deflate enabled then this module will add a Vary HTTP header with a value of Accept-Encoding. A Vary HTTP header informs a client as to what set of request-header fields it is permitted to base its caching on. Since mod_deflate may be adding this header, and CouchDB uses the Accept header to vary the media type (reflected in a Content-Type header with either a value of text/plain or application/json), it’s a good idea to make sure that clients know to also vary their caching based on the Accept HTTP header, and not just the Accept-Encoding HTTP header.

The line beginning with Header add Set-Cookie sets a cookie named NODE on the client. The value of this cookie will be the route name associated with the load balancer member that served the request. This allows for sticky sessions meaning that, once a client has been routed to a specific load balancer member, that client’s requests will continue to be routed to that same load balancer member node. This provides more consistency to the client. The path=/api part indicates to the client the URL path for which the cookie is valid. The env=BALANCER_ROUTE_CHANGED part indicates that the cookie should only be sent if the load balancer route has changed.

The <Proxy balancer://couch-slave> directive group defines a load balancer named couch-slave. A later configuration directive will define what requests should be sent to this load balancer. The three BalancerMember lines each add a member to the load balancer. The route= parameters (e.g., route=couch-a) give each route a name. The route name is used as the value of the NODE cookie. The max=4 parameters indicate the maximum number of connections that the proxy will allow to the backend server. The ProxySet stickysession=NODE directive indicates to the load balancer the cookie name to use (in this case NODE) when determining which route to use. The ProxySet timeout=5 directive instructs the proxy server to wait 5 seconds before timing out connections to the backend server.

Note

Keep the maximum number of connections to each CouchDB node low (e.g., max=4). This will prevent each node from getting overloaded. While 4 may seem like a very low number, CouchDB will respond to each request very quickly and allow for a high level of throughput. If the proxy server has enough memory and is configured to allow enough concurrent clients itself, then it can effectively queue requests for the backend servers.

If we didn’t need to proxy requests based on the HTTP method, we could have used the ProxyPass directive. However, for this added flexible we need to use mod_rewrite with the proxy ([P]) flag. The RewriteEngine On line enables the rewrite engine. The next line sets up a rewrite condition that says to only run the subsequent rewrite rule if the request HTTP method is POST, PUT, DELETE, MOVE, or COPY:

RewriteCond %{REQUEST_METHOD} ^(POST|PUT|DELETE|MOVE|COPY)$

The subsequent rewrite rule then proxies all requests to URIs starting with /api to the equivalent URI on http://couch-master.example.com:5984 (again, only if the previous rewrite condition has been met):

RewriteRule ^/api(.*)$ http://couch-master.example.com:5984/api$1 [P]

The next line contains another rewrite condition. This one says to only run the subsequent rewrite rule if the request HTTP method is GET, HEAD, or OPTIONS:

RewriteCond %{REQUEST_METHOD} ^(GET|HEAD|OPTIONS)$

The subsequent rewrite rule then proxies all requests to URIs starting with /api to the equivalent URI on the couch-master load balancer (again, only if the previous rewrite condition has been met):

RewriteRule ^/api(.*)$ balancer://couch-slave$1 [P]

The following ProxyPassReverse directives instructs Apache to adjust the URLs in the HTTP response headers to match that of the proxy server, instead of the reverse proxied server. This is mainly useful for the Location header that is sent when CouchDB creates a new document:

ProxyPassReverse /api http://couch-master:5984/api
ProxyPassReverse /api balancer://couch-slave

Open /etc/apache2/apache2.conf and look for the ServerLimit, ThreadsPerChild, and MaxClients directives. Apache limits the MaxClients to the ServerLimit multiplied by the ThreadsPerChild. These directives are intended to prevent your server from running out of memory and swapping, which would significantly decrease performance. Following is an example configuration with the MaxClients increased to 5,000 (this is from a machine with 1 GB of RAM):

ServerLimit         200
ThreadsPerChild      25
MaxClients         5000

On couch-proxy.example.com, restart Apache:

sudo /etc/init.d/apache2 restart

Testing

Test your load balancer by making an HTTP request to the proxy server:

curl -X GET http://couch-proxy.example.com/api

It should proxy the request through to one of the CouchDB nodes and respond as follows (your details will be different):

{
   "db_name":"api",
   "doc_count":0,
   "doc_del_count":0,
   "update_seq":0,
   "purge_seq":0,
   "compact_running":false,
   "disk_size":79,
   "instance_start_time":"1296720556231838",
   "disk_format_version":5,
   "committed_update_seq":0
}

Let’s try and POST a new document to the load balancer, treating it as if it’s a CouchDB node:

curl -X POST http://couch-proxy.example.com/api \
-H "Content-Type: application/json" \
-d '{
   "_id":"doc-a"
}'

The response:

{
   "ok":true,
   "id":"doc-a",
   "rev":"1-967a00dff5e02add41819138abb3284d"
}

Let’s try and GET the newly created document from the load balancer, again treating it as if it’s a CouchDB node:

curl -X GET http://couch-proxy.example.com/api/doc-a

The response:

{
   "_id":"doc-a",
   "_rev":"1-967a00dff5e02add41819138abb3284d"
}

If you GET the newly created document from couch-a.example.com, then you should get the exact same response:

curl -X GET http://couch-a.example.com:5984/api/doc-a

If you GET the newly created document from couch-b.example.com, then you should get the exact same response:

curl -X GET http://couch-b.example.com:5984/api/doc-a

Finally, if you GET the newly created document from couch-c.example.com, then you should get the exact same response:

curl -X GET http://couch-c.example.com:5984/api/doc-a

If the document did not replicate from one CouchDB node to the other, then make sure that continuous replication is running.

Note

See Chapter 6 for information about how to perform distributed load testing. When load testing, watch Apache’s error log for errors such as “server reached MaxClients setting, consider raising the MaxClients setting” or “server is within MinSpareThreads of MaxClients, consider raising the MaxClients setting,” and adjust Apache’s settings as described earlier.

After changing these settings and restarting Apache, watch for startup warnings such as “WARNING: MaxClients of 1000 would require 40 servers, and would exceed the ServerLimit value of 16. Automatically lowering MaxClients to 400. To increase, please see the ServerLimit directive.” As mentioned before, Apache limits the MaxClients to the ServerLimit multiplied by the ThreadsPerChild.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required