A few chapters earlier in this book, we mentioned that we would cover how to actually direct traffic in a failover situation in a later chapter. Congratulations, this is that chapter! Your patience has paid off. Sadly, the wait may not be entirely worth it. Being able to consistently direct traffic between two servers, depending on which one is marked “active,” is tremendously important for high availability in general and many services that back Drupal sites specifically. Having an HA MySQL cluster, NFS cluster, Solr cluster, and load balancing cluster all depend on this. However, it’s not the most exciting thing in the world. The general concept is very simple: you run a daemon on two servers, and those two daemons ping each other fairly constantly. When serverA’s daemon doesn’t get a response from serverB’s daemon within a certain failure criterion, serverB is marked down and traffic is directed to serverA. The interesting issues with failover configurations are:
We will cover some of these issues here, starting with traffic direction.
In most cases, we are assuming that whatever service you are failing over is already prepared for failover (i.e., set up to be synchronized between the two servers in question).
There are two types of failover that are commonly deployed: IP-based failover and DNS-based failover. In essence, these two methodologies only differ in what “resource” is changed in order to direct traffic. IP-based failover, not surprisingly, directs traffic via moving an IP between two machines. Its major advantages are simplicity and very “immediate” results. However, it is problematic to use in most cloud environments (usually you cannot get an extra IP assigned for the failover IP, also called the virtual IP or VIP) and can be somewhat difficult to manage for those not used to it. This is as compared to DNS failover, which involves changing a DNS record to point to the “new” server in order to direct traffic.
DNS failover is a fairly common method of implementing HA in cloud environments and can be easier to set up, especially with some specific DNS services designed for it. However, the failover is definitely not as immediate or dependable as IP-based methods. DNS names have some implied caching in many cases, and services such as the Name Service Cache Daemon (NSCD) can cause issues with failover lag. Due to this, you must plan for the possibility of a “slow” failover, where some traffic continues to hit the “old master” before the “new master” fully takes over.
In most cases, this decision will be heavily influenced, if not entirely decided, by how open your provider is to IP failover/moving IP addresses between your servers. Some providers won’t allow you to have an additional IP address assigned to your server to use for failover, and others are nervous about IP failover confusing their switches. However, these issues are fairly rare for dedicated server providers.
One might note that much of our discussion so far has centered on host-level issues. We have discussed directing network traffic between two full hosts and checking whether hosts are up from a network and OS perspective. This is nice, but it is fairly rare (or at least it should be) for an entire host to fail. More commonly a service will fail or start responding slowly. In cases like this, an HA cluster must detect this and start a failover. Likewise, simply moving traffic from serverA to serverB sometimes isn’t enough to actually failover a service. Many times there are other actions that need to be taken, such as setting
false on a MySQL slave.
The failover system we introduce here doesn’t really handle service-level detection and failover in and of itself, but instead uses external services and scripts. For example, to failover a MySQL service using Heartbeat, you could write a “mysql resource” script. This script would perform the failover and would be triggered in the event of a MySQL failover scenario. As far as monitoring is concerned, the system we will cover here can easily be triggered by an external source (i.e., it can be told to “failover this host”). Thus, you can use another monitoring system for services and have it trigger failovers. A common choice for this is Mon (not to be confused with Monit), a very simple framework that can check for service health and then trigger Heartbeat upon a detected failure.
Heartbeat has been the de facto Linux failover tool for a very long time. It is quite stable, very well supported, and decently documented. However, it comes in two “versions,” and the difference between the two can be quite confusing. Heartbeat by itself (sometimes called Heartbeat v1) is a simple failover tool that moves resources and IP addresses between two services. It only supports two servers and is quite simple to set up and use. Heartbeat+Pacemaker (formerly called Heartbeat v2 or Heartbeat+CRM) is a full clustering suite that is quite complicated to set up and use but supports complex configurations and more than two servers. This level of complication is simply not needed for the services that back most Drupal deployments. Because of this, we will only be covering Heartbeat v1 (henceforth just called Heartbeat).
Discussion of v1 versus v2 doesn’t imply actual software changes, but just different types of configuration. The Heartbeat package and actual software are the same in both versions.
Most Linux distributions and *BSDs will have a Heartbeat package. It may also install Pacemaker, but you can mostly just ignore that unless you really need advanced clustering. Thus, to install Heartbeat you can just use one of the following commands:
# *yum install heartbeat* # *apt-get install heartbeat*
We’ll look at some examples of these files next. In these examples, we have two servers that are load balancing MySQL, named node1 and node2.
The authkeys file needs to be the same on both servers—they both need to have the same key to successfully be a cluster. Likewise, this file needs to be secured (not group or “other” readable/writeable).
Here’s an example authkeys configuration:
auth 1 1 sha1 your_secret_key_here
You can generate a secret key like this:
dd if=/dev/urandom count=4 2>/dev/null | md5sum | cut -c1-32.
And here’s an example ha.cf configuration:
ucast eth1 _<the IP of the other node>_ node node1 node node2 auto_failback on
The ha.cf file can become very complicated, but for some situations, it is as simple as this. There are a few very important lines here. First, the
ucast line is telling Heartbeat the IP address of the other node and the interface on which to ping that node. This file differs on each node (as the IP listed in each case will be the IP of the “other” one). If this IP is incorrect, the interface is incorrect, or there is a firewall preventing this ping, Heartbeat will not work correctly (both nodes will believe themselves to be the master as they cannot communicate, this is often called split-brain). Equally importantly, the
node lines identifying the members of the cluster must contain hostnames that actually match the return of “hostname” on each node.
auto_failback line is actually quite important, too. If you set this to
on, resources will be failed back to their “home” node whenever that node comes back from a failure. If it is set to
off, you will have to fail them back manually. Having this option off is generally safer, as you can decide when you are ready to fail back to the “home” node and you avoid the possibility of resources “ping-ponging” back and forth between nodes.
Finally, here’s an example haresources configuration:
node1 192.168.1.3/32/eth1 httpd mysqld _<resource4> <resource5>_
This file is very simple: you just list the resources that Heartbeat will be managing and which node each resource “belongs” to by default. A “resource” is either a Heartbeat-specific resource, such as IPaddr or Filesystem (file system mounts), or just an init script. So, you can manage system-level resources such as IPs and mounts, as well as services such as apache, mysql, and memcache (via their init scripts).
So what IPs might you put in haresources for failover? You should never put the main IP address for a server in this file and have it managed for failover. This would render the server unreachable upon failover. Instead, you should always assign an extra IP (often called the virtual IP or VIP) to be managed by Heartbeat. This IP will then be the one used by whatever service you want to failover. For example, if you were trying to failover HTTP traffic between two servers, you’d request an extra IP from your provider, set up Heartbeat with it managed between the two servers, and then use that IP in your DNS records. When it was failed over, traffic would then move transparently.
Once configured, using Heartbeat is quite easy. You start it via the init script (/etc/init.d/heartbeat in many cases) and can watch its progress in the system log. Once it’s fully started, you can failover resources by either fully shutting down Heartbeat on one of the nodes or using hb_takeover or hb_standby, two small tools that either set resources into standby mode (i.e., fail them over to the other server) or take over resources from the other server. These utilities have three main options: