Chapter 12. Failover Configuration

A few chapters earlier in this book, we mentioned that we would cover how to actually direct traffic in a failover situation in a later chapter. Congratulations, this is that chapter! Your patience has paid off. Sadly, the wait may not be entirely worth it. Being able to consistently direct traffic between two servers, depending on which one is marked “active,” is tremendously important for high availability in general and many services that back Drupal sites specifically. Having an HA MySQL cluster, NFS cluster, Solr cluster, and load balancing cluster all depend on this. However, it’s not the most exciting thing in the world. The general concept is very simple: you run a daemon on two servers, and those two daemons ping each other fairly constantly. When serverA’s daemon doesn’t get a response from serverB’s daemon within a certain failure criterion, serverB is marked down and traffic is directed to serverA. The interesting issues with failover configurations are:

  • What are the failure conditions?
  • Can we insert other conditions besides just a full host down? Service-level failure conditions, perhaps?
  • How do you direct traffic consistently?
  • How do you deal with split-brain?
  • Wait, what is split-brain?
  • It is not a problem with my brain specifically, is it?
  • This seems scary; can’t we just take a downtime?

We will cover some of these issues here, starting with traffic direction.

Note

In most cases, we are assuming that whatever service you are failing over is already prepared for failover (i.e., set up to be synchronized between the two servers in question).

IP Failover Versus DNS Failover

There are two types of failover that are commonly deployed: IP-based failover and DNS-based failover. In essence, these two methodologies only differ in what “resource” is changed in order to direct traffic. IP-based failover, not surprisingly, directs traffic via moving an IP between two machines. Its major advantages are simplicity and very “immediate” results. However, it is problematic to use in most cloud environments (usually you cannot get an extra IP assigned for the failover IP, also called the virtual IP or VIP) and can be somewhat difficult to manage for those not used to it. This is as compared to DNS failover, which involves changing a DNS record to point to the “new” server in order to direct traffic.

DNS failover is a fairly common method of implementing HA in cloud environments and can be easier to set up, especially with some specific DNS services designed for it. However, the failover is definitely not as immediate or dependable as IP-based methods. DNS names have some implied caching in many cases, and services such as the Name Service Cache Daemon (NSCD) can cause issues with failover lag. Due to this, you must plan for the possibility of a “slow” failover, where some traffic continues to hit the “old master” before the “new master” fully takes over.

In most cases, this decision will be heavily influenced, if not entirely decided, by how open your provider is to IP failover/moving IP addresses between your servers. Some providers won’t allow you to have an additional IP address assigned to your server to use for failover, and others are nervous about IP failover confusing their switches. However, these issues are fairly rare for dedicated server providers.

Service-Level Issues

One might note that much of our discussion so far has centered on host-level issues. We have discussed directing network traffic between two full hosts and checking whether hosts are up from a network and OS perspective. This is nice, but it is fairly rare (or at least it should be) for an entire host to fail. More commonly a service will fail or start responding slowly. In cases like this, an HA cluster must detect this and start a failover. Likewise, simply moving traffic from serverA to serverB sometimes isn’t enough to actually failover a service. Many times there are other actions that need to be taken, such as setting read_only to false on a MySQL slave.

The failover system we introduce here doesn’t really handle service-level detection and failover in and of itself, but instead uses external services and scripts. For example, to failover a MySQL service using Heartbeat, you could write a “mysql resource” script. This script would perform the failover and would be triggered in the event of a MySQL failover scenario. As far as monitoring is concerned, the system we will cover here can easily be triggered by an external source (i.e., it can be told to “failover this host”). Thus, you can use another monitoring system for services and have it trigger failovers. A common choice for this is Mon (not to be confused with Monit), a very simple framework that can check for service health and then trigger Heartbeat upon a detected failure.

Heartbeat

Heartbeat has been the de facto Linux failover tool for a very long time. It is quite stable, very well supported, and decently documented. However, it comes in two “versions,” and the difference between the two can be quite confusing. Heartbeat by itself (sometimes called Heartbeat v1) is a simple failover tool that moves resources and IP addresses between two services. It only supports two servers and is quite simple to set up and use. Heartbeat+Pacemaker (formerly called Heartbeat v2 or Heartbeat+CRM) is a full clustering suite that is quite complicated to set up and use but supports complex configurations and more than two servers. This level of complication is simply not needed for the services that back most Drupal deployments. Because of this, we will only be covering Heartbeat v1 (henceforth just called Heartbeat).

Note

Discussion of v1 versus v2 doesn’t imply actual software changes, but just different types of configuration. The Heartbeat package and actual software are the same in both versions.

Installation

Most Linux distributions and *BSDs will have a Heartbeat package. It may also install Pacemaker, but you can mostly just ignore that unless you really need advanced clustering. Thus, to install Heartbeat you can just use one of the following commands:

# *yum install heartbeat*
# *apt-get install heartbeat*

Configuration

There are three important configuration files for Heartbeat:

authkeys
This sets the type of authentication for the cluster and the hash to use for authentication.
ha.cf
This is the general configuration file, defining the two nodes of the cluster, failure timeouts, failback settings, and other major configuration options.
haresources
This defines the resources Heartbeat will be controlling. Usually this means IP resources and various services to be started/stopped during resource acquisition and failover.

We’ll look at some examples of these files next. In these examples, we have two servers that are load balancing MySQL, named node1 and node2.

Note

The authkeys file needs to be the same on both servers—they both need to have the same key to successfully be a cluster. Likewise, this file needs to be secured (not group or “other” readable/writeable).

Here’s an example authkeys configuration:

auth 1
 1 sha1 your_secret_key_here

Tip

You can generate a secret key like this: dd if=/dev/urandom count=4 2>/dev/null | md5sum | cut -c1-32.

And here’s an example ha.cf configuration:

ucast eth1 _<the IP of the other node>_
node node1
node node2
auto_failback on

The ha.cf file can become very complicated, but for some situations, it is as simple as this. There are a few very important lines here. First, the ucast line is telling Heartbeat the IP address of the other node and the interface on which to ping that node. This file differs on each node (as the IP listed in each case will be the IP of the “other” one). If this IP is incorrect, the interface is incorrect, or there is a firewall preventing this ping, Heartbeat will not work correctly (both nodes will believe themselves to be the master as they cannot communicate, this is often called split-brain). Equally importantly, the node lines identifying the members of the cluster must contain hostnames that actually match the return of “hostname” on each node.

Note

The auto_failback line is actually quite important, too. If you set this to on, resources will be failed back to their “home” node whenever that node comes back from a failure. If it is set to off, you will have to fail them back manually. Having this option off is generally safer, as you can decide when you are ready to fail back to the “home” node and you avoid the possibility of resources “ping-ponging” back and forth between nodes.

Finally, here’s an example haresources configuration:

node1 192.168.1.3/32/eth1 httpd mysqld _<resource4> <resource5>_

This file is very simple: you just list the resources that Heartbeat will be managing and which node each resource “belongs” to by default. A “resource” is either a Heartbeat-specific resource, such as IPaddr or Filesystem (file system mounts), or just an init script. So, you can manage system-level resources such as IPs and mounts, as well as services such as apache, mysql, and memcache (via their init scripts).

So what IPs might you put in haresources for failover? You should never put the main IP address for a server in this file and have it managed for failover. This would render the server unreachable upon failover. Instead, you should always assign an extra IP (often called the virtual IP or VIP) to be managed by Heartbeat. This IP will then be the one used by whatever service you want to failover. For example, if you were trying to failover HTTP traffic between two servers, you’d request an extra IP from your provider, set up Heartbeat with it managed between the two servers, and then use that IP in your DNS records. When it was failed over, traffic would then move transparently.

Usage

Once configured, using Heartbeat is quite easy. You start it via the init script (/etc/init.d/heartbeat in many cases) and can watch its progress in the system log. Once it’s fully started, you can failover resources by either fully shutting down Heartbeat on one of the nodes or using hb_takeover or hb_standby, two small tools that either set resources into standby mode (i.e., fail them over to the other server) or take over resources from the other server. These utilities have three main options:

all
Failover all resources.
foreign
Failover just those resources not “owned” by the current node (this goes back to the haresources file).
local
Failover only those resources “owned” by the current node.

These small utilities are also how you integrate Heartbeat with an external monitoring system—you have it call them when a failure is detected.

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.