Chapter 4. Replication for High Availability

Buying expensive machines known for their reliability and ensuring that you have a really good UPS in case of power failures should give you a highly available system. Right?

Well, high availability is actually not that easy to achieve. To have a system that is truly available all the time, you have to plan carefully for any contingency and ensure that you have redundancy to handle failing components. True high availability—a system that does not go down even in the most unexpected circumstances—is hard to achieve and very costly.

The principles for achieving high availability are simple enough. You need to have three things:

Redundancy

If a component fails, you have to have a replacement for it. The replacement can be either idly standing by or part of the existing deployment.

Contingency plans

If a component fails, you have to know what to do. This depends on which component failed and how it failed.

Procedure

If a component fails, you have to be able to detect it and then execute your plans swiftly and efficiently.

If a system has a single point of failure—a single component that, should it fail, will cause the entire system to fail—it puts a severe limit on your ability to achieve high availability. This means that one of your first goals is to locate these single points of failure and ensure you have redundancy for them.

Redundancy

To understand where redundancy might be needed, you have to identify every potential point of failure in the deployment. Even though it sounds easy—not to mention a tad tedious and boring—it requires some imagination to ensure that you really have found them all. Switches, routers, network cards, and even network cables are single points of failure. Outside of your architecture, but no less important, are power sources and physical facilities. But what about services needed to keep the deployment up? Suppose all network management is consolidated in a web-based interface? Or what if you have only one staff person who knows how to handle some types of failure?

Identifying the points of failure does not necessarily mean that you have to eliminate them all. Sometimes it is just not possible for economical, technical, or geographic reasons, but being aware of them helps you with planning.

Some things that you should consider, or at least make a conscious decision about whether to consider, are cost of duplicating components, the probability of failure for different components, the time to replace a component, and risk exposure while repairing a component. If repairing a component takes a week and you are running with the spare as the single point of failure during this time, you are taking a certain risk that the spare could be lost as well, which may or may not be acceptable.

Once you have identified where you need redundancy, you have to choose between two fundamental alternatives: you can either keep duplicates around for each component—ready to take over immediately if the original component should fail—or you can ensure you have extra capacity in the system so that if a component fails, you can still handle the load. This choice does not have to be made in an all-or-nothing fashion: you can combine the two techniques so that you duplicate some components and use extra capacity for some other parts of the system.

On the surface, the easiest approach is to duplicate components, but duplication is expensive. You have to leave a standby around and keep it up-to-date with the main component all the time. The advantages of duplicating components are that you do not lose performance when switching and that switching to the standby is usually faster than restructuring the system, which you would have to do if you approached the problem by creating spare capacity.

Creating spare capacity lets you use all the components for running the business, possibly allowing you to handle higher peaks in your load. When a component breaks, you restructure the system so that all remaining components are in use. It is, however, important to have more capacity than you normally need.

To understand why, consider a simple case where you have a master that handles the writes—actually, you should have two, since you need to have redundancy—with a set of slaves connected to the master whose only purpose is to serve read requests.

Should one of the slaves fail, the system will still be responding, but the capacity of the system will be reduced. If you have 10 slaves, each running at 50 percent capacity, the failure of one slave will increase the load on each slave to 55 percent, which is easy to handle. However, if the slaves are running at 95 percent capacity and one of the slaves fails, each server would have to handle 105 percent of the original load to handle the same load, which is clearly not possible. In this case, the read capacity of the system will be reduced and the response time will be longer.

And planning for the loss of one server is not sufficient: you have to consider the probability of losing more than one server and prepare for that situation as well. Continuing with our previous example, even if each server is running at 80 percent capacity, the system will be able to handle the loss of one server. However, the loss of two servers means that the load on each remaining server will increase to 100 percent, leaving you with no room for unexpected bursts in traffic. If this occurs once a year, it might be manageable, but you have to know how often it is likely to happen.

Table 4-1 gives example probabilities for losing 1, 2, or 3 servers in a setup of 100 servers, given different probabilities of losing a single server. As you can see, with a 1 percent probability of losing a server, you have a 16 percent risk of losing three or more servers. If you are not prepared to handle that, you’re in for some problems if it actually happens.

Note

For a stochastic variable X representing the number of servers lost, the probabilities are calculated using the binomial tail distribution:

Redundancy

Table 4-1. Probabilities of losing servers

Probability of losing a single server

1

2

3

1.00%

100.00%

49.50%

16.17%

0.50%

50.00%

12.38%

2.02%

0.10%

10.00%

0.50%

0.02%

To avoid such a situation, you have to monitor the deployment closely to know what the load is, figure out the capacity of your system through measurements, and do your math to see where the response times will start to suffer.

Planning

Having redundancy is not sufficient; you also need to have plans for what to do when the components fail. In the previous example, it is easy to handle a failing slave, since new connections will be redirected to the working slaves, but consider the following:

  • What happens with the existing connections? Just aborting and returning an error message to the user is probably not a good idea. Typically, there is an application layer between the user and the database, so in this case the application layer has to retry the query with another server.

  • What happens if the master fails? In the previous example, only the slaves failed, but the master can also fail. Assuming you have added redundancy by keeping an extra master around (we will cover how to do that later in the chapter), you must also have plans for moving all the slaves over to the new master.

This chapter will cover some of the techniques and topologies that you can use to handle various situations for failing MySQL servers. There are basically three server roles to consider: master failures, slave failures, and relay failures. Slave failures are just failures of slaves that are used for read scale-out. The slaves that also act as masters are relay slaves and need special care. Master failures are the most important failures to handle quickly, since the deployment will be unavailable until the master is restored.

Slave Failures

By far, the easiest failures to handle are slave failures. Since the slaves are only used for read queries, it is sufficient to inform the load balancer that the slave is missing, which will direct new queries to the functioning slaves. There have to be enough slaves to handle the reduced capacity of the system, but apart from that, a failing slave does not normally affect the replication topology and there are no specific topologies that you need to consider to make slave failure easier to manage.

When a slave has failed, there are inevitably some queries that have been sent to the slave that are waiting for a reply. Once these connections report an error resulting from a lost server, the queries have to be repeated with a functioning slave.

Master Failures

If the master fails, it has to be replaced to keep the deployment up, and it has to be replaced quickly. The moment the master fails, all write queries will be aborted, so the first thing to do is to get a new master available and direct all clients to it.

Since the main master failed, all the slaves are now without a master as well, meaning that all the slaves have stale data, but they are still up and can reply to read queries.

However, some queries may block if they are waiting for changes to arrive at the slave. Some queries may make it into the relay log of the slave and therefore will eventually be executed by the slave. No special consideration has to be taken on the behalf of these queries.

For queries that are waiting for events that did not leave the master before it crashed, the situation is bleaker. In this case, it is necessary to ensure they are handled. This usually means they are reported as failures, so the user will have to reissue the query.

Relay Failures

For servers acting as relay servers, the situation has to be handled specially. If they fail, the remaining slaves have to be redirected to use some other relay or the master itself. Since the relay has been added to relieve the master of some load, it is likely that the master will not be able to handle the load of a batch of slaves connected to one of its relays.

Disaster Recovery

In the world of high availability, “disaster” does not have to mean earthquakes or floods; it just means that something went very bad for the computer and it is not local to the machine that failed.

Typical examples are lost power in the data center—not necessarily because the power was lost in the city; just losing power in the building is sufficient. For a real-life story, and some insight into what happened to MySQL, see the section The mysql.com Outage.

The nature of a disaster is that many things fail at once, making it impossible to handle redundancy by duplicating servers at a single data center. Instead, it is necessary to ensure data is kept safe at another geographic location, and it is quite common for companies to have different components at different offices, even when the company is relatively small.

Procedures

After you have eliminated all single points of failure, ensured you have sufficient redundancy for the system, and made plans for every contingency, you should be ready for the last step.

All your resources and careful planning are of no use unless you can wield them properly. You can usually manage a small site with a few servers manually with very little planning, but as the number of servers increases, automation becomes a necessity—and if you run a successful business, the number of servers might have to increase quickly.

You’re likely better off if you plan from day one to have automation—if you have to grow, you will be busy handling other matters and will probably not have time to create the necessary automation support.

Some of the basic procedures have already been discussed, but you need to consider having ready-made procedures for at least the following tasks:

Adding new slaves

Creating new slaves when you need to scale is the basis for running a big site. There are several options for creating new slaves. They all circle around methods for taking a snapshot of an existing server, usually a slave, restoring the snapshot on a new server, and then starting replication from the correct position.

The time for taking a snapshot will, of course, affect how quickly you can bring the new slave up; if the backup time is too long, the master may have issued a lot of changes, which means that the new slave will take longer to catch up. For this reason, the snapshot time is important. Figure 4-1 shows the snapshot time when the slave has caught up. You can see that when the slave is stopped to take a snapshot, the changes will start to accumulate, which will cause the outstanding changes to increase. Once the slave is restarted, it will start to apply the outstanding changes and the number of outstanding changes will decrease.

Some different methods of taking a snapshot include the following:

Using mysqldump

Using mysqldump is safe but slow. It allows you to restore the data using a different storage engine than that on the original. If you use InnoDB tables, you can take a consistent snapshot, meaning you do not have to bring the server offline.

Copying the database files

This is relatively fast, but requires you to bring the server offline before copying the files.

Using an online backup method

There are different methods available, such as the InnoDB Hot Backup.

Using LVM to get a snapshot

On Linux, it is possible to take a snapshot of a volume using Logical Volume Manager (LVM). It does require that you prepare beforehand, since a special LVM volume has to be created.

Using filesystem snapshot methods

The Solaris ZFS, for example, has built-in support for taking snapshots. This is a very fast technique for creating backups, but it is similar to the other techniques above (except for mysqldump). This means that you cannot restore into a different engine than the one you used to take the server snapshot.

Techniques for creating new slaves are covered in Chapter 2, and the different backup methods are covered in Chapter 12.

Outstanding changes when taking a snapshot

Figure 4-1. Outstanding changes when taking a snapshot

Removing slaves from the topology

Removing slaves from the setup only requires notifying the load balancer that the slave is absent. An example load balancer—with methods for adding and removing servers—can be found in Chapter 5.

Switching the master

For routine maintenance, it is common to have to switch all the slaves of a master over to a secondary master as well as notify load balancers of the master’s absence. This procedure can and should be handled with no downtime at all, since it should not affect normal operations.

Using slave promotion (described later in this chapter) is one way to handle this, but it might be easier to use a hot standby instead (also covered later in this chapter).

Handling slave failures

Your slaves will fail—it is just a matter of how often. Handling slave failures must be a routine event in any deployment. It is only necessary to detect that the slave is absent and remove it from the load balancer’s pool, as described in Chapter 5.

Handling master failures

When the master goes down suddenly, you have to detect the failure and move all the slaves over to a standby, or promote one of the slaves to be the new master. Techniques for this are described later in this chapter.

Upgrading slaves

Upgrading slaves to new versions of the server should usually not be a problem. However, bringing the slave out of the system for the upgrade requires removing it from the load balancer and maybe notifying other systems of the slave’s absence.

Upgrading masters

To upgrade the master, it is usually necessary to upgrade all the slaves first. However, this may not always be the case. To upgrade the master, it is usually necessary to either use a standby as a master while you are performing the upgrade or promote one of the slaves to be the master for the duration of the upgrade.

Hot Standby

The easiest of the topologies for duplicating servers is the hot standby topology. This topology is shown in Figure 4-2 and consists of the master and a dedicated server called a hot standby that duplicates the main master. The hot standby server is connected to the master as a slave, and it reads and applies all changes.

Master with a hot standby

Figure 4-2. Master with a hot standby

The idea is that when the main master fails, the hot standby provides a faithful replica of the master, and all the clients and slaves can therefore be switched over to the hot standby and continue operating. As with many ideas, the reality is not always that rosy.

Failure is inevitable, at least when you run a large deployment. It is not a question of if servers fail, but when and how often they fail. If the master fails for any reason, it should not bring the deployment to a halt. To ensure operations proceed, it is necessary to have a hot standby server available and to redirect all slaves to the hot standby when the main master fails. This will give you a chance to check what happened with the main master, and maybe fix it or replace it. After you have repaired the master, you have to bring it back on track and either set it to be the hot standby, or redirect the slaves to the original master again.

Sounds simple, doesn’t it? Ah, if only it was that easy—unfortunately, you have the following potential problems to ponder:

  • When failing over to the hot standby, you are replicating from a new master, so it will be necessary to translate the binlog positions from those of the original master to those of the hot standby.

  • When failing over a slave to a hot standby, the hot standby might actually not have all the changes that the slave has.

  • When bringing the repaired master back into the configuration, the repaired master might have changes in the binary log that never left the server.

All these are relevant issues, but for starters, let’s just consider the simpler case illustrated in Figure 4-3: that of performing a switchover from a running master to a standby in order to, for example, perform maintenance on the original master. In this case, the master is still running, so the situation becomes a lot simpler, since we can control the master and make it work for us instead of against us. We will later consider how to handle the case when the master just goes down because its software crashed, a frustrated coworker decided to kick the server, or the janitor tripped over the power cord.

Switching over from a running master to a standby

Figure 4-3. Switching over from a running master to a standby

By default, events executed by the slave thread are not logged to the binary log, which turns out to be a problem if the slave is a standby waiting to be used as a master. In this case, it is necessary to have all the changes sent by the master to the standby written to the binary log of the standby—if not, there will be nothing to replicate. To configure the standby server for this, add the option log-slave-updates to the my.cnf file. This option ensures that statements received from the master and executed are also written to the slave’s binary log.

[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir          = /tmp
log-bin         = master-bin
log-bin-index   = master-bin.index
server-id       = 1
log-slave-updates

After updating the options file, restart the server.

The main problem with switching over to a standby in this case is to perform the switchover in such a way that replication starts at the standby precisely where it stopped replicating on the master. If the positions were easy to translate—for example, if the positions were the same on both the master and the standby—we would not have a problem. Unfortunately, the positions may be different on the master and the standby for a number of reasons. The most common case is when the standby was not attached to the master when the master was started, but even if that is done, events cannot be guaranteed to be written the same way to the binary log on the standby as they were written to the binary log on the master.

The basic idea for performing the switchover is to stop the slave and the standby at exactly the same position and then just redirect the slave to the standby. Because the standby hasn’t made any changes after the position where you stopped it, you can just check the binlog position on the standby and direct the slave to start at that position. However, just stopping the slave and the standby will not guarantee that they are synchronized, so you have to do this manually.

To do this, stop both the slave and the standby and compare the binlog positions. Since both positions refer to positions on the same master—the slave and standby are both connected to the same master—you can check the positions just by comparing the filename and the byte position lexicographically (in that order).

standby> SHOW SLAVE STATUS\G
     ...
 Relay_Master_Log_File: master-bin.000096
     ...
   Exec_Master_Log_Pos: 756648
1 row in set (0.00 sec)

slave> SHOW SLAVE STATUS\G
     ...
 Relay_Master_Log_File: master-bin.000096
     ...
   Exec_Master_Log_Pos: 743456
 1 row in set (0.00 sec)

In this case, the standby is ahead of the slave, so just write down the slave position of the standby and start the slave to run until it has caught up with the standby. To have the slave catch up with the standby and stop at the right position, use the START SLAVE UNTIL command as we did when stopping the reporting slave earlier in this chapter:

slave> START SLAVE UNTIL
     ->   MASTER_LOG_FILE = 'master-bin.000096',
     ->   MASTER_LOG_POS =  756648;
Query OK, 0 rows affected (0.18 sec)

slave> SELECT MASTER_POS_WAIT('master-bin.000096',  756648);
Query OK, 0 rows affected (1.12 sec)

The slave and standby have now stopped at exactly the same position, and everything is ready to do the switchover to the standby using CHANGE MASTER TO to direct the slave to the standby and start it. But what position should you specify? Since the file and position that the master recorded for its stopping point are different from the file and position recorded by the standby for the same point, it is necessary to fetch the position that the standby recorded while recording the changes as a master. To do this, execute SHOW MASTER STATUS on the standby:

standby> SHOW MASTER STATUS\G
*************************** 1. row ***************************
            File: standby-bin.000019
        Position: 56447
    Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)

Now you can redirect the slave to the standby using the correct position:

slave> CHANGE MASTER TO
     ->   MASTER_HOST = 'standby-1',
     ->   MASTER_PORT = 3306,
     ->   MASTER_USER = 'repl_user',
     ->   MASTER_PASSWORD = 'xyzzy',
     ->   MASTER_LOG_FILE = '
standby-bin.000019',
     ->   MASTER_LOG_POS = 56447;
  Query OK, 0 rows affected (0.18 sec)

 slave> START SLAVE;
Query OK, 0 rows affected (0.25 sec)

If the opposite is true—that the slave is ahead of the standby—you can just switch the roles of the standby and the slave in the previous steps. This is possible since the master is running and can provide either the slave or the standby with the missing changes. In the next section, we will consider how to handle the situation in which the master has stopped unexpectedly and hence cannot provide either the slave or the standby with the missing changes.

Handling a switchover in Python

Example 4-1 shows the Python code for switching a slave over to another master. The replicate_to_position function instructs a server to read from the master only to the given position. When the procedure returns, the slave will have stopped at exactly this position. The switch_to_master directs a slave to a new master. The procedure assumes that both the server on which it executes and the new master are connected to the same original master. If they are not, the positions are not comparable and the procedure will raise an exception.

Example 4-1. Procedure for switching to a new master

def replicate_to_position(server, pos):
    server.sql("START SLAVE UNTIL MASTER_LOG_FILE=%s, MASTER_LOG_POS=%s",
               (pos.file, pos.pos))
    server.sql("SELECT MASTER_POS_WAIT(%s,%s)", (pos.file, pos.pos))


def switch_to_master(server, standby):
    stop_slave(server)
    stop_slave(standby)
    server_pos = fetch_slave_position(server)
    standby_pos = fetch_slave_position(standby)
    if server_pos < standby_pos:
        replicate_to_position(server, standby_pos)
    elif server_pos > standby_pos:
        replicate_to_position(standby, server_pos)
    master_pos = fetch_master_position(standby)
    change_master(server, standby, master_pos)
    start_slave(standby)
    start_slave(server)

Dual Masters

One frequently mentioned setup for high availability is the dual masters topology. In this setup, two masters replicate each other to keep both current. This setup is very simple to use since it is symmetric. Failing over to the standby master does not require any reconfiguration of the main master, and failing back to the main master again when the standby master fails in turn is very easy.

Servers can be either active or passive. If a server is active it means that the server accepts writes, which are likely to be propagated elsewhere using replication. If a server is passive, it does not accept writes and is just following the active master, usually to be ready to take over when it fails.

When using dual masters, there are two different setups, each serving a different purpose:

Active-active

In an active-active setup, writes go to both servers, which then transfer changes to the other master.

Active-passive

In this setup, one of the masters, called the active master, handles writes while the other server, called the passive master, just keeps current with the active master.

This is almost identical to the hot standby setup, but since it is symmetric, it is easy to switch back and forth between the masters, each taking turns being the active master.

Note that this setup does not necessarily let the passive master answer queries. For some of the solutions that you’ll see in this section, the passive master is a cold standby.

These setups do not necessarily mean that replication is used to keep the servers synchronized—there are other techniques that can serve that purpose. Some techniques can support active-active masters, while other techniques can only support active-passive masters.

The most common use of an active-active dual masters setup is to have the servers geographically close to different sets of users, for example, in offices at different places in the world. The users can then work with the local server, and the changes will be replicated over to the other master so that both masters are kept in sync. Since the transactions are committed locally, the system will be perceived as more responsive. It is important to understand that the transactions are committed locally, meaning that the two masters are not consistent in the sense that they have the same information. The changes committed to one master will be propagated to the other master eventually, but until that has been done, the masters have inconsistent data.

This has two main consequences that you need to be aware of:

  • If the same information is updated on the two masters—for example, a user is accidentally added to both masters—there will be a conflict between the two updates and it is likely that replication will stop.

  • If a crash occurs while the two masters are inconsistent, some transactions will be lost.

To some extent, you can avoid the problem with conflicting changes by allowing writes to only one of the servers, thereby making the other master a passive master. This is called an active-passive setup—where the active server is called the primary and the passive server is called the secondary.

Losing transactions when the server crashes is an inevitable result of using asynchronous replication, but depending on the application, it does not necessarily have to be a serious problem. You can limit the number of transactions that are lost when the server crashes by using a new feature in MySQL 5.5 called semisynchronous replication. The idea behind semisynchronous replication is that the thread committing a transaction will block until at least one slave acknowledges that it has received the transaction. Since the events for the transaction are sent to the slave after the transaction has been committed to the storage engine, the number of lost transactions can be kept down to at most one per thread.

Similar to the active-active approach, the active-passive setup is symmetrical and therefore allows you to switch easily from the main master to the standby and back. Depending on the way you handle the mirroring, it may also be possible to use the passive master for administrative tasks such as upgrading the server and use the upgrade server as the active master once the upgrade is finished without any downtime at all.

One fundamental problem that has to be resolved when using an active-passive setup is the risk of both servers deciding that they are the primary master—this is called the split-brain syndrome. This can occur if network connectivity is lost for a brief period, long enough to have the secondary promote itself to primary, but then the primary is brought online again. If changes have been made to both servers while they are both in the role of primary, there may be a conflict. In the case of using a shared disk, simultaneous writes to the disks by two servers are likely to cause “interesting” problems with the database—that is, probably disastrous and difficult to pinpoint.

Shared disks

A straightforward dual masters approach is shown in Figure 4-4, where a pair of masters is connected using a shared disk architecture such as a SAN (storage area network). In this approach, both servers are connected to the same SAN and are configured to use the same files. Since one of the masters is passive, it will not write anything to the files while the active master is running as usual. If the main server fails, the standby will be ready to take over.

Dual masters using a shared disk

Figure 4-4. Dual masters using a shared disk

The advantage of this approach is that since the binlog files are stored on a shared disk, there is no need for translating binlog positions. The two servers are truly mirror images of each other, but they are running on two different machines. This means that switching over from the main master to the standby is very fast. There is no need for the slaves to translate positions to the new master; all that is necessary is to note the position where the slave stopped, issue a CHANGE MASTER command, and start replication again.

When you fail over using this technique, you have to perform recovery on the tables, since it is very likely updates were stopped midstream. Each storage engine behaves differently in this situation. For example, InnoDB has to perform a normal recovery from the transaction log, as it would in the event of a crash, whereas if you use MyISAM you probably have to repair the tables before being able to continue operation. Of these two choices, InnoDB is preferred because recovery is significantly faster than repairing a MyISAM table.

Example 4-2 shows a Python script for handling such a failover using the Replicant library. Notice that the position uses the server ID of the main server, but since both servers are using the same files, the standby server is really a mirror image of the main server. Since the position contains the server ID as well, this will also catch any mistakes made by the user, such as passing a master that is not a mirror image of the main master.

Example 4-2. Procedure to remaster a slave when using a shared disk

def remaster_slave(slave, master):
    position = fetch_slave_position(slave)
    change_master(slave, master, position)

The ability to set up dual masters using shared disks is dependent on the shared storage solution used, a discussion that is beyond the scope of this book.

The problem with using shared storage is that since the two masters are using the same files for storing data, you have to be very careful when doing any administrative tasks on the passive master. Overwriting the configuration files, even by mistake, can be fatal.

The handling of split-brain syndrome depends on which shared disk solution is used and is beyond the scope of this book. One example, however, occurs when using SCSI, which has support for reserving disks by servers. This allows a server to detect that it is really not the primary anymore by noticing that the disks are reserved by another server.

Replicated disks using DRBD

The Linux High Availability project contains a lot of useful tools for maintaining high availability systems. Most of these tools are beyond the scope of this book, but there is one tool that is interesting for our purposes: DRBD (Distributed Replicated Block Device), which is software for replicating block devices over the network.

Figure 4-5 shows a typical setup of two nodes where DRBD is used to replicate a disk to a secondary server. The setup creates two DRBD block devices, one on each node, which in turn write the data to the real disks. The two DRBD processes communicate over the network to ensure any changes made to the primary are replicated over to the secondary. To the MySQL server, the device replication is transparent. The DRBD devices look and behave like normal disks, so no special configuration is needed for the servers.

Using DRBD to replicate disks

Figure 4-5. Using DRBD to replicate disks

You can only use DRBD in an active-passive setup, meaning that the passive disk cannot be accessed at all. In contrast with the shared disk solution outlined earlier and the bidirectional replication implementation described later in this chapter, the passive master cannot be used—not even for pure read-only tasks.

Similar to the shared disk solution, DRBD has the advantage of not needing to translate positions between the two masters since they share the same files. However, failing over to the standby master takes longer than in the shared disk setup described earlier.

For both the shared disk and the DRBD setup, it is necessary to perform recovery of the database files before bringing the servers online. Since recovery of MyISAM tables is quite expensive, it is recommended that you use a transactional engine with good recovery performance for the database tables—InnoDB is the proven solution in this case, but other transactional engines such as PBXT are maturing quickly, so investigating the alternatives is well-invested time.

Since the mysql database contains strictly MyISAM tables, you should, as a general principle, avoid unnecessary changes to these tables during normal operations. It is, of course, impossible to avoid when you need to perform administrative tasks.

One advantage of DRBD over shared disks is that for the shared disk solution, the disks actually provide a single point of failure. Should the network to the shared disk array go down, it is possible that the server will not work at all. In contrast, replicating the disks means that the data is available on both servers, which reduces the risk of a total failure.

DRBD also has support built in to handle split-brain syndrome and can be configured to automatically recover from it.

Bidirectional replication

When using dual masters in an active-passive setup, there are no significant differences compared to the hot standby solution outlined earlier. However, in contrast to the other dual-masters solutions outlined earlier, it is possible to have an active-active setup (shown in Figure 4-6).

Bidirectional replication

Figure 4-6. Bidirectional replication

Although controversial in some circles, an active-active setup does have its uses. A typical case is when there are two offices working with local information in the same database—for example, sales data or employee data—and want low response times when working with the database, while ensuring the data is available in both places. In this case, the data is naturally local to each office—for example, each salesperson is normally working with his own sales and rarely, if ever, makes changes to another salesperson’s data.

Use the following steps to set up bidirectional replication:

  1. Ensure both servers have different server IDs.

  2. Ensure both servers have the same data (and that no changes are made to either system until replication has been activated).

  3. Create a replication user and prepare replication (using the information in Chapter 1) on both servers.

  4. Start replication on both servers.

Warning

When using bidirectional replication, be forewarned that replication includes no concept of conflict resolution. If both servers update the same piece of data, you will have a conflict that may or may not be noticed. If you are lucky, replication will stop at the offending statement, but you shouldn’t count on it. If you intend to have a high availability system, you should ensure at the application level that two servers do not try to update the same data.

Even if data is naturally partitioned—as in the example given previously with two offices in separate locations—it is critical to put provisions in place to ensure data is not accidentally updated at the wrong server.

In this case the application has to connect to the server responsible for the employee and update the information there, not just update the information locally and hope for the best.

If you want to connect slaves to either of the servers, you have to ensure the log-slave-updates option is enabled. Since the other master is also connected as a slave, an obvious question is: what happens with events that the server sends out when they return to the server?

When replication is running, the server ID of the server that created the event is attached to each event. This server ID is then propagated further when the slave writes the event to its binary log. When a server sees an event with the same server ID as its own server ID, that event is simply skipped and replication proceeds with the next event.

Sometimes, you want to process the event anyway. This might be the case if you have removed the old server and created a new one with the same server ID and you are in the process of performing a PITR. In those cases, it is possible to disable this checking using the replicate-same-server-id configuration variable. However, to prevent you from shooting yourself in the foot, you cannot set this option at the same time that log-slave-updates is set. Otherwise, it would be possible to send events in a circle and quickly thrash all the servers. To prevent that from happening, it is not possible to forward events when using replicate-same-server-id.

When using an active-active setup, there is a need to handle conflicts in a safe way, and by far the easiest way—and the only recommended way to handle an active-active setup—is to ensure the different active servers write to different areas.

One possible solution is to assign different databases—or different tables—to different masters. Example 4-3 shows a setup that uses two different tables, each updated by different masters. To make it easy to view the split data, a view is created that combines the two tables.

Example 4-3. Different tables for different offices

CREATE TABLE Employee_Sweden (
   uid INT AUTO_INCREMENT PRIMARY KEY,
   name VARCHAR(20)
);

CREATE TABLE Employee_USA (
   uid INT AUTO_INCREMENT PRIMARY KEY,
   name VARCHAR(20)
);


-- This view is used when reading from the two tables simultaneously.
CREATE VIEW Employee AS
    SELECT 'Swe', uid, name FROM Employee_Sweden
  UNION
    SELECT 'USA', uid, name FROM Employee_USA;

This approach is best to use if the split is natural in that, for example, different offices have different tables for their local data and the data only needs to be combined for reporting purposes. This might seem easy enough, but the following issues can complicate usage and administration of the tables:

Reads and writes to different tables

Because of the way the view is defined, you cannot update it. Writes have to be directed at the real tables, while reads can either use the view or read directly from the tables.

It might therefore be necessary to introduce application logic to handle the split into reads and writes that go to different tables.

Accurate and current data

Since the two tables are managed by different sites, simultaneous updates to the two tables will cause the system to temporarily enter a state where both servers have information that is not available on the other server. If a snapshot of the information is taken at this time, it will not be accurate.

If accurate information is required, generate methods for ensuring the information is accurate. Since such methods are highly application-dependent, they will not be covered here.

Optimization of views

When using views, two techniques are available to construct a result set. In the first method—called MERGE—the view is expanded in place, optimized, and executed as if it was a SELECT query. In the second method—called TEMPTABLE—a temporary table is constructed and populated with the data.

If the server uses a TEMPTABLE view, it performs very poorly, whereas the MERGE view is close to the corresponding SELECT. MySQL uses TEMPTABLE whenever the view definition does not have a simple one-to-one mapping between the rows of the view and the rows of the underlying table—for example, if the view definition contains UNION, GROUP BY, subqueries, or aggregate functions—so careful design of the views is paramount for getting good performance.

In either case, you have to consider the implications of using a view for reporting, since it might affect performance.

If each server is assigned separate tables, there will be no risk of conflict at all since updates are completely separated. However, if all the sites have to update the same tables, you will have to use some other scheme.

The MySQL server has special support for handling this situation in the form of two server variables:

auto_increment_offset

This variable controls the starting value for any AUTO_INCREMENT column in a table. This is the value that the first row inserted into the table gets for the AUTO_INCREMENT column. For subsequent rows, the value is calculated using auto_increment_increment.

auto_increment_increment

This is the increment used to compute the next value of an AUTO_INCREMENT column.

Note

There are session and global versions of these two variables and they affect all tables on the server, not just the tables created. Whenever a new row is inserted into a table with an AUTO_INCREMENT column, the next value available in the sequence below is used:

valueN = auto_increment_offset + N*auto_increment_increment

Notice that the next value is not computed by adding the auto_increment_increment to the last value in the table.

Use auto_increment_offset and auto_increment_increment to ensure new rows added to a table are assigned numbers from different sequences of numbers depending on which server is used. The idea is that the first server uses the sequence 1, 3, 5… (odd numbers), while the second server uses the sequence 2, 4, 6… (even numbers).

Continuing with Example 4-3, Example 4-4 uses these two variables to ensure the two servers use different IDs when inserting new employees into the Employee table.

Example 4-4. Two servers writing to the same table

-- The common table can be created on either server
CREATE TABLE Employee (
   uid INT AUTO_INCREMENT PRIMARY KEY,
   name VARCHAR(20),
   office VARCHAR(20)
);

-- Setting for first master
SET GLOBAL AUTO_INCREMENT_INCREMENT = 2;
SET GLOBAL AUTO_INCREMENT_OFFSET = 1;

-- Setting for second master
SET GLOBAL AUTO_INCREMENT_INCREMENT = 2;
SET GLOBAL AUTO_INCREMENT_OFFSET = 2;

This scheme handles the insertion of new items in the tables, but when entries are being updated, it is still critical to ensure the update statements are sent to the correct server—the server responsible for the employee. Otherwise, data is likely to be inconsistent. If updates are not done correctly, the slaves will normally not stop—they will just replicate the information, which leads to inconsistent values on the two servers.

For example, if the first master executes the statement:

master-1> UPDATE Employee SET office = 'Vancouver' WHERE uid = 3;
Query OK, 1 rows affected (0.00 sec)

and at the same time, the same row is updated at the second server using the statement:

master-2> UPDATE Employee SET office = 'Paris' WHERE uid = 3;
Query OK, 1 rows affected (0.00 sec)

the result will be that the first master will place the employee in Paris while the second master will place the employee in Vancouver (note that the order will be swapped since each server will update the other server’s statement after its own).

Detecting and preventing such inconsistencies is important because they will cascade and create more inconsistency over time. Statement-based replication executes statements based on the data in the two servers, so one inconsistency can lead to others.

If you take care to separate the changes made by the two servers as outlined previously, the row changes will be replicated and the two masters will therefore be consistent.

If users use different tables on the different servers, the easiest way to prevent such mistakes to assign privileges so that a user cannot accidentally change tables on the wrong server. This is, however, not always possible and cannot prevent the case just shown.

Semisynchronous Replication

Google has an extensive set of patches for MySQL and InnoDB to tailor the server and the storage engine. One of the patches that is available for MySQL version 5.0 is the semisynchronous replication patch. MySQL has since reworked the patch and released it with MySQL 5.5.

The idea behind semisynchronous replication is to ensure the changes are written to disk on at least one slave before allowing execution to continue. This means that for each connection, at most one transaction can be lost due to a master crash.

It is important to understand that the semisynchronous replication patch does not hold off commits of the transaction; it just avoids sending a reply back to the client until the transaction has been written to the relay log of at least one slave. Figure 4-7 shows the order of the calls when committing a transaction. As you can see, the transaction is committed to the storage engine before the transaction is sent to the slave, but the return from the client’s commit call occurs after the slave has acknowledged that the transaction is in durable storage.

Transaction commit with semisynchronous replication

Figure 4-7. Transaction commit with semisynchronous replication

This means that for each connection, it is possible to lose a transaction if a crash occurs after the transaction has been committed to the storage engine but before the transaction has been sent to the slave. However, since the acknowledgment of the transaction goes to the client after the slave has acknowledged that it has the transaction, at most one transaction can be lost.

This usually means that, at most, one transaction can be lost per client, but if the client has multiple connections active with the master at the same time, it can lose one transaction for each connection if the client is committing multiple transactions at the same time and the server crashes.

Configuring semisynchronous replication

To use semisynchronous replication, both the master and the slave need to support it, so both the master and the slave have to be running MySQL version 5.5 or later and have semisynchronous replication enabled. If either the master or the slave does not support semisynchronous replication, it will not be used, but replication works as usual, meaning that more than one transaction can be lost unless special precautions are taken to ensure each transaction reaches the slave before a new transaction is started.

Use the following steps to enable semisynchronous replication:

  1. Install the master plug-in on the master:

    master> INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
  2. Install the slave plug-in on each slave:

    slave> INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
  3. Once you have installed the plug-ins, enable them on the master and the slave. This is controlled through two server variables that are also available as options, so to ensure that the settings take effect even after restart, it is best to bring down the server and add the options to the my.cnf file of the master:

    [mysqld]
    rpl-semi-sync-master-enabled = 1

    and to the slave:

    [mysqld]
    rpl-semi-sync-slave-enabled = 1
  4. Restart the servers.

If you followed the instructions just given, you now have a semisynchronous replication setup and can test it, but consider these cases:

  • What happens if all slaves crash—which is not unlikely if you have only a single server—and therefore no slave acknowledges that it has stored the transaction to the relay log?

  • What happens if all slaves disconnect gracefully? In this case, the master has no slave to which the transaction can be sent for safekeeping.

In addition to rpl-semi-sync-master-enabled and rpl-semi-sync-slave-enabled, there are two options that you can use to handle the situations given above:

rpl-semi-sync-master-timeout=milliseconds

To prevent semisynchronous replication from blocking if it does not receive an acknowledgment, it is possible to set a timeout using the rpl-semi-sync-master-timeout=milliseconds option.

If the master does not receive any acknowledgment before the timeout expires, it will revert to normal asynchronous replication and continue operating without semisynchronous replication.

This option is also available as a server variable and can be set without bringing the server down. Note, however, that as with every server variable, the value will not be saved between restarts.

rpl-semi-sync-master-wait-no-slave={ON|OFF}

If a transaction is committed but the master does not have any slaves connected, it is not possible for the master to send the transaction anywhere for safekeeping. By default, the master will then wait for a slave to connect—as long as it is within the timeout limit—and acknowledge that the transaction has been properly written to disk.

You can use the rpl-semi-sync-master-wait-no-slave={ON|OFF} option to turn off this behavior, in which case the master reverts to asynchronous replication if there are no connected slaves.

Monitoring semisynchronous replication

Both plug-ins install a number of status variables that allow you to monitor semisynchronous replication. We will cover the most interesting ones here—for a complete list, consult the online reference manual for semisynchronous replication.

rpl_semi_sync_master_clients

This status variable reports the number of connected slaves that support and have registered for semisynchronous replication.

rpl_semi_sync_master_status

The status of semisynchronous replication on the master is 1 if it is active, and 0 if it is inactive—either because it has not been enabled or because it was enabled but has reverted to asynchronous replication.

rpl_semi_sync_slave_status

The status of semisynchronous replication on the slave is 1 if active—that is, it has been enabled and the I/O thread is running—and 0 if it is inactive.

You can read the values of these variables either using the SHOW STATUS command or through the information schema table GLOBAL_STATUS. If you want to use the values for other purposes, the SHOW STATUS command is hard to use and a query as shown in Example 4-5 uses SELECT on the information schema to extract the value and store it in a user-defined variable.

Example 4-5. Retrieving values using the information schema

master> SELECT Variable_value INTO @value
    ->   FROM INFORMATION_SCHEMA.GLOBAL_STATUS
    ->  WHERE Variable_name = 'Rpl_semi_sync_master_status';
Query OK, 1 row affected (0.00 sec)

Slave Promotion

The procedures described so far work well when you have a master running that you can use to synchronize the standby and the slave before the switchover, but what happens if the master dies all of a sudden? Since replication has stopped in its tracks with all slaves (including the standby), it will not be possible to run replication just a little more to get all the necessary changes that would put the new master in sync.

If the standby is ahead of all the slaves that need to be reassigned, there is no problem, because you can run replication on each slave to the place where the standby stopped. You will lose any changes that were made on the master but not yet sent to the standby. We will cover how to handle the recovery of the master in this case separately.

If the standby is behind one of the slaves, you shouldn’t use the standby as the new master, since the slave knows more than the standby. As a matter of fact, it would be better if the “more knowledgeable” slave—that is, the slave that has replicated most events from the common master—were the master instead!

This is exactly the approach taken to handle master failures using slave promotion: instead of trying to keep a dedicated standby around, ensure that any one of the slaves connected to the master can be promoted to master and take over at the point where the master was lost. By selecting the “most knowledgeable” slave as the new master, you guarantee that none of the other slaves will be more knowledgeable than the new master, so they can connect to the new master and read events from it.

There is, however, a critical issue that needs to be resolved—synchronizing all slaves with the new master so that no events are lost or repeated. The problem in this situation is that all of the slaves need to read events from the new master, but the positions of the new master are not the same as the positions for the old master. So what is a poor DBA to do?

The traditional method for promoting a slave

Before delving into the final solution, let us first take a look at the recommended practice for handling slave promotion. This will work as a good introduction to the problem, and also allow us to pinpoint the tricky issues that we need to handle for the final solution.

Figure 4-8 shows a typical setup with a master and several slaves.

Promoting a slave to replace a failed master

Figure 4-8. Promoting a slave to replace a failed master

For the traditional method of slave promotion, the following are required:

  • Each promotable slave must have a user account for the replication user.

  • Each promotable slave should run with --log-bin, that is, with the binary log enabled.

  • Each promotable slave should run without the --log-slave-updates option (the reason will become obvious shortly).

Assume you are starting with the original setup shown in Figure 4-8 and that the master fails. You can promote a slave to be the new master by doing the following:

  1. Stop the slave using STOP SLAVE.

  2. Reset the slave that is going to be the new master using RESET MASTER. This will ensure the slave starts as the new master and that any connecting slave will start reading events from the time the slave was promoted.

  3. Connect the other slaves to the new master using CHANGE MASTER TO. Since you reset the new master, you can start replication from the beginning of the binary log, so it is not necessary to provide any position to CHANGE MASTER TO.

Unfortunately, this approach is based on an assumption that is not generally true—that the slaves have received all changes that the master has made. In a typical setup, the slaves will lag behind the master to various degrees. It might be just a few transactions, but nevertheless, they lag behind. In the next section you will see a solution to that problem.

Regardless of that, this approach is so simple that it is useful if you can handle lost transactions or if you are operating under a low load.

A revised method for promoting a slave

The traditional approach to promoting a slave is inadequate in most cases because slaves usually lag behind the master. Figure 4-9 illustrates the typical situation when the master disappears unexpectedly. The box labeled “binary log” in the center is the master’s binary log and each arrow represents how much of the binary log the slave has executed.

Binary log positions of the master and the connected slaves

Figure 4-9. Binary log positions of the master and the connected slaves

In the figure, each slave has stopped at a different binlog position. To resolve the issue and bring the system back online, one slave has to be selected as the new master—preferably the one that has the latest binlog position—and the other slaves have to be synchronized with the new master.

The critical problem lies in translating the positions for each slave—which are the positions in the now-defunct master—to positions on the promoted slave. Unfortunately, the history of events executed and the binlog positions they correspond to on the slaves are lost in the replication process—each time the slave executes an event that has arrived from the master, it writes a new event to its binary log, with a new binlog position. The slave’s position bears no relation to the master’s binlog position of the same event. The only option that remains for us is to scan the binary log of the promoted slave. To use this technique:

  • Enable the binary log; otherwise, no changes can be replicated.

  • Enable log slave updates (using the log-slave-updates option); otherwise, no changes from the original master can be forwarded.

  • Each slave needs to have a replication user to act as a master so that if it turns out to be the best candidate for a new master, other slaves can to connect to it and replicate from it.

Carry out the following steps for each of the slaves that are not promoted:

  1. Figure out the last transaction it executed.

  2. Find the transaction in the binary log of the promoted slave.

  3. Take the binlog position for the transaction from the promoted slave.

  4. Start the nonpromoted slaves to replicate from that position on the promoted slave.

To match the latest transaction on each of the slaves with the corresponding event in the binary log of the promoted slave, you need to tag each transaction. The content and structure of the tags don’t matter; they just need to be uniquely identifiable no matter who executed the transaction so each transaction on the master can be found in the promoted slave’s binary log. We call this kind of tag the global transaction ID.

The easiest way to accomplish this is to insert a statement at the end of each transaction that updates a special table and use that to keep track of where each slave is. Just before committing each transaction, a statement updates the table with a number that is unique for the transaction.

Tagging can be handled in two main ways:

  • Extending the application code to perform the necessary statements

  • Calling a stored procedure to perform each commit and writing the tag in the procedure

Because the first approach is easier to follow, it will be demonstrated here. If you are interested in the second approach, see Stored Procedures to Commit Transactions.

To implement the global transaction ID, we have created the two tables in Example 4-6: one table named Global_Trans_ID to generate sequence numbers and a separate table named Last_Exec_Trans to record the global transaction ID.

The server ID is added to the definition of Last_Exec_Trans to distinguish transactions committed on different servers. If, for example, the promoted slave fails before all the slaves have managed to connect, it is very important to distinguish between the transaction ID of the original master and the transaction ID of the promoted slave. Otherwise, the slaves that didn’t manage to connect to the promoted slave might start to execute from a position that is wrong when being redirected to the second promoted slave. This example uses MyISAM to define the counter table, but it is possible to use InnoDB for this as well.

Example 4-6. Tables used for generating and tracking global transaction IDs

CREATE TABLE Global_Trans_ID (
    number INT UNSIGNED AUTO_INCREMENT PRIMARY KEY
) ENGINE = MyISAM;

CREATE TABLE Last_Exec_Trans (
    server_id INT UNSIGNED,
    trans_id INT UNSIGNED
) ENGINE = InnoDB;

-- Insert a single row with NULLs to be updated.
INSERT INTO Last_Exec_Trans() VALUES ();

The next step is to construct a procedure for adding a global transaction ID to the binary log so that a program promoting a slave can read the ID from the log. The following procedure is suitable for our purposes:

  1. Insert an item into the transaction counter table, making sure to turn off the binary log before doing this, since the insert should not be replicated to the slaves:

    master> SET SQL_LOG_BIN = 0;
    Query OK, 0 rows affected (0.00 sec)
    
    master> INSERT INTO Global_Trans_ID() VALUES ();
    Query OK, 1 row affected (0.00 sec)
  2. Fetch the global transaction ID using the function LAST_INSERT_ID. To simplify the logic, the server ID is fetched from the server variable server_id at the same time:

    master> SELECT @@server_id as server_id, LAST_INSERT_ID() as trans_id;
    +-----------+----------+
    | server_id | trans_id |
    +-----------+----------+
    |         0 |      235 |
    +-----------+----------+
    1 row in set (0.00 sec)
  3. Before inserting the global transaction ID into the Last_Exec_Trans tracking table, you can remove its row from the transaction counter table to save space. This optional step works only for a MyISAM table. If you use InnoDB, you have to be careful about leaving the last used global transaction ID in the table. InnoDB determines the next number from the maximum value in the autoincrement column currently in the table.

    master> DELETE FROM Global_Trans_ID WHERE number < 235;
    Query OK, 1 row affected (0.00 sec)
  4. Turn on the binary log:

    master> SET SQL_LOG_BIN = 1;
    Query OK, 0 rows affected (0.00 sec)
  5. Update the Last_Exec_Trans tracking table with the server ID and the transaction ID you got in step 2. This is the last step before committing the transaction through a COMMIT:

    master> UPDATE Last_Exec_Trans SET server_id = 0, trans_id = 235;
    Query OK, 1 row affected (0.00 sec)
    
    master> COMMIT;
    Query OK, 0 rows affected (0.00 sec)

Each global transaction ID represents a point where replication can be resumed. Therefore, you must carry out this procedure for every transaction. If it is not used for some transaction, the transaction will not be tagged properly and it will not be possible to start from that position.

Now, to promote a slave after the master is lost, find the slave that has the latest changes of all the slaves—that is, has the largest binlog position—and promote it to master. Then have each of the other slaves connect to it.

For a slave to connect to the promoted slave and start replication at the right position, it is necessary to find out what position on the promoted slave has the last executed transaction of the slave. Scan the binary log of the promoted slave to find the right transaction ID.

Use the following steps to carry out the recovery:

  1. Stop the slave. Get the last-seen global transaction ID from its Last_Exec_Trans table.

  2. Pick the slave with the highest global transaction ID to promote to master. If there are several, pick one.

  3. Get the master position of the slave to promote and the binary logs of the slave at the same time using SHOW MASTER LOGS. Note that the last row of SHOW MASTER LOGS matches what you would see in SHOW MASTER STATUS.

  4. Bring the promoted slave online and let it start accepting updates.

  5. Connect to the promoted slave and scan the binary log to find the latest global transaction ID that you found in each slave’s binary log. Unless you have a file position that you know is good, the only good starting position for reading a binary log is the beginning. Therefore, you have to scan the binary logs in reverse order, starting with the latest.

    This step will give you a binlog position on the promoted slave for each global transaction ID that you collected in step 1.

  6. Reconnect each slave to the promoted slave, starting at the position where the slave needs to start in order to recover all information, using the information from step 5.

The first four steps are straightforward, but step 5 is tricky. To illustrate the situation, let’s start with an example of some basic information gathered from the first three steps. Table 4-2 lists three sample slaves with the global transaction ID of each slave.

Table 4-2. Global transaction ID for all connected slaves

 

Server ID

Trans ID

slave-1

1

245

slave-2

1

248

slave-3

1

256

As you can see in Table 4-2, slave-3 has the latest global transaction ID and is therefore the slave you will promote. It is therefore necessary to translate the global transaction ID of each slave to binlog positions on slave-3. For that, we need information about the binary log on slave-3, which we’ll obtain in Example 4-7.

Example 4-7. Master positions of slave-3, which will be promoted

slave-3> SHOW MASTER LOGS;
+--------------------+-----------+
| Log_name           | File_size |
+--------------------+-----------+
| slave-3-bin.000001 |      3115 |
| slave-3-bin.000002 |    345217 |
| slave-3-bin.000003 |     24665 |
| slave-3-bin.000004 |    788243 |
| slave-3-bin.000005 |      1778 |
+--------------------+-----------+
5 row in set (0.00 sec)

The important thing to know from the output of SHOW MASTER LOGS is the names of the logs, so you can scan them for global transaction IDs. For instance, when reading the slave-3-bin.000005 file using mysqlbinlog, part of the output will look like that shown in Example 4-8. The transaction received by slave-3 starting at position 596 (highlighted in the first line of the output) has the global transaction ID received by slave-1, as shown by an UPDATE of the Last_Exec_Trans table.

Example 4-8. Output from the mysqlbinlog command for one transaction

# at 596
#091018 18:35:42 server id 1  end_log_pos 664  Query   thread_id=952   ...
SET TIMESTAMP=1255883742/*!*/;
BEGIN
/*!*/;
# at 664
#091018 18:35:42 server id 1  end_log_pos 779  Query   thread_id=952   ...
SET TIMESTAMP=1255883742/*!*/;
UPDATE user SET messages = messages + 1 WHERE id = 1
/*!*/;
# at 779
#091018 18:35:42 server id 1  end_log_pos 904  Query   thread_id=952   ...
SET TIMESTAMP=1255883742/*!*/;
INSERT INTO message VALUES (1,'MySQL Python Replicant rules!')
/*!*/;
# at 904
#091018 18:35:42 server id 1  end_log_pos 1021  Query   thread_id=952   ...
SET TIMESTAMP=1255883742/*!*/;
UPDATE Last_Exec_Trans SET server_id = 1, trans_id = 245
/*!*/;
# at 1021
#091018 18:35:42 server id 1  end_log_pos 1048  Xid = 1433
COMMIT/*!*/;

Table 4-2 shows that the trans_id 245 is the last transaction seen by slave-1, so now you know that the start position for slave-1 is in file slave-3-bin.000005 at byte position 1048. So to start slave-1 at the correct position, you can now execute CHANGE MASTER TO and START SLAVE:

slave-1> CHANGE MASTER TO
       ->    MASTER_HOST = 'slave-3',
       ->    MASTER_LOG_FILE = 'slave-3-bin.000005',
       ->    MASTER_LOG_POS = 1048;
Query OK, 0 rows affected (0.04 sec)

slave-1> START SLAVE;
Query OK, 0 rows affected (0.17 sec)

By going backward in this manner—locating each of the transactions that you recorded in the first step in the procedure—you can connect the slaves one by one to the new master at exactly the right position.

This technique works well if the update statement is added to every transaction commit. Unfortunately, there are statements that perform an implicit commit before and after the statement. Typical examples include CREATE TABLE, DROP TABLE, and ALTER TABLE. Since these statements do an implicit commit, they cannot be tagged properly, hence it is not possible to restart just after them. This means that if the sequence of statements in Example 4-9 is executed and there is a crash, you will potentially have problems.

If a slave has just executed the CREATE TABLE and then loses the master, the last seen global transaction ID is for the INSERT INTO—that is, just before the CREATE TABLE statement. Therefore, the slave will try to reconnect to the promoted slave with the transaction ID of the INSERT INTO statement. Since it will find the position in the binary log of the promoted slave, it will start by replicating the CREATE TABLE statement again, causing the slave to stop with an error.

You can avoid these problems through careful use and design of statements; for example, if CREATE TABLE is replaced with CREATE TABLE IF NOT EXISTS, the slave will notice that the table already exists and skip execution of the statement.

Example 4-9. Statements where global transaction ID cannot be assigned

INSERT INTO message_board VALUES ('mats@sun.com', 'Hello World!');
CREATE TABLE admin_table (a INT UNSIGNED);
INSERT INTO message_board VALUES ('', '');

Slave promotion in Python

You have now seen two techniques for promoting a slave: a traditional technique that suffers from a loss of transactions on some slaves, and a more complex technique that recovers all available transactions. The traditional technique is straightforward to implement in Python, so let’s concentrate on the more complicated one. To handle slave promotion this way, it is necessary to:

  • Configure all the slaves correctly

  • Add the tables Global_Trans_ID and Last_Exec_Trans to the master

  • Provide the application code to commit a transaction correctly

  • Write code to automate slave promotion

You can use the Promotable class (shown in Example 4-10) to handle the new kind of server. As you can see, this example reuses the previously introduced _enable_binlog helper method, and adds a method to set the log-slave-updates option. Since a promotable slave requires the special tables we showed earlier for the master, the addition of a promotable slave requires the tables added to the master. To do this, we write a function named _add_global_id_tables. The function assumes that if the tables already exist, they have the correct definition, so no attempt is made to re-create them. However, the Last_Exec_Trans table needs to start with one row for the update to work correctly, so if no warning was produced to indicate that a table already exists, we create the table and add a row with NULL.

Example 4-10. The definition of a promotable slave role

_GLOBAL_TRANS_ID_DEF = """
CREATE TABLE IF NOT EXISTS Global_Trans_ID (
  number INT UNSIGNED NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (number)
) ENGINE=MyISAM
"""

_LAST_EXEC_TRANS_DEF = """
CREATE TABLE IF NOT EXISTS Last_Exec_Trans (
  server_id INT UNSIGNED DEFAULT NULL,
  trans_id INT UNSIGNED DEFAULT NULL
) ENGINE=InnoDB
"""

class Promotable(Role):
    def __init__(self, repl_user, master):
        self.__master = master
        self.__user = repl_user

    def _add_global_id_tables(self, master):
        master.sql(_GLOBAL_TRANS_ID_DEF)
        master.sql(_LAST_EXEC_TRANS_DEF)
        if not master.sql("SELECT @@warning_count"):
            master.sql("INSERT INTO Last_Exec_Trans() VALUES ()")

    def _relay_events(self, server, config):
        config.set('mysqld', 'log-slave-updates')

    def imbue(self, server):
        # Fetch and update the configuration
        config = server.get_config()
        self._set_server_id(server, config)
        self._enable_binlog(server, config)
        self._relay_event(server, config)

        # Put the new configuration in place
        server.stop()
        server.put_config(config)
        server.start()

        # Add tables to master
        self._add_global_id_tables(self.__master)

        server.repl_user = self.__master.repl_user

This routine configures the slaves and the master correctly for using global transaction IDs. You still have to update the Last_Exec_Trans table when committing each transaction. In Example 4-11 you can see an example implementation in PHP for committing transactions. The code is written using PHP, since this is part of the application code and not part of the code for managing the deployment.

Example 4-11. Code for starting, committing, and aborting transactions

function start_trans($link) {
  mysql_query("START TRANSACTION", $link);
}

function commit_trans($link) {
  mysql_select_db("common", $link);
  mysql_query("SET SQL_LOG_BIN = 0", $link);
  mysql_query("INSERT INTO Global_Trans_ID() VALUES ()", $link);
  $trans_id = mysql_insert_id($link);
  $result = mysql_query("SELECT @@server_id as server_id", $link);
  $row = mysql_fetch_row($result);
  $server_id = $row[0];

  $delete_query = "DELETE FROM Global_Trans_ID WHERE number = %d";
  mysql_query(sprintf($delete_query, $trans_id),
           $link);
  mysql_query("SET SQL_LOG_BIN = 1", $link);

  $update_query = "UPDATE Last_Exec_Trans SET server_id = %d, trans_id = %d";
  mysql_query(sprintf($update_query, $server_id, $trans_id), $link);
  mysql_query("COMMIT", $link);
}

function rollback_trans($link) {
  mysql_query("ROLLBACK", $link);
}

We can then use this code to commit transactions by calling the functions instead of the usual COMMIT and ROLLBACK. For example, we could write a PHP function to add a message to a database and update a message counter for the user:

function add_message($email, $message, $link) {
  start_trans($link);
  mysql_select_db("common", $link);
  $query = sprintf("SELECT user_id FROM user WHERE email = '%s'", $email);
  $result = mysql_query($query, $link);
  $row = mysql_fetch_row($result);
  $user_id = $row[0];

  $update_user = "UPDATE user SET messages = messages + 1 WHERE user_id = %d";
  mysql_query(sprintf($update_user, $user_id), $link);

  $insert_message = "INSERT INTO message VALUES (%d,'%s')";
  mysql_query(sprintf($insert_message, $user_id, $message), $link);
  commit_trans($link);
}

$conn = mysql_connect(":/var/run/mysqld/mysqld1.sock", "root");
add_message('mats@example.com', "MySQL Python Replicant rules!", $conn);

What remains is the job of handling the actual slave promotion in the event of a failure. The procedure was already outlined above, but the implementation is more involved.

The first step is to fetch the binlog files remotely, similarly to the method used in Chapter 2. In this case, we need to fetch the entire binlog file, since we do not know where to start reading. The fetch_remote_binlog function in Example 4-12 returns an iterator to the lines of the binary log.

Example 4-12. Fetching a remote binary log

def fetch_remote_binlog(server, binlog_file):
    command = ["mysqlbinlog",
               "--read-from-remote-server",
               "--force",
               "--host=%s" % (server.host),
               "--user=%s" % (server.sql_user.name)]
    if server.sql_user.passwd:
        command.append("--password=%s" % (server.sql_user.passwd))
    command.append(binlog_file)
    return iter(subprocess.Popen(command, stdout=subprocess.PIPE).stdout)

The iterator returns the lines of the binary log one by one, so the lines have to be further separated into transactions and events to make the binary log easier to work with. In Example 4-13 you can see a function named group_by_event that groups lines belonging to the same event into a single string, and a function named group_by_trans that groups a stream of events (as returned by group_by_event) into lists, where each list represents a transaction.

Example 4-13. Parsing the mysqlbinlog output to extract transactions

delimiter = "/*!*/;"

def group_by_event(lines):
    event_lines = []
    for line in lines:
        if line.startswith('#'):
            if line.startswith("# End of log file"):
                del event_lines[-1]
                yield ''.join(event_lines)
                return
            if line.startswith("# at"):
                yield ''.join(event_lines)
                event_lines = []
        event_lines.append(line)

def group_by_trans(lines):
    group = []
    in_transaction = False
    for event in group_by_event(lines):
        group.append(event)
        if event.find(delimiter + "\nBEGIN\n" + delimiter) >= 0:
            in_transaction = True
        elif not in_transaction:
            yield group
            group = []
        else:
            p = event.find("\nCOMMIT")
            if p >= 0 and (event.startswith(delimiter, p+7)
                           or event.startswith(delimiter, p+8)):
                yield group
                group = []
                in_transaction = False

Example 4-14 shows a function named scan_logfile that scans the mysqlbinlog output for the global transaction IDs that were introduced. The function accepts a master from which to fetch the binlog file, the name of a binlog file to scan (the filename is the name of the binary log on the master), and a callback function on_gid that will be called whenever a global transaction ID is seen. The on_gid function will be called with the global transaction ID (consisting of a server_id and a trans_id) and the binlog position of the end of the transaction.

Example 4-14. Scanning a binlog file for global transaction IDs

_GIDCRE = re.compile(r"^UPDATE Last_Exec_Trans SET\s+"
                     r"server_id = (?P<server_id>\d+),\s+"
                     r"trans_id = (?P<trans_id>\d+)$", re.MULTILINE)
_HEADCRE = re.compile(r"#\d{6}\s+\d?\d:\d\d:\d\d\s+"
                      r"server id\s+(?P<sid>\d+)\s+"
                      r"end_log_pos\s+(?P<end_pos>\d+)\s+"
                      r"(?P<type>\w+)")

def scan_logfile(master, logfile, on_gid):
    from mysqlrep import Position
    lines = fetch_remote_binlog(master, logfile)
    # Scan the output to find global transaction ID update statements
    for trans in group_by_trans(lines):
        if len(trans) < 3:
            continue
        # Check for an update of the Last_Exec_Trans table
        m = _GIDCRE.search(trans[-2])
        if m:
            server_id = int(m.group("server_id"))
            trans_id = int(m.group("trans_id"))
            # Check for an information comment with end_log_pos. We
            # assume InnoDB tables only, so we can therefore rely on
            # the transactions to end in an Xid event.
            m = _HEADCRE.search(trans[-1])
            if m and m.group("type") == "Xid":
                pos = Position(server_id, logfile, int(m.group("end_pos")))
                on_gid(server_id, trans_id, pos)

The code for the last step is given in Example 4-15. The promote_slave function takes a list of slaves that lost their master and executes the promotion by identifying the new master from the slaves. Finally, it reconnects all the other slaves to the promoted slave by scanning the binary logs. The code uses the support function fetch_ global_trans_id to fetch the global transaction ID from the table that we introduced.

Example 4-15. Identifying the new master and reconnecting all slaves to it

def fetch_global_trans_id(slave):
    result = slave.sql("SELECT server_id, trans_id FROM Last_Exec_Trans")
    return (int(result["server_id"]), int(result["trans_id"]))

def promote_slave(slaves):
    slave_info = {}

    # Collect the global transaction ID of each slave
    for slave in slaves:
        slave.connect()
        server_id, trans_id = fetch_global_trans_id(slave)
        slave_info.setdefault(trans_id, []).append((server_id, trans_id, slave))
        slave.disconnect()

    # Pick the slave to promote by taking the slave with the highest
    # global transaction id.
    new_master = slave_info[max(slave_info)].pop()[2]

    def maybe_change_master(server_id, trans_id, position):
        from mysqlrep.utility import change_master
        try:
            for sid, tid, slave in slave_info[trans_id]:
                if slave is not new_master:
                    change_master(slave, new_master, position)
        except KeyError:
            pass

    # Read the the master logfiles of the new master.
    new_master.connect()
    logs = [row["Log_name"] for row in new_master.sql("SHOW MASTER LOGS")]
    new_master.disconnect()

    # Read the master logfiles one by one in reverse order, the
    # latest binlog file first.
    logs.reverse()
    for log in logs:
        scan_logfile(new_master, log, maybe_change_master)

Worth noting in the code is that the slaves are collected into a dictionary using the transaction ID from the global transaction ID as a key. Since there can be several slaves associated with the same key, we used the “Associating Multiple Values with Each Key in a Dictionary” recipe in Alex Martelli et al.’s Python Cookbook (O’Reilly). This puts a list of servers under each key and allows quick lookup and processing in maybe_change_master based on only the transaction ID.

Note

With the code in Example 4-15, there is no guarantee that the transaction IDs will be in order, so if that is important, you will have to take additional measures. The transaction IDs could be in a nonsequential order if one transaction fetches a global transaction ID but is interrupted before it can commit another transaction, which fetches a global transaction ID and commits. To make sure the transaction IDs reflect the order in which transactions start, add a SELECT ... FOR UPDATE just before fetching a global transaction ID by changing the code as follows:

def commit_trans(cur):
    cur.execute("SELECT * FROM Last_Exec_Trans FOR UPDATE")
    cur.execute("SET SQL_LOG_BIN = 0")
       cur.execute("INSERT INTO Global_Trans_ID() VALUES ()")
       .

       .
    cur.commit()

This will lock the row until the transaction is committed, but will also slow down the system some, which is wasteful if the ordering is not required.

Circular Replication

After reading about dual masters, you might wonder if it is possible to set up a multimaster with more than two masters replicating to each other. Since each slave can only have a single master, it is only possible to get this by setting up replication in a circular fashion.

Although this is not a recommended setup, it is certainly possible. The reason it is not recommended is because it is very hard to get it to work correctly in the presence of failure. The reasons for this will become clear in a moment.

Using a circular replication setup with three or more servers can be quite practical for reasons of locality. As a real-life example, consider the case of a mobile phone operator with subscribers all over Europe. Since the mobile phone owners travel around quite a lot, it is convenient to have the registry for the customers close to the actual phone, so by placing the data centers at some strategic places in Europe, it is possible to quickly verify call data and also register new calls locally. The changes can then be replicated to all the servers in the ring, and eventually all servers will have accurate billing information. In this case, circular replication is a perfect setup: all subscriber data is replicated to all sites, and updates of data are allowed in all data centers.

Setting up circular replication (as shown in Figure 4-10) is quite easy. Example 4-16 provides a script that sets up circular replication automatically, so where are the complications? As in every setup, you should ask yourself, “What happens when something goes wrong?”

Circular replication setup

Figure 4-10. Circular replication setup

Example 4-16. Setting up circular replication

def circular_replication(server_list):
    count = len(server_list)
    for i in range(0, count):
        change_master(server_list[(i+1) % count], server_list[i])

In Figure 4-10, there are four servers named for the cities in which they are located (the names are arbitrarily picked and do not reflect a real setup). Replication goes in a circle: “Stockholm” to “Moscow” to “Paris” to “London” and back to “Stockholm.” This means that “Moscow” is upstream of “Paris,” but downstream of “Stockholm.” Suppose that “Moscow” goes down suddenly and unexpectedly. To allow replication to continue, it is necessary to reconnect the “downstream” server “Paris” to the “upstream” server “Stockholm” to ensure the continuing operation of the system.

Figure 4-11 shows a scenario in which a single server fails and the servers reconnect to allow replication to continue. Sounds simple enough, doesn’t it? Well, it’s not really as simple as it looks. There are basically three issues that you have to consider:

  • The downstream server—the server that was slave to the failed master—needs to connect to the upstream server and start replication from what it last saw. How is that position decided?

  • Suppose that the crashed server has managed to send out some events before crashing. What happens with those events?

  • We need to consider how we should bring the failed server into the topology again. What if the server applied some transactions of its own that were written to the binary log but not yet sent out? It is clear that these transactions are lost, so we need to handle this.

Changing topology in response to a failing server

Figure 4-11. Changing topology in response to a failing server

When detecting that one of the servers failed, it is easy to use the CHANGE MASTER command to connect the downstream server to the upstream server, but for replication to work correctly, we must determine the right position. To find the correct position, use binary log scanning techniques similar to what we used for slave promotion. However, in this case, we have several servers to consider when deciding what position to start from. The Last_Exec_Trans table introduced earlier already contains the server ID and the global transaction ID seen from that server.

The second problem is more complicated. If the failing server managed to send out an event, there is nothing that can remove that event from the replication stream, so it will circle around the replication topology forever. If the statement is idempotent—it can be reapplied multiple times without causing problems—the situation could be manageable for a short period, but in general, the statement has to be removed somehow.

In MySQL version 5.5, the parameter IGNORE_SERVER_IDS was added to the CHANGE MASTER command. This parameter allows a server to remove more events from the replication stream than just the events with the same server ID as the server. So, assuming that the servers have the IDs shown in Figure 4-11, we can reconnect Paris to Stockholm using the following command:

paris> CHANGE MASTER TO
    ->   MASTER_HOST='stockholm.example.com',
    ->   IGNORE_SERVER_IDS = (2);

For versions of MySQL earlier than version 5.5, there is no such support and you may have to devise some other means of removing the offending events. The easiest method is probably to bring in a server temporarily with the same ID as the crashed server for the sole purpose of removing the offending event.

The complete procedure to shrink the ring in a circular setup—assuming that you are using MySQL 5.5—is as follows:

  1. Determine the global transaction IDs of the last committed transactions on the downstream server for all servers that are still up and running.

    paris> SELECT Server_ID, Trans_ID FROM Last_Exec_Trans WHERE Server_ID != 2;
    +-----------+----------+
    | Server_ID | Trans_ID |
    +-----------+----------+
    |         1 |     5768 |
    |         3 |     4563 |
    |         4 |      768 |
    +-----------+----------+
    3 rows in set (0.00 sec)
  2. Scan the binary log of the upstream server for the last of the global transaction IDs seen in Last_Exec_Trans.

  3. Connect the downstream server to this position using CHANGE MASTER.

    paris> CHANGE MASTER TO
        ->   MASTER_HOST='stockholm.example.com',
        ->   IGNORE_SERVER_IDS = (2);

Since the failed server can be in an alternative future compared to the other servers, the safest way to bring it into the circle again is to restore the server from one of the servers in the ring and reconnect the circle so that the new server is in the ring again. The steps to accomplish that are:

  1. Restore the server from one of the existing servers—the server that will eventually be the upstream server—in the ring and attach it as a slave to that server.

    moscow> CHANGE MASTER TO MASTER_HOST='stockholm.example.com';
    Query OK, 0 rows affected (0.18 sec)
    
    moscow> START SLAVE;
    Query OK, 0 rows affected (0.00 sec)
  2. Once the server has caught up sufficiently, break the ring by disconnecting the downstream server. This server will no longer receive any updates.

    paris> STOP SLAVE;
    Query OK, 0 rows affected (0.00 sec)
  3. Since the restored server might not have all the events that the downstream server has, it is necessary to wait for the restored server to have at least all the events the downstream server has. Since the positions are for the same server, you can do this using a combination of SHOW SLAVE STATUS and MASTER_POS_WAIT.

    paris> SHOW SLAVE STATUS;
        ...
     Relay_Master_Log_File: stockholm-bin.000096
         ...
      Exec_Master_Log_Pos: 756648
    1 row in set (0.00 sec)
    
    
    
    moscow> SELECT MASTER_POS_WAIT('stockholm-bin.000096', 756648);
    +-------------------------------------------------+
    | MASTER_POS_WAIT('stockholm-bin.000096', 756648) |
    +-------------------------------------------------+
    |                                          985761 |
    +-------------------------------------------------+
    1 row in set (156.32 sec)
  4. Determine the position of the event on the restored server by scanning the binary log of the restored server for the global ID that was last seen by the downstream server.

  5. Connect the downstream server to the restored server and start replication.

    paris> CHANGE MASTER TO
        ->    MASTER_HOST='moscow.example.com',
        ->    MASTER_LOG_FILE='moscow-bin.000107',
        ->    MASTER_LOG_POS=196758,
    Query OK, 0 rows affected (0.18 sec)
    
    moscow> START SLAVE;
    Query OK, 0 rows affected (0.00 sec)

Conclusion

High availability is a nontrivial concept to implement in practice. In this chapter, we presented a look into high availability and how you can achieve it with MySQL. In the next chapter, we will look more at high availability as we examine a companion topic: scaling out.

Get MySQL High Availability now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.