MySQL High Availability, 2nd Edition

Chapter 4. The Binary Log

“Joel?”

Joel jumped, nearly banging his head as he crawled out from under his desk. “I was just rerouting a few cables,” he said by way of an explanation.

Mr. Summerson merely nodded and said in a very authoritative manner, “I need you to look into a problem the marketing people are having with the new server. They need to roll back the data to a certain point.”

“Well, that depends…” Joel started, worried about whether he had snapshots of old states of the system.

“I told them you’d be right down.”

With that, Mr. Summerson turned and walked away. A moment later a woman stopped in front of his door and said, “He’s always like that. Don’t take it personally. Most of us call it a drive-by tasking.” She laughed and introduced herself. “My name’s Amy. I’m one of the developers here.”

Joel walked around his desk and met her at the door. “I’m Joel.”

After a moment of awkward silence Joel said, “I, er, better get on that thing.”

Amy smiled and said, “See you around.”

“Just focus on what you have to do to succeed,” Joel thought as he returned to his desk to search for that MySQL book he bought last week.

The previous chapter included a very brief introduction to the binary log. In this chapter, we will fill in more details and give a more thorough description of the binary log structure, the replication event format, and how to use the mysqlbinlog tool to investigate and work with the contents of binary logs.

The binary log records changes made to the database. It is usually used for replication, and the binary log then allows the same changes to be made on the slaves as well. Because the binary log normally keeps a record of all changes, you can also use it for auditing purposes to see what happened in the database, and for point-in-time recovery (PITR) by playing back the binary log to a server, repeating changes that were recorded in the binary log. (This is what we did in Reporting, where we played back all changes done between 23:55:00 and 23:59:59.)

The binary log contains information that could change the database. Note that statements that could potentially change the database are also logged, even if they don’t actually do change the database. The most notable cases are those statements that optionally make a change, such as DROP TABLE IF EXISTS or CREATE TABLE IF NOT EXISTS, along with statements such as DELETE and UPDATE that have WHERE conditions that don’t happen to match any rows on the master.

SELECT statements are not normally logged because they do not make any changes to any database. There are, however, exceptions.

The binary log records each transaction in the order that the commit took place on the master. Although transactions may be interleaved on the master, each appears as an uninterrupted sequence in the binary log, the order determined by the time of the commit.

Structure of the Binary Log

Conceptually, the binary log is a sequence of binary log events (also called binlog events or even just events when there is no risk of confusion). As you saw in Chapter 3, the binary log actually consists of several files, as shown in Figure 4-1, that together form the binary log.

Figure 4-1. The structure of the binary log

The actual events are stored in a series of files called binlog files with names in the form host-bin.000001, accompanied by a binlog index file that is usually named host-bin.index and keeps track of the existing binlog files. The binlog file that is currently being written to by the server is called the active binlog file. If no slaves are lagging, this is also the file that is being read by the slaves. The names of the binlog files and the binlog index file can be controlled using the log-bin and log-bin-index options, which you are familiar with from Configuring the Master. The options are covered in more detail later in this chapter.

The index file keeps track of all the binlog files used by the server so that the server can correctly create new binlog files when necessary, even after server restarts. Each line in the index file contains the name of a binlog file that is part of the binary log. Depending on the MySQL version, it can either be the full name or a name relative to the data directory. Commands that affect the binlog files, such as PURGE BINARY LOGS, RESET MASTER, and FLUSH LOGS, also affect the index file by adding or removing lines to match the files that were added or removed by the command.

As shown in Figure 4-1, each binlog file is made up of binlog events, with the Format_description event serving as the file’s header and the Rotate event as its footer. Note that a binlog file might not end with a rotate event if the server was interrupted or crashed.

The Format_description event contains information about the server that wrote the binlog file as well as some critical information about the file’s status. If the server is stopped and restarted, it creates a new binlog file and writes a new Format_description event to it. This is necessary because changes can potentially occur between bringing a server down and bringing it up again. For example, the server could be upgraded, in which case a new Format_description event would have to be written.

When the server has finished writing a binlog file, a Rotate event is added to end the file. The event points to the next binlog file in sequence by giving the name of the file as well as the position to start reading from.

The Format_description event and the Rotate event will be described in detail in the next section.

With the exception of control events (e.g., Format_description, Rotate, and Incident), events of a binlog file are grouped into units called groups, as seen in Figure 4-2. In transactional storage engines, each group is roughly equivalent to a transaction, but for nontransactional storage engines or statements that cannot be part of a transaction, such as CREATE or ALTER, each statement is a group by itself.^[4]

In short, each group of events in the binlog file contains either a single statement not in a transaction or a transaction consisting of several statements.

Figure 4-2. A single binlog file with groups of events

Each group is executed entirely or not at all (with the exception of a few well-defined cases). If, for some reason, the slave stops in the middle of a group, replication will start from the beginning of the group and not from the last statement executed. Chapter 8 describes in detail how the slave executes events.

Binlog Event Structure

MySQL 5.0 introduced a new binlog format: binlog format 4. The preceding formats were not easy to extend with additional fields if the need were to arise, so binlog format 4 was designed specifically to be extensible. This is still the event format used in every server version since 5.0, even though each version of the server has extended the binlog format with new events and some events with new fields. Binlog format 4 is the event format described in this chapter.

Each binlog event consists of four parts:

Common header

The common header is—as the name suggests—common to all events in the binlog file.

The common header contains basic information about the event, the most important fields being the event type and the size of the event.

Post header

The post header is specific to each event type; in other words, each event type stores different information in this field. But the size of this header, just as with the common header, is the same throughout a given binlog file. The size of each event type is given by the Format_description event.

Event body

After the headers comes the event body, which is the variable-sized part of the event. The size and the end position is listed in the common header for the event. The event body stores the main data of the event, which is different for different event types. For the Query event, for instance, the body stores the query, and for the User_var event, the body stores the name and value of a user variable that was just set by a statement.

Checksum

Starting with MySQL 5.6, there is a checksum at the end of the event, if the server is configured to generate one. The checksum is a 32-bit integer that is used to check that the event has not been corrupted since it was written.

A complete listing of the formats of all events is beyond the scope of this book, but because the Format_description and Rotate events are critical to how the other events are interpreted, we will briefly cover them here. If you are interested in the details of the events, you can find them in the MySQL Internals Manual.

As already noted, the Format_description event starts every binlog file and contains common information about the events in the file. The result is that the Format_de⁠scription event can be different between different files; this typically occurs when a server is upgraded and restarted. The Format_description_log_event contains the following fields:

Binlog file format version

This is the version of the binlog file, which should not be confused with the version of the server. MySQL versions 3.23, 4.0, and 4.1 use version 3 of the binary log, while MySQL versions 5.0 and later use version 4 of the binary log.

The binlog file format version changes when developers make significant changes in the overall structure of the file or the events. In MySQL version 5.0, the start event for a binlog file was changed to use a different format and the common headers for all events were also changed, which prompted the change in the binlog file format version.

Server version

This is a version string denoting the server that created the file. This includes the version of the server as well as additional information if special builds are made. The format is normally the three-position version number, followed by a hyphen and any additional build options. For example, “5.5.10-debug-log” means debug build version 5.5.10 of the server.

Common header length

This field stores the length of the common header. Because it’s here in the Format_description, this length can be different for different binlog files. This holds for all events except the Format_description and Rotate events, which cannot vary.

Post-header lengths

The post-header length for each event is fixed within a binlog file, and this field stores an array of the post-header length for each event that can occur in the binlog file. Because the number of event types can vary between servers, the number of different event types that the server can produce is stored before this field.

Note

The Rotate and Format_description log events have a fixed length because the server needs them before it knows the size of the common header length. When connecting to the server, it first sends a Format_description event. Because the length of the common header is stored in the Format_description event, there is no way for the server to know what the size of the common header is for the Rotate event unless it has a fixed size. So for these two events, the size of the common header is fixed and will never change between server versions, even if the size of the common header changes for other events.

Because both the size of the common header and the size of the post header for each event type are given in the Format_description event, extending the format with new events or even increasing the size of the post headers by adding new fields is supported by this format and will therefore not require a change in the binlog file format.

With each extension, particular care is taken to ensure that the extension does not affect interpretation of events that were already in earlier versions. For example, the common header can be extended with an additional field to indicate that the event is compressed and the type of compression used, but if this field is missing—which would be the case if a slave is reading events from an old master—the server should still be able to fall back on its default behavior.

Event Checksums

Because hardware can fail and software can contain bugs, it is necessary to have some way to ensure that data corrupted by such events is not applied on the slave. Random failures can occur anywhere, and if they occur inside a statement, they often lead to a syntax error causing the slave to stop. However, relying on this to prevent corrupt events from being replicated is a poor way to ensure integrity of events in the binary log. This policy would not catch many types of corruptions, such as in timestamps, nor would it work for row-based events where the data is encoded in binary form and random corruptions are more likely to lead to incorrect data.

To ensure the integrity of each event, MySQL 5.6 introduced replication event checksums. When events are written, a checksum is added, and when the events are read, the checksum is computed for the event and compared against the checksum written with the event. If the checksums do not match, execution can be aborted before any attempt is made to apply the event on the slave. The computation of checksums can potentially impact performance, but benchmarking has demonstrated no noticeable performance degradation from the addition and checking of checksums, so they are enabled by default in MySQL 5.6. They can, however, be turned off if necessary.

In MySQL 5.6, checksums can be generated when changes are written either to the binary log or to the relay log, and verified when reading events back from one of these logs.

Replication event checksums are controlled using three options:

binlog-checksum=type: This option enables checksums and tells the server what checksum computation to use. Currently there are only two choices: CRC32 uses ISO-3309 CRC-32 checksums, whereas NONE turns off checksumming. The default is CRC32, meaning that checksums are generated.
master-verify-checksum=boolean: This option controls whether the master verifies the checksum when reading it from the binary log. This means that the event checksum is verified when it is read from the binary log by the dump thread (see Replication Architecture Basics), but before it is sent out, and also when using SHOW BINLOG EVENTS. If any of the events shown is corrupt, the command will throw an error. This option is off by default.
slave-sql-verify-checksum=boolean: This option controls whether the slave verifies the event checksum after reading it from the relay log and before applying it to the slave database. This option is off by default.

If you get a corrupt binary log or relay log, mysqlbinlog can be used to find the bad checksum using the --verify-binlog-checksum option. This option causes mysqlbinlog to verify the checksum of each event read and stop when a corrupt event is found, which the following example demonstrates:

$ client/mysqlbinlog --verify-binlog-checksum master-bin.000001
        .
        .
        .
# at 261
#110406  8:35:28 server id 1  end_log_pos 333 CRC32 0xed927ef2...
SET TIMESTAMP=1302071728/*!*/;
BEGIN
/*!*/;
# at 333
#110406  8:35:28 server id 1  end_log_pos 365 CRC32 0x01ed254d  Intvar
SET INSERT_ID=1/*!*/;
ERROR: Error in Log_event::read_log_event(): 'Event crc check failed!...
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Logging Statements

Starting with MySQL 5.1, row-based replication is also available, which will be covered in Row-Based Replication.

In statement-based replication, the actual executed statement is written to the binary log together with some execution information, and the statement is re-executed on the slave. Because not all events can be logged as statements, there are some exceptions that you should be aware of. This section will describe the process of logging individual statements as well as the important caveats.

Because the binary log is a common resource—all threads write to it—it is critical to prevent two threads from updating the binary log at the same time. To handle this, a lock for the binary log—called the LOCK_log mutex—is acquired just before each group is written to the binary log and released just after the group has been written. Because all session threads for the server can potentially log transactions to the binary log, it is quite common for several session threads to block on this lock.

Logging Data Manipulation Language Statements

Data manipulation are usually language (DML) statements are usually DELETE, INSERT, and UPDATE statements. To support logging changes in a consistent manner, MySQL writes the binary log while transaction-level locks are held, and releases them after the binary log has been written.

To ensure that the binary log is updated consistently with the tables that the statement modifies, each statement is logged to the binary log during statement commit, just before the table locks are released. If the logging were not made as part of the statement, another statement could be “injected” between the changes that the statement introduces to the database and the logging of the statement to the binary log. This would mean that the statements would be logged in a different order than the order in which they took effect in the database, which could lead to inconsistencies between master and slave. For instance, an UPDATE statement with a WHERE clause could update different rows on the slave because the values in those rows might be different if the statement order changed.

Logging Data Definition Language Statements

Data definition language (DDL) statements affect a schema, such as CREATE TABLE and ALTER TABLE statements. These create or change objects in the filesystem—for example, table definitions are stored in .frm files and databases are represented as filesystem directories—so the server keeps information about these available in internal data structures. To protect the update of the internal data structure, it is necessary to acquire an internal lock (called LOCK_open) before altering the table definition.

Because a single lock is used to protect these data structures, the creation, alteration, and destruction of database objects can be a considerable source of performance problems. This includes the creation and destruction of temporary tables, which is quite common as a technique to create an intermediate result set to perform computations on.

If you are creating and destroying a lot of temporary tables, it is often possible to boost performance by reducing the creation (and subsequent destruction) of temporary tables.

Logging Queries

For statement-based replication, the most common binlog event is the Query event, which is used to write a statement executed on the master to the binary log. In addition to the actual statement executed, the event contains some additional information necessary to execute the statement.

Recall that the binary log can be used for many purposes and contains statements in a potentially different order from that in which they were executed on the master. In some cases, part of the binary log may be played back to a server to perform PITR, and in some cases, replication may start in the middle of a sequence of events because a backup has been restored on a slave before starting replication.

In all these cases, the events are executing in different contexts (i.e., there is information that is implicit when the server executes the statement but that has to be known to execute the statement correctly). Examples include:

Current database: If the statement refers to a table, function, or procedure without qualifying it with the database, the current database is implicit for the statement.
Value of user-defined variable: If a statement refers to a user-defined variable, the value of the variable is implicit for the statement.
Seed for the RAND function: The RAND function is based on a pseudorandom number function, meaning that it can generate a sequence of numbers that are reproducible but appear random in the sense that they are evenly distributed. The function is not really random, but starts from a seed number and applies a pseudorandom function to generate a deterministic sequence of numbers. This means that given the same seed, the RAND function will always return the same number. However, this makes the seed implicit for the statement.
The current time: Obviously, the time the statement started executing is implicit. Having a correct time is important when calling functions that are dependent on the current time—such as NOW and UNIX_TIMESTAMP—because otherwise they will return different results if there is a delay between the statement execution on the master and on the slave.
Value used when inserting into an AUTO_INCREMENT column: If a statement inserts a row into a table with a column defined with the AUTO_IN⁠CRE⁠MENT attribute, the value used for that row is implicit for the statement because it depends on the rows inserted before it.
Value returned by a call to LAST_INSERT_ID: If the LAST_INSERT_ID function is used in a statement, it depends on the value inserted by a previous statement, which makes this value implicit for the statement.
Thread ID: For some statements, the thread ID is implicit. For example, if the statement refers to a temporary table or uses the CURRENT_ID function, the thread ID is implicit for the statement.

Because the context for executing the statements cannot be known when they’re replayed—either on a slave or on the master after a crash and restart—it is necessary to make the implicit information explicit by adding it to the binary log. This is done in slightly different ways for different kinds of information.

In addition to the previous list, some information is implicit to the execution of triggers and stored routines, but we will cover that separately in Triggers, Events, and Stored Routines.

Let’s consider each of the cases of implicit information individually, demonstrate the problem with each one, and examine how the server handles it.

Current database

The log records the current database by adding it to a special field of the Query event. This field also exists for the events used to handle the LOAD DATA INFILE statement, discussed in LOAD DATA INFILE Statements, so the description here applies to that statement as well. The current database also plays an important role in filtering on the database and is described later in this chapter.

Current time

Five functions use the current time to compute their values: NOW, CURDATE, CURTIME, UNIX_TIMESTAMP, and SYSDATE. The first four functions return a value based on the time when the statement started to execute. In contrast, SYSDATE returns the value when the function is executed. The difference can best be demonstrated by comparing the execution of NOW and SYSDATE with an intermediate sleep:

mysql> SELECT SYSDATE(), NOW(), SLEEP(2), SYSDATE(), NOW()\G
*************************** 1. row ***************************
SYSDATE(): 2013-06-08 23:24:08
    NOW(): 2013-06-08 23:24:08
 SLEEP(2): 0
SYSDATE(): 2013-06-08 23:24:10
    NOW(): 2013-06-08 23:24:08
1 row in set (2.00 sec)

Both functions are evaluated when they are encountered, but NOW returns the time that the statement started executing, whereas SYSDATE returns the time when the function was executed.

To handle these time functions correctly, the timestamp indicating when the event started executing is stored in the event. This value is then copied from the event to the slave execution thread and used as if it were the time the event started executing when computing the value of the time functions.

Because SYSDATE gets the time from the operating system directly, it is not safe for statement-based replication and will return different values on the master and slave when executed. So unless you really want to have the actual time inserted into your tables, it is prudent to stay away from this function.

Context events

Some implicit information is associated with statements that meet certain conditions:

If the statement contains a reference to a user-defined variable (as in Example 4-1), it is necessary to add the value of the user-defined variable to the binary log.
If the statement contains a call to the RAND function, it is necessary to add the pseudorandom seed to the binary log.
If the statement contains a call to the LAST_INSERT_ID function, it is necessary to add the last inserted ID to the binary log.
If the statement performs an insert into a table with an AUTO_INCREMENT column, it is necessary to add the value that was used for the column (or columns) to the binary log.

Example 4-1. Statements with user-defined variables

SET @value = 45;
INSERT INTO t1 VALUES (@value);

In each of these cases, one or more context events are added to the binary log before the event containing the query is written. Because there can be several context events preceding a Query event, the binary log can handle multiple user-defined variables together with the RAND function, or (almost) any combination of the previously listed conditions. The binary log stores the necessary context information through the following events:

User_var

Each such event records the name and value of a single user-defined variable.

Rand

Records the random number seed used by the RAND function. The seed is fetched internally from the session’s state.

Intvar

If the statement is inserting into an autoincrement column, this event records the value of the internal autoincrement counter for the table before the statement starts.

If the statement contains a call to LAST_INSERT_ID, this event records the value that this function returned in the statement.

Example 4-2 shows some statements that generate all of the context events and how the events appear when displayed using SHOW BINLOG EVENTS. Note that there can be several context events before each statement.

Example 4-2. Query events with context events

master> SET @foo = 12;
Query OK, 0 rows affected (0.00 sec)

master> SET @bar = 'Smoothnoodlemaps';
Query OK, 0 rows affected (0.00 sec)

master> INSERT INTO t1(b,c) VALUES
     ->   (@foo,@bar), (RAND(), 'random');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

master> INSERT INTO t1(b) VALUES (LAST_INSERT_ID());
Query OK, 1 row affected (0.00 sec)

master> SHOW BINLOG EVENTS FROM 238\G
*************************** 1. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 238
 Event_type: Query
  Server_id: 1
End_log_pos: 306
       Info: BEGIN
*************************** 2. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 306
 Event_type: Intvar
  Server_id: 1
End_log_pos: 334
       Info: INSERT_ID=1
*************************** 3. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 334
 Event_type: RAND
  Server_id: 1
End_log_pos: 369
       Info: rand_seed1=952494611,rand_seed2=949641547
*************************** 4. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 369
 Event_type: User var
  Server_id: 1
End_log_pos: 413
       Info: @`foo`=12
*************************** 5. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 413
 Event_type: User var
  Server_id: 1
End_log_pos: 465
       Info: @`bar`=_utf8 0x536D6F6F74686E6F6F6...
*************************** 6. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 465
 Event_type: Query
  Server_id: 1
End_log_pos: 586
       Info: use `test`; INSERT INTO t1(b,c) VALUES (@foo,@bar)...
*************************** 7. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 586
 Event_type: Xid
  Server_id: 1
End_log_pos: 613
       Info: COMMIT /* xid=44 */
*************************** 8. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 613
 Event_type: Query
  Server_id: 1
End_log_pos: 681
       Info: BEGIN
*************************** 9. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 681
 Event_type: Intvar
  Server_id: 1
End_log_pos: 709
       Info: LAST_INSERT_ID=1
*************************** 10. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 709
 Event_type: Intvar
  Server_id: 1
End_log_pos: 737
       Info: INSERT_ID=3
*************************** 11. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 737
 Event_type: Query
  Server_id: 1
End_log_pos: 843
       Info: use `test`; INSERT INTO t1(b) VALUES (LAST_INSERT_ID())
*************************** 12. row ***************************
   Log_name: mysqld1-bin.000001
        Pos: 843
 Event_type: Xid
  Server_id: 1
End_log_pos: 870
       Info: COMMIT /* xid=45 */
12 rows in set (0.00 sec)

Thread ID

The last implicit piece of information that the binary log sometimes needs is the thread ID of the MySQL session handling the statement. The thread ID is necessary when a function is dependent on the thread ID—such as when it refers to CONNECTION_ID—but most importantly for handling temporary tables.

Temporary tables are specific to each thread, meaning that two temporary tables with the same name are allowed to coexist, provided they are defined in different sessions. Temporary tables can provide an effective means to improve the performance of certain operations, but they require special handling to work with the binary log.

Internally in the server, temporary tables are handled by creating obscure names for storing the table definitions. The names are based on the process ID of the server, the thread ID that creates the table, and a thread-specific counter to distinguish between different instances of the table from the same thread. This naming scheme allows tables from different threads to be distinguished from each other, but each statement can access its proper table only if the thread ID is stored in the binary log.

Similar to how the current database is handled in the binary log, the thread ID is stored as a separate field in every Query event and can therefore be used to compute thread-specific data and handle temporary tables correctly.

When writing the Query event, the thread ID to store in the event is read from the pseudo_thread_id server variable. This means that it can be set before executing a statement, but only if you have SUPER privileges. This server variable is intended to be used by mysqlbinlog to emit statements correctly and should not normally be used.

For a statement that contains a call to the CONNECTION_ID function or that uses or creates a temporary table, the Query event is marked as thread-specific in the binary log. Because the thread ID is always present in the Query event, this flag is not necessary but is mainly used to allow mysqlbinlog to avoid printing unnecessary assignments to the pseudo_thread_id variable.

LOAD DATA INFILE Statements

The LOAD DATA INFILE statement makes it easy to fill tables quickly from a file. Unfortunately, it is dependent on a certain kind of context that cannot be covered by the context events we have discussed: files that need to be read from the filesystem.

To handle LOAD DATA INFILE, the MySQL server uses a special set of events to handle the transfer of the file using the binary log. In addition to solving the problem for LOAD DATA INFILE, this makes the statement a very convenient tool for transferring large amounts of data from the master to the slave, as you will see soon. To correctly transfer and execute a LOAD DATA INFILE statement, several new events are introduced into the binary log:

Begin_load_query

This event signals the start of data transfer in the file.

Append_block

A sequence of one or more of these events follows the Begin_load_query event to contain the rest of the file’s data, if the file was larger than the maximum allowed packet size on the connection.

Execute_load_query

This event is a specialized variant of the Query event that contains the LOAD DATA INFILE statement executed on the master.

Even though the statement contained in this event contains the name of the file that was used on the master, this file will not be sought by the slave. Instead, the contents provided by the preceding Begin_load_query and Append_block events will be used.

For each LOAD DATA INFILE statement executed on the master, the file to read is mapped to an internal file-backed buffer, which is used in the following processing. In addition, a unique file ID is assigned to the execution of the statement and is used to refer to the file read by the statement.

While the statement is executing, the file contents are written to the binary log as a sequence of events starting with a Begin_load_query event—which indicates the beginning of a new file—followed by zero or more Append_block events. Each event written to the binary log is no larger than the maximum allowed packet size, as specified by the max-allowed-packet option.

After the entire file is read and applied to the table, the execution of the statement terminates by writing the Execute_load_query event to the binary log. This event contains the statement executed together with the file ID assigned to the execution of the statement. Note that the statement is not the original statement as the user wrote it, but rather a recreated version of the statement.

Note

If you are reading an old binary log, you might instead find Load_log_event, Execute_log_event, and Create_file_log_event. These were the events used to replicate LOAD DATA INFILE prior to MySQL version 5.0.3 and were replaced by the implementation just described.

Example 4-3 shows the events written to the binary log by a successful execution of a LOAD DATA INFILE statement. In the Info field, you can see the assigned file ID—1, in this case—and see that it is used for all the events that are part of the execution of the statement. You can also see that the file foo.dat used by the statement contains more than the maximum allowed packet size of 16384, so it is split into three events.

Example 4-3. Successful execution of LOAD DATA INFILE

master> SHOW BINLOG EVENTS IN 'master-bin.000042' FROM 269\G
*************************** 1. row ***************************
   Log_name: master-bin.000042
        Pos: 269
 Event_type: Begin_load_query
  Server_id: 1
End_log_pos: 16676
       Info: ;file_id=1;block_len=16384
*************************** 2. row ***************************
   Log_name: master-bin.000042
        Pos: 16676
 Event_type: Append_block
  Server_id: 1
End_log_pos: 33083
       Info: ;file_id=1;block_len=16384
*************************** 3. row ***************************
   Log_name: master-bin.000042
        Pos: 33083
 Event_type: Append_block
  Server_id: 1
End_log_pos: 33633
       Info: ;file_id=1;block_len=527
*************************** 4. row ***************************
   Log_name: master-bin.000042
        Pos: 33633
 Event_type: Execute_load_query
  Server_id: 1
End_log_pos: 33756
       Info: use `test`; LOAD DATA INFILE 'foo.dat' INTO...;file_id=1
4 rows in set (0.00 sec)

Binary Log Filters

It is possible to filter out statements from the binary log using two options: binlog-do-db and binlog-ignore-db (which we will call binlog-*-db, collectively). The binlog-do-db option is used when you want to filter only statements belonging to a certain database, and binlog-ignore-db is used when you want to ignore a certain database but replicate all other databases.

These options can be given multiple times, so to filter out both the database one_db and the database two_db, you must give both options in the my.cnf file. For example:

[mysqld]
binlog-ignore-db=one_db
binlog-ignore-db=two_db

The way MySQL filters events can be quite a surprise to unfamiliar users, so we’ll explain how filtering works and make some recommendations on how to avoid some of the major headaches.

Figure 4-3 shows how MySQL determines whether the statement is filtered. The filtering is done on a statement level—either the entire statement is filtered out or the entire statement is written to the binary log—and the binlog-*-db options use the current database to decide whether the statement should be filtered, not the database of the tables affected by the statement.

To help you understand the behavior, consider the statements in Example 4-4. Each line uses bad as the current database and changes tables in different databases.

Figure 4-3. Logic for binlog-*-db filters

Example 4-4. Statements using different databases

USE bad; INSERT INTO t1 VALUES (1),(2);
USE bad; INSERT INTO good.t2 VALUES (1),(2);
USE bad; UPDATE good.t1, ugly.t2 SET a = b;

: This line changes a table in the database named bad since it does not qualify the table name with a database name.
: This line changes a table in a different database than the current database.
: This line changes two tables in two different databases, neither of which is the current database.

Now, given these statements, consider what happens if the bad database is filtered using binlog-ignore-db=bad. None of the three statements in Example 4-4 will be written to the binary log, even though the second and third statements change tables on the good and ugly database and make no reference to the bad database. This might seem strange at first—why not filter the statement based on the database of the table changed? But consider what would happen with the third statement if the ugly database was filtered instead of the bad database. Now one database in the UPDATE is filtered out and the other isn’t. This puts the server in a catch-22 situation, so the problem is solved by just filtering on the current database, and this rule is used for all statements (with a few exceptions).

Note

To avoid mistakes when executing statements that can potentially be filtered out, make it a habit not to write statements so they qualify table, function, or procedure names with the database name. Instead, whenever you want to access a table in a different database, issue a USE statement to make that database the current database. In other words, instead of writing:

INSERT INTO other.book VALUES ('MySQL', 'Paul DuBois');

write:

USE other; INSERT INTO book VALUES ('MySQL', 'Paul DuBois');

Using this practice, it is easy to see by inspection that the statement does not update multiple databases simply because no tables should be qualified with a database name.

This behavior does not apply when row-based replication is used. The filtering employed in row-based replication will be discussed in Filtering in Row-Based Replication, but since row-based replication can work with each individual row change, it is able to filter on the actual table that the row is targeted for and does not use the current database.

So, what happens when both binlog-do-db and binlog-ignore-db are used at the same time? For example, consider a configuration file containing the following two rules:

[mysqld]
binlog-do-db=good
binlog-ignore-db=bad

In this case, will the following statement be filtered out or not?

USE ugly; INSERT INTO t1 VALUES (1);

Following the diagram in Figure 4-3, you can see that if there is at least a binlog-do-db rule, all binlog-ignore-db rules are ignored completely, and since only the good database is included, the previous statement will be filtered out.

Warning

Because of the way that the binlog-*-db rules are evaluated, it is pointless to have both binlog-do-db and binlog-ignore-db rules at the same time. Since the binary log can be used for recovery as well as replication, the recommendation is not to use the binlog-*-db options and instead filter out the event on the slave by using replicate-* options (these are described in Filtering and skipping events). Using the binlog-*-db option would filter out statements from the binary log and you will not be able to restore the database from the binary log in the event of a crash.

Triggers, Events, and Stored Routines

A few other constructions that are treated specially when logged are stored programs—that is, triggers, events, and stored routines (the last is a collective name for stored procedures and stored functions). Their treatment with respect to the binary log contains some elements in common, so they will be covered together in this section. The explanation distinguishes statements of two types: statements that define, destroy, or alter stored programs and statements that invoke them.

Statements that define or destroy stored programs

The following discussion shows triggers in the examples, but the same principles apply to definition of events and stored routines. To understand why the server needs to handle these features specially when writing them to the binary log, consider the code in Example 4-5.

In the example, a table named employee keeps information about all employees of an imagined system and a table named log keeps a log of interesting information. Note that the log table has a timestamp column that notes the time of a change and that the name column in the employee table is the primary key for the table. There is also a status column to tell whether the addition succeeded or failed.

To track information about employee information changes—for example, for auditing purposes—three triggers are created so that whenever an employee is added, removed, or changed, a log entry of the change is added to a log table.

Notice that the triggers are after triggers, which means that entries are added only if the executed statement is successful. Failed statements will not be logged. We will later extend the example so that unsuccessful attempts are also logged.

Example 4-5. Definitions of tables and triggers for employee administration

CREATE TABLE employee (
   name CHAR(64) NOT NULL,
   email CHAR(64),
   password CHAR(64),
   PRIMARY KEY (name)
);

CREATE TABLE log (
   id INT AUTO_INCREMENT,
   email CHAR(64),
   status CHAR(10),
   message TEXT,
   ts TIMESTAMP,
   PRIMARY KEY (id)
);

CREATE TRIGGER tr_employee_insert_after AFTER INSERT ON employee
FOR EACH ROW
   INSERT INTO log(email, status, message)
     VALUES (NEW.email, 'OK', CONCAT('Adding employee ', NEW.name));

CREATE TRIGGER tr_employee_delete_after AFTER DELETE ON employee
FOR EACH ROW
   INSERT INTO log(email, status, message)
     VALUES (OLD.email, 'OK', 'Removing employee');

delimiter $$
CREATE TRIGGER tr_employee_update_after AFTER UPDATE ON employee
FOR EACH ROW
BEGIN
  IF OLD.name != NEW.name THEN
    INSERT INTO log(email, status, message)
      VALUES (OLD.email, 'OK',
              CONCAT('Name change from ', OLD.name, ' to ', NEW.name));
  END IF;
  IF OLD.password != NEW.password THEN
    INSERT INTO log(email, status, message)
      VALUES (OLD.email, 'OK', 'Password change');
  END IF;
  IF OLD.email != NEW.email THEN
    INSERT INTO log(email, status, message)
      VALUES (OLD.email, 'OK', CONCAT('E-mail change to ', NEW.email));
  END IF;
END $$
delimiter ;

With these trigger definitions, it is now possible to add and remove employees as shown in Example 4-6. Here an employee is added, modified, and removed, and as you can see, each of the operations is logged to the log table.

The operations of adding, removing, and modifying employees may be done by a user who has access to the employee table, but what about access to the log table? In this case, a user who can manipulate the employee table should not be able to make changes to the log table. There are many reasons for this, but they all boil down to trusting the contents of the log table for purposes of maintenance, auditing, disclosure to legal authorities, and so on. So the DBA may choose to make access to the employee table available to many users while keeping access to the log table very restricted.

Example 4-6. Adding, removing, and modifying users

master> SET @pass = PASSWORD('xyzzy');
Query OK, 0 rows affected (0.00 sec)

master> INSERT INTO employee VALUES ('mats', 'mats@example.com', @pass);

Query OK, 1 row affected (0.00 sec)

master> UPDATE employee SET name = 'matz'
     -> WHERE email = 'mats@example.com';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

master> SET @pass = PASSWORD('foobar');
Query OK, 0 rows affected (0.00 sec)

master> UPDATE employee SET password = @pass
     -> WHERE email = 'mats@example.com';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

master> DELETE FROM employee WHERE email = 'mats@example.com';
Query OK, 1 row affected (0.00 sec)

master> SELECT * FROM log;
+----+------------------+-------------------------------+---------------------+
| id | email            | message                       | ts                  |
+----+------------------+-------------------------------+---------------------+
|  1 | mats@example.com | Adding employee mats          | 2012-11-14 18:56:08 |
|  2 | mats@example.com | Name change from mats to matz | 2012-11-14 18:56:11 |
|  3 | mats@example.com | Password change               | 2012-11-14 18:56:41 |
|  4 | mats@example.com | Removing employee             | 2012-11-14 18:57:11 |
+----+------------------+-------------------------------+---------------------+
4 rows in set (0.00 sec)

The INSERT, UPDATE, and DELETE in the example can generate a warning that it is unsafe to log when statement mode is used. This is because it invokes a trigger that inserts into an autoincrement column. In general, such warnings should be investigated.

To make sure the triggers can execute successfully against a highly protected table, they are executed as the user who defined the trigger, not as the user who changed the contents of the employee table. So the CREATE TRIGGER statements in Example 4-5 are executed by the DBA, who has privileges to make additions to the log table, whereas the statements altering employee information in Example 4-6 are executed through a user management account that only has privileges to change the employee table.

When the statements in Example 4-6 are executed, the employee management account is used for updating entries in the employee table, but the DBA privileges are used to make additions to the log table. The employee management account cannot be used to add or remove entries from the log table.

As an aside, Example 4-6 assigns passwords to a user variable before using them in the statement. This is done to avoid sending sensitive data in plain text to another server.

In general, a user with REPLICATION SLAVE privileges has privileges to read everything that occurs on the master and should therefore be secured so that the account cannot be compromised. Details are beyond the scope of this book, but here are some examples of precautions you can take:

Make it impossible to log in to the account from outside the firewall.
Track all login attempts for accounts with REPLICATION SLAVE privileges, and place the log on a separate secure server.
Encrypt the connection between the master and the slave using, for example, MySQL’s built-in Secure Sockets Layer (SSL) support.

Even if the account has been secured, there is information that does not have to be in the binary log, so it makes sense not to store it there in the first place.

One of the more common types of sensitive information is passwords. Events containing passwords can be written to the binary log when executing statements that change tables on the server and that include the password required for access to the tables.

A typical example is:

UPDATE employee SET pass = PASSWORD('foobar')
  WHERE email = 'chuck@example.com';

If replication is in place, it is better to rewrite this statement without the password. This is done by computing and storing the hashed password into a user-defined variable and then using that in the expression:

SET @password = PASSWORD('foobar');
UPDATE employee SET pass = @password WHERE email = 'chuck@example.com';

Since the SET statement is not replicated, the original password will not be stored in the binary log, only in the memory of the server while executing the statement.

As long as the password hash, rather than the plain-text password, is stored in the table, this technique works. If the raw password is stored directly in the table, there is no way to prevent the password from ending up in the binary log. But storing hashes for passwords is a standard good practice in any case, to prevent someone who gets his hands on the raw data from learning the passwords.

Encrypting the connection between the master and the slave offers some protection, but if the binary log itself is compromised, encrypting the connection doesn’t help.

If you recall the earlier discussion about implicit information, you may already have noticed that both the user executing a line of code and the user who defines a trigger are implicit. As you will see in Chapter 8, neither the definer nor the invoker of the trigger is critical to executing the trigger on the slave, and the user information is effectively ignored when the slave executes the statement. However, the information is important when the binary log is played back to a server—for instance, when doing a PITR.

To play back a binary log to the server without problems in handling privileges on all the various tables, it is necessary to execute all the statements as a user with SUPER privileges. But the triggers may not have been defined using SUPER privileges, so it is important to recreate the triggers with the correct user as the trigger’s definer. If a trigger is defined with SUPER privileges instead of by the user who defined the trigger originally, it might cause a privilege escalation.

To permit a DBA to specify the user under which to execute a trigger, the CRE⁠ATE TRIG⁠GER syntax includes an optional DEFINER clause. If a DEFINER is not given to the statements—as is the case in Example 4-7—the statement will be rewritten for the binary log to add a DEFINER clause and use the current user as the definer. This means that the definition of the insert trigger appears in the binary log, as shown in Example 4-7. It lists the account that created the trigger (root@localhost) as the definer, which is what we want in this case.

Example 4-7. A CREATE TRIGGER statement in the binary log

master> SHOW BINLOG EVENTS FROM 92236 LIMIT 1\G
*************************** 1. row ***************************
   Log_name: master-bin.000038
        Pos: 92236
 Event_type: Query
  Server_id: 1
End_log_pos: 92491
       Info: use `test`; CREATE DEFINER=`root`@`localhost` TRIGGER ...
1 row in set (0.00 sec)

Statements that invoke triggers and stored routines

Moving over from definitions to invocations, we can ask how the master’s triggers are handled during replication. Well, actually they’re not handled at all.

The statement that invokes the trigger is logged to the binary log, but it is not linked to the particular trigger. Instead, when the slave executes the statement, it automatically executes any triggers associated with the tables affected by the statement. This means that there can be different triggers on the master and the slave, and the triggers on the master will be invoked on the master while the triggers on the slave will be invoked on the slave. For example, if the trigger to add entries to the log table is not necessary on the slave, performance can be improved by eliminating the trigger from the slave.

Still, any context events necessary for replicating correctly will be written to the binary log before the statement that invokes the trigger, even if it is just the statements in the trigger that require the context events. Thus, Example 4-8 shows the binary log after executing the INSERT statement in Example 4-5. Note that the first event writes the INSERT ID for the log table’s primary key. This reflects the use of the log table in the trigger, but it might appear to be redundant because the slave will not use the trigger.

You should, however, note that using different triggers on the master and slave—or no trigger at all on either the master or slave—is the exception and that the INSERT ID is necessary for replicating the INSERT statement correctly when the trigger is both on the master and slave.

Example 4-8. Contents of the binary log after executing INSERT

master> SHOW BINLOG EVENTS FROM 93340\G
*************************** 1. row ***************************
   Log_name: master-bin.000038
        Pos: 93340
 Event_type: Intvar
  Server_id: 1
End_log_pos: 93368
       Info: INSERT_ID=1
*************************** 2. row ***************************
   Log_name: master-bin.000038
        Pos: 93368
 Event_type: User var
  Server_id: 1
End_log_pos: 93396h
       Info: @`pass`=_utf8 0x2A3942353030333433424335324...
utf8_general_ci
*************************** 3. row ***************************
   Log_name: master-bin.000038
        Pos: 93396
 Event_type: Query
  Server_id: 1
End_log_pos: 93537
       Info: use `test`; INSERT INTO employee VALUES ...
3 rows in set (0.00 sec)

Stored Procedures

Stored functions and stored procedures are known by the common name stored routines. Since the server treats stored procedures and stored functions very differently, stored procedures will be covered in this section and stored functions in the next section.

The situation for stored routines is similar to triggers in some aspects, but very different in others. Like triggers, stored routines offer a DEFINER clause, and it must be explicitly added to the binary log whether or not the statement includes it. But the invocation of stored routines is handled differently from triggers.

To begin, let’s extend Example 4-6, which defines tables for employees and logs, with some utility routines to work with the employees. Even though this can be handled with standard INSERT, DELETE, and UPDATE statements, we’ll use stored procedures to demonstrate some issues involved in writing them to the binary log. For these purposes, let’s extend the example with the functions in Example 4-9 for adding and removing employees.

Example 4-9. Stored procedure definitions for managing employees

delimiter $$
CREATE PROCEDURE employee_add(p_name CHAR(64), p_email CHAR(64),
                              p_password CHAR(64))
   MODIFIES SQL DATA
BEGIN
   DECLARE l_pass CHAR(64);
   SET l_pass = PASSWORD(p_password);
   INSERT INTO employee(name, email, password)
     VALUES (p_name, p_email, l_pass);
END $$

CREATE PROCEDURE employee_passwd(p_email CHAR(64),
                                 p_password CHAR(64))
   MODIFIES SQL DATA
BEGIN
   DECLARE l_pass CHAR(64);
   SET l_pass = PASSWORD(p_password);
   UPDATE employee SET password = l_pass
    WHERE email = p_email;
END $$

CREATE PROCEDURE employee_del(p_name CHAR(64))
   MODIFIES SQL DATA
BEGIN
   DELETE FROM employee WHERE name = p_name;
END $$
delimiter ;

For the employee_add and employee_passwd procedures, we have extracted the encrypted password into a separate variable for the reasons already explained, but the employee_del procedure just contains a DELETE statement, since nothing else is needed. A binlog entry corresponding to one function is:

master> SHOW BINLOG EVENTS FROM 97911 LIMIT 1\G
*************************** 1. row ***************************
   Log_name: master-bin.000038
        Pos: 97911
 Event_type: Query
  Server_id: 1
End_log_pos: 98275
       Info: use `test`; CREATE DEFINER=`root`@`localhost`PROCEDURE ...
1 row in set (0.00 sec)

As expected, the definition of this procedure is extended with the DEFINER clause before writing the definition to the binary log, but apart from that, the body of the procedure is left intact. Notice that the CREATE PROCEDURE statement is replicated as a Query event, as are all DDL statements.

In this regard, stored routines are similar to triggers in the way they are treated by the binary log. But invocation differs significantly from triggers. Example 4-10 calls the procedure that adds an employee and shows the resulting contents of the binary log.

Example 4-10. Calling a stored procedure

master> CALL employee_add('chuck', 'chuck@example.com', 'abrakadabra');
Query OK, 1 row affected (0.00 sec)

master> SHOW BINLOG EVENTS FROM 104033\G
*************************** 1. row ***************************
   Log_name: master-bin.000038
        Pos: 104033
 Event_type: Intvar
  Server_id: 1
End_log_pos: 104061
       Info: INSERT_ID=1
*************************** 2. row ***************************
   Log_name: master-bin.000038
        Pos: 104061
 Event_type: Query
  Server_id: 1
End_log_pos: 104416
       Info: use `test`; INSERT INTO employee(name, email, password)
             VALUES ( NAME_CONST('p_name',_utf8'chuck' COLLATE ...),
             NAME_CONST('p_email',_utf8'chuck@example.com' COLLATE ...),
             NAME_CONST('pass',_utf8'*FEB349C4FDAA307A...' COLLATE ...))
2 rows in set (0.00 sec)

In Example 4-10, there are four things that you should note:

The CALL statement is not written to the binary log. Instead, the statements executed as a result of the call are written to the binary log. In other words, the body of the stored procedure is unrolled into the binary log.
The statement is rewritten to not contain any references to the parameters of the stored procedure—that is, p_name, p_email, and p_password. Instead, the NAME_CONST function is used for each parameter to create a result set with a single value.
The locally declared variable pass is also replaced with a NAME_CONST expression, where the second parameter contains the encrypted password.
Just as when a statement that invokes a trigger is written to the binary log, the statement that calls the stored procedure is preceded by an Intvar event holding the insert ID used when adding the employee to the log table.

Since neither the parameter names nor the locally declared names are available outside the stored routine, NAME_CONST is used to associate the name of the parameter or local variable with the constant value used when executing the function. This guarantees that the value can be used in the same way as the parameter or local variable. However, this change is not significant; currently it offers no advantages over using the parameters directly.

Stored Functions

Stored functions share many similarities with stored procedures and some similarities with triggers. Similar to both stored procedures and triggers, stored functions have a DEFINER clause that is normally (but not always) used when the CREATE FUNCTION statement is written to the binary log.

In contrast to stored procedures, stored functions can return scalar values and you can therefore embed them in various places in SQL statements. For example, consider the definition of a stored routine in Example 4-11, which extracts the email address of an employee given the employee’s name. The function is a little contrived—it is significantly more efficient to just execute statements directly—but it suits our purposes well.

Example 4-11. A stored function to fetch the name of an employee

delimiter $$
CREATE FUNCTION employee_email(p_name CHAR(64))
   RETURNS CHAR(64)
   DETERMINISTIC
BEGIN
   DECLARE l_email CHAR(64);
   SELECT email INTO l_email FROM employee WHERE name = p_name;
   RETURN l_email;
END $$
delimiter ;

This stored function can be used conveniently in other statements, as shown in Example 4-12. In contrast to stored procedures, stored functions have to specify a characteristic—such as DETERMINISTIC, NO SQL, or READS SQL DATA—if they are to be written to the binary log.

Example 4-12. Examples of using the stored function

master> CREATE TABLE collected (
     ->   name CHAR(32),
     ->   email CHAR(64)
     -> );
Query OK, 0 rows affected (0.09 sec)

master> INSERT INTO collected(name, email)
     ->  VALUES ('chuck', employee_email('chuck'));
Query OK, 1 row affected (0.01 sec)

master> SELECT employee_email('chuck');
+-------------------------+
| employee_email('chuck') |
+-------------------------+
| chuck@example.com       |
+-------------------------+
1 row in set (0.00 sec)

When it comes to calls, stored functions are replicated in the same manner as triggers: as part of the statement that executes the function. For instance, the binary log doesn’t need any events preceding the INSERT statement in Example 4-12, but it will contain the context events necessary to replicate the stored function inside the INSERT.

What about SELECT? Normally, SELECT statements are not written to the binary log since they don’t change any data, but a SELECT containing a stored function, as in Example 4-13, is an exception.

Example 4-13. Example of stored function that updates a table

CREATE TABLE log(log_id INT AUTO_INCREMENT PRIMARY KEY, msg TEXT);

delimiter $$
CREATE FUNCTION log_message(msg TEXT)
  RETURNS INT
  DETERMINISTIC
BEGIN
  INSERT INTO log(msg) VALUES(msg);
  RETURN LAST_INSERT_ID();
END $$
delimiter ;

SELECT log_message('Just a test');

When executing the stored function, the server notices that it adds a row to the log table and marks the statement as an “updating” statement, which means that it will be written to the binary log. So, for the slightly artificial example in Example 4-13, the binary log will contain the event:

*************************** 7. row ***************************
   Log_name: mysql-bin.000001
        Pos: 845
 Event_type: Query
  Server_id: 1
End_log_pos: 913
       Info: BEGIN
*************************** 8. row ***************************
   Log_name: mysql-bin.000001
        Pos: 913
 Event_type: Intvar
  Server_id: 1
End_log_pos: 941
       Info: LAST_INSERT_ID=1
*************************** 9. row ***************************
   Log_name: mysql-bin.000001
        Pos: 941
 Event_type: Intvar
  Server_id: 1
End_log_pos: 969
       Info: INSERT_ID=1
*************************** 10. row ***************************
   Log_name: mysql-bin.000001
        Pos: 969
 Event_type: Query
  Server_id: 1
End_log_pos: 1109
       Info: use `test`; SELECT `test`.`log_message`(_utf8'Just a test' COLLATE...
*************************** 11. row ***************************
   Log_name: mysql-bin.000001
        Pos: 1105
 Event_type: Xid
  Server_id: 1
End_log_pos: 1132
       Info: COMMIT /* xid=237 */

The CREATE ROUTINE privilege is required to define a stored procedure or stored function. Strictly speaking, no other privileges are needed to create a stored routine, but since it normally executes under the privileges of the definer, defining a stored routine would not make much sense if the definer of the procedure didn’t have the necessary privileges to read to or write from tables referenced by the stored procedure.

But replication threads on the slave execute without privilege checks. This leaves a serious security hole allowing any user with the CREATE ROUTINE privilege to elevate her privileges and execute any statement on the slave.

In MySQL versions earlier than 5.0, this does not cause problems, because all paths of a statement are explored when the statement is executed on the master. A privilege violation on the master will prevent a statement from being written to the binary log, so users cannot access objects on the slave that were out of bounds on the master. However, with the introduction of stored routines, it is possible to create conditional execution paths, and the server does not explore all paths when executing a stored routine.

Since stored procedures are unrolled, the exact statements executed on the master are also executed on the slave, and since the statement is logged only if it was successfully executed on the master, it is not possible to get access to other objects. Not so with stored functions.

If a stored function is defined with SQL SECURITY INVOKER, a malicious user can craft a function that will execute differently on the master and the slave. The security breach can then be buried in the branch executed on the slave. This is demonstrated in the following example:

CREATE FUNCTION magic()
  RETURNS CHAR(64)
  SQL SECURITY INVOKER
BEGIN
  DECLARE result CHAR(64);
  IF @@server_id <> 1 THEN
     SELECT what INTO result FROM secret.agents LIMIT 1;
     RETURN result;
  ELSE
     RETURN 'I am magic!';
  END IF;
END $$

One piece of code executes on the master (the ELSE branch), whereas a separate piece of code (the IF branch) executes on the slave where the privilege checks are disabled. The effect is to elevate the user’s privileges from CREATE ROUTINE to the equivalent of SUPER.

Notice that this problem doesn’t occur if the function is defined with SQL SECURITY DEFINER, because the function executes with the user’s privileges and will be blocked on the slave.

To prevent privilege escalation on a slave, MySQL requires SUPER privileges by default to define stored functions. But because stored functions are very useful, and some database administrators trust their users with creating proper functions, this check can be disabled with the log-bin-trust-function-creators option.

Events

The events feature is a MySQL extension, not part of standard SQL. Events, which should not be confused with binlog events, are handled by a stored program that is executed regularly by a special event scheduler.

Similar to all other stored programs, definitions of events are also logged with a DEFINER clause. Since events are invoked by the event scheduler, they are always executed as the definer and do not pose a security risk in the way that stored functions do.

When events are executed, the statements are written to the binary log directly.

Since the events will be executed on the master, they are automatically disabled on the slave and will therefore not be executed there. If the events were not disabled, they would be executed twice on the slave: once by the master executing the event and replicating the changes to the slave, and once by the slave executing the event directly.

Warning

Because the events are disabled on the slave, it is necessary to enable these events if the slave, for some reason, should lose the master.

So, for example, when promoting a slave as described in Chapter 5, don’t forget to enable the events that were replicated from the master. This is easiest to do using the following statement:

UPDATE mysql.events
    SET Status = ENABLED
  WHERE Status = SLAVESIDE_DISABLED;

The purpose of the check is to enable only the events that were disabled when being replicated from the master. There might be events that are disabled for other reasons.

Special Constructions

Even though statement-based replication is normally straightforward, some special constructions have to be handled with care. Recall that for the statement to be executed correctly on the slave, the context has to be correct for the statement. Even though the context events discussed earlier handle part of the context, some constructions have additional context that is not transferred as part of the replication process.

The LOAD_FILE function

The LOAD_FILE function allows you to fetch a file and use it as part of an expression. Although quite convenient at times, the file has to exist on the slave server to replicate correctly since the file is not transferred during replication, as the file to LOAD DATA INFILE is. With some ingenuity, you can rewrite a statement involving the LOAD_FILE function either to use the LOAD DATA INFILE statement or to define a user-defined variable to hold the contents of the file. For example, take the following statement that inserts a document into a table:

master> INSERT INTO document(author, body)
     ->   VALUES ('Mats Kindahl', LOAD_FILE('go_intro.xml'));

You can rewrite this statement to use LOAD DATA INFILE instead. In this case, you have to take care to specify character strings that cannot exist in the document as field and line delimiters, since you are going to read the entire file contents as a single column.

master> LOAD DATA INFILE 'go_intro.xml' INTO TABLE document
     ->    FIELDS TERMINATED BY '@*@' LINES TERMINATED BY '&%&'
     ->    (author, body) SET author = 'Mats Kindahl';

An alternative is to store the file contents in a user-defined variable and then use it in the statement.

master> SET @document = LOAD_FILE('go_intro.xml');
master> INSERT INTO document(author, body) VALUES
     ->    ('Mats Kindahl, @document);

Nontransactional Changes and Error Handling

So far we have considered only transactional changes and have not looked at error handling at all. For transactional changes, error handling is pretty uncomplicated: a statement that tries to change transactional tables and fails will not have any effect at all on the table. That’s the entire point of having a transactional system—so the changes that the statement attempts to introduce can be safely ignored. The same applies to transactions that are rolled back: they have no effect on the tables and can therefore simply be discarded without risking inconsistencies between the master and the slave.

A specialty of MySQL is the provisioning of nontransactional storage engines. This can offer some speed advantages, because the storage engine does not have to administer the transactional log that the transactional engines use, and it allows some optimizations on disk access. From a replication perspective, however, nontransactional engines require special considerations.

The most important aspect to note is that replication cannot handle arbitrary nontransactional engines, but has to make some assumptions about how they behave. Some of those limitations are lifted with the introduction of row-based replication in version 5.1—a subject that will be covered in Row-Based Replication—but even in that case, it cannot handle arbitrary storage engines.

One of the features that complicates the issue further, from a replication perspective, is that it is possible to mix transactional and nontransactional engines in the same transaction, and even in the same statement.

To continue with the example used earlier, consider Example 4-14, where the log table from Example 4-5 is given a nontransactional storage engine while the employee table is given a transactional one. We use the nontransactional MyISAM storage engine for the log table to improve its speed, while keeping the transactional behavior for the employee table.

We can further extend the example to track unsuccessful attempts to add employees by creating a pair of insert triggers: a before trigger and an after trigger. If an administrator sees an entry in the log with a status field of FAIL, it means the before trigger ran, but the after trigger did not, and therefore an attempt to add an employee failed.

Example 4-14. Definition of log and employee tables with storage engines

CREATE TABLE employee (
   name CHAR(64) NOT NULL,
   email CHAR(64),
   password CHAR(64),
   PRIMARY KEY (email)
) ENGINE = InnoDB;

CREATE TABLE log (
   id INT AUTO_INCREMENT,
   email CHAR(64),
   message TEXT,
   status ENUM('FAIL', 'OK') DEFAULT 'FAIL',
   ts TIMESTAMP,
   PRIMARY KEY (id)
) ENGINE = MyISAM;

delimiter $$
CREATE TRIGGER tr_employee_insert_before BEFORE INSERT ON employee
FOR EACH ROW
BEGIN
  INSERT INTO log(email, message)
    VALUES (NEW.email, CONCAT('Adding employee ', NEW.name));
  SET @LAST_INSERT_ID = LAST_INSERT_ID();
END $$
delimiter ;

CREATE TRIGGER tr_employee_insert_after AFTER INSERT ON employee
FOR EACH ROW
   UPDATE log SET status = 'OK' WHERE id = @LAST_INSERT_ID;

What are the effects of this change on the binary log?

To begin, let’s consider the INSERT statement from Example 4-6. Assuming the statement is not inside a transaction and AUTOCOMMIT is 1, the statement will be a transaction by itself. If the statement executes without errors, everything will proceed as planned and the statement will be written to the binary log as a Query event.

Now, consider what happens if the INSERT is repeated with the same employee. Since the email column is the primary key, this will generate a duplicate key error when the insertion is attempted, but what will happen with the statement? Is it written to the binary log or not?

Let’s have a look…

master> SET @pass = PASSWORD('xyzzy');
Query OK, 0 rows affected (0.00 sec)

master> INSERT INTO employee(name,email,password)
     ->   VALUES ('chuck','chuck@example.com',@pass);
ERROR 1062 (23000): Duplicate entry 'chuck@example.com' for key 'PRIMARY'
master> SELECT * FROM employee;
+------+--------------------+-------------------------------------------+
| name | email              | password                                  |
+------+--------------------+-------------------------------------------+
| chuck | chuck@example.com | *151AF6B8C3A6AA09CFCCBD34601F2D309ED54888 |
+------+--------------------+-------------------------------------------+
1 row in set (0.00 sec)

master> SHOW BINLOG EVENTS FROM 38493\G
*************************** 1. row ***************************
   Log_name: master-bin.000038
        Pos: 38493
 Event_type: User var
  Server_id: 1
End_log_pos: 38571
       Info: @`pass`=_utf8 0x2A31353141463642384333413641413...
*************************** 2. row ***************************
   Log_name: master-bin.000038
        Pos: 38571
 Event_type: Query
  Server_id: 1
End_log_pos: 38689
       Info: use `test`; INSERT INTO employee(name,email,password)...
2 rows in set (0.00 sec)

As you can see, the statement is written to the binary log even though the employee table is transactional and the statement failed. Looking at the contents of the table using the SELECT reveals that there is still a single employee, proving the statement was rolled back—so why is the statement written to the binary log?

Looking into the log table will reveal the reason.

master> SELECT * FROM log;
+----+------------------+------------------------+--------+---------------------+
| id | email            | message                | status | ts                  |
+----+------------------+------------------------+--------+---------------------+
|  1 | mats@example.com | Adding employee mats   | OK     | 2010-01-13 15:50:45 |
|  2 | mats@example.com | Name change from ...   | OK     | 2010-01-13 15:50:48 |
|  3 | mats@example.com | Password change        | OK     | 2010-01-13 15:50:50 |
|  4 | mats@example.com | Removing employee      | OK     | 2010-01-13 15:50:52 |
|  5 | mats@example.com | Adding employee mats   | OK     | 2010-01-13 16:11:45 |
|  6 | mats@example.com | Adding employee mats   | FAIL   | 2010-01-13 16:12:00 |
+----+--------------+----------------------------+--------+---------------------+
6 rows in set (0.00 sec)

Look at the last line, where the status is FAIL. This line was added to the table by the before trigger tr_employee_insert_before. For the binary log to faithfully represent the changes made to the database on the master, it is necessary to write the statement to the binary log if there are any nontransactional changes present in the statement or in triggers that are executed as a result of executing the statement. Since the statement failed, the after trigger tr_employee_insert_after was not executed, and therefore the status is still FAIL from the execution of the before trigger.

Since the statement failed on the master, information about the failure needs to be written to the binary log as well. The MySQL server handles this by using an error code field in the Query event to register the exact error code that caused the statement to fail. This field is then written to the binary log together with the event.

The error code is not visible when using the SHOW BINLOG EVENTS command, but you can view it using the mysqlbinlog tool, which we will cover later in the chapter.

Logging Transactions

You have now seen how individual statements are written to the binary log, along with context information, but we did not cover how transactions are logged. In this section, we will briefly cover how transactions are logged.

A transaction can start under a few different circumstances:

When the user issues START TRANSACTION (or BEGIN).
When AUTOCOMMIT=1 and a statement accessing a transactional table starts to execute. Note that a statement that writes only to nontransactional tables—for example, only to MyISAM tables—does not start a transaction.
When AUTOCOMMIT=0 and the previous transaction was committed or aborted either implicitly (by executing a statement that does an implicit commit) or explicitly by using COMMIT or ROLLBACK.

Not every statement that is executed after the transaction has started is part of that transaction. The exceptions require special care from the binary log.

Nontransactional statements are by their very definition not part of the transaction. When they are executed, they take effect immediately and do not wait for the transaction to commit. This also means that it is not possible to roll them back. They don’t affect an open transaction: any transactional statement executed after the nontransactional statement is still added to the currently open transaction.

In addition, several statements do an implicit commit. These can be separated into three groups based on the reason they do an implicit commit.

Statements that write files

Most DDL statements (CREATE, ALTER, etc.), with some exceptions, do an implicit commit of any outstanding transaction before starting to execute and an implicit commit after they have finished. These statements modify files in the filesystem and are for that reason not transactional.

Statements that modify tables in the mysql database

All statements that create, drop, or modify user accounts or privileges for users do an implicit commit and cannot be part of a transaction. Internally, these statements modify tables in the mysql database, which are all nontransactional.

In MySQL versions earlier than 5.1.3, these statements did not cause an implicit commit, but because they were writing to nontransactional tables, they were treated as nontransactional statements. As you will soon see, this caused some inconsistencies, so implicit commits were added for these statements over the course of several versions.

Statements that require implicit commits for pragmatic reasons

Statements that lock tables, statements that are used for administrative purposes, and LOAD DATA INFILE cause implicit commits in various situations because the implementation requires this to make them work correctly.

Statements that cause an implicit commit are clearly not part of any transaction, because any open transaction is committed before execution starts. You can find a complete list of statements that cause an implicit commit in the online MySQL Reference Manual.

Transaction Cache

The binary log can have statements in a different order from their actual execution, because it combines all the statements in each transaction to keep them together. Multiple sessions can execute simultaneous transactions on a server, and the transactional storage engines maintain their own transactional logs to make sure each transaction executes correctly. These logs are not visible to the user. In contrast, the binary log shows all transactions from all sessions in the order in which they were committed as if each executed sequentially.

To ensure each transaction is written as a unit to the binary log, the server has to separate statements that are executing in different threads. When committing a transaction, the server writes all the statements that are part of the transaction to the binary log as a single unit. For this purpose, the server keeps a transaction cache for each thread, as illustrated in Figure 4-4. Each statement executed for a transaction is placed in the transaction cache, and the contents of the transaction cache are then copied to the binary log and emptied when the transaction commits.

Figure 4-4. Threads with transaction caches and a binary log

Statements that contain nontransactional changes require special attention. Recall from our previous discussion that nontransactional statements do not cause the current transaction to terminate, so the changes introduced by the execution of a nontransactional statement have to be recorded somewhere without closing the currently open transaction. The situation is further complicated by statements that simultaneously affect transactional and nontransactional tables. These statements are considered transactional but include changes that are not part of the transaction.

Statement-based replication cannot handle this correctly in all situations, so a best-effort approach has been taken. We’ll describe the measures taken by the server, followed by the issues you have to be aware of in order to avoid the replication problems that are left over.

How nontransactional statements are logged

When no transaction is open, nontransactional statements are written to the binary log at the end of execution of the statement and do not “transit” in the transaction cache before ending up in the binary log. If, however, a transaction is open, the rules for how to handle the statement are as follows:

Rule 1: If the statement is marked as transactional, it is written to the transaction cache.
Rule 2: If the statement is not marked as transactional and there are no statements in the transaction cache, the statement is written directly to the binary log.
Rule 3: If the statement is not marked as transactional, but there are statements in the transaction cache, the statement is written to the transaction cache.

The third rule might seem strange, but you can understand the reasoning if you look at Example 4-15. Returning to our employee and log tables, consider the statements in Example 4-15, where a modification of a transactional table comes before modification of a nontransactional table in the transaction.

Example 4-15. Transaction with nontransactional statement

1     START TRANSACTION;
2     SET @pass = PASSWORD('xyzzy');
3     INSERT INTO employee(name,email,password)
      VALUES ('mats','mats@example.com', @pass);
4     INSERT INTO log(email, message)
      VALUES ('root@example.com', 'This employee was bad');
5     COMMIT;

Following rule 3, the statement on line 4 is written to the transaction cache even though the table is nontransactional. If the statement were written directly to the binary log, it would end up before the statement in line 3 because the statement in line 3 would not end up in the binary log until a successful commit in line 5. In short, the slave’s log would end up containing the comment added by the DBA in line 4 before the actual change to the employee in line 3, which is clearly inconsistent with the master. Rule 3 avoids such situations. The left side of Figure 4-5 shows the undesired effects if rule 3 did not apply, whereas the right side shows what actually happens thanks to rule 3.

Figure 4-5. Alternative binary logs depending on rule 3

Rule 3 involves a trade-off. Because the nontransactional statement is cached while the transaction executes, there is a risk that two transactions will update a nontransactional table on the master in a different order from that in which they are written to the binary log.

This situation can arise when there is a dependency between the first transactional and the second nontransactional statement of the transaction, but this cannot generally be handled by the server because it would require parsing each statement completely, including code in all triggers invoked, and performing a dependency analysis. Although technically possible, this would add extra processing to all statements during an open transaction and would therefore affect performance, perhaps significantly. Because the problem can almost always be avoided by designing transactions properly and ensuring that there are no dependencies of this kind in the transaction, the overhead was not added to MySQL.

How to avoid replication problems with nontransactional statements

The best strategy for avoiding problems is not to use nontransactional tables. However, if they are required by the application, a strategy for avoiding the dependencies discussed in the previous section is to ensure that statements affecting nontransactional tables are written first in the transaction. In this case, the statements will be written directly to the binary log, because the transaction cache is empty (refer to rule 2 in the preceding section). The statements are known to have no dependencies.

If you need any values from these statements later in the transaction, you can assign them to temporary tables or variables. After that, the real contents of the transaction can be executed, referencing the temporary tables or variables.

Starting with MySQL 5.6, it is possible to force nontransactional statements to be written directly to the binary log by using the option binlog_direct_non_transactional_updates. When this option is enabled, all nontransactional statements will be written to the binary log before the transaction where they appear instead of inside the transaction, even when the nontransactional statement appears after a transactional statement. This option changes the behavior only in statement-based replication. In row-based replication, the rows for the nontransactional statement are always written before the transaction. This will work because row-based replication does not execute the statements, but just changes the data in the table. Hence, the nontransactional “statement” will not have any dependencies at all and can be safely written before the transaction.

As an example, if the transaction in Example 4-15 is executed with binlog_direct_non_transactional_updates disabled (which is the default), statements are written to the binary log in the order shown by Figure 4-6.

Order of statements from in natural order

Figure 4-6. Order of statements from Example 4-15 in natural order

If, instead, binlog_direct_non_transactional_updates is enabled, the sequence of events will be in the order shown in Figure 4-7. Here, a separate transaction created for the write to the log table, which is then written before the transaction containing transactional statements.

Figure 4-7. Order of statements from Example 4-15 in safe order

Because the nontransactional statement is written before the transaction, it is critical that there are no dependencies on the statements executed before the nontransactional statement in the transaction. If this is something that you need, you should consider using row-based replication instead.

Distributed Transaction Processing Using XA

MySQL version 5.0 lets you coordinate transactions involving different resources by using the X/Open Distributed Transaction Processing model, XA. Although currently not very widely used, XA offers attractive opportunities for coordinating all kinds of resources with transactions.

In version 5.0, the server uses XA internally to coordinate the binary log and the storage engines.

A set of commands allows the client to take advantage of XA synchronization as well. XA allows different statements entered by different users to be treated as a single transaction. On the other hand, it imposes some overhead, so some administrators turn it off globally.

Instructions for working with the XA protocol are beyond the scope of this book, but we will give a brief introduction to XA here before describing how it affects the binary log.

XA includes a transaction manager that coordinates a set of resource managers so that they commit a global transaction as an atomic unit. Each transaction is assigned a unique XID, which is used by the transaction manager and the resource managers. When used internally in the MySQL server, the transaction manager is usually the binary log and the resource managers are the storage engines. The process of committing an XA transaction is shown in Figure 4-8 and consists of two phases.

In phase 1, each storage engine is asked to prepare for a commit. When preparing, the storage engine writes any information it needs to commit correctly to safe storage and then returns an OK message. If any storage engine replies negatively—meaning that it cannot commit the transaction—the commit is aborted and all engines are instructed to roll back the transaction.

After all storage engines have reported that they have prepared without error, and before phase 2 begins, the transaction cache is written to the binary log. In contrast to normal transactions, which are terminated with a normal Query event with a COMMIT, an XA transaction is terminated with an Xid event containing the XID.

Figure 4-8. Distributed transaction commit using XA

In phase 2, all the storage engines that were prepared in phase 1 are asked to commit the transaction. When committing, each storage engine will report that it has commit⁠ted the transaction in stable storage. It is important to understand that the commit cannot fail: once phase 1 has passed, the storage engine has guaranteed that the transaction can be committed and therefore is not allowed to report failure in phase 2. A hardware failure can, of course, cause a crash, but since the storage engines have stored the information in durable storage, they will be able to recover properly when the server restarts. The restart procedure is discussed in the section The Binary Log and Crash Safety.

After phase 2, the transaction manager is given a chance to discard any shared resources, should it choose to. The binary log does not need to do any such cleanup actions, so it does not do anything special with regard to XA at this step.

In the event that a crash occurs while committing an XA transaction, the recovery procedure in Figure 4-9 will take place when the server is restarted. At startup, the server will open the last binary log and check the Format_description event. If the binlog-in-use flag is set (described in Binlog File Rotation), it indicates that the server crashed and XA recovery has to be executed.

Figure 4-9. Procedure for XA recovery

The server starts by walking through the binary log that was just opened and finding the XIDs of all transactions in the binary log by reading the Xid events. Each storage engine loaded into the server will then be asked to commit the transactions in this list. For each XID in the list, the storage engine will determine whether a transaction with that XID is prepared but not committed, and commit it if that is the case. If the storage engine has prepared a transaction with an XID that is not in this list, the XID obviously did not make it to the binary log before the server crashed, so the transaction should be rolled back.

Binary Log Group Commit

Writing data to disk takes time; quite a lot actually. The performance of disk is orders of magnitude slower than main memory, where disk access times is counted in milliseconds, main memory access is counted in nanoseconds. For that reason, operating systems have elaborate memory management systems for keeping part of the files in memory and not performing more writes to disk than necessary. Since database systems have to be safe from crashes, it is necessary to force the data to be written to disk when a transaction is committed.

To avoid the performance impact resulting if each transaction has to be written to disk, multiple independent transactions can be grouped together and written to disk as a single unit, which is called group commit. Because the disk write time is mainly a consequence of the time it takes to move the disk head to the correct position on the disk, and not the amount of data written, this improves performance significantly.

It is, however, not sufficient to commit the transaction data in the storage engine efficiently, the binary log also has to be written efficiently. For that reason, binary log group commit was added in MySQL 5.6. With this, multiple independent transactions can be committed as a group to the binary log, improving performance significantly.

In order for online backup tools like MySQL Enterprise Backup to work correctly, it is important that transactions are written to the binary log in the same order as they are prepared in the storage engine.

The implementation is built around a set of stages that each transaction has to pass through before being fully committed, you can see an illustration in Figure 4-10. Each stage is protected by a mutex so there can never be more than one thread in each stage.

Each stage handles one part of the commit procedure. The first stage flushes the transaction caches of the threads to the file pages, the second stage executes a sync to get the file pages to disk, and the last stage commits all transactions.

Figure 4-10. Binary log group commit architecture

To move sessions between stages in an ordered manner, each stage has an associated queue where sessions can queue up for processing. Each stage queue is protected by a mutex that is held briefly while manipulating the queue. You can see the names of the mutexes for each stage in Table 4-1, in case you want to use the performance schema to monitor performance of the commit phase.

Table 4-1. Binlog group commit stages and their mutexes

Stage	Stage mutex	Stage queue mutex
Flush	`LOCK_log`	`LOCK_flush_queue`
Sync	`LOCK_sync`	`LOCK_sync_queue`
Commit	`LOCK_commit`	`LOCK_commit_queue`

Sessions normally run in separate threads, so any session thread that wants to commit a transaction enqueues the session in the flush stage queue. A session thread then proceeds through each stage in the following way:

If the session thread is enqueued to a non-empty stage queue, it is a follower and will go and wait for its transaction to be committed by some other session thread.
If the session thread enqueued to an empty stage queue, it is a (stage) leader and will bring all the sessions registered for the stage through the stage.
The leader empties all the sessions of the queue in a single step. The order of the sessions will be maintained, but after this step, new sessions can enqueue to the stage queue.
The stage processing is done as follows:
- For the flush queue, the transaction of each session will be flushed to the binary log in the same order as they enqueued to the flush queue.
- For the sync stage, an fsync call is executed.
- For the commit stage, the transactions are committed in the storage engines in the same order as they registered.
The sessions are enqueued to the queue for the next stage in the same order as they were registered for this stage.
Note that the queue might not be empty, which means that a session thread has “caught up” with a preceding leader thread. In this case, the session thread that was the leader and did the processing will become a follower for the next stage and let the leader already queued up handle the processing for the combined sets of sessions. Note that a leader can become a follower, but a follower can never become a leader.

As explained, new leader threads merge their session queues with older threads that are in the middle of commit procedures. This lets the system adapt dynamically to changing situations. Normally, the sync stage is the most expensive stage, so many batches of sessions will pass through the flush stage and “pile up” in the sync stage queue, after which a single leader thread will bring all the sessions through that stage. However, if battery-backed caches are used, fsync calls are very cheap, so in that case, sessions might not pile up at this stage.

Transactions have to be committed in order so that online backup methods (such as XtraBackup or MySQL Enterprise Backup) work correctly. For this reason, the commit stage commits the transactions in the same order as they are written to the binary log.

However, it is also possible to commit transactions in parallel in the commit stage. This means that the commits are done in an arbitrary order. This does not affect correctness in normal operations, but it means that you should not take an online backup while doing parallel commits.

Benchmarks did not show any measurable improvement in performance when committing transactions in parallel, so the default is to always commit in order. This is also the recommended setting, unless you have special needs (or just want to test it).

You can control whether transactions should be ordered using binlog_order_commits. If this is set to OFF, transactions are committed in parallel and the last stage is skipped (the threads commit themselves in an arbitrary order instead of waiting for the leader).

The flush stage includes an optimization when reading sessions from the queue. Instead of entering the flush stage and taking the entire queue for processing, the leader “skims” one session at a time from the queue, flushes it, and then repeats the procedure. The idea behind this optimization is that as long as there are sessions enqueueing, there is no point in advancing to the queue stage, and since flushing the transaction to the binary log takes time, performing it on each session provides an opportunity for more sessions that need committing to enqueue themselves. This optimization showed significant performance improvements in throughput. But because it can affect latency, you can control how long the leader thread will keep “skimming” sessions from the flush queue using the binlog_max_flush_queue_time, which takes the number of microseconds that the leader should skim the queue. Once the timer expires, the entire queue will be grabbed and the leader reverts to executing the procedure described in the preceding list for the stage. This means that the time given is not the maximum latency of a transaction: the leader will have to flush all the transactions that are registered for the queue before moving to the sync stage.

Row-Based Replication

The primary goal of replication is to keep the master and the slave synchronized so they have the same data. As you saw earlier, replication offers a number of special features to ensure the results are as close as possible to being identical on master and slave: context events, session-specific IDs, etc.

Despite this, there are still some situations that statement-based replication can’t currently handle correctly:

If an UPDATE, DELETE, or INSERT statement contains a LIMIT clause, it may cause problems if a database crashes during execution.
If there is an error during execution of a nontransactional statement, there is no guarantee that the effects are the same on the master and the slave.
If a statement contains a call to a UDF, there is no way to ensure the same value is used on the slave.
If the statement contains any nondeterministic function—such as USER, CURRENT_USER, or CONNECTION_ID—results may differ between master and slave.
If a statement updates two tables with autoincrement columns, it will not work correctly, because only a single last insert ID can be replicated, which will then be used for both tables on the slave, while on the master, the insert ID for each table will be used individually.

In these cases, it is better to replicate the actual data being inserted into the tables, which is what row-based replication does.

Instead of replicating the statement that performs the changes, row-based replication replicates each row being inserted, deleted, or updated separately, with the values that were used for the operation. Since the row that is sent to the slave is the same row that is sent to the storage engine, it contains the actual data being inserted into the table. Hence there are no UDFs to consider, no autoincrement counters to keep track of, and no partial execution of statements to take into consideration—just data, plain and simple.

Row-based replication opens up an entirely new set of scenarios that you just cannot accomplish with statement-based replication. However, you must also be aware of some differences in behavior.

When choosing between statement-based and row-based replication, consider the following:

Do you have statements that update a lot of rows, or do the statements usually only change or insert a few rows?
If the statement changes a lot of rows, statement-based replication will have more compact statements and may execute faster. But since the statement is executed on the slave as well, this is not always true. If the statement has a complex optimization and execution plan, it might be faster to use row-based replication, because the logic for finding rows is much faster.
If the statement changes or inserts only a few rows, row-based replication is potentially faster because there is no parsing involved and all processing goes directly to the storage engine.
Do you need to see which statements are executed? The events for handling row-based replication are hard to decode, to say the least. In statement-based replication, the statements are written into the binary log and hence can be read directly.^[5]
Statement-based replication has a simple replication model: just execute the same statement on the slave. This has existed for a long time and is familiar to many DBAs. Row-based replication, on the other hand, is less familiar to many DBAs and can therefore potentially be harder to fix when replication fails.
If data is different on master and slave, executing statements can yield different results on master and slave. Sometimes this is intentional—in this case, statement-based replication can and should be used—but sometimes this not intentional and can be prevented through row-based replication.

Enabling Row-based Replication

You can control the format to use when writing to the binary log using the option binlog-format. This option can take the values STATEMENT, MIXED, or ROW (see Options for Row-Based Replication for more information), which comes both as a global variable and session variable. This makes it possible for a session to temporarily switch replication format (but you need SUPER privileges to change this variable, even the session version). However, to ensure that row-based replication is used when starting the server, you need to stop the master and update the configuration file. Example 4-16 shows the addition needed to enable row-based replication.

Example 4-16. Options to configure row-based replication

[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir          = /tmp
log-bin         = master-bin
log-bin-index   = master-bin.index
server-id       = 1
binlog-format   = ROW

Using Mixed Mode

Mixed-mode replication is recommended for MySQL version 5.1 and later, but the default value for the binlog-format option is STATEMENT. This might seem odd, but that decision was made to avoid problems for users who upgrade from versions 5.0 or earlier. Because those versions had no row-based replication and users had to use statement-based replication, the MySQL developers did not want servers to make a sudden switch. If the servers suddenly started sending out row-based replication events when they were upgraded, the deployment would likely be a mess. To reduce the number of factors that an upgrading DBA has to consider, the default for this option remains STATEMENT.

However, if you use one of the template files distributed with MySQL version 5.1, you will notice the binlog-format option has the value MIXED, per the recommendation.

The principles behind mixed-mode replication are simple: use statement-based replication normally and switch to row-based replication for unsafe statements. We have already examined the kinds of statements that can lead to problems and why. To summarize, mixed-mode currently switches to row-based replication if:

The statement calls any of the following:
- The UUID function
- A user-defined function
- The CURRENT_USER or USER function
- The LOAD_FILE function
Two or more tables with an AUTO_INCREMENT column are updated in the same statement.
A server variable is used in the statement.
The storage engine does not allow statement-based replication (e.g., the MySQL Cluster engine).

This list is, by necessity, incomplete: it is being extended as new constructions are discovered unsafe. For a complete and accurate list, refer to the online MySQL Reference Manual.

Binary Log Management

The events mentioned thus far represent some real change of data that occurred on the master. However, things can happen that do not represent any change of data on the master but can affect replication. For example, if the server is stopped, it can potentially affect replication. No events can, of course, be written to the binary log when the server is not running, but this means that if anything is changed in the datafile, it will not be represented in the binary log. A typical example of this is restoring a backup, or otherwise manipulating the datafiles. Such changes are not replicated simply because the server is not running. However, the fact that the server stopped is sometimes represented in the binary log using a binary log event precisely to be able to recognize that there can be a “gap” in the binary log where things could have happened.

Events are needed for other purposes as well. Because the binary logs consist of multiple files, it is necessary to split the groups at convenient places to form the sequence of binlog files. To handle this safely, special events (rotate events) are added to the binary log.

The Binary Log and Crash Safety

As you have seen, changes to the binary log do not correspond to changes to the master databases on a one-to-one basis. It is important to keep the databases and the binary log mutually consistent in case of a crash. In other words, there should be no changes committed to the storage engine that are not written to the binary log, and vice versa.

Nontransactional engines introduce problems right away. For example, it is not possible to guarantee consistency between the binary log and a MyISAM table because MyISAM is nontransactional and the storage engine will carry through any requested change long before any attempt to log the statement.

But for transactional storage engines, MySQL includes measures to make sure that a crash does not cause the binary log to lose too much information.

As we described in Logging Statements, events are written to the binary log before releasing the locks on the table, but after all the changes have been given to the storage engine. So if there is a crash before the storage engine releases the locks, the server has to ensure that any changes recorded to the binary log are actually in the table on the disk before allowing the statement (or transaction) to commit. This requires coordination with standard filesystem synchronization.

Because disk accesses are very expensive compared to memory accesses, operating systems are designed to cache parts of the file in a dedicated part of the main memory—usually called the page cache—and wait to write file data to disk until necessary. Writing to disk becomes necessary when another page must be loaded from disk and the page cache is full, but it can also be requested by an application by doing an explicit call to write the pages of a file to disk.

Recall from the earlier description of XA that when the first phase is complete, all data has to be written to durable storage—that is, to disk—for the protocol to handle crashes correctly. This means that every time a transaction is committed, the page cache has to be written to disk. This can be very expensive and, depending on the application, not always necessary. To control how often the data is written to disk, you can set the sync-binlog option. This option takes an integer specifying how often to write the binary log to disk. If the option is set to 5, for instance, the binary log will be written to disk on every fifth commit group of statements or transactions. The default value is 0, which means that the binary log is not explicitly written to disk by the server, but happens at the discretion of the operating system.

Note

Note that since the introduction of binary log group commit in MySQL version 5.6, there is a sync with each commit group and not with each transaction or statement. This means that with sync-binlog=1 several transactions will be written to disk in a batch. You can read more about binary log group commit in Binary Log Group Commit.

For storage engines that support XA, such as InnoDB, setting the sync-binlog option to 1 means that you will not lose any transactions under normal crashes. For engines that do not support XA, you might lose at most one transaction.

If, however, every group is written to disk, it means that the performance suffers, usually a lot. Disk accesses are notoriously slow and caches are used for precisely the purpose of improving the performance because they remove the need to always write data to disk. If you are prepared to risk losing a few transactions or statements—either because you can handle the work it takes to recover this manually or because it is not important for the application—you can set sync-binlog to a higher value or leave it at the default.

Binlog File Rotation

MySQL starts a new file to hold binary log events at regular intervals. For practical and administrative reasons, it wouldn’t work to keep writing to a single file—operating systems have limits on file sizes. As mentioned earlier, the file to which the server is currently writing is called the active binlog file.

Switching to a new file is called binary log rotation or binlog file rotation, depending on the context.

Four main activities cause a rotation:

The server stops: Each time the server starts, it begins a new binary log. We’ll discuss why shortly.
The binlog file reaches a maximum size: If the binlog file grows too large, it will be automatically rotated. You can control the size of the binlog files using the binlog-cache-size server variable.
The binary log is explicitly flushed: The FLUSH LOGS command writes all logs to disk and creates a new file to continue writing the binary log. This can be useful when administering recovery images for PITR. Reading from an open binlog file can have unexpected results, so it is advisable to force an explicit flush before trying to use binlog files for recovery.
An incident occurred on the server: In addition to stopping altogether, the server can encounter incidents that cause the binary log to be rotated. These incidents sometimes require special manual intervention from the administrator, because they can leave a “gap” in the replication stream. It is easier for the DBA to handle the incident if the server starts on a fresh binlog file after an incident.

The first event of every binlog file is the Format_description event, which describes the server that wrote the file along with information about the contents and status of the file.

Three items are of particular interest here:

The binlog-in-use flag: Because a crash can occur while the server is writing to a binlog file, it is critical to indicate when a file was closed properly. Otherwise, a DBA could replay a corrupted file on the master or slave and cause more problems. To provide assurance about the file’s integrity, the binlog-in-use flag is set when the file is created and cleared after the final event (Rotate) has been written to the file. Thus, any program can see whether the binlog file was properly closed.
Binlog file format version: Over the course of MySQL development, the format for the binary log has changed several times, and it will certainly change again. Developers increment the version number for the format when significant changes—notably changes to the common headers—render new files unreadable to previous versions of the server. (The current format, starting with MySQL version 5.0, is version 4.) The binlog file format version field lists its version number; if a different server cannot handle a file with that version, it simply refuses to read the file.
Server version: This is a string denoting the version of the server that wrote the file. The server version used to run the examples in this chapter was “5.5.31-0ubuntu0.12.04.1,” for instance. As you can see, the string is guaranteed to include the MySQL server version, but it also contains additional information related to the specific build. In some situations, this information can help you or the developers figure out and resolve subtle bugs that can occur when replicating between different versions of the server.

To rotate the binary log safely even in the presence of crashes, the server uses a write-ahead strategy and records its intention in a temporary file called the purge index file (this name was chosen because the file is used while purging binlog files as well, as you will see). Its name is based on that of the index file, so for instance if the name of the index file is master-bin.index, the name of the purge index file is master-bin.~rec~. After creating the new binlog file and updating the index file to point to it, the server removes the purge index file.

Note

In versions of MySQL earlier than 5.1.43, rotation or binlog file purging could leave orphaned files; that is, the files might exist in the filesystem without being mentioned in the index file. Because of this, old files might not be purged correctly, leaving them around and requiring manual cleaning of the files from the directory.

The orphaned files do not cause a problem for replication, but can be considered an annoyance. The procedure shown in this section ensures that no files are orphaned in the event of a crash.

In the event of a crash, if a purge index file is present on the server, the server can compare the purge index file and the index file when it restarts and see what was actually accomplished compared to what was intended.

Before MySQL version 5.6 it was possible that the binary log could be left partially written. This could occur if only a part of the cache was written to the binary log before the server crashed. With MySQL version 5.6, the binary log will be trimmed and the partially written transaction removed from the binary log. This is safe because the transaction is not committed in the storage engine until it was written completely to the binary log.

Incidents

The term “incidents” refers to events that don’t change data on a server but must be written to the binary log because they have the potential to affect replication. Most incidents don’t require special intervention from the DBA—for instance, servers can stop and restart without changes to database files—but there will inevitably be some incidents that call for special action.

Currently, there are two incident events that you might discover in a binary log:

Stop

Indicates that the server was stopped through normal means. If the server crashed, no stop event will be written, even when the server is brought up again. This event is written in the old binlog file (restarting the server rotates to a new file) and contains only a common header; no other information is provided in the event.

When the binary log is replayed on the slave, it ignores any Stop events. Normally, the fact that the server stopped does not require special attention and replication can proceed as usual. If the server was switched to a new version while it was stopped, this will be indicated in the next binlog file, and the server reading the binlog file will then stop if it cannot handle the new version of the binlog format. In this sense, the Stop event does not represent a “gap” in the replication stream. However, the event is worth recording because someone might manually restore a backup or make other changes to files before restarting replication, and the DBA replaying the file could find this event in order to start or stop the replay at the right time.

Incident

An event type introduced in version 5.1 as a generic incident event. In contrast with the Stop event, this event contains an identifier to specify what kind of incident occurred. It is used to indicate that the server was forced to perform actions that almost guarantee that changes are missing from the binary log.

For example, incident events in version 5.1 are written if the database was reloaded or if a nontransactional event was too big to fit in the binlog file. MySQL Cluster generates this event when one of the nodes had to reload the database and could therefore be out of sync.

When the binary log is replayed on the slave, it stops with an error if it encounters an Incident event. In the case of the MySQL Cluster reload event, it indicates a need to resynchronize the cluster and probably to search for events that are missing from the binary log.

Purging the Binlog File

Over time, the server will accumulate binlog files unless old ones are purged from the filesystem. The server can automatically purge old binary logs from the filesystem, or you can explicitly tell the server to purge the files.

To make the server automatically purge old binlog files, set the expire-logs-days option—which is available as a server variable as well—to the number of days that you want to keep binlog files. Remember that as with all server variables, this setting is not preserved between restarts of the server. So if you want the automatic purging to keep going across restarts, you have to add the setting to the my.cnf file for the server.

To purge the binlog files manually, use the PURGE BINARY LOGS command, which comes in two forms:

PURGE BINARY LOGS BEFORE datetime: This form of the command will purge all files that are before the given date. If datetime is in the middle of a logfile (which it usually is), all files before the one holding datetime will be purged.
PURGE BINARY LOGS TO 'filename': This form of the command will purge all files that precede the given file. In other words, all files before filename in the output from SHOW MASTER LOGS will be removed, leaving filename as the first binlog file.

Binlog files are purged when the server starts or when a binary log rotation is done. If the server discovers files that require purging, either because a file is older than expire-logs-days or because a PURGE BINARY LOGS command was executed, it will start by writing the files that the server has decided are ripe for purging to the purge index file (for example, master-bin.~rec~). After that, the files are removed from the filesystem, and finally the purge index file is removed.

In the event of a crash, the server can continue removing files by comparing the contents of the purge index file and the index file and removing all files that were not removed because of a crash. As you saw earlier, the purge index file is used when rotating as well, so if a crash occurs before the index file can be properly updated, the new binlog file will be removed and then re-created when the rotate is repeated.

The mysqlbinlog Utility

One of the more useful tools available to an administrator is the client program mysqlbinlog. This is a small program that can investigate the contents of binlog files as well as relay logfiles (we will cover the relay logs in Chapter 8). In addition to reading binlog files locally, mysqlbinlog can also fetch binlog files remotely from other servers.

The mysqlbinlog is a very useful tool when investigating problems with replication, but you can also use it to implement PITR, as demonstrated in Chapter 3.

The mysqlbinlog tool normally outputs the contents of the binary log in a form that can be executed by sending them to a running server. When statement-based replication is employed, the statements executed are emitted as SQL statements. For row-based replication, which will be introduced in Chapter 8, mysqlbinlog generates some additional data necessary to handle row-based replication. This chapter focuses entirely on statement-based replication, so we will use the command with options to suppress output needed to handle row-based replication.

Some options to mysqlbinlog will be explained in this section, but for a complete list, consult the online MySQL Reference Manual.

Basic Usage

Let’s start with a simple example where we create a binlog file and then look at it using mysqlbinlog. We will start up a client connected to the master and execute the following commands to see how they end up in the binary log:

mysqld1> RESET MASTER;
Query OK, 0 rows affected (0.01 sec)

mysqld1> CREATE TABLE employee (
      ->    id INT AUTO_INCREMENT,
      ->    name CHAR(64) NOT NULL,
      ->    email CHAR(64),
      ->    password CHAR(64),
      ->    PRIMARY KEY (id)
      -> );
Query OK, 0 rows affected (0.00 sec)

mysqld1> SET @password = PASSWORD('xyzzy');
Query OK, 0 rows affected (0.00 sec)

mysqld1> INSERT INTO employee(name,email,password)
      ->   VALUES ('mats','mats@example.com',@password);
Query OK, 1 row affected (0.01 sec)

mysqld1> SHOW BINARY LOGS;
+--------------------+-----------+
| Log_name           | File_size |
+--------------------+-----------+
| mysqld1-bin.000038 |       670 |
+--------------------+-----------+
1 row in set (0.00 sec)

Let’s now use mysqlbinlog to dump the contents of the binlog file master-bin.000038, which is where all the commands ended up. The output shown in Example 4-17 has been edited slightly to fit the page.

Example 4-17. Output from execution of mysqlbinlog

$ sudo mysqlbinlog                       \
>     --short-form                       \
>     --force-if-open                    \
>     --base64-output=never              \
>     /var/lib/mysql1/mysqld1-bin.000038
1  /*!40019 SET @@session.max_insert_delayed_threads=0*/;
2  /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
3  DELIMITER /*!*/;
4  ROLLBACK/*!*/;
5  use test/*!*/;
6  SET TIMESTAMP=1264227693/*!*/;
7  SET @@session.pseudo_thread_id=999999999/*!*/;
8  SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1,
       @@session.unique_checks=1, @@session.autocommit=1/*!*/;
9  SET @@session.sql_mode=0/*!*/;
10 SET @@session.auto_increment_increment=1,
       @@session.auto_increment_offset=1/*!*/;
11 /*!\C utf8 *//*!*/;
12 SET @@session.character_set_client=8,@@session.collation_connection=8,
       @@session.collation_server=8/*!*/;
13 SET @@session.lc_time_names=0/*!*/;
14 SET @@session.collation_database=DEFAULT/*!*/;
15 CREATE TABLE employee (
16    id INT AUTO_INCREMENT,
17    name CHAR(64) NOT NULL,
18    email CHAR(64),
19    password CHAR(64),
20    PRIMARY KEY (id)
21 ) ENGINE=InnoDB
22 /*!*/;
23 SET TIMESTAMP=1264227693/*!*/;
24 BEGIN
25 /*!*/;
26 SET INSERT_ID=1/*!*/;
27 SET @`password`:=_utf8 0x2A31353141463... COLLATE `utf8_general_ci`/*!*/;
28 SET TIMESTAMP=1264227693/*!*/;
29 INSERT INTO employee(name,email,password)
30   VALUES ('mats','mats@example.com',@password)
31 /*!*/;
32 COMMIT/*!*/;
33 DELIMITER ;
34 # End of log file
35 ROLLBACK /* added by mysqlbinlog */;
36 /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

To get this output, we use three options:

short-form: With this option, mysqlbinlog prints only information about the SQL statements issued, and leaves out comments with information about the events in the binary log. This option is useful when mysqlbinlog is used only to play back the events to a server. If you want to investigate the binary log for problems, you will need these comments and should not use this option.
force-if-open: If the binlog file is not closed properly, either because the binlog file is still being written to or because the server crashed, mysqlbinlog will print a warning that this binlog file was not closed properly. This option prevents the printing of that warning.
base64-output=never: This prevents mysqlbinlog from printing base64-encoded events. If mysqlbinlog has to print base64-encoded events, it will also print the Format_description event of the binary log to show the encoding used. For statement-based replication, this is not necessary, so this option is used to suppress that event.

In Example 4-17, lines 1–4 contain the preamble printed in every output. Line 3 sets a delimiter that is unlikely to occur elsewhere in the file. The delimiter is also designed to appear as a comment in processing languages that do not recognize the setting of the delimiter.

The rollback on line 4 is issued to ensure the output is not accidentally put inside a transaction because a transaction was started on the client before the output was fed into the client.

We can skip momentarily to the end of the output—lines 33–35—to see the counterpart to lines 1–4. They restore the values set in the preamble and roll back any open transaction. This is necessary in case the binlog file was truncated in the middle of a transaction, to prevent any SQL code following this output from being included in a transaction.

The USE statement on line 5 is printed whenever the database is changed. Even though the binary log specifies the current database before each SQL statement, mysqlbinlog shows only the changes to the current database. When a USE statement appears, it is the first line of a new event.

The first line that is guaranteed to be in the output for each event is SET TIMESTAMP, as shown on lines 6 and 23. This statement gives the timestamp when the event started executing in seconds since the epoch.

Lines 7–14 contain general settings, but like USE on line 5, they are printed only for the first event and whenever their values change.

Because the INSERT statement on lines 29–30 is inserting into a table with an autoincrement column using a user-defined variable, the INSERT_ID session variable on line 26 and the user-defined variable on line 27 are set before the statement. This is the result of the Intvar and User_var events in the binary log.

If you omit the short-form option, each event in the output will be preceded by some comments about the event that generated the lines. You can see these comments, which start with hash marks (#) in Example 4-18.

Example 4-18. Interpreting the comments in mysqlbinlog output

$ sudo mysqlbinlog                       \
>     --force-if-open                    \
>     --base64-output=never              \
>     /var/lib/mysql1/mysqld1-bin.000038
   .
   .
   .
1  # at 386
2  #100123  7:21:33 server id 1  end_log_pos 414   Intvar
3  SET INSERT_ID=1/*!*/;
4  # at 414
5  #100123  7:21:33 server id 1  end_log_pos 496   User_var
6  SET @`password`:=_utf8 0x2A313531...838 COLLATE `utf8_general_ci`/*!*/;
7  # at 496
8  #100123  7:21:33 server id 1  end_log_pos 643
   Query   thread_id=6     exec_time=0     error_code=0
9  SET TIMESTAMP=1264227693/*!*/;
10 INSERT INTO employee(name,email,password)
11   VALUES ('mats','mats@example.com',@password)
12 /*!*/;
13 # at 643
14 #100123  7:21:33 server id 1  end_log_pos 670   Xid = 218
15 COMMIT/*!*/;
16 DELIMITER ;
17 # End of log file
18 ROLLBACK /* added by mysqlbinlog */;
19 /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

The first line of the comment gives the byte position of the event, and the second line contains other information about the event. Consider, for example, the INSERT statement line:

# at 496
#100123  7:21:33 server id 1  end_log_pos 643   Query   thread_id=6
    exec_time=0     error_code=0

The various parts of the comments have the following meanings:

at 496: The byte position where the event starts (i.e., the first byte of the event).
100123 7:21:33: The timestamp of the event as a datetime (date plus time). This is the time when the query started executing or when the events were written to the binary log.
server_id 1: The server ID of the server that generated the event. This server ID is used to set the pseudo_thread_id session variable, and a line setting this variable is printed if the event is thread-specific and the server ID is different from the previously printed ID.
end_log_pos 643: The byte position of the event that follows this event. By taking the difference between this value and the position where the event starts, you can get the length of the event.
Query: The type of event. In Example 4-18, you can see several different types of events, such as User_var, Intvar, and Xid.

The fields after these are event-specific, and hence different for each event. For the Query event, we can see two additional fields:

thread_id=6: The ID of the thread that executed the event. This is used to handle thread-specific queries, such as queries that access temporary tables.
exec_time=0: The execution time of the query in seconds.

Example 4-17 and Example 4-18 dump the output of a single file, but mysqlbinlog accepts multiple files as well. If several binlog files are given, they are processed in order.

The files are printed in the order you request them, and there is no checking that the Rotate event ending each file refers to the next file in sequence. The responsibility for ensuring that these binlog files make up part of a real binary log lies on the user.

Thanks to the way the binlog files binlog files are given, they will be processed in are named, submitting multiple files to mysqlbinlog—such as by using * as a file-globbing wildcard—is usually not a problem. However, let’s look at what happens when the binlog file counter, which is used as an extension to the filename, goes from 999999 to 1000000:

$ ls mysqld1-bin.[0-9]*
mysqld1-bin.000007  mysqld1-bin.000011  mysqld1-bin.000039
mysqld1-bin.000008  mysqld1-bin.000035  mysqld1-bin.1000000
mysqld1-bin.000009  mysqld1-bin.000037  mysqld1-bin.999998
mysqld1-bin.000010  mysqld1-bin.000038  mysqld1-bin.999999

As you can see, the last binlog file to be created is listed before the two binlog files that are earlier in binary log order. So it is worth checking the names of the files before you use wildcards.

Since your binlog files are usually pretty large, you won’t want to print the entire contents of the binlog files and browse them. Instead, there are a few options you can use to limit the output so that only a range of the events is printed.

start-position=bytepos

The byte position of the first event to dump. Note that if several binlog files are supplied to mysqlbinlog, this position will be interpreted as the position in the first file in the sequence.

If an event does not start at the position given, mysqlbinlog will still try to interpret the bytes starting at that position as an event, which usually leads to garbage output.

stop-position=bytepos

The byte position of the last event to print. If no event ends at that position, the last event printed will be the event with a position that precedes bytepos. If multiple binlog files are given, the position will be the position of the last file in the sequence.

start-datetime=datetime

Prints only events that have a timestamp at or after datetime. This will work correctly when multiple files are given—if all events of a file are before the datetime, all events will be skipped—but there is no checking that the events are printed in order according to their timestamps.

stop-datetime=datetime

Prints only events that have a timestamp before datetime. This is an exclusive range, meaning that if an event is marked 2010-01-24 07:58:32 and that exact datetime is given, the event will not be printed.

Note that since the timestamp of the event uses the start time of the statement but events are ordered in the binary log based on the commit time, it is possible to have events with a timestamp that comes before the timestamp of the preceding event. Since mysqlbinlog stops at the first event with a timestamp outside the range, there might be events that aren’t displayed because they have timestamps before datetime.

Reading remote files

In addition to reading files on a local filesystem, the mysqlbinlog utility can read binlog files from a remote server. It does this by using the same mechanism that the slaves use to connect to a master and ask for events. This can be practical in some cases, since it does not require a shell account on the machine to read the binlog files, just a user on the server with REPLICATION SLAVE privileges.

To handle remote reading of binlog files, include the read-from-remote-server option along with a host and user for connecting to the server, and optionally a port (if different from the default) and a password.

When reading from a remote server, give just the name of the binlog file, not the full path.

So to read the Query event from Example 4-18 remotely, the command would look something like the following (the server prompts for a password, but it is not output when you enter it):

$ sudo mysqlbinlog
>    --read-from-remote-server
>    --host=master.example.com
>    --base64-output=never
>    --user=repl_user --password
>    --start-position=294
>    mysqld1-bin.000038
Enter password:
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 294
#130608 22:09:19 server id 1  end_log_pos 0 	Start: binlog v 4, server v 5.5.31
 -0ubuntu0.12.04.1-log created 130608 22:09:19
# at 294
#130608 22:13:08 server id 1  end_log_pos 362 	Query	thread_id=53	exec_time=0
 error_code=0
SET TIMESTAMP=1370722388/*!*/;
SET @@session.pseudo_thread_id=53/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0...
SET @@session.sql_mode=0/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset...
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33...
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 362
#130608 22:13:08 server id 1  end_log_pos 390 	Intvar
SET INSERT_ID=1/*!*/;
# at 390
#130608 22:13:08 server id 1  end_log_pos 472 	User_var
SET @`password`:=_utf8 0x2A31353141463642384333413641413039434643434244333...
# at 472
#130608 22:13:08 server id 1  end_log_pos 627 	Query	thread_id=53	exec_time=0
use `test`/*!*/;
SET TIMESTAMP=1370722388/*!*/;
INSERT INTO employee(name, email, password)
  VALUES ('mats', 'mats@example.com', @password)
/*!*/;
# at 627
#130608 22:13:08 server id 1  end_log_pos 654 	Xid = 175
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Reading raw binary logfiles

The mysqlbinlog utility is very useful for investigating binary logs, but it can also be used for taking backups of binlog files. This is very useful when you do not have access to the machine using normal shell but have REPLICATION SLAVE privileges. In this case, you can read the binary logfiles from the server, but they should not be parsed and rather just saved to files.

The normal way to use mysqlbinlog for taking a backup of binlog files remotely is:

mysqlbinlog --raw --read-from-remote-server \
   --host=master.example.com --user=repl_user \
   master-bin.000012 master-bin.000013 ...

In this example, the binlog files master-bin.000012 and master-bin.000013 will be read and saved in the current directory. Note that you have to use --read-from-remote-server together with --raw. Using --raw without --read-from-remote-server is pointless since that would be the same as using a plain file copy. The most interesting options to control the behavior of mysqlbinlog are described below. You can find a detailed description of how to use mysqlbinlog for backups at the MySQL Reference Manual.

--result-file=prefix: This option gives the prefix to use when constructing the files to write to. The prefix can be a directory name (with trailing slash), or any other prefix. It defaults to the empty string, so if this option is not used, files will be written using the same name as they have on the master.
--to-last-log: Normally, only the files given on the command line are transferred, but if this option is provided, only the starting binary log file has to be given. After that, mysqlbinlog will transfer all files after the first one.
--stop-never: Do not stop reading after reaching end of file of the last log: keep waiting for more input. This option is useful when taking a backup that is to be used for point-in-time recovery. See Backup and MySQL Replication for a detailed treatment of backing up for point-in-time recovery.

Interpreting Events

Sometimes, the standard information printed by mysqlbinlog is not sufficient for spotting a problem, so it is necessary to go into the details of the event and investigate its content. To handle such situations, you can pass the hexdump option to tell mysqlbinlog to write the actual bytes of the events.

Before going into the details of the events, here are some general rules about the format of the data in the binary log:

Integer data: Integer fields in the binary log are printed in little-endian order, so you have to read integer fields backward. This means that, for example, the 32-bit block 03 01 00 00 represents the hexadecimal number 103.
String data: String data is usually stored both with length data and null-terminated. Sometimes, the length data appears just before the string and sometimes it is stored in the post header.

This section will cover the most common events, but an exhaustive reference concerning the format of all the events is beyond the scope of this book. Check the Binary Log section in the MySQL Internals Manual for an exhaustive list of all the events available and their fields. The most common of all the events is the Query event, so let’s concentrate on it first. Example 4-19 shows the output for such an event.

Example 4-19. Output when using option - -hexdump

   $ sudo mysqlbinlog                       \
   >     --force-if-open                    \
   >     --hexdump                          \
   >     --base64-output=never              \
   >     /var/lib/mysql1/mysqld1-bin.000038
      .
      .
      .
1  # at 496
2  #100123  7:21:33 server id 1  end_log_pos 643
3  # Position  Timestamp   Type   Master ID        Size      Master Pos    Flags
4  #      1f0 6d 95 5a 4b   02   01 00 00 00   93 00 00 00   83 02 00 00   10 00
5  #      203 06 00 00 00 00 00 00 00  04 00 00 1a 00 00 00 40 |................|
6  #      213 00 00 01 00 00 00 00 00  00 00 00 06 03 73 74 64 |.............std|
7  #      223 04 08 00 08 00 08 00 74  65 73 74 00 49 4e 53 45 |.......test.INSE|
8  #      233 52 54 20 49 4e 54 4f 20  75 73 65 72 28 6e 61 6d |RT.INTO.employee|
9  #      243 65 2c 65 6d 61 69 6c 2c  70 61 73 73 77 6f 72 64 |.name.email.pass|
10 #      253 29 0a 20 20 56 41 4c 55  45 53 20 28 27 6d 61 74 |word....VALUES..|
11 #      263 73 27 2c 27 6d 61 74 73  40 65 78 61 6d 70 6c 65 |.mats...mats.exa|
12 #      273 2e 63 6f 6d 27 2c 40 70  61 73 73 77 6f 72 64 29 |mple.com...passw|
13 #      283 6f 72 64 29                                      |ord.|
14 #       Query   thread_id=6     exec_time=0     error_code=0
   SET TIMESTAMP=1264227693/*!*/;
   INSERT INTO employee(name,email,password)
     VALUES ('mats','mats@example.com',@password)

The first two lines and line 14 are comments listing basic information that we discussed earlier. Notice that when you use the hexdump option, the general information and the event-specific information are split into two lines, whereas they are merged in the normal output.

Lines 3 and 4 list the common header:

Timestamp: The timestamp of the event as an integer, stored in little-endian format.
Type: A single byte representing the type of the event. Some event types are given in the MySQL Internals Manual but to get the values for your specific server you need to look in the source code (in the file sql/log_event.h).
Master ID: The server ID of the server that wrote the event, written as an integer. For the event shown in Example 4-19, the server ID is 1.
Size: The size of the event in bytes, written as an integer.
Master Pos: The same as end_log_pos (i.e., the start of the event following this event).
Flags: This field has 16 bits reserved for general flags concerning the event. The field is mostly unused, but it stores the binlog-in-use flag. As you can see in Example 4-19, the binlog-in-use flag is set, meaning that the binary log is not closed properly (in this case, because we didn’t flush the logs before calling mysqlbinlog).

After the common header come the post header and body for the event. As already mentioned, an exhaustive coverage of all the events is beyond the scope of this book, but we will cover the most important and commonly used events: the Query and Format_description log events.

Query event post header and body

The Query event is by far the most used and also the most complicated event issued by the server. Part of the reason is that it has to carry a lot of information about the context of the statement when it was executed. As already demonstrated, integer variables, user variables, and random seeds are covered using specific events, but it is also necessary to provide other information, which is part of this event.

The post header for the Query event consists of five fields. Recall that these fields are of fixed size and that the length of the post header is given in the Format_description event for the binlog file, meaning that later MySQL versions may add additional fields if the need should arise.

Thread ID: A four-byte unsigned integer representing the thread ID that executed the statement. Even though the thread ID is not always necessary to execute the statement correctly, it is always written into the event.
Execution time: The number of seconds from the start of execution of the query to when it was written to the binary log, expressed as a four-byte unsigned integer.
Database name length: The length of the database name, stored as an unsigned one-byte integer. The database name is stored in the event body, but the length is given here.
Error code: The error code resulting from execution of the statement, stored as a two-byte unsigned integer. This field is included because, in some cases, statements have to be logged to the binary log even when they fail.
Status variables length: The length of the block in the event body storing the status variables, stored as a two-byte unsigned integer. This status block is sometimes used with a Query event to store various status variables, such as SQL_MODE.

The event body consists of the following fields, which are all of variable length.

Status variables

A sequence of status variables. Each status variable is represented by a single integer followed by the value of the status variable. The interpretation and length of each status variable value depends on which status variable it concerns. Status variables are not always present; they are added only when necessary. Some examples of status variables follow:

Q_SQL_MODE_CODE: The value of SQL_MODE used when executing the statement.
Q_AUTO_INCREMENT: This status variable contains the values of auto_increment_increment and auto_increment_offset used for the statement, assuming that they are not the default of 1.
Q_CHARSET: This status variable contains the character set code and collation used by the connection and the server when the statement was executed.

Current database

The name of the current database, stored as a null-terminated string. Notice that the length of the database name is given in the post header.

Statement text

The statement that was executed. The length of the statement can be computed from the information in the common header and the post header. This statement is normally identical to the original statement written, but in some cases, the statement is rewritten before it is stored in the binary log. For instance, as you saw earlier in this chapter, triggers and stored procedures are stored with DEFINER clauses specified.

Format description event post header and body

The Format_description event records important information about the binlog file format, the event format, and the server. Because it has to remain robust between versions—it should still be possible to interpret it even if the binlog format changes—there are some restrictions on which changes are allowed.

One of the more important restrictions is that the common header of both the Format_description event and the Rotate event is fixed at 19 bytes. This means that it is not possible to extend the event with new fields in the common header.

The post header and event body for the Format_description event contain the following fields:

Binlog file version

The version of the binlog file format used by this file. For MySQL versions 5.0 and later, this is 4.

Server version string

A 50-byte string storing server version information. This is usually the three-part version number followed by information about the options used for the build, “5.5.31-0ubuntu0.12.04.1,” for instance.

Creation time

A four-byte integer holding the creation time—the number of seconds since the epoch—of the first binlog file written by the server since startup. For later binlog files written by the server, this field will be zero.

This scheme allows a slave to determine that the server was restarted and that the slave should reset state and temporary data—for example, close any open transactions and drop any temporary tables it has created.

Common header length

The length of the common header for all events in the binlog file except the Format_description and Rotate events. As described earlier, the length of the common header for the Format_description and Rotate events is fixed at 19 bytes.

Post-header lengths

This is the only variable-length field of the Format_description log event. It holds an array containing the size of the post header for each event in the binlog file as a one-byte integer. The value 255 is reserved as the length for the field, so the maximum length of a post header is 254 bytes.

Binary Log Options and Variables

A set of options and variables allow you to configure a vast number of aspects of binary logging.

Several options control such properties as the name of the binlog files and the index file. Most of these options can be manipulated as server variables as well. Some have already been mentioned earlier in the chapter, but here you will find more details on each:

expire-log-days=days

The number of days that binlog files should be kept. Files that are older than the specified number will be purged from the filesystem when the binary log is rotated or the server restarts.

By default this option is 0, meaning that binlog files are never removed.

log-bin[=basename ]

The binary log is turned on by adding the log-bin option in the my.cnf file, as explained in Chapter 3. In addition to turning on the binary log, this option gives a base name for the binlog files; that is, the portion of the filename before the dot. If an extension is provided, it is removed when forming the base name of the binlog files.

If the option is specified without a basename, the base name defaults to host-bin where host is the base name—that is, the filename without directory or extension—of the file given by the pid-file option, which is usually the hostname as given by gethostname(2). For example, if pid-file is /usr/run/mysql/master.pid, the default name of the binlog files will be master-bin.000001, master-bin.000002, etc.

Since the default value for the pid-file option includes the hostname, it is strongly recommended that you give a value to the log-bin option. Otherwise the binlog files will change names when the hostname changes (unless pid-file is given an explicit value).

log-bin-index[=filename]

Gives a name to the index file. This can be useful if you want to place the index file in a different place from the default.

The default is the same as the base name used for log-bin. For example, if the base name used to create binlog files is master-bin, the index file will be named master-bin.index.

Similar to the situation for the log-bin option, the hostname will be used for constructing the index filename, meaning that if the hostname changes, replication will break. For this reason, it is strongly recommended that you provide a value for this option.

log-bin-trust-function-creators

When creating stored functions, it is possible to create specially crafted functions that allow arbitrary data to be read and manipulated on the slave. For this reason, creating stored functions requires the SUPER privilege. However, since stored functions are very useful in many circumstances, it might be that the DBA trusts anyone with CREATE ROUTINE privileges not to write malicious stored functions. For this reason, it is possible to disable the SUPER privilege requirement for creating stored functions (but CREATE ROUTINE is still required).

binlog-cache-size=bytes

The size of the in-memory part of the transaction cache in bytes. The transaction cache is backed by disk, so whenever the size of the transaction cache exceeds this value, the remaining data will go to disk.

This can potentially create a performance problem, so increasing the value of this option can improve performance if you use many large transactions.

Note that just allocating a very large buffer might not be a good idea, because that means that other parts of the server get less memory, which might cause performance degradation.

max-binlog-cache-size=bytes

Use this option to restrict the size of each transaction in the binary log. Since large transactions can potentially block the binary log for a long time, they will cause other threads to convoy on the binary log and can therefore create a significant performance problem. If the size of a transaction exceeds bytes, the statement will be aborted with an error.

max-binlog-size=bytes

Specifies the size of each binlog file. When writing a statement or transaction would exceed this value, the binlog file is rotated and writing proceeds in a new, empty binlog file.

Notice that if the transaction or statement exceeds max-binlog-size, the binary log will be rotated, but the transaction will be written to the new file in its entirety, exceeding the specified maximum. This is because transactions are never split between binlog files.

sync-binlog=period

Specifies how often to write the binary log to disk using fdatasync(2). The value given is the number of transaction commits for each real call to fdatasync(2). For instance, if a value of 1 is given, fdatasync(2) will be called for each transaction commit, and if a value of 10 is given, fdatasync(2) will be called after each 10 transaction commits.

A value of zero means that there will be no calls to fdatasync(2) at all and that the server trusts the operating system to write the binary log to disk as part of the normal file handling.

read-only

Prevents any client threads—except the slave thread and users with SUPER privileges—from updating any data on the server (this does not include temporary tables, which can still be updated). This is useful on slave servers to allow replication to proceed without data being corrupted by clients that connect to the slave.

Options for Row-Based Replication

Use the following options to configure row-based replication:

binlog-format

The binlog-format option can be set to use one of the following modes:

STATEMENT: This will use the traditional statement-based replication for all statements.
ROW: This will use the shiny new row-based replication for all statements that insert or change data (data manipulation language, or DML, statements). However, statement-based replication must still be used for statements that create tables or otherwise alter the schema (data definition language, or DDL, statements).
MIXED: This is intended to be a safe version of statement-based replication and is the recommended mode to use with MySQL version 5.1 and later. In mixed-mode replication, the server writes the statements to the binary log as statements, but switches to row-based replication if a statement is considered unsafe through one of the criteria we have discussed in this chapter.

The variable also exists as a global server variable and as a session variable. When starting a new session, the global value is copied to the session variable and then the session variable is used to decide how to write statements to the binary log.

binlog-max-row-event-size

Use this option to specify when to start a new event for holding the rows. Because the events are read fully into memory when being processed, this option is a rough way of controlling the size of row-holding events so that not too much memory is used when processing the rows.

binlog-rows-query-log-events (new in MySQL 5.6)

This option causes the server to add an informational event to the binary log before the the row events. The informational event contains the original query that generated the rows.

Note that, because unrecognized events cause the slave to stop, any slave server before MySQL 5.6.2 will not recognize the event and hence stop replicating from the master. This means that if you intend to use an informational event, you need to upgrade all slaves to MySQL 5.6.2 (or later) before enabling binlog-rows-query-log-events.

Starting with MySQL 5.6.2, informational events can be added for other purposes as well, but they will not cause the slave to stop. They are intended to allow information to be added to the binary log for those readers (including slaves) that can have use for it, but they should not in any way change the semantics of execution and can therefore be safely ignored by slaves that do not recognize them.

Conclusion

Clearly, there is much to the binary log—including its use, composition, and techniques. We presented these concepts and more in this chapter, including how to control the binary log behavior. The material in this chapter builds a foundation for a greater understanding of the mechanics of the binary log and its importance in logging changes to data.

Joel opened an email message from his boss that didn’t have a subject. “I hate it when people do that,” he thought. Mr. Summerson’s email messages were like his taskings—straight and to the point. The message read, “Thanks for recovering that data for the marketing people. I’ll expect a report by tomorrow morning. You can send it via email.”

Joel shrugged and opened a new email message, careful to include a meaningful subject. He wondered what level of detail to include and whether he should explain what he learned about the binary log and the mysqlbinlog utility. After a moment of contemplation, he included as many details as he could. “He’ll probably tell me to cut it back to a bulleted list,” thought Joel. That seemed like a good idea, so he wrote a two-sentence summary and a few bullet points and moved them to the top of the message. When he was finished, he sent it on its way to his boss. “Maybe I should start saving these somewhere in case I have to recount something,” he mused.

^[4]In some special cases, covered in How nontransactional statements are logged, nontransactional statements can be part of a group.

^[5]MySQL 5.6 supports logging the statement that generated the rows to the binary log together with the rows. See the note in Options for Row-Based Replication.

Get MySQL High Availability, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 4. The Binary Log

Structure of the Binary Log

Binlog Event Structure

Note

Event Checksums

Logging Statements

Logging Data Manipulation Language Statements

Logging Data Definition Language Statements

Logging Queries

Current database

Current time

Context events

Thread ID

LOAD DATA INFILE Statements

Note

Binary Log Filters

Note

Warning

Triggers, Events, and Stored Routines

Statements that define or destroy stored programs

Statements that invoke triggers and stored routines

Stored Procedures

Stored Functions

Events

Warning

Special Constructions

The LOAD_FILE function

Nontransactional Changes and Error Handling

Logging Transactions

Transaction Cache

How nontransactional statements are logged

How to avoid replication problems with nontransactional statements

Distributed Transaction Processing Using XA

Binary Log Group Commit

Row-Based Replication

Enabling Row-based Replication

Using Mixed Mode

Binary Log Management

The Binary Log and Crash Safety

Note

Binlog File Rotation

Note

Incidents

Purging the Binlog File

The mysqlbinlog Utility

Basic Usage

Reading remote files

Reading raw binary logfiles

Interpreting Events

Query event post header and body

Format description event post header and body

Binary Log Options and Variables

Options for Row-Based Replication

Conclusion

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly