So, you have decided to take on a large installation and take care of its operation. Well, you are up for some very interesting—as well as rewarding—times.
Compared to running a small site, supporting a large venture requires planning, foresight, experience, and even more planning. As a database administrator for a large venture you are required to—or will be required to—do things like the following:
Provide plans for recovery of business-essential data in the event of a disaster. It is also likely that you will have to execute the procedure at least once.
Provide plans for handling a large customer/user base and monitoring the load of each node in the site in order to optimize it.
Plan for rapid scale-out in the event the user base grows rapidly.
For all these cases, it is critical to plan for the events in advance and be prepared to act quickly when necessary.
Since not all applications using big sets of servers are websites, we prefer to use the term deployment—rather than the term site or website—to refer to the server that you are using to support some kind of application. This could be a website, but could just as well be a CRM (customer relationship management) system or an online game. The book focuses on the database layer of such a system, but there are some examples that demonstrate how the application layer and the database layer integrate.
There are two things that you need in order to keep a site responsive and available: backups of data and redundancy in the system. The backups can restore a node to the state it was in before a crash, and redundancy allows the site to continue to operate even if one or more of the nodes stops functioning.
There are many ways to do backups, and the method you choose will depend on your needs. Do you need to recover to an exact point in time? In that case, you have to ensure that you have all that is necessary for performing a point-in-time recovery (PITR). Do you want to keep the servers up while making a backup? If so, you need to ensure that you are using some form of online backup method.
Redundancy is handled by duplicating hardware, keeping several instances running in parallel, and using replication to keep multiple copies of the same data available on several machines. If one of the machines fails, it is possible to switch over to another machine that has a copy of the same data.
Together with replication, backup also plays an important role in scaling your system and adding new nodes when needed. If done right, it is even possible to automatically add new slaves at the press of a button, at least figuratively speaking.
If you’re reading this book, you probably have a pretty good idea of what replication is about. It is nevertheless a good idea to introduce the concepts and ideas.
Replication is used to replicate all changes made on a server—called the master server or just master—to another server, which is called the slave server or just slave. This is normally used to create a faithful copy of the master server, but replication can be used for other purposes as well.
The two most common uses of replication are to create a backup of the main server to avoid losing any data if the master crashes and to have a copy of the main server to perform reporting and analysis work without disturbing the rest of the business.
For a small business, this makes a lot of things simpler, but it is possible to do a lot more with replication, including the following:
- Supporting several offices
It is possible to maintain servers at each location and replicate changes to the other offices so that the information is available everywhere. This may be necessary to protect the data and also to satisfy legal requirements to keep information about the business available for auditing purposes.
- Ensuring the business stays operational even if one of the servers goes down
An extra server can be used to handle all the traffic if the original server goes down.
- Ensuring the business can operate even in the presence of a disaster
Replication can be used to send changes to an alternative data center at a different geographic location.
- Protecting against mistakes (“oopses”)
It is possible to create a delayed slave by connecting a slave to a master such that the slave is always a fixed period—for example, an hour—behind the master. If a mistake is made on the master, it is possible to find the offending statement and remove it before it is executed by the slave.
One of the two most important uses of replication in many modern applications is that of scaling out. Modern applications are typically very read-intensive; they have a high proportion of reads compared to writes. To reduce the load on the master, you can set up a slave with the sole purpose of answering read queries. By connecting a load balancer, it is possible to direct read queries to a suitable slave, while write queries go to the master.
When using replication in a scale-out scenario, it is important to understand that MySQL replication is asynchronous in the sense that transactions are committed at the master server first, then replicated to the slave and applied there. This means that the master and slave may not be consistent, and if replication is running continuously, the slave will lag behind the master.
The advantage of using asynchronous replication is that it is faster and scales better than synchronous replication, but in cases where it is important to have current data, the asynchrony must be handled to ensure the information is actually up-to-date.
Another important application of replication is ensuring high availability by adding redundancy. The most common technique is to use a dual-master setup, that is, using replication to keep a pair of masters available all the time, where each master mirrors the other. If one of the masters goes down, the other one is ready to take over immediately.
In addition to the dual-master setup, there are other techniques for achieving high availability that do not involve replication, such as using shared or replicated disks. Although they are not specifically tied to MySQL, these techniques are important tools for ensuring high availability.
A backup strategy is a critical component of keeping a system available. Regular backups of the servers provide safety against crashes and disasters, which, to some extent, can be handled by replication. Even when replication is used correctly and efficiently, however, there are some things that replication cannot handle. You’ll need to have a working backup strategy for the following cases:
- Protection against mistakes
If a mistake is discovered, potentially a long time after it actually occurred, replication will not help. In this case, it is necessary to roll back the system to a time before the mistake was introduced and fix the problem. This requires a working backup schedule.
Replication provides some protection against mistakes if you are using a time-delayed slave, but if the mistake is discovered after the delay period, the change will have already taken effect on the slave as well. So, in general, it is not possible to protect against mistakes using replication only—backups are required as well.
- Creating new servers
When creating new servers—either slaves for scale-out purposes or new masters to act as standbys—it is necessary to make a backup of an existing server and restore that backup image on the new server. This requires a quick and efficient backup method to minimize the downtime and keep the load on the system at an acceptable level.
- Legal reasons
In addition to pure business reasons for data preservation, you may have legal requirements to keep data safe, even in the event of a disaster. Not complying with these requirements can pose significant problems to operating the business.
In short, a backup strategy is necessary for operating the business, regardless of any other precautions you have in place to ensure that the data is safe.
Even if you have replication set up correctly, it is necessary to understand the load on your system and to keep a keen eye on any problems that surface. As business requirements change due to changed customer usage patterns, it is necessary to balance the system to use resources as efficiently as possible and to reduce the risk of losing availability due to sudden changes in resource utilization.
There are a number of different things that you can monitor, measure, and plan for to handle these types of changes. Some examples are:
You can add indexes to tables that are frequently read.
You can rewrite queries or change the structure of databases to speed up execution time.
If locks are held for a long time, it is an indication that several connections are using the same table. It might pay off to switch storage engines.
If some of your scale-out slaves are hot-processing a disproportionate number of queries, the system might require some rebalancing to ensure that all the scale-out slaves are hit evenly.
To handle sudden changes in resource usage, it is necessary to determine the normal load of each server and understand when the system will start to respond slowly because of a sudden increase in load.
Without monitoring, you have no way of spotting problematic queries, hot slaves, or improperly utilized tables.
There is plenty of literature on using MySQL for various jobs, and also a lot of literature about high-availability systems. Here is a list of books that we strongly recommend if you are going to work with MySQL:
- MySQL by Paul DuBois (Addison-Wesley)
This is the reference to MySQL and consists of 1,200 pages (really!) packed with everything you want to know about MySQL (and probably a lot that you don’t want to know).
- High Performance MySQL, Second Edition by Baron Schwartz, et al. (O’Reilly, http://oreilly.com/catalog/9780596101718/)
This is one of the best books on using MySQL in an enterprise setting. It covers optimizing queries and ensuring your system is responsive and available.
- Scalable Internet Architectures by Theo Schlossnagle (Sams Publishing).
Written by one of the most prominent thinkers in the industry, this is a must for anybody working with systems of scale.
Note
The book uses a Python library developed by the authors (called the MySQL Python Replicant) for many of the administrative tasks. MySQL Python is available on Launchpad at https://launchpad.net/mysql-replicant-python.
Get MySQL High Availability now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.