BUY THIS BOOK
Add to Cart

Print Book $49.99


Add to Cart

Print+PDF $64.99

Add to Cart

PDF $39.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £30.99

What is this?

Looking to Reprint or License this content?


High Performance MySQL
High Performance MySQL, Second Edition Optimization, Backups, Replication, and More By Baron Schwartz, Peter Zaitsev, Vadim Tkachenko, Jeremy Zawodny, Arjen Lentz, Derek J. Balling
June 2008
Pages: 708

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: MySQL Architecture
MySQL's architecture is very different from that of other database servers, and makes it useful for a wide range of purposes. MySQL is not perfect, but it is flexible enough to work well in very demanding environments, such as web applications. At the same time, MySQL can power embedded applications, data warehouses, content indexing and delivery software, highly available redundant systems, online transaction processing (OLTP), and much more.
To get the most from MySQL, you need to understand its design so that you can work with it, not against it. MySQL is flexible in many ways. For example, you can configure it to run well on a wide range of hardware, and it supports a variety of data types. However, MySQL's most unusual and important feature is its storage-engine architecture, whose design separates query processing and other server tasks from data storage and retrieval. In MySQL 5.1, you can even load storage engines as runtime plug-ins. This separation of concerns lets you choose, on a per-table basis, how your data is stored and what performance, features, and other characteristics you want.
This chapter provides a high-level overview of the MySQL server architecture, the major differences between the storage engines, and why those differences are important. We've tried to explain MySQL by simplifying the details and showing examples. This discussion will be useful for those new to database servers as well as readers who are experts with other database servers.
A good mental picture of how MySQL's components work together will help you understand the server. shows a logical view of MySQL's architecture.
The topmost layer contains the services that aren't unique to MySQL. They're services most network-based client/server tools or servers need: connection handling, authentication, security, and so forth.
Figure : A logical view of the MySQL server architecture
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
MySQL's Logical Architecture
A good mental picture of how MySQL's components work together will help you understand the server. shows a logical view of MySQL's architecture.
The topmost layer contains the services that aren't unique to MySQL. They're services most network-based client/server tools or servers need: connection handling, authentication, security, and so forth.
Figure : A logical view of the MySQL server architecture
The second layer is where things get interesting. Much of MySQL's brains are here, including the code for query parsing, analysis, optimization, caching, and all the built-in functions (e.g., dates, times, math, and encryption). Any functionality provided across storage engines lives at this level: stored procedures, triggers, and views, for example.
The third layer contains the storage engines. They are responsible for storing and retrieving all data stored "in" MySQL. Like the various filesystems available for GNU/Linux, each storage engine has its own benefits and drawbacks. The server communicates with them through the storage engine API. This interface hides differences between storage engines and makes them largely transparent at the query layer. The API contains a couple of dozen low-level functions that perform operations such as "begin a transaction" or "fetch the row that has this primary key." The storage engines don't parse SQL or communicate with each other; they simply respond to requests from the server.
Each client connection gets its own thread within the server process. The connection's queries execute within that single thread, which in turn resides on one core or CPU. The server caches threads, so they don't need to be created and destroyed for each new connection.
When clients (applications) connect to the MySQL server, the server needs to authenticate them. Authentication is based on username, originating host, and password. X.509 certificates can also be used across an Secure Sockets Layer (SSL) connection. Once a client has connected, the server verifies whether the client has privileges for each query it issues (e.g., whether the client is allowed to issue a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Concurrency Control
Anytime more than one query needs to change data at the same time, the problem of concurrency control arises. For our purposes in this chapter, MySQL has to do this at two levels: the server level and the storage engine level. Concurrency control is a big topic to which a large body of theoretical literature is devoted, but this book isn't about theory or even about MySQL internals. Thus, we will just give you a simplified overview of how MySQL deals with concurrent readers and writers, so you have the context you need for the rest of this chapter.
We'll use an email box on a Unix system as an example. The classic mbox file format is very simple. All the messages in an mbox mailbox are concatenated together, one after another. This makes it very easy to read and parse mail messages. It also makes mail delivery easy: just append a new message to the end of the file.
But what happens when two processes try to deliver messages at the same time to the same mailbox? Clearly that could corrupt the mailbox, leaving two interleaved messages at the end of the mailbox file. Well-behaved mail delivery systems use locking to prevent corruption. If a client attempts a second delivery while the mailbox is locked, it must wait to acquire the lock itself before delivering its message.
This scheme works reasonably well in practice, but it gives no support for concurrency. Because only a single process can change the mailbox at any given time, this approach becomes problematic with a high-volume mailbox.
Reading from the mailbox isn't as troublesome. There's nothing wrong with multiple clients reading the same mailbox simultaneously; because they aren't making changes, nothing is likely to go wrong. But what happens if someone tries to delete message number 25 while programs are reading the mailbox? It depends, but a reader could come away with a corrupted or inconsistent view of the mailbox. So, to be safe, even reading from a mailbox requires special care.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Transactions
You can't examine the more advanced features of a database system for very long before transactions enter the mix. A transaction is a group of SQL queries that are treated atomically, as a single unit of work. If the database engine can apply the entire group of queries to a database, it does so, but if any of them can't be done because of a crash or other reason, none of them is applied. It's all or nothing.
Little of this section is specific to MySQL. If you're already familiar with ACID transactions, feel free to skip ahead to "Transactions in MySQL" on .
A banking application is the classic example of why transactions are necessary. Imagine a bank's database with two tables: checking and savings. To move $200 from Jane's checking account to her savings account, you need to perform at least three steps:
  1. Make sure her checking account balance is greater than $200.
  2. Subtract $200 from her checking account balance.
  3. Add $200 to her savings account balance.
The entire operation should be wrapped in a transaction so that if any one of the steps fails, any completed steps can be rolled back.
You start a transaction with the START TRANSACTION statement and then either make its changes permanent with COMMIT or discard the changes with ROLLBACK. So, the SQL for our sample transaction might look like this:
1   START TRANSACTION;
2   SELECT balance FROM checking WHERE customer_id = 10233276;
3   UPDATE checking SET balance = balance - 200.00 WHERE customer_id = 10233276;
4   UPDATE savings  SET balance = balance + 200.00 WHERE customer_id = 10233276;
5   COMMIT;
But transactions alone aren't the whole story. What happens if the database server crashes while performing line 4? Who knows? The customer probably just lost $200. And what if another process comes along between lines 3 and 4 and removes the entire checking account balance? The bank has given the customer a $200 credit without even knowing it.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Multiversion Concurrency Control
Most of MySQL's transactional storage engines, such as InnoDB, Falcon, and PBXT, don't use a simple row-locking mechanism. Instead, they use row-level locking in conjunction with a technique for increasing concurrency known as multiversion concurrency control (MVCC). MVCC is not unique to MySQL: Oracle, PostgreSQL, and some other database systems use it too.
You can think of MVCC as a twist on row-level locking; it avoids the need for locking at all in many cases and can have much lower overhead. Depending on how it is implemented, it can allow nonlocking reads, while locking only the necessary rows during write operations.
MVCC works by keeping a snapshot of the data as it existed at some point in time. This means transactions can see a consistent view of the data, no matter how long they run. It also means different transactions can see different data in the same tables at the same time! If you've never experienced this before, it may be confusing, but it will become easier to understand with familiarity.
Each storage engine implements MVCC differently. Some of the variations include optimistic and pessimistic concurrency control. We'll illustrate one way MVCC works by explaining a simplified version of InnoDB's behavior.
InnoDB implements MVCC by storing with each row two additional, hidden values that record when the row was created and when it was expired (or deleted). Rather than storing the actual times at which these events occurred, the row stores the system version number at the time each event occurred. This is a number that increments each time a transaction begins. Each transaction keeps its own record of the current system version, as of the time it began. Each query has to check each row's version numbers against the transaction's version. Let's see how this applies to particular operations when the transaction isolation level is set to REPEATABLE READ
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
MySQL's Storage Engines
This section gives an overview of MySQL's storage engines. We won't go into great detail here, because we discuss storage engines and their particular behaviors throughout the book. Even this book, though, isn't a complete source of documentation; you should read the MySQL manuals for the storage engines you decide to use. MySQL also has forums dedicated to each storage engine, often with links to additional information and interesting ways to use them.
If you just want to compare the engines at a high level, you can skip ahead to .
MySQL stores each database (also called a schema) as a subdirectory of its data directory in the underlying filesystem. When you create a table, MySQL stores the table definition in a .frm file with the same name as the table. Thus, when you create a table named MyTable, MySQL stores the table definition in MyTable.frm. Because MySQL uses the filesystem to store database names and table definitions, case sensitivity depends on the platform. On a Windows MySQL instance, table and database names are case insensitive; on Unix-like systems, they are case sensitive. Each storage engine stores the table's data and indexes differently, but the server itself handles the table definition.
To determine what storage engine a particular table uses, use the SHOW TABLE STATUS command. For example, to examine the user table in the mysql database, execute the following:
mysql> SHOW TABLE STATUS LIKE 'user' \G
*************************** 1. row ***************************
           Name: user
         Engine: MyISAM
     Row_format: Dynamic
           Rows: 6
 Avg_row_length: 59
    Data_length: 356
Max_data_length: 4294967295
   Index_length: 2048
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2002-01-24 18:07:17
    Update_time: 2002-01-24 21:56:29
     Check_time: NULL
      Collation: utf8_bin
       Checksum: NULL
 Create_options: 
        Comment: Users and global privileges
1 row in set (0.00 sec)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Finding Bottlenecks: Benchmarking and Profiling
At some point, you're bound to need more performance from MySQL. But what should you try to improve? A particular query? Your schema? Your hardware? The only way to know is to measure what your system is doing, and test its performance under various conditions. That's why we put this chapter early in the book.
The best strategy is to find and strengthen the weakest link in your application's chain of components. This is especially useful if you don't know what prevents better performance—or what will prevent better performance in the future.
Benchmarking and profiling are two essential practices for finding bottlenecks. They are related, but they're not the same. A benchmark measures your system's performance. This can help determine a system's capacity, show you which changes matter and which don't, or show how your application performs with different data.
In contrast, profiling helps you find where your application spends the most time or consumes the most resources. In other words, benchmarking answers the question "How well does this perform?" and profiling answers the question "Why does it perform the way it does?"
We've arranged this chapter in two parts, the first about benchmarking and the second about profiling. We begin with a discussion of reasons and strategies for benchmarking, then move on to specific benchmarking tactics. We show you how to plan and design benchmarks, design for accurate results, run benchmarks, and analyze the results. We end the first part with a look at benchmarking tools and examples of how to use several of them.
The rest of the chapter shows how to profile both applications and MySQL. We show detailed examples of real-life profiling code we've used in production to help analyze application performance. We also show you how to log MySQL's queries, analyze the logs, and use MySQL's status counters and other tools to see what MySQL and your queries are doing.
Many medium to large MySQL deployments have staff dedicated to benchmarking. However, every developer and DBA should be familiar with basic benchmarking principles and practices, because they're broadly useful. Here are some things benchmarks can help you do:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Benchmark?
Many medium to large MySQL deployments have staff dedicated to benchmarking. However, every developer and DBA should be familiar with basic benchmarking principles and practices, because they're broadly useful. Here are some things benchmarks can help you do:
  • Measure how your application currently performs. If you don't know how fast it currently runs, you can't be sure any changes you make are helpful. You can also use historical benchmark results to diagnose problems you didn't foresee.
  • Validate your system's scalability. You can use a benchmark to simulate a much higher load than your production systems handle, such as a thousand-fold increase in the number of users.
  • Plan for growth. Benchmarks help you estimate how much hardware, network capacity, and other resources you'll need for your projected future load. This can help reduce risk during upgrades or major application changes.
  • Test your application's ability to tolerate a changing environment. For example, you can find out how your application performs during a sporadic peak in concurrency or with a different configuration of servers, or you can see how it handles a different data distribution.
  • Test different hardware, software, and operating system configurations. Is RAID 5 or RAID 10 better for your system? How does random write performance change when you switch from ATA disks to SAN storage? Does the 2.4 Linux kernel scale better than the 2.6 series? Does a MySQL upgrade help performance? What about using a different storage engine for your data? You can answer these questions with special benchmarks.
You can also use benchmarks for other purposes, such as to create a unit test suite for your application, but we focus only on performance-related aspects here.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Benchmarking Strategies
There are two primary benchmarking strategies: you can benchmark the application as a whole, or isolate MySQL. These two strategies are known as full-stack and single-component benchmarking, respectively. There are several reasons to measure the application as a whole instead of just MySQL:
  • You're testing the entire application, including the web server, the application code, and the database. This is useful because you don't care about MySQL's performance in particular; you care about the whole application.
  • MySQL is not always the application bottleneck, and a full-stack benchmark can reveal this.
  • Only by testing the full application can you see how each part's cache behaves.
  • Benchmarks are good only to the extent that they reflect your actual application's behavior, which is hard to do when you're testing only part of it.
On the other hand, application benchmarks can be hard to create and even harder to set up correctly. If you design the benchmark badly, you can end up making bad decisions, because the results don't reflect reality.
Sometimes, however, you don't really want to know about the entire application. You may just need a MySQL benchmark, at least initially. Such a benchmark is useful if:
  • You want to compare different schemas or queries.
  • You want to benchmark a specific problem you see in the application.
  • You want to avoid a long benchmark in favor of a shorter one that gives you a faster "cycle time" for making and measuring changes.
It's also useful to benchmark MySQL when you can repeat your application's queries against a real dataset. The data itself and the dataset's size both need to be realistic. If possible, use a snapshot of actual production data.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Benchmarking Tactics
With the general behind us, let's move on to the specifics of how to design and execute benchmarks. Before we discuss how to do benchmarks well, though, let's look at some common mistakes that can lead to unusable or inaccurate results:
  • Using a subset of the real data size, such as using only one gigabyte of data when the application will need to handle hundreds of gigabytes, or using the current dataset when you plan for the application to grow much larger.
  • Using incorrectly distributed data, such as uniformly distributed data when the real system's data will have "hot spots." (Randomly generated data is often unrealistically distributed.)
  • Using unrealistically distributed parameters, such as pretending that all user profiles are equally likely to be viewed.
  • Using a single-user scenario for a multiuser application.
  • Benchmarking a distributed application on a single server.
  • Failing to match real user behavior, such as "think time" on a web page. Real users request a page and then read it; they don't click on links one after another without pausing.
  • Running identical queries in a loop. Real queries aren't identical, so they cause cache misses. Identical queries will be fully or partially cached at some level.
  • Failing to check for errors. If a benchmark's results don't make sense—e.g., if a slow operation suddenly completes very quickly—check for errors. You might just be benchmarking how quickly MySQL can detect a syntax error in the SQL query! Always check error logs after benchmarks, as a matter of principle.
  • Ignoring how the system performs when it's not warmed up, such as right after a restart. Sometimes you need to know how long it'll take your server to reach capacity after a restart, so you'll want to look specifically at the warm-up period. Conversely, if you intend to study normal performance, you'll need to be aware that if you benchmark just after a restart many caches will be cold, and the benchmark results won't reflect the results you'll get under load when the caches are warmed up.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Benchmarking Tools
You don't have to roll your own benchmarking system, and in fact you shouldn't unless there's a good reason why you can't use one of the available ones. There are a wide variety of tools ready for you to use. We show you some of them in the following sections.
Recall that there are two types of benchmarks: full-stack and single-component. Not surprisingly, there are tools to benchmark full applications, and there are tools to stress-test MySQL and other components in isolation. Testing the full stack is usually a better way to get a clear picture of your system's performance. Existing full-stack tools include:
ab
ab is a well-known Apache HTTP server benchmarking tool. It shows how many requests per second your HTTP server is capable of serving. If you are benchmarking a web application, this translates to how many requests per second the entire application can satisfy. It's a very simple tool, but its usefulness is also limited because it just hammers one URL as fast as it can. More information on ab is available at http://httpd.apache.org/docs/2.0/programs/ab.html.
http_load
This tool is similar in concept to ab; it is also designed to load a web server, but it's more flexible. You can create an input file with many different URLs, and http_load will choose from among them at random. You can also instruct it to issue requests at a timed rate, instead of just running them as fast as it can. See http://www.acme.com/software/http_load/ for more information.
JMeter
JMeter is a Java application that can load another application and measure its performance. It was designed for testing web applications, but you can also use it to test FTP servers and issue queries to a database via JDBC.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Benchmarking Examples
In this section, we show you some examples of actual benchmarks with tools we mentioned in the preceding sections. We can't cover each tool exhaustively, but these examples should help you decide which benchmarks might be useful for your purposes and get you started using them.
Let's start with a simple example of how to use http_load, and use the following URLs, which we saved to a file called urls.txt:
http://www.mysqlperformanceblog.com/
http://www.mysqlperformanceblog.com/page/2/
http://www.mysqlperformanceblog.com/mysql-patches/
http://www.mysqlperformanceblog.com/mysql-performance-presentations/
http://www.mysqlperformanceblog.com/2006/09/06/slow-query-log-analyzes-tools/
The simplest way to use http_load is to simply fetch the URLs in a loop. The program fetches them as fast as it can:
$ http_load -parallel 1 -seconds 10 urls.txt
19 fetches, 1 max parallel, 837929 bytes, in 10.0003 seconds
44101.5 mean bytes/connection
1.89995 fetches/sec, 83790.7 bytes/sec
msecs/connect: 41.6647 mean, 56.156 max, 38.21 min
msecs/first-response: 320.207 mean, 508.958 max, 179.308 min
HTTP response codes:
  code 200 - 19
The results are pretty self-explanatory; they simply show statistics about the requests. A slightly more complex usage scenario is to fetch the URLs as fast as possible in a loop, but emulate five concurrent users:
$ http_load -parallel 5 -seconds 10 urls.txt
94 fetches, 5 max parallel, 4.75565e+06 bytes, in 10.0005 seconds
50592 mean bytes/connection
9.39953 fetches/sec, 475541 bytes/sec
msecs/connect: 65.1983 mean, 169.991 max, 38.189 min
msecs/first-response: 245.014 mean, 993.059 max, 99.646 min
HTTP response codes:
  code 200 - 94
Alternatively, instead of fetching as fast as possible, we can emulate the load for a predicted rate of requests (such as five per second):
$ http_load -rate 5 -seconds 10 urls.txt
48 fetches, 4 max parallel, 2.50104e+06 bytes, in 10 seconds
52105 mean bytes/connection
4.8 fetches/sec, 250104 bytes/sec
msecs/connect: 42.5931 mean, 60.462 max, 38.117 min
msecs/first-response: 246.811 mean, 546.203 max, 108.363 min
HTTP response codes:
  code 200 - 48
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Profiling
Profiling shows you how much each part of a system contributes to the total cost of producing a result. The simplest cost metric is time, but profiling can also measure the number of function calls, I/O operations, database queries, and so forth. The goal is to understand why a system performs the way it does.
Just like with benchmarking, you can profile at the application level or on a single component, such as the MySQL server. Application-level profiling usually yields better insight into how to optimize the application and provides more accurate results, because the results include the work done by the whole application. For example, if you're interested in optimizing the application's MySQL queries, you might be tempted to just run and analyze the queries. However, if you do this, you'll miss a lot of important information about the queries, such as insights into the work the application has to do when reading results into memory and processing them.
Because web applications are such a common use case for MySQL, we use a PHP web site as our example. You'll typically need to profile the application globally to see how the system is loaded, but you'll probably also want to isolate some subsystems of interest, such as the search function. Any expensive subsystem is a good candidate for profiling in isolation.
When we need to optimize how a PHP web site uses MySQL, we prefer to gather statistics at the granularity of objects (or modules) in the PHP code. The goal is to measure how much of each page's response time is consumed by database operations. Database access is often, but not always, the bottleneck in applications. Bottlenecks can also be caused by any of the following:
  • External resources, such as calls to web services or search engines
  • Operations that require processing large amounts of data in the application, such as parsing big XML files
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Operating System Profiling
It's often useful to peek into operating system statistics and try to find out what the operating system and hardware are doing. This can help not only when profiling an application, but also when troubleshooting.
This section is admittedly biased toward Unix-like operating systems, because that's what we work with most often. However, you can use the same techniques on other operating systems, as long as they provide the statistics.
The tools we use most frequently are vmstat, iostat, mpstat, and strace. Each of these shows a slightly different perspective on some combination of process, CPU, memory, and I/O activity. These tools are available on most Unix-like operating systems. We show examples of how to use them throughout this book, especially at the end of .
Be careful with strace on GNU/Linux on production servers. It seems to have issues with multithreaded processes sometimes, and we've crashed servers with it.
One set of tools we don't discuss elsewhere in detail is tools for discovering network activity and doing basic troubleshooting. As an example of how to do this, we show how you can track a MySQL connection back to its origin on another server.
Begin with the output of SHOW PROCESSLIST in MySQL, and note the Host column in one of the processes. We use the following example:
*************************** 21. row ***************************
     Id: 91296
   User: web
   Host: sargon.cluster3:37636
     db: main
Command: Sleep
   Time: 10
  State:
   Info: NULL
The Host column shows where the connection originated and, just as importantly, the TCP port from which it came. You can use that information to find out which process opened the connection. If you have root access to sargon, you can use netstat and the port number to find out which process opened the connection:
root@sargon# 
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Schema Optimization and Indexing
Optimizing a poorly designed or badly indexed schema can improve performance by orders of magnitude. If you require high performance, you must design your schema and indexes for the specific queries you will run. You should also estimate your performance requirements for different kinds of queries, because changes to one query or one part of the schema can have consequences elsewhere. Optimization often involves tradeoffs. For example, adding indexes to speed up retrieval will slow updates. Likewise, a denormalized schema can speed up some types of queries but slow down others. Adding counter and summary tables is a great way to optimize queries, but they may be expensive to maintain.
Sometimes you may need to go beyond the role of a developer and question the business requirements handed to you. People who aren't experts in database systems often write business requirements without understanding their performance impacts. If you explain that a small feature will double the server hardware requirements, they may decide they can live without it.
Schema optimization and indexing require a big-picture approach as well as attention to details. You need to understand the whole system to understand how each piece will affect others. This chapter begins with a discussion of data types, then covers indexing strategies and normalization. It finishes with some notes on storage engines.
You will probably need to review this chapter after reading the chapter on query optimization. Many of the topics discussed here—especially indexing—can't be considered in isolation. You have to be familiar with query optimization and server tuning to make good decisions about indexes.
MySQL supports a large variety of data types, and choosing the correct type to store your data is crucial to getting good performance. The following simple guidelines can help you make better choices, no matter what type of data you are storing:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Choosing Optimal Data Types
MySQL supports a large variety of data types, and choosing the correct type to store your data is crucial to getting good performance. The following simple guidelines can help you make better choices, no matter what type of data you are storing:
Smaller is usually better.
In general, try to use the smallest data type that can correctly store and represent your data. Smaller data types are usually faster, because they use less space on the disk, in memory, and in the CPU cache. They also generally require fewer CPU cycles to process.
Make sure you don't underestimate the range of values you need to store, though, because increasing the data type range in multiple places in your schema can be a painful and time-consuming operation. If you're in doubt as to which is the best data type to use, choose the smallest one that you don't think you'll exceed. (If the system is not very busy or doesn't store much data, or if you're at an early phase in the design process, you can change it easily later.)
Simple is good.
Fewer CPU cycles are typically required to process operations on simpler data types. For example, integers are cheaper to compare than characters, because character sets and collations (sorting rules) make character comparisons complicated. Here are two examples: you should store dates and times in MySQL's built-in types instead of as strings, and you should use integers for IP addresses. We discuss these topics further later.
Avoid NULL if possible.
You should define fields as NOT NULL whenever you can. A lot of tables include nullable columns even when the application does not need to store
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Indexing Basics
Indexes are data structures that help MySQL retrieve data efficiently. They are critical for good performance, but people often forget about them or misunderstand them, so indexing is a leading cause of real-world performance problems. That's why we put this material early in the book—even earlier than our discussion of query optimization.
Indexes (also called "keys" in MySQL) become more important as your data gets larger. Small, lightly loaded databases often perform well even without proper indexes, but as the dataset grows, performance can drop very quickly.
The easiest way to understand how an index works in MySQL is to think about the index in a book. To find out where a particular topic is discussed in a book, you look in the index, and it tells you the page number(s) where that term appears.
MySQL uses indexes in a similar way. It searches the index's data structure for a value. When it finds a match, it can find the row that contains the match. Suppose you run the following query:
mysql> SELECT first_name FROM sakila.actor WHERE actor_id = 5;
There's an index on the actor_id column, so MySQL will use the index to find rows whose actor_id is 5. In other words, it performs a lookup on the values in the index and returns any rows containing the specified value.
An index contains values from a specified column or columns in a table. If you index more than one column, the column order is very important, because MySQL can only search efficiently on a leftmost prefix of the index. Creating an index on two columns is not the same as creating two separate single-column indexes, as you'll see.
There are many types of indexes, each designed to perform well for different purposes. Indexes are implemented in the storage engine layer, not the server layer. Thus, they are not standardized: indexing works slightly differently in each engine, and not all engines support all types of indexes. Even when multiple engines support the same index type, they may implement it differently under the hood.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Indexing Strategies for High Performance
Creating the correct indexes and using them properly is essential to good query performance. We've introduced the different types of indexes and explored their strengths and weaknesses. Now let's see how to really tap into the power of indexes.
There are many ways to choose and use indexes effectively, because there are many special-case optimizations and specialized behaviors. Determining what to use when and evaluating the performance implications of your choices are skills you'll learn over time. The following sections will help you understand how to use indexes effectively, but don't forget to benchmark!
MySQL generally can't use indexes on columns unless the columns are isolated in the query. "Isolating" the column means it should not be part of an expression or be inside a function in the query.
For example, here's a query that can't use the index on actor_id:
mysql> SELECT actor_id FROM sakila.actor WHERE actor_id + 1 = 5;
A human can easily see that the WHERE clause is equivalent to actor_id = 4, but MySQL can't solve the equation for actor_id. It's up to you to do this. You should get in the habit of simplifying your WHERE criteria, so the indexed column is alone on one side of the comparison operator.
Here's another example of a common mistake:
mysql> SELECT ... WHERE TO_DAYS(CURRENT_DATE) - TO_DAYS(date_col) <= 10;
This query will find all rows where the date_col value is newer than 10 days ago, but it won't use indexes because of the TO_DAYS() function. Here's a better way to write this query:
mysql> SELECT ... WHERE date_col >= DATE_SUB(CURRENT_DATE, INTERVAL 10 DAY);
This query will have no trouble using an index, but you can still improve it in another way. The reference to CURRENT_DATE will prevent the query cache from caching the results. You can replace CURRENT_DATE with a literal to fix that problem:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
An Indexing Case Study
The easiest way to understand indexing concepts is with an illustration, so we've prepared a case study in indexing.
Suppose we need to design an online dating site with user profiles that have many different columns, such as the user's country, state/region, city, sex, age, eye color, and so on. The site must support searching the profiles by various combinations of these properties. It must also let the user sort and limit results by the last time the profile's owner was online, ratings from other members, etc. How do we design indexes for such complex requirements?
Oddly enough, the first thing to decide is whether we have to use index-based sorting, or whether filesorting is acceptable. Index-based sorting restricts how the indexes and queries need to be built. For example, we can't use an index for a WHERE clause such as WHERE age BETWEEN 18 AND 25 if the same query uses an index to sort users by the ratings other users have given them. If MySQL uses an index for a range criterion in a query, it cannot also use another index (or a suffix of the same index) for ordering. Assuming this will be one of the most common WHERE clauses, we'll take for granted that many queries will need a filesort.
Now we need to look at which columns have many distinct values and which columns appear in WHERE clauses most often. Indexes on columns with many distinct values will be very selective. This is generally a good thing, because it lets MySQL filter out undesired rows more efficiently.
The country column may or may not be selective, but it'll probably be in most queries anyway. The sex column is certainly not selective, but it'll probably be in every query. With this in mind, we create a series of indexes for many different combinations of columns, prefixed with (sex,country).
The traditional wisdom is that it's useless to index columns with very low selectivity. So why would we place a nonselective column at the beginning of every index? Are we out of our minds?
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Index and Table Maintenance
Once you've created tables with proper data types and added indexes, your work isn't over: you still need to maintain your tables and indexes to make sure they perform well. The three main goals of table maintenance are finding and fixing corruption, maintaining accurate index statistics, and reducing fragmentation.
The worst thing that can happen to a table is corruption. With the MyISAM storage engine, this often happens due to crashes. However, all storage engines can experience index corruption due to hardware problems or internal bugs in MySQL or the operating system.
Corrupted indexes can cause queries to return incorrect results, raise duplicate-key errors when there is no duplicated value, or even cause lockups and crashes. If you experience odd behavior—such as an error that you think shouldn't be happening—run CHECK TABLE to see if the table is corrupt. (Note that some storage engines don't support this command, and others support multiple options to specify how thoroughly they check the table.) CHECK TABLE usually catches most table and index errors.
You can fix corrupt tables with the REPAIR TABLE command, but again, not all storage engines support this. In these cases you can do a "no-op" ALTER, such as altering a table to use the same storage engine it currently uses. Here's an example for an InnoDB table:
mysql> ALTER TABLE innodb_tbl ENGINE=INNODB;
Alternatively, you can either use an offline engine-specific repair utility, such as myisamchk, or dump the data and reload it. However, if the corruption is in the system area, or in the table's "row data" area instead of the index, you may be unable to use any of these options. In this case, you may need to restore the table from your backups or attempt to recover data from the corrupted files (see ).
The MySQL query optimizer uses two API calls to ask the storage engines how index values are distributed when deciding how to use indexes. The first is the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Normalization and Denormalization
There are usually many ways to represent any given data, ranging from fully normalized to fully denormalized and anything in between. In a normalized database, each fact is represented once and only once. Conversely, in a denormalized database, information is duplicated, or stored in multiple places.
If you're not familiar with normalization, you should study it. There are many good books on the topic and resources online; here, we just give a brief introduction to the aspects you need to know for this chapter. Let's start with the classic example of employees, departments, and department heads:
EMPLOYEE
DEPARTMENT
HEAD
Jones
Accounting
Jones
Smith
Engineering
Smith
Brown
Accounting
Jones
Green
Engineering
Smith
The problem with this schema is that abnormalities can occur while the data is being modified. Say Brown takes over as the head of the Accounting department. We need to update multiple rows to reflect this change, and while those updates are being made the data is in an inconsistent state. If the "Jones" row says the head of the department is something different from the "Brown" row, there's no way to know which is right. It's like the old saying, "A person with two watches never knows what time it is." Furthermore, we can't represent a department without employees—if we delete all employees in the Accounting department, we lose all records about the department itself. To avoid these problems, we need to normalize the table by separating the employee and department entities. This process results in the following two tables for employees:
EMPLOYEE_NAME
DEPARTMENT
Jones
Accounting
Smith
Engineering
Brown
Accounting
Green
Engineering
and departments:
DEPARTMENT
HEAD
Accounting
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Speeding Up ALTER TABLE
MySQL's ALTER TABLE performance can become a problem with very large tables. MySQL performs most alterations by making an empty table with the desired new structure, inserting all the data from the old table into the new one, and deleting the old table. This can take a very long time, especially if you're short on memory and the table is large and has lots of indexes. Many people have experience with ALTER TABLE operations that have taken hours or days to complete.
MySQL AB is working on improving this. Some of the upcoming improvements include support for "online" operations that won't lock the table for the whole operation. The InnoDB developers are also working on support for building indexes by sorting. MyISAM already supports this technique, which makes building indexes much faster and results in a compact index layout. (InnoDB currently builds its indexes one row at a time in primary key order, which means the index trees aren't built in optimal order and are fragmented.)
Not all ALTER TABLE operations cause table rebuilds. For example, you can change or drop a column's default value in two ways (one fast, and one slow). Say you want to change a film's default rental duration from 3 to 5 days. Here's the expensive way:
mysql> ALTER TABLE sakila.film
    -> MODIFY COLUMN rental_duration TINYINT(3) NOT NULL DEFAULT 5;
Profiling that statement with SHOW STATUS shows that it does 1,000 handler reads and 1,000 inserts. In other words, it copied the table to a new table, even though the column's type, size, and nullability didn't change.
In theory, MySQL could have skipped building a new table. The default value for the column is actually stored in the table's .frm file, so you should be able to change it without touching the table itself. MySQL doesn't yet use this optimization; however, any MODIFY COLUMN will cause a table rebuild.
You can change a column's default with
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Notes on Storage Engines
We close this chapter with some storage engine-specific schema design choices you should keep in mind. We're not trying to write an exhaustive list; our goal is just to present some key factors that are relevant to schema design.
Table locks
MyISAM tables have table-level locks. Be careful this doesn't become a bottleneck.
No automated data recovery
If the MySQL server crashes or power goes down, you should check and possibly repair your MyISAM tables before using them. If you have large tables, this could take hours.
No transactions
MyISAM tables don't support transactions. In fact, MyISAM doesn't even guarantee that a single statement will complete; if there's an error halfway through a multirow UPDATE, for example, some of the rows will be updated and some won't.
Only indexes are cached in memory
MyISAM caches only the index inside the MySQL process, in the key buffer. The operating system caches the table's data, so in MySQL 5.0 an expensive operating system call is required to retrieve it.
Compact storage
Rows are stored jam-packed one after another, so you get a small disk footprint and fast full table scans for on-disk data.
Table locks
Like MyISAM tables, Memory tables have table locks. This isn't usually a problem though, because queries on Memory tables are normally fast.
No dynamic rows
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Query Performance Optimization
In the previous chapter, we explained how to optimize a schema, which is one of the necessary conditions for high performance. But working with the schema isn't enough—you also need to design your queries well. If your queries are bad, even the best-designed schema will not perform well.
Query optimization, index optimization, and schema optimization go hand in hand. As you gain experience writing queries in MySQL, you will come to understand how to design schemas to support efficient queries. Similarly, what you learn about optimal schema design will influence the kinds of queries you write. This process takes time, so we encourage you to refer back to this chapter and the previous one as you learn more.
This chapter begins with general query design considerations—the things you should consider first when a query isn't performing well. We then dig much deeper into query optimization and server internals. We show you how to find out how MySQL executes a particular query, and you'll learn how to change the query execution plan. Finally, we look at some places MySQL doesn't optimize queries well and explore query optimization patterns that help MySQL execute queries more efficiently.
Our goal is to help you understand deeply how MySQL really executes queries, so you can reason about what is efficient or inefficient, exploit MySQL's strengths, and avoid its weaknesses.
The most basic reason a query doesn't perform well is because it's working with too much data. Some queries just have to sift through a lot of data and can't be helped. That's unusual, though; most bad queries can be changed to access less data. We've found it useful to analyze a poorly performing query in two steps: