Chapter 1. Query Response Time

Performance is query response time.

This book explores that idea from various angles with a single intent: to help you achieve remarkable MySQL performance. Efficient MySQL performance means focusing on the best practices and techniques that directly affect MySQL performance—no superfluous details or deep internals required by DBAs and experts. I presume that you’re a busy professional who is using MySQL, not managing it, and that you need the most results for the least effort. That’s not laziness, that’s efficiency. To that end, this book is direct and to the point. And by the end, you will be able to achieve remarkable MySQL performance.

MySQL performance is a complex and multifaceted subject, but you do not need to become an expert to achieve remarkable performance. I narrow the scope of MySQL complexity by focusing on the essentials. MySQL performance begins with query response time.

Query response time is how long it takes MySQL to execute a query. Synonymous terms are: response time, query time, execution time, and (inaccurately) query latency.¹ Timing starts when MySQL receives the query and ends when it has sent the result set to the client. Query response time comprises many stages (steps during query execution) and waits (lock waits, I/O waits, and so on), but a complete and detailed breakdown is neither possible nor necessary. As with many systems, basic troubleshooting and analysis reveal the majority of problems.

Note

Performance increases as query response time decreases. Improving query response time is synonymous with reducing query response time.

This chapter is the foundation. It expounds query response time so that, in subsequent chapters, you can learn how to improve it. There are seven major sections. The first is a true story to motivate and amuse. The second discusses why query response time is the North Star of MySQL performance. The third outlines how query metrics are transformed into meaningful reports: query reporting. The fourth addresses query analysis: using query metrics and other information to understand query execution. The fifth maps out the journey of improving query response time: query optimization. The sixth gives an honest and modest schedule for optimizing queries. The seventh discusses why MySQL cannot simply go faster—why query optimization is necessary.

A True Story of False Performance

In 2004, I was working the night shift at a data center—2 p.m. to midnight. It was a great job for two reasons. First, the only employees in the data center after 5 p.m. were a handful of engineers monitoring and managing thousands of physical servers for an undisclosed number of customers and websites—probably tens of thousands of websites. It was an engineer’s dream. Second, there were countless MySQL servers that always had problems to fix. It was a gold mine of learning and opportunity. But at the time, there were few books, blogs, or tools about MySQL. (Though that same year, O’Reilly published the first edition of High Performance MySQL.) Consequently, the state of the art for “fixing” MySQL performance problems was “sell the customer more RAM.” For sales and management it always worked, but for MySQL the results were inconsistent.

One night I decided not to sell the customer more RAM and, instead, to do a technical deep dive to find and fix the true root cause of their MySQL performance problem. Their database was powering a bulletin board which had slowed to a crawl under the weight of its success—still a common problem today, almost 20 years later. To make a long story short, I found a single query missing a critical index. After properly indexing the query, performance improved dramatically and the website was saved. It cost the customer zero dollars.

Not all performance problems and solutions are that straightforward and glamorous. But just shy of 20 years’ experience with MySQL has taught me (and many others) that MySQL performance problems are very often solved by the best practices and techniques in this book.

North Star

I’m a MySQL DBA and a software engineer, so I know what it’s like working with MySQL as the latter. Especially when it comes to performance, we (software engineers) just want it (MySQL) to work. Between shipping features and putting out fires, who has time for MySQL performance? And when MySQL performance is poor—or worse: when it suddenly becomes poor—the way forward can be difficult to see because there are many considerations: where do we begin? Do we need more RAM? Faster CPUs? More storage IOPS? Is the problem a recent code change? (Fact: code changes deployed in the past can cause performance problems in the future, sometimes days in the future.) Is the problem a noisy neighbor? Are the DBAs doing something to the database? Has the app has gone viral and it’s the good kind of problem to have?

As an engineer whose expertise is the application, not MySQL, that situation can be overwhelming. To move forward confidently, start by looking at query response time because it is meaningful and actionable. These are powerful qualities that lead to real solutions:

Meaningful: Query response time is the only metric anyone truly cares about because, let’s be honest, when the database is fast, nobody looks at it or asks questions. Why? Because query response time is the only metric we experience. When a query takes 7.5 seconds to execute, we experience 7.5 seconds of impatience. That same query might examine a million rows, but we don’t experience a million rows examined. Our time is precious.
Actionable: There’s so much you can do to improve query response time and make everyone happy again that you’re holding a book about it. (Do people still hold books in the future? I hope so.) Query response time is directly actionable because you own the code, so you can change the queries. Even if you don’t own the code (or have access to it), you can still indirectly optimize query response time. “Improving Query Response Time” addresses direct and indirect query optimization.

Focus on improving query response time—the North Star of MySQL performance. Do not begin by throwing hardware at the problem. Begin by using query metrics to determine what MySQL is doing, then analyze and optimize slow queries to reduce response time, and repeat. Performance will improve.

Query Reporting

Query metrics provide invaluable insights into query execution: response time, lock time, rows examined, and so on. But query metrics, like all metrics, are raw values that need to be collected, aggregated, and reported in a way that’s meaningful to (and readable for) engineers. That’s what this section outlines: how query metric tools transform query metrics into query reports. But query reporting is only a means to an end, as discussed in “Query Analysis”.

Looking ahead, query analysis is the real work: analyzing query metrics (as reported) and other information with the goal of understanding query execution. To improve MySQL performance, you must optimize queries. To optimize queries, you must understand how they execute. And to understand that, you must analyze them with pertinent information, including query reports and metadata.

But first you need to understand query reporting, since it represents the trove of query metrics that provide invaluable insights into query execution. The next three sections teach you about the following:

Sources: query metrics originate from two sources and vary by MySQL distribution and version
Aggregation: query metric values are grouped and aggregated by normalized SQL statements
Reporting: query reports are organized by a high-level profile and a query-specific report

Then you’re ready for “Query Analysis”.

Note

This is not a book about database administration, so this section does not discuss the setup and configuration of query metrics in MySQL. I presume this is already done or will be done. If not, don’t worry: ask your DBA, hire a consultant, or learn how by reading the MySQL manual.

Sources

Query metrics originate from the slow query log or the Performance Schema. As the names indicate, the former is a log file on disk, and the latter is a database with the same name: performance_schema. Although completely different in nature (log file on disk as opposed to tables in a database), both provide query metrics. The important difference is how many metrics they provide: apart from query response time, which both provide, the number of metrics ranges from 3 to more than 20.

Note

The name slow query log is historical. Long ago, MySQL logged only queries that took greater than N seconds to execute, and the minimum value for N was 1. Old versions of MySQL would not log a query that took 900 milliseconds to execute because that was “fast.” The slow query log really earned its name. Today, the minimum value is zero with a resolution of microseconds. When set to zero, MySQL logs every query executed. Therefore, the name is a little misleading, but now you know why.

All things considered, the Performance Schema is the best source of query metrics because it exists in every current version and distribution of MySQL, it works locally and in the cloud, it provides all nine metrics covered in “Query Metrics”, and it’s the most consistent. Plus, the Performance Schema contains a wealth of other data for deep MySQL analysis, so its usefulness extends far beyond query metrics. The slow query log is a good source, too, but it varies considerably:

MySQL

As of MySQL 8.0.14, enable system variable log_slow_extra and the slow query log provides six of the nine metrics in “Query Metrics”, lacking only Rows_affected, Select_scan, and Select_full_join. It’s still a good source, but use the Performance Schema if possible.

Before MySQL 8.0.14, which includes MySQL 5.7, the slow query log is bare bones, providing only Query_time, Lock_time, Rows_sent, and Rows_examined. You can still analyze queries with only these four metrics, but the analysis is much less insightful. Consequently, avoid the slow query log before MySQL 8.0.14 and instead use the Performance Schema.

Percona Server

Percona Server provides a significantly greater number of metrics in the slow query log when system variable log_slow_verbosity is configured: all nine metrics covered in “Query Metrics” and more. It also supports query sampling (logging a percentage of queries) when system variable log_slow_rate_limit is configured, which is helpful for busy servers. These features make the Percona Server slow query log a great source. See “Slow Query Log” in the Percona Server manual for details.

MariaDB Server

MariaDB Server 10.x uses the Percona Server slow query log enhancements, but there are two notable differences: system variable log_slow_verbosity is configured differently in MariaDB, and it does not provide metric Rows_affected. Otherwise, it’s essentially the same and a great source, too. See “Slow Query Log Extended Statistics” in the MariaDB knowledge base for details.

The slow query log is disabled by default, but you can enable it dynamically (without restarting MySQL). The Performance Schema should be enabled by default, though some cloud providers disable it by default. Unlike the slow query log, the Performance Schema cannot be enabled dynamically—you must restart MySQL to enable it.

Make sure the best query metric source is used and properly configured. Ask your DBA, hire a consultant, or learn how by reading the MySQL manual.

Warning

The slow query log can log all queries when long_query_time is set to zero, but be careful: on a busy server, this can increase disk I/O and use a significant amount of disk space.

Aggregation

Query metrics are grouped and aggregated by query. That sounds obvious since they’re called query metrics, but some query metric tools can group by username, hostname, database, and so on. These alternate groupings are exceptionally rare and yield a different type of query analysis, so I don’t address them in this book. Since query response time is the North Star of MySQL performance, grouping query metrics by query is the best way to see which queries have the slowest response time, which forms the basis of query reporting and analysis.

There’s one little problem: how do you uniquely identify queries to determine the groups to which they belong? For example, system metrics (CPU, memory, storage, and so on) are grouped by hostname because hostnames are unique and meaningful. But queries don’t have any uniquely identifying properties like hostname. The solution: a SHA-256 hash of the normalized SQL statement. Example 1-1 shows how a SQL statement is normalized.

Example 1-1. SQL statement normalization

SELECT col FROM tbl WHERE id=1 

SELECT `col` FROM `tbl` WHERE `id` = ? 

f49d50dfab1c364e622d1e1ff54bb12df436be5d44c464a4e25a1ebb80fc2f13

: SQL statement (sample)
: Digest text (normalized SQL statement)
: Digest hash (SHA-256 of digest text)

MySQL normalizes SQL statements to digest texts, then computes the SHA-256 hash of the digest text to yield the digest hash. (It’s not necessary to understand the full process of normalization; it’s sufficient to know that normalization replaces all values with ? and collapses multiple whitespaces to a single space.) Since the digest text is unique, the digest hash is also unique (hash collisions notwithstanding).

Note

The MySQL manual uses the term digest equivocally to mean either digest text or digest hash. Since the digest hash is computed from the digest text, the equivocation is only a language ambiguity, not a technical error. Please allow me to equivocate, too, and use digest to mean either digest text or digest hash when the technical difference doesn’t matter.

There is an important shift in terminology in the context of query metrics: the term query changes to be synonymous with digest text. The shift in terminology aligns with the shift in focus: grouping metrics by query. To group by query, query must be unique, which is only true of digests.

SQL statements are also called query samples (or samples for short), and they may or may not be reported. For security, most query metric tools discard samples by default (because they contain real values) and report only digest texts and hashes. Samples are required for query analysis because you can EXPLAIN them, which produces metadata necessary for understanding query execution. Some query metric tools EXPLAIN a sample, then discard it, and report the EXPLAIN plan (the output of EXPLAIN). Others only report the sample, which is still very convenient: copy-paste to EXPLAIN. If you have neither, then manually extract samples from the source or manually write them when needed.

Two more clarifications about terminology and then I promise we’ll move on to more exciting material. First, terminology varies widely depending on the query metric tool, as shown in Table 1-1.

Table 1-1. Query metric terminology
Official (MySQL)	Alternatives
SQL statement	Query
Sample	Query
Digest text	Class, family, fingerprint, query
Digest hash	Class ID, query ID, signature

Second, another term that originated from Percona is query abstract: a SQL statement highly abstracted to its SQL command and table list. Example 1-2 is the query abstract for SELECT col FROM tbl WHERE id=1.

Example 1-2. Query abstract

SELECT tbl

Query abstracts are not unique, but they are useful because they’re succinct. Usually, developers only need to see a query abstract to know the full query that it represents.

Brevity is the soul of wit.

William Shakespeare

It’s important to understand that SQL statements are normalized because the queries you write are not the queries you see. Most of the time, this is not a problem because digest texts closely resemble SQL statements. But the process of normalization raises another important point: do not dynamically generate the same logical query with different syntax, else it will normalize to different digests and be reported as different queries. For example, in the case of a programmatically-generated query that changes the WHERE clause based on user input:

SELECT name FROM captains WHERE last_name = 'Picard'
SELECT name FROM captains WHERE last_name = 'Picard' AND first_name = 'Jean-Luc'

Those two queries may be logically the same to you and the application, but they’re different queries with respect to reporting because they normalize to different digests. To my knowledge, no query metric tool allows you to combine queries. And it’s technically correct to report those queries separately because every condition—especially in the WHERE clause—affects query execution and optimization.

One point about query normalization: values are removed, so the following two queries normalize to the same digest:

-- SQL statements
SELECT `name` FROM star_ships WHERE class IN ('galaxy')
SELECT `name` FROM star_ships WHERE class IN ('galaxy', 'intrepid')

-- Digest text
SELECT `name` FROM `star_ships` WHERE `class` IN (...)

Since the digest is the same for both queries, the metrics for both queries are grouped, aggregated, and reported as one query.

Enough about terminology and normalization. Let’s talk about reporting.

Reporting

Reporting is a challenge and an art form because a single application can have hundreds of queries. Each query has many metrics and each metric has several statistics: minimum, maximum, average, percentile, and so forth. On top of that, each query has metadata: samples, EXPLAIN plans, table structures, on so on. It’s challenging to store, process, and present all this data. Almost every query metric tool presents the data in a two-level hierarchy: query profile and query report. Those terms vary by query metric tool, but you will easily recognize each when you see them.

Query profile

A query profile shows slow queries. It is the top-level organization for query reporting, usually the first thing you see in a query metric tool. It presents query digests and a limited subset of query metrics, which is why it’s called a profile.

Slow is relative to the sort metric: the aggregate value of a query metric by which queries are ordered. The first ordered query is called the slowest, even if the sort metric is not query time (or any time). For example, if the sort metric is average rows sent, the first ordered query is still called the slowest.

Although any query metric can be the sort metric, query time is the universal default sort metric. When you reduce query execution time, you free up time that allows MySQL to do more work, or possibly do other work more quickly. Sorting queries by query time shows you where to begin: the slowest, most time-consuming queries.

What’s not universal is how query time is aggregated. The most common aggregate values are:

Total query time

Total query time is the sum of execution time (per query). This is the most common aggregate value because it answers an important question: which query does MySQL spend the most time executing? To answer that, a query metric tool adds up all the time MySQL spends executing each query. The query with the greatest total time is the slowest, most time-consuming query. Here’s an example of why this is important. Suppose query A has a 1-second response time and executes 10 times, while query B has a 0.1-second response time and executes 1,000 times. Query A has a much slower response time, but query B is 10 times more time-consuming: 10 seconds in total versus 100 seconds in total, respectively. In a query profile sorted by total query time, query B is the slowest query. This is important because you free up the most time for MySQL by optimizing query B.

Percentage execution time

Percentage execution time is total query time (per query) divided by grand total execution time (all queries). For example, if query C has a total query time of 321 ms and query D has a total query time of 100 ms, then grand total execution time is 421 ms. Individually, query C is (321 ms / 421 ms) × 100 = 76.2% of grand total execution time, and query D is (100 ms / 421 ms) × 100 = 23.8% of grand total execution time. In other words, MySQL spent 421 ms executing queries, 76.2% of which was spent executing query C. In a query profile sorted by percentage execution time, query C is the slowest query. Percentage execution time is used by some query metric tools, but not all.

Query load

Query load is total query time (per query) divided by clock time, where clock time is the number of seconds in the time range. If the time range is 5 minutes, then clock time is 300 seconds. For example, if query E has a total query time 250.2 seconds, then its load is 250.2 s / 300 s = 0.83; and if query F has a total query time of 500.1 seconds, then its load is 500.1 s / 300 s = 1.67. In a query profile sorted by query load, query F is the slowest query because its load is the greatest.

Load is relative to time but also subtly indicative of concurrency: multiple instances of a query executing at the same time. Query load less than 1.0 means that, on average, the query does not execute concurrently. Query load greater than 1.0 indicates query concurrency. For example, a query load of 3.5 means that, any time you look, you’re likely to see 3.5 instances of the query executing. (In reality, 3 or 4 instances of the query since there cannot be 0.5 instances of a query.) The higher the query load, the greater the possibility of contention if the query accesses the same or nearby rows. Query load greater than 10 is high and likely to be a slow query, but there are exceptions. As I write this, I’m looking at a query with a load of 5,962. How is that possible? I reveal the answer in “Data Access”.

When the sort metric uses a nontemporal query metric, like rows sent, a different aggregate value (average, maximum, and so on) might make sense depending on what you’re trying to diagnose. This is far less common than total query time, but it occasionally reveals interesting queries worth optimizing.

Query report

A query report shows you everything there is to know about one query. It is the second-level organization for query reporting, usually accessed by selecting a slow query in the query profile. It presents all query metrics and metadata. Whereas the query profile tells you something just by looking at it (which queries are the slowest), a query report is an organized information dump used for query analysis. As such, the more information, the better because it helps you understand query execution.

Query reports vary dramatically depending on the query metric tool. A bare minimum report includes all query metrics from the source and the basic statistics for those metrics: minimum, maximum, average, percentile, and so forth. A thorough report includes metadata: query samples, EXPLAIN plans, table structures, and more. (Samples may be disabled for security purposes because they contain real values.) A few query metric tools go further by adding additional information: metric graphs, histograms (distributions), anomaly detection, time shift comparison (now versus last week), developer notes, SQL comment key-value extraction, and so on.

Query analysis only requires query metrics in the report. Metadata can be collected manually. If the query metric tool you use reports only query metrics, don’t worry: that’s a start, but you will need to manually collect EXPLAIN plans and table structures, at the very least.

With a query report figuratively in hand, you’re equipped for query analysis.

Query Analysis

The goal of query analysis is understanding query execution, not solving slow response time. That might surprise you, but solving slow response time happens after query analysis, during query optimization. First, you need to understand what you’re trying to change: query execution.

Query execution is like a story with a beginning, middle, and end: you read all three to understand the story. Once you understand how MySQL executes a query, then you will understand how to optimize it. Understanding through analysis, then action through optimization.

Tip

I have helped many engineers analyze queries, and the primary difficulty is not understanding the metrics but getting stuck in the analysis: staring deeply into the numbers, waiting for a revelation. Don’t get stuck. Carefully review all the metrics and metadata—read the whole story—then turn your attention to query optimization with the goal of improving response time.

The following sections address key aspects to an efficient and insightful query analysis. Sometimes the cause of slow response time is so obvious that the analysis reads more like a tweet than a story. But when it’s not—when the analysis reads like a graduate thesis on French existentialism—understanding these aspects will help you find the cause and determine a solution.

Query Metrics

From “Sources”, you know that query metrics vary depending on the source, MySQL distribution, and MySQL version. All query metrics are important because they help you understand query execution, but the nine metrics detailed in the following sections are essential to every query analysis.

The Performance Schema provides all nine essential query metrics.

Note

Query metric names also vary by source. In the slow query log, query time is Query_time; but in the Performance Schema, it’s TIMER_WAIT. I don’t use either convention. Instead, I use human-friendly names like query time and rows sent. Query reporting almost always uses human-friendly names, too.

Query time

Query time is the most important metric—you knew that already. What you may not know is that query time includes another metric: lock time.

Lock time is an inherent part of query time, so it’s not surprising that the latter includes the former. What’s surprising is that query time and lock time are the only two time-based query metrics, with one exception: the Percona Server slow query log has metrics for InnoDB read time, row lock wait time, and queue wait time. Lock time is important, but there’s an unfortunate technical gotcha: it’s accurate only in the slow query log. More on this later.

Using the Performance Schema, you can see many (but not all) parts of query execution. This is off-topic and beyond the scope of this book, but it’s good awareness so you know where to look if you need to dig deeper. MySQL instruments a bewildering number of events that the manual defines as, “anything the server does that takes time and has been instrumented so that timing information can be collected.” Events are organized in a hierarchy:

transactions
└── statements
    └── stages
        └── waits

Transactions: Transactions are the top-level event because every query executes in a transaction (Chapter 8 covers transactions).
Statements: Statements are queries, to which query metrics apply.
Stages: Stages are “steps during the statement-execution process, such as parsing a statement, opening a table, or performing a filesort operation.”
Waits: Waits are “events that take time.” (This definition amuses me. It’s tautological and oddly satisfying in its simplicity.)

Example 1-3 shows the stages for a single UPDATE statement (as of MySQL 8.0.22).

Example 1-3. Stages for a single `UPDATE` statement

+----------------------------------+----------------------------------+-----------+
| stage                            | source:line                      | time (ms) |
+----------------------------------+----------------------------------+-----------+
| stage/sql/starting               | init_net_server_extension.cc:101 |     0.109 |
| stage/sql/Executing hook on trx  | rpl_handler.cc:1120              |     0.001 |
| stage/sql/starting               | rpl_handler.cc:1122              |     0.008 |
| stage/sql/checking permissions   | sql_authorization.cc:2200        |     0.004 |
| stage/sql/Opening tables         | sql_base.cc:5745                 |     0.102 |
| stage/sql/init                   | sql_select.cc:703                |     0.007 |
| stage/sql/System lock            | lock.cc:332                      |     0.072 |
| stage/sql/updating               | sql_update.cc:781                | 10722.618 |
| stage/sql/end                    | sql_select.cc:736                |     0.003 |
| stage/sql/query end              | sql_parse.cc:4474                |     0.002 |
| stage/sql/waiting handler commit | handler.cc:1591                  |     0.034 |
| stage/sql/closing tables         | sql_parse.cc:4525                |     0.015 |
| stage/sql/freeing items          | sql_parse.cc:5007                |     0.061 |
| stage/sql/logging slow query     | log.cc:1640                      |     0.094 |
| stage/sql/cleaning up            | sql_parse.cc:2192                |     0.002 |
+----------------------------------+----------------------------------+-----------+

The real output is more complex; I simplified it for easy reading. The UPDATE statement executed in 15 stages. The actual execution of the UPDATE was the eighth stage: stage/sql/updating. There were 42 waits, but I removed them from the output because they’re too far off topic.

Performance Schema events (transactions, statements, stages, and waits) are the fine details of query execution. Query metrics apply to statements. If you need to dig deeper in a query, look in the Performance Schema.

Efficiency is our modus operandi, so don’t get lost in the Performance Schema until you need to, which may be never. Query time is sufficient.

Lock time

Lock time is time spent acquiring locks during query execution. Ideally, lock time is a minuscule percentage of query time, but values are relative (see “Relative Values”). For example, on one extremely optimized database that I manage, lock time is 40% to 50% of query time for the slowest query. Sounds terrible, right? But it’s not: the slowest query has a maximum query time of 160 microseconds and a maximum lock time of 80 microseconds—and the database executes over 20,000 queries per second (QPS).

Although values are relative, I can safely say that lock time greater than 50% of query time is a problem because MySQL should spend the vast majority of its time doing work, not waiting. A theoretically perfect query execution would have zero wait time, but that’s impossible due to shared resources, concurrency, and latency inherent in the system. Still, we can dream.

MySQL Storage Engines and Data Locking

Before I explain more about lock time and locks in general, let me clarify some background information.

MySQL has many storage engines—and a history of storage engines, but I won’t bore you with that. The default storage engine is InnoDB. Other storage engines include: MyISAM, MEMORY, TempTable, Aria with MariaDB, MyRocks with Percona Server and MariaDB, XtraDB with Percona Server, and more. (Fun fact: the Performance Schema is implemented as a storage engine.) In this book, InnoDB is implied unless stated otherwise.

There are table locks and row locks. The server (MySQL) manages tables and table locks. Tables are created using a storage engine (InnoDB by default) but are storage engine agnostic, meaning you can convert a table from one storage engine to another. Row-level locking is managed by the storage engine if supported. MyISAM does not support row-level locking, so it manages data access with table locks. InnoDB supports row-level locking, so it manages data access with row locks. Since InnoDB is the default storage engine, row-level locking is implied unless stated otherwise.

Note

InnoDB also has table locks called intention locks, but they’re not important for this discussion.

There are metadata locks managed by the server that control access to schemas, tables, stored programs, and more. Whereas table locks and row locks control access to table data, metadata locks control access to table structures (columns, indexes, and so on) to prevent changes while queries are accessing the tables. Every query acquires a metadata lock on every table that it accesses. Metadata locks are released at the end of the transaction, not the query.

Note

Remember: InnoDB and row-level locking are implied unless stated otherwise.

Remember the unfortunate technical gotcha mentioned earlier? Here it is: lock time from the Performance Schema does not include row lock waits, only table and metadata lock waits. Row lock waits are the most important part of lock time, which makes lock time from the Performance Schema nearly useless. By contrast, lock time from the slow query log includes all lock waits: metadata, table, and row. Lock time from either source does not indicate which type of lock wait. From the Performance Schema, it’s certainly metadata lock wait; and from the slow query log, it’s probably row lock wait, but metadata lock wait is a possibility, too.

Warning

Lock time from the Performance Schema does not include row lock waits.

Locks are primarily used for writes (INSERT, UPDATE, DELETE, REPLACE) because rows must be locked before they can be written. Response time for writes depends, in part, on lock time. The amount of time needed to acquire row locks depends on concurrency: how many queries are accessing the same (or nearby) rows at the same time. If a row has zero concurrency (accessed by only one query at a time), then lock time is vanishingly small. But if a row is hot—jargon for very frequently accessed—then lock time could account for a significant percentage of response time. Concurrency is one of several data access patterns (see “Data Access Patterns” in Chapter 4).

For reads (SELECT), there are nonlocking and locking reads. The distinction is easy because there are only two locking reads: SELECT…FOR UPDATE and SELECT…FOR SHARE. If not one of those two, then SELECT is nonlocking, which is the normal case.

Although SELECT…FOR UPDATE and SELECT…FOR SHARE are the only locking reads, don’t forget about writes with an optional SELECT. In the following SQL statements, the SELECT acquires shared row locks on table s:

INSERT…SELECT FROM s
REPLACE…SELECT FROM s
UPDATE…WHERE…(SELECT FROM s)
CREATE TABLE…SELECT FROM s

Strictly speaking, those SQL statements are writes, not reads, but the optional SELECT acquires shared row locks on table s. See “Locks Set by Different SQL Statements in InnoDB” in the MySQL manual for details.

Locking reads should be avoided, especially SELECT…FOR UPDATE, because they don’t scale, they tend to cause problems, and there is usually a nonlocking solution to achieve the same result. With respect to lock time, a locking read is like a write: it depends on concurrency. Be careful with SELECT…FOR SHARE: shared locks are compatible with other shared locks, but they’re incompatible with exclusive locks, which means shared locks block writes on the same (or nearby) rows.

For nonlocking reads, even though row locks are not acquired, lock time will not be zero because metadata and table locks are acquired. But acquiring these two should be very fast: less than 1 millisecond. For example, another database I manage executes over 34,000 QPS but the slowest query is a nonlocking SELECT that does a full table scan, reading six million rows every execution, with very high concurrency: 168 query load. Despite these large values, its maximum lock time is 220 microseconds, and average lock time is 80 microseconds.

Nonlocking read does not mean non-blocking. SELECT queries must acquire shared metadata locks (MDL) on all tables accessed. As usual with locks, shared MDL are compatible with other shared MDL, but one exclusive MDL blocks all other MDL. ALTER TABLE is the common operation that acquires an exclusive MDL. Even using ALTER TABLE…ALGORITHM=INPLACE, LOCK=NONE or third-party online schema change tools like pt-online-schema-change and gh-ost, an exclusive MDL must be acquired at the end to swap the old table structure for the new one. Although the table swap is very quick, it can cause a noticeable disruption when MySQL is heavily loaded because all table access is blocked while the exclusive MDL is held. This problem shows up as a blip in lock time, especially for SELECT statements.

Warning

SELECT can block waiting for metadata locks.

Locking might be the most complex and nuanced aspect of MySQL. To avoid going down the proverbial rabbit hole, let me state five points but defer explanation for now. Merely being aware of these points greatly increases your MySQL prowess:

Lock time can be significantly greater than innodb_lock_wait_timeout because this system variable applies to each row lock.
Locking and transaction isolation levels are related.
InnoDB locks every row it accesses including rows it does not write.
Locks are released on transaction commit or rollback, and sometimes during query execution.
InnoDB has different types of locks: record, gap, next-key, and more.

“Row Locking” goes into detail. For now, let’s put it all together and visualize how query time includes lock time. Figure 1-1 shows locks acquired and released during query execution.

Labels 1 to 10 mark events and details with respect to locking:

Acquire shared metadata lock on table
Acquire intention exclusive (IX) table lock
Acquire row lock 1
Update (write) row 1
Acquire row lock 2
Release row lock 2
Acquire row lock 3
Update (write) row 3
Commit transaction
Release all locks

Two points of interest:

Lock time from the Performance Schema includes only labels 1 and 2. From the slow query log it includes labels 1, 2, 3, 5, and 7.
Although row 2 is locked (label 5), it’s not written and its lock is released (label 6) before the transaction commits (label 9). This can happen, but not always. It depends on the query and transaction isolation level.

That was a lot of information about lock time and locking, but now you are well-equipped to understand lock time in your query analysis.

Rows examined

Rows examined is the number of rows that MySQL accessed to find matching rows. It indicates the selectivity of the query and the indexes. The more selective both are, the less time MySQL wastes examining non-matching rows. This applies to reads and writes, except INSERT unless it’s an INSERT…SELECT statement.

To understand rows examined, let’s look at two examples. First, let’s use the following table, t1, and three rows:

CREATE TABLE `t1` (
  `id` int NOT NULL,
  `c` char(1) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

+----+---+
| id | c |
+----+---+
|  1 | a |
|  2 | b |
|  3 | c |
+----+---+

Column id is the primary key, and column c is not indexed.

The query SELECT c FROM t1 WHERE c = 'b' matches one row but examines three rows because there is no unique index on column c. Therefore, MySQL has no idea how many rows match the WHERE clause. We can see that only one row matches, but MySQL doesn’t have eyes, it has indexes. By contrast, the query SELECT c FROM t1 WHERE id = 2 matches and examines only one row because there is a unique index on column id (the primary key) and the table condition uses the entire index. Now MySQL can figuratively see that only one row matches, so that’s all it examines. Chapter 2 teaches indexes and indexing, which explain table conditions and a lot more.

For the second example, let’s use the following table, t2, and seven rows:

CREATE TABLE `t2` (
  `id` int NOT NULL,
  `c` char(1) NOT NULL,
  `d` varchar(8) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `c` (`c`)
) ENGINE=InnoDB;

+----+------+--------+
| id | c    | d      |
+----+------+--------+
|  1 | a    | apple  |
|  2 | a    | ant    |
|  3 | a    | acorn  |
|  4 | a    | apron  |
|  5 | b    | banana |
|  6 | b    | bike   |
|  7 | c    | car    |
+----+------+--------+

Column id is the same as before (primary key). Column c has a nonunique index. Column d is not indexed.

How many rows will query SELECT d FROM t2 WHERE c = 'a' AND d = 'acorn' examine? The answer is: four. MySQL uses the nonunique index on column c to look up rows matching the condition c = 'a', and that matches four rows. And to match the other condition, d = 'acorn', MySQL examines each of those four rows. As a result, the query examines four rows but matches (and returns) only one row.

It’s not uncommon to discover that a query examines more rows than expected. The cause is usually the selectivity of the query or the indexes (or both), but sometimes it’s because the table has grown a lot larger than expected, so there are a lot more rows to examine. Chapter 3 examines this further (pun intended).

Rows examined only tells half the story. The other half is rows sent.

Rows sent

Rows sent is the number of rows returned to the client—the result set size. Rows sent is most meaningful in relation to rows examined.

Rows sent = Rows examined: The ideal case is when rows sent and rows examined are equal and the value is relatively small, especially as a percentage of total rows, and query response time is acceptable. For example, 1,000 rows from a table with one million rows is a reasonable 0.1%. This is ideal if response time is acceptable. But 1,000 rows from a table with only 10,000 rows is a questionable 10% even if response time is acceptable. Regardless of the percentage, if rows sent and rows examined are equal and the value is suspiciously high, it strongly indicates that the query is causing a table scan, which is usually terrible for performance—“Table scan” explains why.
Rows sent < Rows examined: Fewer rows sent than examined is a reliable sign of poor query or index selectivity. If the difference is extreme, it likely explains slow response time. For example, 1,000 rows sent and 100,000 rows examined aren’t large values, but they mean 99% of rows did not match—the query caused MySQL to waste a lot of time. Even if response time is acceptable, an index could dramatically reduce the wasted time.
Rows sent > Rows examined: It’s possible, but rare, to send more rows than were examined. This happens under special conditions, like when MySQL can “optimize away” the query. For example, SELECT COUNT(id) FROM t2 on the table in the previous section sends one row for the value of COUNT(id) but examines zero rows.

Rows sent is rarely a problem by itself. Modern networks are fast and the MySQL protocol is efficient. If your distribution and version of MySQL have the bytes sent metric in the slow query log (the Performance Schema does not provide this query metric), you can use it two ways. First, the minimum, maximum, and average values reveal the result set size in bytes. This is usually small, but it can be large if the query returns BLOB or JSON columns. Second, total bytes sent can be converted to a network throughput (Mbps or Gbps) to reveal the network utilization of the query, which is also usually very small.

Rows affected

Rows affected is the number of rows inserted, updated, or deleted. Engineers are very careful to affect only the correct rows. It’s a serious bug when the wrong rows are changed. Viewed this way, the value of rows affected is always correct. But a surprisingly large value could indicate a new or modified query that affects more rows than intended.

Another way to view rows affected is as the batch size of bulk operations. Bulk INSERT, UPDATE, and DELETE are a common source of several problems: replication lag, history list length, lock time, and overall performance degradation. Equally common is the question, “How large should the batch size be?” There’s no universally correct answer. Instead, you must determine the batch size and rate that MySQL and the application can sustain without impacting query response time. I explain in “Batch Size”, which focuses on DELETE but is also applicable to INSERT and UPDATE.

Select scan

Select scan is the number of full table scans on the first table accessed. (If the query accesses two or more tables, the next metric applies: select full join.) This is usually bad for performance because it means the query isn’t using an index. After Chapter 2, which teaches indexes and indexing, it should be easy to add an index to fix a table scan. If select scan is not zero, query optimization is strongly advised.

It’s possible, but very rare, that a query causes a table scan sometimes but not always. To determine why, you need a query sample and EXPLAIN plan for both: a query sample that causes a table scan, and a query sample that does not. One likely reason is how many rows MySQL estimates the query will examine relative to index cardinality (the number of unique values in the index), total rows in the table, and other costs. (The MySQL query optimizer uses a cost model.) Estimates aren’t perfect and sometimes MySQL is wrong, resulting in a table scan or suboptimal execution plan, but again: this is very rare.

More than likely, select scan is either all zero or all one (it’s a binary value). Be happy if it’s zero. Optimize the query if it’s not zero.

Select full join

Select full join is the number of full table scans on tables joined. This is similar to select scan but worse—I explain why in a moment. Select full join should always be zero; if not, query optimization is practically required.

When you EXPLAIN a query with multiple tables, MySQL prints the table join order from top (first table) to bottom (last table). Select scan applies only to the first table. Select full join applies only to the second and subsequent tables.

Table join order is determined by MySQL, not the query.² Example 1-4 shows the EXPLAIN plan for SELECT…FROM t1, t2, t3: MySQL determines a different join order than the implicit three-table join in the query.

Example 1-4. EXPLAIN plan for three tables joined

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t3
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 2
     filtered: 100.00
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 7
     filtered: 100.00
        Extra: NULL

MySQL reads table t3 first, then joins table t1, then joins table t2. That join order is different than the query (FROM t1, t2, t3), which is why you must EXPLAIN a query to see its join order.

Tip

Always EXPLAIN a query to see its join order.

Select scan applies to table t3 because it’s the first table in the join order and it causes a table scan (indicated by type: ALL). Select full join would apply to table t1 if it caused a table scan, but it doesn’t: MySQL joins the table using a range scan on the primary key (indicated by type: range and key: PRIMARY, respectively). Select full join applies to table t2 because MySQL joins it using a full table scan (indicated by type: ALL).

The table scan on t2 is called a full join because MySQL scans the full table on join. Select full join is worse than select scan because the number of full joins that occur on a table during query execution is equal to the product of rows from the preceding tables. MySQL estimates three rows from table t3 (indicated by rows: 3) and two rows from table t1 (indicated by rows: 2). Therefore, 3 × 2 = 6 full joins on table t2 during query execution. But the select full join metric value will be 1 because it counts full joins in the execution plan, not during query execution, which is sufficient because even one full join is too many.

Note

As of MySQL 8.0.18, the hash join optimization improves performance for certain joins, but avoiding full joins remains the best practice. See “Table Join Algorithms” for a brief overview of hash join.

Created tmp disk tables

Created tmp disk tables is the number of temporary tables created on disk. It’s normal for queries to create temporary tables in memory; but when an in-memory temporary table becomes too large, MySQL writes it to disk. That can affect response time because disk access is orders of magnitude slower than memory access.

However, temporary tables on disk is not a common problem because MySQL tries to avoid them. Excessive “tmp disk tables” indicates a query that can be optimized, or (perhaps) the system variable tmp_table_size is too small. Always optimize queries first. Change system variables as a last resort—especially ones that affect memory allocation.

See “Internal Temporary Table Use in MySQL” in the MySQL manual for more information.

Query count

Query count is the number of query executions. The value is arbitrary unless extremely low and the query is slow. “Low and slow” is an odd combination worth investigating.

As I write this, I’m looking at a query profile that’s a perfect example: the slowest query executed once but took 44% of execution time. Other metrics are:

Response time: 16 s
Lock time: 110 μs
Rows examined: 132,000
Rows sent: 13

Not your everyday query. It looks like an engineer manually executed the query, but I can tell from the digest text that it was programmatically generated. What’s the story behind this query? To find out, I’ll have to ask the application developers.

Metadata and the Application

There’s more to query analysis than query metrics: metadata. In fact, you can’t complete a query analysis without at least two pieces of metadata: the EXPLAIN plan (also called the query execution plan), and the table structure for each table. A few query metric tools automatically collect metadata and show it in the query report. If your query metric tool does not, don’t worry: it’s easy to collect metadata. EXPLAIN and SHOW CREATE TABLE report the EXPLAIN plan and table structures, respectively.

Metadata is indispensable for query analysis, query optimization, and MySQL performance in general. EXPLAIN is a vital tool in your MySQL repertoire. I explain it in “EXPLAIN: Query Execution Plan” and use it extensively throughout this book.

There’s even more to query analysis than query metrics and metadata: the application. Metrics and metadata are must-have for any query analysis, but the story is only complete when you know what purpose the query serves: why does the application execute the query? Knowing this allows you to evaluate changes to the application, which is the focus of Chapter 4. More than once, I’ve seen engineers realize that a query can be a lot simpler—or completely removed.

Query metrics, metadata, and the application should complete the story. But I’d be remiss not to mention that, sometimes, issues outside MySQL and the application influence the story—and usually not for the better. “Noisy Neighbors” is a classic case. If response time is slow but a thorough query analysis doesn’t reveal why, then consider outside issues. But don’t be too quick to jump to this conclusion; outside issues should be the exception, never the norm.

Relative Values

For each query metric, the only objectively good value is zero because, as the saying goes, the fastest way to do something is to not do it. Nonzero values are always relative to the query and application. For example, one thousand rows sent is fine in general, but it could be terrible if the query is supposed to return only one row. Relative values are meaningful when considering the full story: metrics, metadata, and the application.

Here’s another true story to illustrate that values are relative and meaningful with the full story. I inherited an application that became slower and slower over the years. It was an internal application—not used by customers—so fixing it wasn’t a priority until it became unbearably slow. In the query profile, the slowest query was examining and returning over ten thousand rows—not a full table scan, just a lot of rows. Instead of fixating on the values, I went spelunking in the source code and discovered that the function executing the query was only counting the number of rows, not using the rows. It was slow because it needlessly accessed and returned thousands of rows, and it became slower over time because the number of rows increased as the database grew. With the full story, the optimization was glaringly obvious and simple: SELECT COUNT(*).

Average, Percentile, and Maximum

It’s standard to talk about query response time as if it’s a single value, but it’s not. From “Aggregation” you know that query metrics are grouped and aggregated by query. As a result, query metrics are reported as single, statistical values: minimum, maximum, average, and percentile. You are undoubtedly familiar with these ubiquitous “stats,” but with respect to query response time, the following points may surprise you:

Average is overly optimistic
Percentile is an assumption
Maximum is the best representation

Let me explain:

Average

Don’t be fooled by the average: if query count is small, a few very large or small values can skew the average response time (or any metric). Moreover, without knowing the distribution of values, we cannot know what percentage of values the average represents. For example, if the average equals the median, then the average represents the bottom 50% of values, which are the better (faster) response times. In that case, the average is overly optimistic. (Most values are overly optimistic if you ignore the worst half.) The average only tells you, at a glance, if the query typically executes in a matter of microseconds, milliseconds, or seconds. Don’t read more into it than that.

Percentile

Percentiles solve the problem of averages. Without going into a full explanation of percentiles, P95 is the value that 95% of samples are less than or equal to.³ For example, if P95 equals 100 ms, then 95% of values are less than or equal to 100 ms, and 5% of values are greater than 100 ms. Consequently, P95 represents 95% of values, which is objectively more representative—and less optimistic—than the average. There’s another reason percentiles are used: the small percentage of values ignored are considered outliers. For example, network jitter and flukes can cause a small percentage of query executions to take longer than normal. Since that’s no fault of MySQL, we ignore those execution times as outliers.

Percentiles are standard practice, but they’re also an assumption. Yes, there can be outliers, but they should be proven, not assumed. Until the top N% are proven not to be outliers, they are the most interesting values precisely because they’re not normal. What’s causing them? That’s difficult to answer, which is why percentiles are standard practice: it’s easier to ignore the top N% of values than to dig deep and find the answer.

The best percentile is P999 (99.9%) because discarding 0.1% of values is an acceptable tradeoff between assuming that they’re outliers and the reality that outliers do exist.⁴

Maximum

Maximum query time solves the problem of percentiles: don’t discard any values. The maximum value is not a myth or statistical apparition like the average. Somewhere in the world, some application user experienced the maximum query response time—or gave up after a few seconds and left. You should want to know why, and you can find the answer. Whereas explaining the top N% of values is difficult because there are many values and, thus, many potentially different answers, explaining the maximum is a single value and answer. Query metric tools often use the query with the maximum response time as the sample, which makes explaining it almost trivial because you have the proverbial smoking gun. With that sample, one of two things will happen: either it reproduces the problem, in which case you continue with the analysis; or, it does not reproduce the problem, in which case you have proven that it’s an outlier that can be ignored.

Here’s another true story of the former case. An otherwise good application would randomly respond very slowly. Minimum, average, and P99 query time were all milliseconds, but maximum query time was seconds. Instead of ignoring the maximum, I collected query samples of normal and maximum execution time. The difference was the size of the IN list in the WHERE clause: hundreds of values for normal query time, and several thousand values for maximum query time. Fetching more values takes longer to execute, but milliseconds to seconds is not normal even for thousands of values. EXPLAIN provided the answer: normal query time used an index, but maximum query time caused a full table scan. MySQL can switch query execution plans (see “It’s a Trap! (When MySQL Chooses Another Index)”), which explains MySQL, but what explains the application? Long story short, the query was used to look up data for fraud detection, and occasionally a big case would look up several thousand rows at once, which caused MySQL to switch query execution plans. Normally, the query was perfectly fine, but digging into the maximum response time revealed not only a MySQL gotcha but also an opportunity to improve the application and user experience by handling large lookups more efficiently.

Average, percentile, and maximum are useful, just be aware of what they do and do not represent.

Also consider the distribution of values between the minimum and maximum. If you’re lucky, the query report includes histograms, but don’t count on it: calculating histograms for an arbitrary time range is difficult, so almost no query metric tool does it. The basic statistics (minimum, maximum, average, and percentile) indicate enough of the distribution to determine if the query is stable: metrics are roughly the same for every execution. (In Chapter 6, I return to the idea of stability. See “Normal and Stable: The Best Database Is a Boring Database”.) Unstable queries complicate the analysis: what causes the query to execute differently? The cause is likely outside MySQL, which makes it more difficult to find, but it’s necessary to find because stable queries are easier to analyze, understand, and optimize.

Improving Query Response Time

Improving query response time is a journey called query optimization. I call it a journey to set the proper expectations. Query optimization takes time and effort, and there is a destination: faster query response time. To make the journey efficient—not a waste of time and effort—there are two parts: direct query optimization and indirect query optimization.

Direct Query Optimization

Direct query optimization is changes to queries and indexes. These changes solve a lot of performance problems, which is why the journey begins with direct query optimization. And because these changes are so powerful, the journey often ends here, too.

Let me use an analogy that’s a little simplistic now but will be more insightful later. Think of a query as a car. Mechanics have tools to fix a car when it’s not running well. Some tools are common (like a wrench), and others are specialized (like a dual overhead cam lock). Once a mechanic pops the hood and finds the problem, they know which tools are needed to fix it. Likewise, engineers have tools to fix a query when it’s running slowly. The common tools are query analysis, EXPLAIN, and indexes. The specialized tools are query-specific optimizations. To name only a few from “Optimizing SELECT Statements” in the MySQL manual:

Range Optimization
Index Merge Optimization
Hash Join Optimization
Index Condition Pushdown Optimization
Multi-Range Read Optimization
Constant-Folding Optimization
IS NULL Optimization
ORDER BY Optimization
GROUP BY Optimization
DISTINCT Optimization
LIMIT Query Optimization

In this book, I do not explain query-specific optimizations because Chapter 8, “Optimization” in the MySQL manual already explains them in detail, and it’s authoritative and regularly updated. Plus, query-specific optimizations vary by MySQL version and distribution. Instead, I teach indexes and indexing in Chapter 2: the foundation for knowing which query-specific optimizations to use—and how—when fixing a slow query. After Chapter 2, you will wield specialized tools like the “Index Condition Pushdown Optimization” like a master mechanic wields a dual overhead cam lock.

Every so often I talk with an engineer who is surprised and a little unhappy when the query optimizations they so assiduously applied do not solve the problem. Direct query optimization is necessary but not always sufficient. An optimized query can be or become a problem under different circumstances. When you can’t optimize a query any further (or you can’t optimize it at all because you don’t have access to the source code), you can optimize around the query, which leads to part two of the journey: indirect query optimization.

Indirect Query Optimization

Indirect query optimization is changes to data and access patterns. Instead of changing a query, you change what the query accesses and how: its data and access patterns, respectively. These changes indirectly optimize the query because query, data, and access patterns are inextricable with respect to performance. Changes to one influence the others. It’s easy to prove.

Suppose you have a slow query. Data size and access patterns don’t matter for this proof, so imagine whatever you like. I can reduce query response time to near-zero. (Let’s say near-zero is 1 microsecond. For a computer that’s a long time, but for a human it’s imperceptible.) The indirect “optimization” is: TRUNCATE TABLE. With no data, MySQL can execute any query in near-zero time. That’s cheating, but it nonetheless proves the point: reducing data size improves query response time.

Let’s revisit the car analogy. Indirect query optimization is analogous to changing major design elements of the car. For example, weight is a factor in fuel efficiency: decreasing weight increases fuel efficiency. (Data is analogous to weight, which is why TRUNCATE TABLE dramatically increases performance—but don’t use this “optimization.”) Reducing weight is not a straightforward (direct) change because engineers can’t magically make parts weigh less. Instead, they have to make significant changes, such as switching from steel to aluminum, which can affect many other design elements. Consequently, these changes require a greater level of effort.

A greater level of effort is why indirect query optimization is part two of the journey. If direct query optimization solves the problem, then stop—be efficient. (And congratulations.) If it doesn’t and you’re certain the query cannot be further optimized, then it’s time to change data and access patterns, which Chapters 3 and 4 cover.

When to Optimize Queries

When you fix a slow query, another one takes its place. There will always be slow queries, but you should not always optimize them because it’s not an efficient use of your time. Instead, recall “North Star” and ask: is query response time acceptable? If not, then please continue optimizing queries. If yes, then you’re done for now because when the database is fast, nobody looks or asks questions.

As a DBA, I would like you to review query metrics (starting with the “Query profile”) every week and optimize the slowest queries if needed, but as a software engineer I know that’s not practical and almost never happens. Instead, here are three occasions when you should optimize queries.

Performance Affects Customers

When performance affects customers, it is the duty of engineers to optimize queries. I don’t think any engineer would disagree; rather, engineers are eager to improve performance. Some might say this is bad advice because it’s reactive, not proactive, but my overwhelming experience is that engineers (and even DBAs) don’t look at query metrics until customers report that the application is too slow or timing out. As long as query metrics are always on and at the ready, this is an objectively good time to optimize queries because the need for better performance is as real as your customers.

Before and After Code Changes

Most engineers don’t argue against prioritizing query optimization before and after code changes, but my experience is that they don’t do it, either. I implore you to avoid this common pattern: seemingly innocent changes are made to code, vetted in staging, deployed to production, then performance starts to “swirl the bowl” (a colorful metaphor related to toilets that means “become worse”). What happened? The cause is usually changes to queries and access patterns, which are closely related. Chapter 2 begins to explain why; Chapters 3 and 4 complete the explanation. For now, the point is: you will be a hero if you review query metrics before and after code changes.

Once a Month

Even if your code and queries do not change, at least two things around them are changing: data and access patterns. I hope your application is wildly successful and stores ever more data as the number of users climbs “up and to the right.” Query response time changes over time as data and access patterns change. Fortunately, these changes are relatively slow, usually on the order of weeks or months. Even for an application experiencing hyper-growth (for example, adding thousands of new users every day to millions of existing users), MySQL is really good at scaling up so that query response time remains stable—but nothing lasts forever (even the stars die). There is always a point at which good queries go bad. This reality becomes clear after Chapters 3 and 4. For now, the point is: you will rise from hero to legend—possibly with song and story written about you—if you review query metrics once a month.

MySQL: Go Faster

There is no magic or secret to make MySQL significantly faster without changing queries or the application. Here’s another true story to illustrate what I mean.

A team of developers learned that their application was going to be mentioned by a celebrity. They expected a flood of traffic, so they planned ahead to ensure that MySQL and the application would survive. An engineer on the team asked me to help increase MySQL throughput (QPS). I asked, “By how much?” She said, “By 100x”. I said, “Sure. Do you have a year and a willingness to rearchitect the application?” She said, “No, we have one day.”

I understand what the engineer was thinking: how much throughput could MySQL handle if we significantly upgraded the hardware—more CPU cores, more memory, more IOPS? There’s no simple or single answer because it depends on many factors that this book explores in the coming chapters. But one thing is certain: time is a hard limit.

There are 1,000 milliseconds in 1 second—no more, no less. If a query takes 100 milliseconds to execute, then its worst-case throughput is 10 QPS per CPU core: 1,000 ms / 100 ms/query = 10 QPS. (Its real throughput is likely higher—more on this in a moment.) If nothing changes, then there’s simply no more time to execute the query with greater throughput.

To make MySQL do more work in the same amount of time, you have three options:

Change the nature of time
Decrease response time
Increase load

Option one is beyond the scope of this book, so let’s focus on options two and three.

Decreasing response time frees time that MySQL can use to do more work. It’s simple math: if MySQL is busy 999 milliseconds out of every second, then it has one free millisecond to do more work. If that’s not enough free time, then you must decrease the time that the current work is consuming. The best way to accomplish that: direct query optimization. Failing that: indirect query optimization. And finally: better, faster hardware. The following chapters teach you how.

Increasing load—the number of queries executing concurrently—tends to happen first because it doesn’t require any query or application changes: simply execute more queries at once (concurrently), and MySQL responds by using more CPU cores. This happens because one CPU core executes one thread, which executes one query. Worst case, MySQL uses N CPU cores to execute N queries concurrently. But the worst case is practically nonexistent because response time is not CPU time. A nonzero amount of response time is CPU time, and the rest is off-CPU. For example, response time might be 10 ms of CPU time and 90 ms of disk I/O wait. Therefore, the worst-case throughput for a query that takes 100 milliseconds to execute is 10 QPS per CPU core, but its real throughput should be higher since the worst case is practically nonexistent. Sounds great, right? Just push MySQL harder and voilà: more performance. But you know how the story ends: push MySQL too hard and it stops working because every system has finite capacity. MySQL can easily push most modern hardware to its limits, but don’t try it until you’ve read “Performance Destabilizes at the Limit”.

Bottom line: MySQL cannot simply go faster. To make MySQL go faster, you must embark on the journey of direct and indirect query optimization.

Summary

This chapter expounded query time so that, in subsequent chapters, you can learn how to improve it. The central takeaway points are:

Performance is query response time: how long it takes MySQL to execute a query.
Query response time is the North Star of MySQL performance because it is meaningful and actionable.
Query metrics originate from the slow query log or the Performance Schema.
The Performance Schema is the best source of query metrics.
Query metrics are grouped and aggregated by digest: normalized SQL statements.
A query profile shows slow queries; slow is relative to the sort metric.
A query report shows all available information for one query; it’s used for query analysis.
The goal of query analysis is understanding query execution, not solving slow response time.
Query analysis uses query metrics (as reported), metadata (EXPLAIN plans, table structures, and so on), and knowledge of the application.
Nine query metrics are essential to every query analysis: query time, lock time, rows examined, rows sent, row affected, select scan, select full join, created tmp disk tables, and query count.
Improving query response time (query optimization) is a two-part journey: direct query optimization, then indirect query optimization.
- Direct query optimization is changes to queries and indexes.
- Indirect query optimization is changes to data and access patterns.
At the very least, review the query profile and optimize slow queries when performance affects customers, before and after code changes, and once a month.
To make MySQL go faster, you must decrease response time (free time to do more work) or increase load (push MySQL to work harder).

The next chapter teaches MySQL indexes and indexing—direct query optimization.

Practice: Identify Slow Queries

The goal of this practice is to identify slow queries using pt-query-digest: a command-line tool that generates a query profile and query reports from a slow query log.

Warning

Use a development or staging MySQL instance—do not use production unless you are confident that it will not cause problems. The slow query log is inherently safe, but enabling it on a busy server can increase disk I/O.

If you have DBAs who manage MySQL, ask them to enable and configure the slow query log. Or, you can learn how by reading “The Slow Query Log” in the MySQL manual. (You need a MySQL user account with SUPER privileges to configure MySQL.) If you’re using MySQL in the cloud, read the cloud provider documentation to learn how to enable and access the slow query log.

MySQL configurations vary, but the simplest way to configure and enable the slow query log is:

SET GLOBAL long_query_time=0;

SET GLOBAL slow_query_log=ON;

SELECT @@GLOBAL.slow_query_log_file;
+-------------------------------+
| @@GLOBAL.slow_query_log_file  |
+-------------------------------+
| /usr/local/var/mysql/slow.log |
+-------------------------------+

Zero in the first statement, SET GLOBAL long_query_time=0;, causes MySQL to log every query. Be careful: on a busy server, this can increase disk I/O and use gigabytes of disk space. If needed, use a slightly larger value like 0.0001 (100 microseconds) or 0.001 (1 millisecond).

Note

Percona Server and MariaDB Server support slow query log sampling: set system variable log_slow_rate_limit to log every Nth query. For example, log_slow_rate_limit = 100 logs every 100th query, which equals 1% of all queries. Over time, this creates a representative sample when combined with long_query_time = 0. When using this feature, be sure that the query metric tool accounts for sampling, else it will under report values. pt-query-digest accounts for sampling.

The last statement, SELECT @@GLOBAL.slow_query_log_file;, outputs the slow query log filename that you need as the first command line argument to pt-query-digest. You can dynamically change this variable if you want to log to a different file.

Second, run pt-query-digest with the slow query log filename as the first command line argument. The tool will print a lot of output; but for now, look at the Profile near the top of the output:

# Profile
# Rank Query ID                            Response time   Calls
# ==== =================================== =============== =====
#    1 0x95FD3A847023D37C95AADD230F4EB56A  1000.0000 53.8%   452  SELECT tbl
#    2 0xBB15BFCE4C9727175081E1858C60FD0B   500.0000 26.9%    10  SELECT foo bar
#    3 0x66112E536C54CE7170E215C4BFED008C    50.0000  2.7%     5  INSERT tbl
# MISC 0xMISC                               310.0000 16.7%   220  <2 ITEMS>

The preceding output is a text-based table listing the slowest queries from the slow query log. In this example, SELECT tbl (a query abstract) is the slowest query, accounting for 53.8% of total execution time. (By default, pt-query-digest sorts queries by percentage execution time.) Below the query profile, a query report is printed for each query.

Explore the pt-query-digest output. Its manual documents the output, and there is a trove of information on the internet because the tool is widely used. Also check out Percona Monitoring and Management: a comprehensive database monitoring solution that uses Grafana to report query metrics. Both tools are free, open source, and supported by Percona.

By reviewing slow queries, you know exactly which queries to optimize for the most efficient performance gains. More importantly, you’ve begun to practice MySQL performance like an expert: with a focus on queries, because performance is query response time.

¹ Latency is delay inherent in the system. Query response time is not a delay inherent in MySQL; it comprises various latencies: network, storage, and so on.

² Unless STRAIGHT_JOIN is used—but don’t use this. Let the MySQL query optimizer choose the join order for the best query execution plan. It’s almost always right, so trust it unless you can prove it wrong.

³ For a full explanation of percentiles, see HackMySQL.

⁴ P95, P99, and P999 are conventional. I’ve never seen other percentiles used with MySQL—median (P50) and maximum (P100) notwithstanding.

Get Efficient MySQL Performance now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 1. Query Response Time

Note

A True Story of False Performance

North Star

Query Reporting

Note

Sources

Note

Warning

Aggregation

Example 1-1. SQL statement normalization

Note

Example 1-2. Query abstract

Reporting

Query profile

Query report

Query Analysis

Tip

Query Metrics

Note

Query time

Example 1-3. Stages for a single UPDATE statement

Lock time

Warning

Warning

Figure 1-1. Lock time during query execution

Rows examined

Rows sent

Rows affected

Select scan

Select full join

Example 1-4. EXPLAIN plan for three tables joined

Tip

Note

Created tmp disk tables

Query count

Metadata and the Application

Relative Values

Average, Percentile, and Maximum

Improving Query Response Time

Direct Query Optimization

Indirect Query Optimization

When to Optimize Queries

Performance Affects Customers

Before and After Code Changes

Once a Month

MySQL: Go Faster

Summary

Practice: Identify Slow Queries

Warning

Note

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Example 1-3. Stages for a single `UPDATE` statement