Chapter 15. MySQL Query Optimization

We noted in a previous chapter the importance of MySQL query optimization in relation to MySQL configuration tuning. While tuning is important, it often has nowhere near the impact of actually fixing a poorly performing query. Fixing queries is also a lot more fun. As in the other MySQL chapters, we need to note that this subject is massive and we cannot possibly cover it fully here. It is highly recommended that you get a book on this subject, as it is a very deep one and is well worth learning for any Drupal developer. As a web developer using a CMS, you are only slightly removed from the SQL layer. Not deeply knowing how to use this layer and how to optimize it is very limiting. To get you started, this chapter will cover some very basic optimization, index usage, and join optimization techniques.

Index Basics

Indexes are very important for database performance. But even though they have this level of importance, they are not completely understood by many developers. This often leads to problems that could easily have been avoided. The biggest issue is a belief that indexes are kind of magical and that the MySQL optimizer should be able to run a query quickly if an index even touches the columns in question. Sadly, indexes are not magic.

It is best to think of an index as a tree, largely because they are trees in most DB systems (B+Trees, specifically; there is a link to an excellent article on B+Trees and how InnoDB uses them at the end of Chapter 21). Thus, if you have an example index test that covers (columnA, columnB), you literally have a tree of columnA values, with columnB values in the leaves. If you have a query that has a WHERE condition on these two columns, MySQL will go through this tree looking for the correct columnA value first, and then go into the leaves of that object and find the correct columnB value.

Due to this tree organization, there are some limitations. For example, if you have a query with a WHERE condition on columnB and not columnA, it can’t use the index on (columnA, columnB). Thinking about this, it makes a lot of sense. How would you get to the columnB values in this index? You must have a way to traverse the columnA branches to get to the columnB values. This leads to some interesting decisions. For example, if you have five queries that run against a table and three of them have WHERE conditions on columnB alone, that column should come first in the index so that it can be used.

Note

Range queries are a special case. A range query is one that is going to return a range of values in an index. A good example is a datetime query, such as SELECT * FROM table WHERE date > "SOMEDATE";. Sometimes you may want to put a particular column at the beginning of an index (henceforth called the prefix of the index), but if that column’s condition is a range, you have to reconsider. MySQL cannot use any column of an index after a range query. So, if your index is (columnA, columnB) and your query is SELECT * FROM table WHERE columnA > 5 AND columnB = 2; you will not be using the index to satisfy the second condition (columnB = 2). In this situation, you must have columnB in the prefix of the index for it to be used.

Base Tables and Join Order

Now that you have a basic understanding of how indexes work, you need an equally basic understanding of joins. In keeping with our nature theme, we are going to visualize indexes as trees and tables as pools of water—specifically, you can think of a table as a large pool of water at the top of a hill, with smaller pools of water under it as you progress down the hill. Each pool has a little waterfall that flows from it into the pool below. The largest pool at the top of the hill is the base table for the join. The goal is to limit the size of the waterfalls and ensure that the resulting pool doesn’t need water treatment (temporary tables used to service a GROUP BY, ORDER BY, or DISTINCT).

Let’s consider an example. Our biggest pool is the Drupal node table. We then join against a taxonomy table (pool) and filter on a taxonomy type. So, we have no real filter on the node table, and our only real filter (the one that defines our dataset) is on the taxonomy table. How do we limit the water flowing from the node pool into the taxonomy pool? We don’t. This is going to be a big waterfall. The node table being the base table and all the filtering/conditions being on subtables is a very common problem in Drupal. The issue here is conceptually somewhat simple: to fulfill this query, we need to join every node against every taxonomy term, whereas in reality, it would be far better to start with the taxonomy table and only join the filtered rows that match our condition. You always want to have your major conditions (the ones that limit your returned data the most) in the base table. Drupal, in particular Views, makes this very difficult sometimes.

Let’s consider another example. Again, we have node as our base pool, and we are going to join against taxonomy. This time our only filter is on node, so that’s good, but then we are going to GROUP BY a column in taxonomy. Unfortunately, we have a problem. We can’t execute this GROUP BY via an index when it’s not in the base pool, and thus the resulting pool after the join will need some water treatment (a temporary table sort). Again, this is a very common problem with Drupal Views queries. It all goes back to ensuring that you actually have the right table as the base for your query.

The question now is, how do you determine the base table for a join? For Drupal, the answer usually comes down to making sure your relationships in Drupal Views are set correctly. It is quite easy to have a Views query with all LEFT JOINs. A query like this forces MySQL to treat the node table as the base table, which is almost always a problem. Other than this issue, you often have to just run EXPLAIN on the query in question and figure out why the base table is what it is, then edit your View or Views handlers to swap the tables or set the conditions differently. We will go over some common examples of issues with Drupal Views next.

Common Issues

We are now going to cover some examples of common problems that arise with Drupal Views, indexing, and joins.

First, let’s discuss the debugging options available to you with Views. In the Views settings, there are a few very important options for tracking Views and Views performance:

“Show the SQL query” (live preview section)
This allows you to see the SQL query Views is building for execution. Obviously, this is quite important.
“Show performance statistics” (live preview section)
This allows you to easily track build and execution time.
“Show other queries run during render during live preview” (live preview section).
It is quite possible that other queries are being called during the processing of a View, besides the main view query. This is a common point of confusion and this debug setting will help immensely.
“Add Views signature to all SQL queries” (advanced tab)
This is arguably the most important option on this page, as it allows you to actually tell where a query is coming from. Back in “olden times” / “the good old days,” you could just search for a query in an application to find it. These days, with the advent of Views and other dynamic query builders, this is basically impossible. Having the Views signature built into the query is hugely valuable.

Now that you have Views set up to assist you, or at least to not insist on making your job impossible, it’s time to look at some common issues.

The ORDER BY on an Unrelated Table

This truly is a classic for Drupal Views. Because so much of the data in Drupal is heavily normalized (split into different tables), it’s quite common to have your dataset defined by tableA (i.e., all your WHERE conditions are on this table) and then ORDER BY your data on a column in a totally different table. This is a problem, as the ORDER BY will have to be serviced by a temp table or filesort. The tracker query that ships with Views is a great example of this:

SELECT node.type AS node_type, node.title AS node_title, node.nid AS nid,
users_node.name AS users_node_name, users_node.uid AS users_node_uid,
node_comment_statistics.comment_count AS node_comment_statistics_comment_count,
node_comment_statistics.last_comment_timestamp AS
node_comment_statistics_last_comment_timestamp, history.timestamp AS
history_timestamp, node.created AS node_created, node.changed AS node_changed,
'tracker:default' AS view_name
FROM
{node} node
LEFT JOIN {users} users_node ON node.uid = users_node.uid
INNER JOIN {node_comment_statistics} node_comment_statistics ON
node.nid = node_comment_statistics.nid
LEFT JOIN {history} history ON node.nid = history.nid AND history.uid = '1'
WHERE (( (node.status = '1') ))
ORDER BY node_comment_statistics_last_comment_timestamp DESC
LIMIT 25 OFFSET 0

This query has a single filter on node.status, and then ORDER BYs on node_comment_statistics. However, this query is better than most because at least node_comment_statistics is a required relationship in this View, which means it’s an INNER JOIN and not a LEFT JOIN. This means that if the optimizer decides that node.status as a filter is not very useful (i.e., all your nodes have the same status), it can “flip the join” and start with node_comment_statistics, making it possible to use an index for the ORDER BY. (You can only use an index for this when the table containing the column or columns in question is the base table.)

The Useless DISTINCT (“In Case of Accidents!”)

Views allows you to just “throw in” a DISTINCT in the Query Settings of the View. The ease with which you can add a DISTINCT to a query to ensure it doesn’t start returning duplicates is a big problem, much more so than you might think. Many times performance can be hugely improved by just removing these safety-net DISTINCTs. However, it takes a lot of work to differentiate between a useless DISTINCT and one that is actually needed. It’s better to think about this from the beginning.

Tip

It is a good idea to always EXPLAIN a query after you add a DISTINCT. It’s good to know if you should be thinking about structuring a query a different way or if the DISTINCT is fine when you are still in the development stage. Restructuring a query is far more difficult to do in the testing stage.

Starfish Syndrome (All LEFT JOINS)

When you create a query, Drupal Views has you define relationships between the data. When a relationship is not required, it’s translated into a LEFT JOIN in most cases. However, when it is required, it’s just an INNER JOIN. The problem with this is the amount of importance placed on a somewhat poorly worded checkbox. Quite often, this checkbox is left unchecked even when a relationship is required.

For example, if you have a user table and another table that holds email addresses for those users, there will always be at least one row per user, and this is a required relationship. However, it’s tempting to not check that box, because it doesn’t really impact functionality and it seems more complicated. The impact of this decision will be to greatly limit MySQL’s optimizer, as it will only ever be able to use the user table as the base table for the query.

We call this starfish syndrome, because if you diagram queries like this, you have the base table in the center and all the joined tables around it (like the legs of a starfish). The optimizer has no choice in JOINs. This is almost always bad for Views queries in particular, because often the default base table is node. Not only is this a huge table in most installations, but it often doesn’t define the returned data very well (it doesn’t contain columns that define custom data).

Note

Going back to our waterfall metaphor, you could think of this as a big pool in the middle with all the smaller pools flowing into it. The amount of water never gets smaller; the middle pool just gets bigger and bigger.

Node Access

Node access is a problem. Having fine-grained permissions on a per-node basis is a nice idea. However, it is terrible from a query performance perspective. It adds a set of filters that are totally separate from anything defining the dataset for your query and that use tables that have nothing to do with your dataset. While it is possible if you work really hard to make this work quickly and have good queries, in general you can either have node access or have good queries. It’s an OR, not an AND, question.

Note

There is work underway on refactoring node_access to work in a different way with better performance on new versions of MySQL. Hopefully this work will resolve this long-standing issue.

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.