Chapter 14. Tools for Managing and Monitoring MySQL

The last few years have been tremendously exciting when it comes to third-party tools for MySQL. We have gone from a few little tools here and there to professionally maintained packages of tools, with many of the utilities in these packages solving issues that have been problematic for MySQL DBAs for a very long time. We will be covering a few of our favorite packages in this chapter. We will not be covering any commercial (for-pay) utilities. There are many great ones, of course, but we will limit this chapter to either open source or “free as in beer” utilities.

Percona Toolkit

The Percona Toolkit actually used to be called Maatkit, before its adventures in rebranding. It was a must-have toolkit before the rename and has only gotten more so since. It contains over 30 tools, varying from hugely complicated and impactful utilities to glorified output reformatters. Some of the most important tools include:

pt-query-digest
There once was a tool called mysqlsla, which basically everyone used to read the MySQL slow query log and output reports showing the top slow queries in the log. This was very useful, but mysqlsla didn’t really progress much. pt-query-digest is a replacement that allows for running EXPLAINs automatically, monitoring process lists, and even generating ASCII graphs. Is it an overengineered tool for its purpose? Probably. Is it awesome enough that we don’t care about that? Absolutely.
pt-table-checksum
We recommend this tool for every replication cluster that uses statement-based or mixed binary logging (It is somewhat less needed for, and more importantly not supported with, row-based replication). pt-table-checksum allows you to run a replicated checksum across your cluster to ensure that the two (or more) servers are actually consistent. MySQL itself doesn’t have checksums or validation built into the replication engine, and this tool fills a critical gap. (Note that some MySQL forks, such as MariaDB, are now integrating checksums into the binary log.)
pt-online-schema-change
This tool is very dangerous to use, but quite cool to think about using. This script automates the somewhat common practice of ALTERing a large table by creating a new table with a new table definition and then filling that table from the old table, while triggers are inserted to update the new table in real time while the fill is executed. At the end of all of this, you switch the two tables. It’s a complicated procedure that is fully automated by pt-online-schema-change.
pt-index-usage
This tool can take a slow log (or general log), EXPLAIN the queries, and generate reports on index usage. It is a useful companion to pt-query-digest, more specific to index usage.

There are many more tools in this package, and we recommend taking a look at all of them. They generally make your life far easier when managing a DB server and examining query performance.

Openark Kit

The openark kit gets less attention than Percona Toolkit, but it has some equally useful tools. While Percona Toolkit tends to be slightly more focused on performance optimization and replication, openark is more focused on operational tools. These are a few of the tools we have found most useful:

oak-chunk-update
This tool takes a large UPDATE/DELETE statement and automatically splits it into chunks to prevent it from locking a large segment of an often-used table.
oak-kill-slow-queries
In a perfect world, this script wouldn’t need to exist. However, perfection is sadly lacking in most database clusters. For those times when perfection is escaping you, we have this script. It connects to your DB and kills queries that take longer than a configurable time period. This is useful when you have a site with a known “bad” query that hasn’t been fixed yet and could possibly harm your database server’s performance if left unchecked (for example, a query that only causes problem when run many times concurrently).
oak-purge-master-logs
This script can connect to your entire replication chain and then purge the master logs on a server (leaving a configurable number of logs). What is useful about this is that it will check every slave server and won’t purge a log that a slave still needs. Thus, it is a safe way to purge master logs.
oak-security-audit
This tool runs a quick audit of some standard security settings, accounts, and privileges.
oak-show-limits
This script is really only useful for large installations, but it’s very useful for those—it shows the “space” you have left in AUTO_INCREMENT columns. This doesn’t sound critical, right up until it’s incredibly critical.

mysqlreport

mysqlreport has been the standard in “general MySQL statistics” for as long as we can remember. It should be run on every MySQL server at least once. This utility will print out current usage information for the query cache, key buffer, and InnoDB buffer pool, as well as detailed InnoDB statistics and much more. If you only install one of these utilities, this should be the one. A guide for understanding its output is available here. We mentioned a few variables that are important to tune correctly for a Drupal/InnoDB MySQL instance in the previous chapter—this tool is how you would track those variables and validate your configuration. Let’s look at some example output, with comments describing each section.

The first section of mysqlreport’s output describes the MyISAM key buffer cache: for most deployments, you are mainly interested in whether it’s 100% used or not. If it is 100% used, it should likely be increased in size (note that even with an InnoDB server, the key buffer will be used for temporary and internal tables):

__ Key _________________________________________________________________
Buffer used    14.00k of  30.00M  %Used:   0.05
  Current       5.50M            %Usage:  18.33
Write hit       0.00%
Read hit       70.21%

The next section describes the breakdown by query type: in most cases, you will have far more SELECTs than the volatile UPDATE/INSERT/DELETE/REPLACE options, but if that’s ever not true, this will allow you to discover that and track it. If you suddenly see a huge spike in volatile statements, you may need to investigate why that is (a cache table being refreshed repeatedly, a spam network registering new users, etc.) and take action:

__ Questions ___________________________________________________________
Total           1.81G    2.3k/s
  QC Hits     920.26M    1.2k/s  %Total:  50.78
  DMS         831.76M    1.0k/s           45.90
  Com_         39.21M    49.5/s            2.16
  COM_QUIT     21.09M    26.6/s            1.16
  -Unknown     91.11k     0.1/s            0.01
Slow 2 s       68.31k     0.1/s            0.00  %DMS:   0.01  Log:  ON
DMS           831.76M    1.0k/s           45.90
  SELECT      809.20M    1.0k/s           44.65         97.29
  INSERT       11.49M    14.5/s            0.63          1.38
  UPDATE        9.07M    11.4/s            0.50          1.09
  DELETE        1.24M     1.6/s            0.07          0.15
  REPLACE     769.68k     1.0/s            0.04          0.09
Com_           39.21M    49.5/s            2.16
  set_option   29.66M    37.4/s            1.64
  show_tables   9.49M    12.0/s            0.52
  begin        29.75k     0.0/s            0.00

The following section gives a breakdown of the types of sorts, scans, and joins you’re performing. This section is usually a lot more useful when you have a huge amount of query control. Drupal deployments (especially multisite deployments), are not always a great fit for making this section useful. However, you should always monitor this for large changes and trends:

__ SELECT and Sort _____________________________________________________
Scan           15.80M    19.9/s %SELECT:   1.95
Range         234.83M   296.3/s           29.02
Full join      12.67k     0.0/s            0.00
Range check         0       0/s            0.00
Full rng join 201.66k     0.3/s            0.02
Sort scan     121.17M   152.9/s
Sort range     26.33M    33.2/s
Sort mrg pass   3.26M     4.1/s

The Query Cache section describes the query cache fill percentage and its usage: There are two important aspects to this part of the report. First, it allows you to monitor the hit to insert ratio and the prune ratio. The hit to insert ratio in particular is very important, as you’re really looking for a case where this ratio is quite hit-heavy. This represents a stable query cache that is very effective. However, if this ratio is close to 1:1, the cache is not being very effective and may not be worth the extra locking. Secondly, this section allows you to monitor how full the query cache is and whether there are excessive amounts of prunes. Assuming your cache is effective, it being full and being pruned often could mean it needs to be bigger:

__ Query Cache _________________________________________________________
Memory usage   26.57M of  50.00M  %Used:  53.15
Block Fragmnt  16.39%
Hits          920.26M    1.2k/s
Inserts       807.94M    1.0k/s
Insrt:Prune    1.24:1   196.1/s
Hit:Insert     1.14:1

The next section is pretty simple: it just describes table locks. If you have a lot of waited table locks, you may have a MyISAM table that you need to convert to InnoDB:

__ Table Locks _________________________________________________________
Waited              0       0/s  %Total:   0.00
Immediate       1.21G    1.5k/s

The Tables section describes the table cache. The major concern here is whether it is 100% full or not. In this example, the table cache should likely be increased as it is full:

__ Tables ______________________________________________________________
Open             1000 of 1000    %Cache: 100.00
Opened          1.55k     0.0/s

The next section allows you to monitor your peak connection count:

__ Connections _________________________________________________________
Max used          289 of  300      %Max:  96.33
Total          21.09M    26.6/s

The following section is largely useful for figuring out how many of your temporary tables are hitting disk and finding temporary table spikes. If you have a huge spike after a deployment, for example, there is likely an ill-behaved query in that code push and it should either be reverted or hot-fixed:

__ Created Temp ________________________________________________________
Disk table      4.59M     5.8/s
Table         129.73M   163.7/s    Size: 200.0M
File          859.24k     1.1/s

The Threads section of the report shows what proportion of your threads are coming off the thread cache (ideally, most of them). As you can see in this example, this server isn’t doing well in this regard and should have the thread cache settings checked:

__ Threads _____________________________________________________________
Running             4 of   19
Cached              0 of    0      %Hit:      0
Created        21.09M    26.6/s
Slow                0       0/s

The next section is often not a focus, but if you see large spikes in aborted clients and connections, it can indicate network issues:

__ Aborted _____________________________________________________________
Clients            50     0.0/s
Connects       31.32k     0.0/s

The Bytes section gives a simple data transfer report:

__ Bytes _______________________________________________________________
Sent            1.65T    2.1M/s
Received      240.05G  302.9k/s

The next section of the report covers the InnoDB buffer pool usage and fill rate; it is likely the most important section of the report for an InnoDB server. In this example, you can see that the buffer pool is 100% full and should be increased in size. However, it’s not the end of the world, as the read hit rate is still quite high, if not 100%. This part of the report also includes information on the pages in the buffer pool, the number of pages free, the number of data pages, etc., and it covers the IO InnoDB is performing (pages read and written and flushes to disk performed). In general, you are reading this section to ensure that your buffer pool has a 100% read hit rate and has free pages:

__ InnoDB Buffer Pool __________________________________________________
Usage          15.00G of  15.00G  %Used: 100.00
Read hit       99.97%
Pages
  Free              0            %Total:   0.00
  Data        828.94k                     84.32 %Drty:   3.37
  Misc         154101                     15.68
  Latched                                  0.00
Reads         411.66G  519.5k/s
  From file   129.71M   163.7/s            0.03
  Ahead Rnd         0       0/s
  Ahead Sql                 0/s
Writes        350.34M   442.1/s
Flushes        18.50M    23.3/s
Wait Free         222     0.0/s

The following section gives some introspection on InnoDB locking. The biggest things to watch for here are excessive waits (if it can actually compute a per-second value for this, it is a bad sign) and your average time for acquiring a lock. As you can see in this example, we actually have some slight locking issues on this server. At this point, we would want to review the slow log to look for queries with high lock times (note that only recent MySQL versions support InnoDB lock times in the slow log):

__ InnoDB Lock _________________________________________________________
Waits           16218     0.0/s
Current             0
Time acquiring
  Total       4370595 ms
  Average         269 ms
  Max           19367 ms

Finally, the last section describes InnoDB data and IO usage in detail. Unless you are doing fairly advanced tuning and configuration related to IO throughput, this section is likely not a focus for you:

__ InnoDB Data, Pages, Rows ____________________________________________
Data
  Reads       130.65M   164.9/s
  Writes       15.47M    19.5/s
  fsync         2.09M     2.6/s
  Pending
    Reads           1
    Writes          0
    fsync           0

Pages
  Created     686.23k     0.9/s
  Read        140.26M   177.0/s
  Written      18.50M    23.3/s

Rows
  Deleted      12.93M    16.3/s
  Inserted     14.48M    18.3/s
  Read        178.26G  224.9k/s
  Updated       8.96M    11.3/s

Percona Monitoring Plug-Ins

Percona Monitoring Plug-Ins is a set of extensions for Nagios/Icinga and Cacti that make them far more useful for MySQL administrators. They are a very useful addition to most Nagios or Cacti installations.

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.