Chapter 16. Alternative Storage and Cache Backends

Drupal uses MySQL as the default storage backend for all data on a site. This includes low-level APIs such as caching and queues, as well as site content.

MySQL provides a great database backend for Drupal, but there are certain issues that arise, especially as Drupal sites get larger. Drupal is very modular and flexible in terms of how data can be defined. Entity types such as nodes, users, and comments can be extended with configurable fields, and then the Views module allows those to be queried and presented—all without leaving the user interface. This is a very powerful set of features, but it can make getting great performance out of a traditional relational database quite difficult. The entity/field storage provided by core provides one or two tables for each core entity, then two tables for each individual field (one for the current or “live” revision and one for older revisions). Queries with a condition on one field that sort on another then have to join across multiple tables, which is very hard to optimize with a relational database.

Another common performance bottleneck that particularly affects higher-traffic Drupal sites is that cache and session data are stored in MySQL tables by default. While this is generally fine for smaller sites, as the number of cache and session objects grows, frequent actions such as cache invalidation put a high demand on the database layer. This can cause increased network traffic and ultimately increase the load on the database servers. There are multiple Drupal contributed projects that provide a way to pull cache entries out of MySQL and store them in an alternative storage backend, such as Memcached or Redis. Not only can these backends help to offload queries from MySQL, but they can be easily scaled out horizontally as your site grows.

Note

Drupal’s default MySQL-based search can also be replaced with alternative backend solutions. This is covered in Chapter 17.

Cache, Lock, and Session Storage

As introduced in Chapter 3, Drupal’s cache API stores data in MySQL by default, but it can be easily integrated with other technologies. Swapping out the cache backend can help improve both performance and scalability; however, we should look closely at exactly what changes when the cache storage is replaced. One incorrect assumption that many people make is that replacing the cache backend will automatically make each individual cache request faster. For cache gets, this is often not the case—a well-tuned MySQL server can perform as well as or better than Memcache in some cases, although Memcache should almost always outperform MySQL for cache sets.

The true benefit becomes clearer as traffic to the site increases. Key/value stores such as Memcached and Redis are great at dealing with large amounts of data, and Memcached in particular can be scaled horizontally simply by adding more servers. MySQL, on the other hand, is far more difficult to scale horizontally and is likely to also be being used as your primary data source. Having your caching layer impacting the speed of your primary data store is not really acceptable. By swapping out cache backends, we achieve two things:

  1. Offloading queries from MySQL, allowing the server to perform better for other queries
  2. Placing data into a key/value store, which will maintain performance as the data size increases to multiple gigabytes and allows for easy horizontal scaling

This is why even though replacing the cache backend may not improve speed for an individual request, it can actually make a dramatic difference for large or heavily loaded sites.

So far we have concentrated on two alternatives: Redis and Memcached. There are other cache storage options, such as the APC user cache (this allows you to use the APC cache for objects other than PHP opcode); however, Redis and Memcache are by far the most widely used options for large Drupal sites, as they perform well and are generally much easier to scale. APC in particular is unable to share its cache across multiple servers, which makes it unusable as a data cache for any site that uses more than one web server, and even on a single server the user cache suffers from fragmentation and does not implement the least recently used (LRU) eviction. (APC is obviously still very usable as an opcode cache, even when you have multiple servers.)

Redis and Memcached both provide a similar service: an in-memory key/value store. They are both designed to provide O(1) performance for all GET and SET operations. What this means in practice is that GET and SET calls maintain a constant speed as your cache size grows (this is a very good thing!). While both are strictly in-memory caches for all queries, Redis also adds the option for on-disk persistence using one of two options: using periodic flushes to disk or an append-only file to store all commands. The benefit to persistence in this situation is that if your caching servers are restarted or go offline for any reason, the cache will be prewarmed when the servers come back online—meaning a smaller dip in performance while the caches are rebuilt.

As far as the Drupal modules for Memcached and Redis are concerned, Memcache has been around a lot longer and is more mature, with a wider user base. That said, the Redis module is quickly gaining popularity now that Redis technology has matured. For new sites, either option works; we suggest testing out both and seeing which you are more comfortable with. Both modules include their own lock implementations, and Redis can also be leveraged for a queue implementation (using the Redis Queue module). While not specifically related to caching, this can be useful in situations where you require a job queue.

Memcache In Depth

With the Drupal Memcache module being a bit more mature and better tested, we’ll focus on it for a more in-depth exploration. First, it’s important to understand how things differ between Memcache and the default MySQL-backed cache API:

  1. All data that was being stored in the cache* tables in MySQL will instead be stored in Memcached. There is a way to override this for individual caches, covered in the next section.
  2. Clearing Drupal caches does not explicitly delete things out of Memcached. Instead, the time of the last cache clear is tracked and retrieved items are compared to that timestamp (along with any item-specific expiry time) to see if they are still valid.
  3. Since the maximum amount of memory the Memcached daemon can allocate is defined at startup, it’s possible to fill up available memory, at which time Memcached will evict least-recently-used items—meaning you can lose your items out of the cache if it isn’t large enough. Don’t worry, we’ll explain how to monitor and avoid this.

PHP Extensions for Memcache

In order to use the Memcache module, you’ll need to add Memcache support to PHP. There are currently two PHP Memcache extensions available in PECL: Memcache and Memcached. The older and more stable of the two is Memcache (no d). Later, the Memcached extension was created as a rewrite and to leverage libmemcached, a lightweight library shared by several client implementations. Both are considered stable, and either will work with the Drupal Memcache module. In general, we recommend the Memcache extension, as it’s consistently proven to be more stable. Though the Memcached extension enjoyed a surge of popularity, it’s been linked to a number of bugs that have proven difficult to solve. All modern Memcache features are provided by both PECL extensions, though as of this writing, you will need to use a Beta 3.x version of the Memcache extension to access some newer features (this has been in Beta since 2008 and is widely considered stable). The Memcache PECL extension has been more heavily tested by the Drupal module’s authors. That said, either one should work fine for most situations, and both are packaged on the majority of Linux distributions.

Warning

The PHP Memcache extensions use different hashing algorithms, which will lead to cache inconsistencies if you have multiple web servers using different Memcache extensions. Be sure that whichever extension you choose is used by all of your servers.

Assigning Memcached Servers and Bins

The Memcache module supports splitting a single cache storage bin across multiple Memcached instances. The Memcached instances could be distributed across different servers, or multiple instances on the same server (or some combination). There are a lot of references online that suggest splitting up the various Drupal cache tables so that each has its own Memcached instance. This was required at one point due to the lack of wildcard flushes in the Memcache module. However, now that wildcard flushes are supported, there is generally no reason to split the caches into individual Memcached instances (sessions being a notable exception, as covered in the section What to Store in Memcache). The settings for this would look something like the following configuration, but for the latest documentation on how to configure the Memcache module, please see the module’s README.txt file:

$conf['memcache_servers'] = array(
  '172.16.1.5:11211' => 'default',
  '172.16.1.6:11211' => 'default',
  '172.16.1.6:11212' => 'default'
);

Note

If you are connecting to a Memcached instance on a local server, it’s also possible (and potentially better performing) to connect over a Unix socket. Simply use the syntax unix:///path/to/socket for the server definition in your $conf[memcache_servers] array.

It is useful and recommended to run multiple Memcached instances across servers for all cache bins. By splitting the cache across multiple servers, you ensure that if one of the servers goes offline, only that portion of your cache is lost. On the other hand, if you only have one Memcached instance, you will be left without any cache entries if that Memcached instance goes offline. In many cases, we find that creating a Memcached instance on each of the web servers is a good setup. Another alternative, if your web servers don’t have memory to spare, is to create a group of servers specifically for Memcached. Whatever you decide, be sure that the Memcached port is not open to external traffic—this is covered below in the section Configuring the Memcache Daemon.

When using multiple Memcached instances for a single cache bin, the PHP Memcache(d) extension uses a hashing algorithm to create a key hash for each item, and then uses that hash to decide which Memcached instance to store/fetch the item on. If you are using multiple Memcached instances, it is important to use the consistent hashing algorithm, since it will minimize cache item reassignments if your Memcached instances ever change—for example, when adding or removing a server. The two PHP extensions have different settings for this value, and different packages or Linux distributions may override the default, so you should always verify that this is set correctly.

To enable consistent hashing in the PHP Memcache extension, use:

memcache.hash_strategy consistent

To enable consistent hashing in the PHP Memcached extension, use:

memcached.sess_consistent_hash 1

The default behavior of the PHP Memcache(d) extensions is to only store values in a single Memcached instance. Adding additional Memcached servers and grouping them into a single bin in your Drupal settings does not get you high availability for your cache data. Both of the extensions provide a configuration option that enables them to write items to multiple servers at the same time. However, this can lead to invalid cache items being served should a server go offline for a time and then come back. Generally, it’s better to have a few cache misses if a server goes down instead of worrying about high availability for your cache data and then having to deal with potential cache corruption.

Memcache Locking and Stampede Protection

The Drupal Memcache module provides stampede protection, which helps minimize the overhead of rebuilding cache items during a cache rebuild. For example, if the cached version of a popular article on your website was invalidated by a visitor posting a comment, and then several other visitors requested that page at the same time, each Drupal process accessing the article would typically go through the following steps (unaware that other processors were doing the same thing):

  1. Check for the item in Memcache.
  2. See that the item is invalid/expired.
  3. Regenerate the item (this step can involve many SQL queries, PHP processing, etc.).
  4. Push the data into a cache item in Memcache.

This can be particularly problematic if step 3 is a large operation that could put heavy load onto your database. Enabling stampede protection, ensures that the module will create a lock for the cache item so that only one process will attempt to update it in Memcache. Other requests that come along while the lock is in place will serve stale content out of Memcache (or simply wait for a valid cache entry if the existing item is missing or invalid as opposed to expired) to prevent overloading of the database while the cache is being rebuilt.

As stampede protection relies on Drupal’s locking layer, it’s critical that you also move locks out of MySQL and into Memcache. If you enable stampede protection without also moving locks into Memcache, you can experience severe performance degradation. Enabling stampede protection and the Memcache lock implementation is done with the following settings in settings.php:

$conf['memcache_stampede_protection'] = TRUE;
$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';

What to Store in Memcache

When you enable the Drupal Memcache module, by default, all cache tables will be stored in Memcache. In Drupal 8, form state information has been moved into a key/value store, but in previous versions, it was stored in a cache table (though not strictly a cache). For Drupal 7 sites using Memcache, it is important to keep the cache_form table in MySQL to ensure form data isn’t lost if Memcached goes offline. If the form IDs are lost, then any form submissions will fail, and since Memcached is not persistent, there is the potential that this could happen. So, be sure to follow the instructions and always keep the cache_form table in MySQL. This override is done for Drupal 7 with the following setting in settings.php:

$conf['cache_class_cache_form'] = 'DrupalDatabaseCache';

The Memcache module also gives the option to store Drupal sessions in Memcached instead of MySQL. As of this writing, though the 6.x implementation is considered stable, the 7.x version has known bugs (tracked in https://drupal.org/node/656838). Some people see session storage in Memcache as a great feature, while others see it as a disaster waiting to happen. The nice part about moving sessions out of MySQL is it means a lot less read and write activity on the database. The downside is the potential to lose the session information out of Memcache in the event of a Memcached server going offline or because of an eviction should the Memcache bin fill up. We recommend doing two things to reduce the risk of sessions being evicted from Memcache: create a new bin dedicated to sessions with ample space so other caches can’t cause session evictions, and spread the sessions bin across multiple servers so a failing Memcached server only invalidates a portion of live sessions.

Configuring the Memcache Daemon

As mentioned earlier in this chapter, it’s very important to allocate enough space in Memcached that you aren’t constantly running out of memory for cache storage and forcing Memcache to evict valid cache items in order to store new items. Most Linux distributions ship Memcached packages with a default bin size of 32–64 MB. Generally, this is way too small for Drupal sites, as many sites can end up with multiple gigabytes of cached content. When first setting up Memcached, you can guess your size requirements by looking at the size of your cache tables in MySQL—while this won’t be an exact representation of the size required to store your cache items in Memcache, it can provide at least a ballpark figure to get you started. From there, you can watch the Memcache usage using the built-in stats command or the statistics section provided by the Memcache_admin module, or track it with something like Munin or Cacti. If you start to see space near full or a large number of evictions, then you should increase the memory allocation for Memcached.

It’s possible that you’ll start to see evictions happen before the Memcached instance is using 100% of its allocated memory. This is because Memcached allocates space in various slabs, where each slab is used to store items of a certain size range—for example, one slab will store items that need 1,000–2,000 bytes, another slab will store items that need 2,001–4,000 bytes, etc. So, if you have a lot of items with a similar size, then it’s possible to fill up a slab while the overall Memcached instance still has free space in other slabs. While you can adjust the slab allocation size increments, it is generally not a good practice; except in very special cases, you’re generally better off to simply allocate more memory to Memcached overall.

Beyond the memory allocation, you shouldn’t need to customize much in the Memcached configuration. One important point, however, is to ensure that Memcached is locked down to prevent remote access. This can be done either with iptables, by hiding the Memcached server(s) behind a network address translated network, by using the -l flag to Memcached to specify which port to listen on, or using some combination of those options.

How to Break Your Site with Memcache

Although the Memcache module is in very wide use, its usage seems to be a source of confusion for many. There are a number of common problems we’ve come across related to Drupal Memcache configuration that we’ll share here, in the hope that it will help others to implement Memcache correctly.

Inconsistent Caching

In infrastructures with multiple web nodes, it’s important that the Drupal Memcache server and bin configuration is the same across all servers. The three most popular ways to break this rule are to:

  • Configure one web server to list more or different Memcache bins than other servers. This will cause the hashing algorithm on each web server to assign different servers for the same cache item, leading to an inconsistent cache.
  • Configure each web server to only connect to Memcached on localhost. A common variation of the previous example, this causes each web server to have its own separate cache, meaning you may see changes in content based on which web server the load balancer directs you to. Also, depending how your cache clears are done, you may only ever be clearing out the cache from one of the Memcached instances.
  • Forget to enable consistent hashing in the PHP Memcache extension configuration. This causes the PHP extension to use the standard hashing strategy, which invalidates your entire cache on all servers any time a server is brought online or offline for any reason. You should be sure to always set this explicitly in your PHP configuration, because different packages on different Linux distributions will have different defaults for this setting.

Constant Evictions

As described previously, it’s important to allocate enough memory to Memcached so that it can fit all of your active cache items. Failing to do this, or failure to watch the usage as your site grows, can cause it to run out of space and be forced to evict cache items before they expire. Avoid this by closely monitoring memory usage and evictions. You should keep in mind that some evictions in the statistics are not bad. Because Memcache uses an LRU algorithm and the Drupal Memcache module, by design, doesn’t actually flush out data, some evictions are to be expected. What you don’t want to have is a constant level of evictions, as that likely means you don’t have enough memory to fit your full data set.

Vanishing Sessions

If you are storing Drupal sessions in Memcache, it’s important to ensure two things:

  • Your Memcached servers and your network connections between them and your web nodes are stable. If you lose even one of your Memcached servers, even for a small duration of time, you will lose session data for some number of active users.
  • Sessions are being stored in a bin where they will not be evicted. Generally this means creating a separate bin exclusively to hold sessions, and ensuring that it has more than enough space for your active traffic.

Sessions are an area where Redis has an advantage over memcache: items are persisted to disk, and it’s possible to have an instance with LRU disabled specifically for session storage, ensuring that sessions won’t be evicted until they’ve expired.

Entity/Field Storage

Drupal 7 introduced the field API to Drupal core, providing functionality previously provided by the Content Construction Kit, or CCK. It’s this API that allows for entity types such as nodes, users, and comments to be extended with configurable fields. Since fields may be added to or removed from entities at any time, each field is stored in its own set of database tables. One table stores the values for the current revision, and another stores the values for any historical revisions of the entity. Both are written to when entities are saved, but only one is queried at any one time.

Loading of field values happens via a single query, and the values are cached via Drupal’s cache API, so in itself is not a significant performance issue, and the Entity Cache contributed module allows for caching of the entire entity object, including data from base tables and hook implementations. However, in certain cases, saving entities can require dozens or hundreds of writes to the various field tables. Where performance issues usually arise first are with queries generating lists of entities: a condition on one field and a sort on the other requires a JOIN, making it impossible for MySQL to effectively use indexes for those queries.

The vast majority of sites are able to deal with these issues by caching the results of listing queries (either using Drupal’s cache API directly, or by enabling caching in the Views module) and using pregeneration or custom denormalization in SQL for particularly bad queries.

However, with large data sets and a highly dynamic site with lots of pages, cache hit rates may be too low, or queries may be unacceptably slow on a cache miss.

To allow sites to optimize for these cases, the field API also has the concept of pluggable storage. This allows the field_sql_storage module to be replaced entirely. Drupal 7 does not, however, allow the entity base tables to be replaced, so data is still written to those and core queries them directly. In practice, it was found that field storage implementations needed to store the entity base table information as well, requiring duplication between the hardcoded entity storage and the field storage.

As a result, Drupal 8 has the same default MySQL implementation, but storage is controlled at the entity level rather than for individual fields, allowing the base tables to be replaced as well as the field storage.

While entity/field storage is pluggable, changing the storage layer is unfortunately not as transparent (especially in Drupal 7) as it is with the cache or sessions. Many contributed modules query entity tables directly, and poorly written custom modules may directly query field tables. If considering using pluggable storage, there are several best practices to keep in mind in order to ensure the smoothest possible transition. If you have an existing site and are considering changing the entity storage, enforcing these best practices at the application level early on is a prerequisite to making the switch, and you should also be prepared to audit contributed modules as well as your own custom code.

EntityFieldQuery/EntityQuery

When querying entities, using EntityQuery as opposed to db_select() or db_query() on the tables directly ensures that the query will work regardless of the storage used.

When using Views, the EntityFieldQuery Views Backend module replaces Views’s SQL query builder with an EntityFieldQuery builder, providing the same interoperability for exported views.

CRUD

Contributed and custom modules often attempt to avoid loading or saving full entities, opting instead to query individual values or to update via db_update(), especially when multiple entities are being loaded or updated. Bypassing the API like this can lead to inconsistent results due to caches not being invalidated correctly, alter hooks not firing on load, etc. It also makes it impossible to use those modules with an alternative storage backend, since the database table being queried directly may not even exist. And when loading, querying the table bypasses both static and persistent caches, so it’s often slower anyway. Always use the proper CRUD (create, read, update, and delete) functions whenever dealing with entities.

MongoDB

While there are theoretically many potential options for entity storage, currently the only viable option is MongoDB. MongoDB is an open source document database that has so far proved extremely promising in solving many of the performance issues that affect Drupal’s entity storage. Integration is provided via Prod: change text to the MongoDB module. MongoDB allows considerable improvements in both the performance and the scalability of sites with very large data sets, but it’s still a relatively new technology both in terms of general adoption and Drupal specifically. You should carefully consider whether your project requires MongoDB entity/field storage before making the change, since it imposes development constraints (albeit good ones) as well as introducing an additional system to administer—one that holds all of your site’s most important data! You’ll need to set up replication and backups for MongoDB in addition to MySQL. Also ensure you’re comfortable maintaining MongoDB long-term. Entity data cannot be regenerated in the same way as a cache entry if it gets lost, and migration from MySQL to MongoDB and back again is a manual process at the time of writing, so while it is possible, changing entity storage on a live site is a very complex process.

The main feature of MongoDB as a document database is that each entity is stored as a single record. MongoDB stores documents in BSON format (a variant of JSON), so the resulting entity storage is very similar to what you’d see running json_encode(entity_load(1));. It should look quite familiar to any Drupal developer who’s ever looked at an entity structure with var_export() or via a web service. Queries against MongoDB are written in JSON syntax; while this is useful on the command line, at the Drupal level, you should only ever use EntityFieldQuery to avoid being locked into MongoDB at the application layer (much as using db_query() and db_select() enforces MySQL storage).

As opposed to potentially dozens of individual database tables, a single record is created with both the properties and configurable fields of the entity all in the same place. Since configurable fields can have multiple values, this is impossible to achieve in SQL in terms of a single database table with a column for each field. This allows complex entity queries, which might JOIN across 20 or more tables with the default MySQL field storage, to be executed against a single collection.

MongoDB is schemaless, in that it’s not necessary to define the structure of a collection up front before documents can be saved. Therefore, adding new fields or field values does not require any Data Definition Language (DDL) operations on the storage. However, being schemaless doesn’t mean no schema as such; it just means it’s entirely up to the application to enforce it.

Indexes can be applied to collections in a similar way to how MySQL indexes are applied to tables. There is currently no automated way for the MongoDB module to create indexes for EntityFieldQuery, so sites planning to use the MongoDB module in production should expect to audit queries and add their own indexes.

One of the main limitations of MongoDB is the inability to join between collections. The MongoDB module stores entities in their own collections, meaning that it’s not possible to query specific groups of nodes as easily as it would be with MySQL—for example, nodes created by users who live in Australia. There are workarounds for this—for example, adding an “author country” field to nodes and copying over the value from the user would allow that query to be run—but as with denormalization in MySQL, this requires maintaining both sets of data should the user record be updated.

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.