Chapter 19. Reverse Proxies and Content Delivery Networks

We’ve discussed a number of ways to speed up Drupal sites by improving code, optimizing infrastructure, and speeding up database queries. These changes can make a huge difference in website performance, but they can only go so far. Site performance can be improved even further by caching content before requests even reach Apache (and Drupal, PHP, or MySQL). Reverse proxies provide a way to do just that—cache static items such as images, JavaScript, and CSS, and potentially even full pages—and serve those items in a fraction of the time it would take if the request had go into Apache, PHP, and MySQL. This chapter focuses on reverse proxies with built-in caches, also referred to as web accelerators.

Content delivery networks (CDNs) take the idea of a reverse proxy a step further by moving cached content physically closer to website visitors. In addition to offloading network traffic from your infrastructure, this also makes your site seem much faster to visitors by reducing their network latency.

Using a Reverse Proxy with Drupal

When implementing a reverse proxy cache in front of your web servers, there are many options to consider. First up is which reverse proxy to use. Varnish has become very popular in recent years and its configuration language is extremely powerful, allowing for very specific caching configurations. Varnish has generally become accepted by the Drupal community as the reverse proxy of choice, so we’ll concentrate solely on Varnish here. The overall ideas apply to any reverse proxy, but the specific configuration examples will apply to Varnish only.

Other Popular Reverse Proxy Caches

Of course, Varnish isn’t the only option out there. Other popular reverse proxy caches include the following:

Nginx: In addition to functioning as a web server in place of Apache, Nginx is also frequently used as a standalone reverse proxy.
Squid: Squid is one of the oldest reverse proxies still available today, though its use has declined in favor of Varnish in recent years.
Apache Traffic Server: Originally developed at Yahoo!, Traffic Server was donated to the Apache Software Foundation and is now fully open source.

Secondly, you need to decide how much you want to cache and how integrated the proxy will be with your website. For example, caching static content items like images and JavaScript is relatively easy and can be done “out of the box” with minimal configuration, using pretty much any reverse proxy. If you want to take things a step further and start caching full pages served by Drupal, that requires a bit more configuration. Going even further, you can closely integrate Varnish and Drupal such that when an object is edited, Drupal can immediately purge pages containing that object out of Varnish’s cache to prevent stale content from being served. While the more advanced actions may take a bit more configuration, it’s not terribly difficult once you get comfortable with the configuration language—and there are a number of contributed Drupal modules that will help ease the process as well.

Let’s look at an example of using Varnish to cache. Figure 19-1 shows Varnish handling incoming HTTP requests for the site. When Varnish has a valid item in its cache, it will serve that item from the cache immediately with no backend request to the web server. When Varnish does not have an item cached, or has an out-of-date item cached, it will make a request to the web server for the item, store it in the local Varnish cache if it’s cacheable, and then return it to the client.

Varnish handling incoming http requests, serving cached items, and connecting to web server(s) on the backend as needed

Figure 19-1. Varnish reverse proxy

Understanding Varnish Configuration Language

Varnish Configuration Language (VCL) is used to define how Varnish will handle requests, cache items, and connect to one or more backends (web servers). You don’t need to be an expert in VCL in order to use Varnish, but understanding at least the default subroutines and their behavior will make it much easier to customize a VCL file for your specific needs. VCL was designed to be similar to C and Perl, and therefore is easy to pick up for most developers and system administrators.

Note

Changes to the VCL file used by Varnish do not take effect immediately; they must first be compiled and then loaded into Varnish. This can be done with commands through the Varnish admin interface, or happens automatically when the Varnish daemon is restarted.

Loading VCL Changes

In order to pull in changes to your VCL file, you will need to compile it in Varnish. Compiling the VCL in Varnish is preferred over fully restarting the Varnish daemon for a couple of reasons. First, you won’t interrupt existing connections. Second, if you restart Varnish and the new VCL file has an error, Varnish will refuse to start; however, if you manually compile a new VCL file with an error, Varnish will report the error but continue to run with the old, working configuration.

To compile a VCL file, you need to connect to the Varnish administrative interface. This can be done either with telnet or with the varnishadm utility. Here’s an example using varnishadm to connect to the Vanrish admin port to compile and then load the updated VCL file /etc/varnish/example.vcl. newconfig is a name to reference the configuration and can be whatever name you want to use:

$  varnishadm
200
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,3.9.10-100.fc17.x86_64,x86_64,-sfile,-smalloc,-hcritbit

Type 'help' for command list.
Type 'quit' to close CLI session.

> vcl.load newconfig /etc/varnish/example.vcl
200 13
VCL compiled.
> vcl.use newconfig
200 0

Defining a Backend

The first step when setting up Varnish is to configure it with your backend information. The backend declaration supplies Varnish with information on how to connect to your backend web server(s). The declaration can be as simple as providing a hostname and port, but it also allows you to configure additional options such as connection timeouts, max connections, and probe checks, which are used to check whether the backend is healthy or not. An example backend server declaration looks like this:

backend default {
  .host = "10.0.1.15";
  .port = "80";
  .connect_timeout = 20s;
  .first_byte_timeout = 20s;
  .between_bytes_timeout = 10s;
  .max_connections = 120;
  .probe = {
    .request =
      "GET healthcheck.php HTTP/1.1"
      "Host: www.example.com"
      "Connection: close"
      "Accept-Encoding: gzip" ;
    .interval = 5s;
    .timeout = 3s;
    .window = 5;
    .threshold = 3;
  }
}

This example defines a backend named default, which will connect to port 80 on host 10.0.1.15. Varnish will throw an error if any of the timeouts are hit while making a request to the backend. The max_connections setting allows a way to limit the number of connections that Varnish will make to the backend—this should not be more than your Apache MaxClients setting on the web server, discussed in the previous chapter.

Note

When running a single server with both Apache and Varnish, you will need to have Varnish listen on port 80 and move Apache to listen on an alternate port (via the Listen setting in httpd.conf). Also be sure to set the correct Apache port in your VCL backend settings.

The probe section is optional; it defines a periodic health check for the backend. Using a probe gives you a proactive way to check the status of a backend—otherwise, the status will only be updated if it hits a timeout when making a backend request for a client. Obviously, that particular client request will be served a Varnish error. It’s best to define a probe, especially if you have multiple backends, so sick backends are automatically avoided with the use of a director. Once a probe detects that the backend has returned to a healthy state, the backend will be made active again in the director.

Inside of the probe definition, you are able to specify a URL to request, along with any other headers that should be used for the request. In addition, you define how often to run the check, what its timeout is, and window and threshold values. In order to declare a backend as healthy, Varnish will look the past X probe responses, where X is your window value: at least threshold of them must be successful. In our example, Varnish will look at the previous five probe requests, and if three or more of them were successful, the backend will be marked as healthy. If two or fewer probe requests were successful, the backend will be marked as sick.

Note

Your probe should use a simple page as a check. You don’t want to use a page that will put a large load on your web server or take a long time to respond, though it’s best to choose a page that verifies the full stack is working. For example, use a simple Drupal page instead of using a static HTML page, which might succeed even if there were problems with Drupal, PHP, and/or the database.

Directors: Dealing with Multiple Backend Servers

If you have more than one web server, you can define each in its own backend declaration, but then you’ll need a way to group them together and tell Varnish how to direct traffic between them; that’s where directors come in. In Varnish, a director is a logical grouping of backend servers. There are a number of different director types that use different algorithms to decide which backend to use for a given request. To declare a director, you need to give it a name, tell it which director type to use, and then define which backend servers to include.

Varnish Director Types

The types of director available in Varnish include:

Random: Picks a backend at random, though weights can be used to adjust the chance of using a particular backend.
Client: Uses the client’s identity to choose a backend.
Hash: Chooses a backend based on the request URL hash. This is useful if you have one Varnish instance load balancing in front of multiple Varnish servers and want to split the cache among them; it prevents cache duplication between the servers.
Round-robin: Cycles through a list of backends, directing one request to each, then moving on to the next.
DNS: Allows you to specify a list of IPs or a netblock to use for backend servers. This backend makes it easy to define a large number of backend servers with minimal configuration in the VCL.
Fallback: Provided with a list of backends, this director will start at the top of the list and use the first one that is considered healthy.

The following example shows how to define a round-robin director, which will simply loop over all the backends listed, directing a single request to one backend and then moving to the next backend for the next request. Assume we’ve already defined two backends, web1 and web2:

director main round-robin {
  {
    .backend = web1;
  }
  {
    .backend = web2;
  }
}

This defines a director named main, which will direct traffic evenly between both of the backends. Another commonly used director is the random director, which will randomly select a backend from a list, though you can also weight the backend servers in order to send more or less traffic to them. This can be useful if your web servers are not uniform and one has more processing power than another. An example definition looks like this:

director loadbalance random {
  {
    .backend = web1;
    .weight  = 3;
  }
  {
    .backend = web2;
    .weight  = 1;
  }
  {
    .backend = web3;
    .weight  = 1;
  }
}

This example declares a director named loadbalance that randomly selects a backend server from those listed, but will give the web1 backend a weight such that it will be selected roughly three times as often as either of the other two backend servers.

Once you define a director, Varnish needs to be told when to use that director. This is configured within the vcl_recv subroutine, described in more depth in the next section. At its simplest, all you need to do is set the req.backend variable to the backend or director you want to use. The setting used is the same whether you are setting it to a single backend or to a director:

sub vcl_recv {
  set req.backend = loadbalance;
}

This example sets the default backend to the loadbalance director we defined in the previous example.

Built-in VCL Subroutines

There are a number of subroutines used by Varnish to handle requests. You can override or “prepend” any of these subroutines with your own definitions. Most of the time, only one or two of these subroutines needs to be modified in order to work well with your site. However, some sites need a bit more customization and end up overriding most or all of the built-in subroutines. It’s important to know that if you don’t define one of the built-in subroutines in your VCL file, that subroutine will still exist in its default configuration. Here is a quick overview of some of the predefined subroutines used in the VCL file that are commonly modified with site-specific configurations:

vcl_recv: This subroutine is called to deal with an incoming request. It contains the logic to tell Varnish if and how to serve the request—for example, whether it should return a cached item or bypass the cache and fetch an object from a backend server in order to fulfill the request. vcl_recv is also where you declare which backend (or director) to use for a given request.
vcl_fetch: This subroutine is called after an item has been fetched from a backend. Generally, things to check for here include if a cookie is being set, or some other response that indicates the item should not be cached.
vcl_error: This subroutine is called when an error is hit either within Varnish or with the backend response. Here you have the option of calling a restart on the request to try again, in the hopes of not getting an error the second time. This subroutine is also used to customize the Varnish error output and to deal with custom error codes that may have been set elsewhere in the VCL.
vcl_hash: This subroutine creates a hash for the cached item. The cache is used internally by Varnish for future lookups and by default includes the URL and either the HTTP host or IP address. If you are doing some custom caching (e.g., splitting the cache) based on a cookie or a certain header in the request, then you would likely do so by including that item in your hash calculation. This is useful, for example, if you want to cache mobile requests separately from regular “desktop” requests based on a device type cookie or header.

Note

For Varnish 3.0, the VCL reference located here gives an overview of all VCL subroutines and shows the default configuration for each. The default VCL example here provides a flowchart showing how a request flows through the various VCL subroutines.

Customizing Subroutines

When you add code for any of the built-in subroutines in your VCL configuration, by default you will only prepend the default for that subroutine. If your custom code terminates with an action (e.g., some sort of return statement), then the default won’t be run for that subroutine. Sometimes there are reasons you want to bypass the default routines, though generally it is best to only prepend them by not including a return statement at the end of your code.

As an example, let’s write a few custom vcl_recv subroutines with one difference: one will return at the end of the subroutine (bypassing the default vcl_recv stub), and one will not return by default, meaning the default stub will get executed. For starters, here is the default vcl_recv stub:

/*
 * Copyright (c) 2006 Verdens Gang AS
 * Copyright (c) 2006-2011 Varnish Software AS
 * All rights reserved.
 *
 * Author: Poul-Henning Kamp <phk@phk.freebsd.dk>
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE
 * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
 * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
 * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
 * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
 * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

sub vcl_recv {
  if (req.restarts == 0) {
    if (req.http.x-forwarded-for) {
      set req.http.X-Forwarded-For =
        req.http.X-Forwarded-For + ", " + client.ip;
    } else {
      set req.http.X-Forwarded-For = client.ip;
    }
  }
  if (req.request != "GET" &&
      req.request != "HEAD" &&
      req.request != "PUT" &&
      req.request != "POST" &&
      req.request != "TRACE" &&
      req.request != "OPTIONS" &&
      req.request != "DELETE") {
        /* Non-RFC2616 or CONNECT which is weird. */
        return (pipe);
  }
  if (req.request != "GET" && req.request != "HEAD") {
    /* We only deal with GET and HEAD by default */
    return (pass);
  }
  if (req.http.Authorization || req.http.Cookie) {
    /* Not cacheable by default */
    return (pass);
  }
  return (lookup);
}

Let’s walk through what the default code does. First, it appends to the X-Forwarded-For header if that is set. Next, it checks for nonstandard request types; if it encounters something unexpected, it will return pipe, meaning that the request will be piped directly to the backend instead of going through the normal request flow in Varnish. After that, if the request type is anything except GET or HEAD, Varnish will pass it through without caching (you wouldn’t want to cache a POST request, for example). In the final check, Varnish checks if there is an Authorization header or if a cookie is set. If either of those are in place, the request is returned with pass, meaning it won’t be cached. Finally, after going through all that, the subroutine returns lookup, meaning it will forward the request to the backend and attempt to cache the result.

For the sake of this example, let’s assume that you want to add one additional check to vcl_recv. Specifically, there is a URL, /update.php, which you want to tell Varnish never to cache. Seems reasonable enough. Let’s see how that would be handled in vcl_recv:

sub vcl_recv {
   if (req.url ~ "^/update.php$") {
     return(pass);
  }
}

Once you’ve loaded that code into Varnish, your custom vcl_recv will be run for incoming requests. In this case, if a request comes in for /update.php Varnish will return pass, meaning that it will bypass its cache for the request. Requests for any other URL on the site will fall through your subroutine and, because you did not include a return at the end of the subroutine, the default stub shown earlier will be executed as well. Compare that to the following:

sub vcl_recv {
   if (req.url ~ "^/update.php$") {
     return(pass);
  }
  return(lookup);
}

The code in this version includes return(lookup) in the subroutine. In this case, requests for URLs other than /update.php will be cached in Varnish and delivered to clients. The problem is that because the default stub is not run, it’s possible that you are caching an item even in cases where you probably shouldn’t (for example, if the request contains a session cookie).

The difference may seem minimal, but it can have quite an impact. It’s not “wrong” to override/bypass the default subroutines, but you should be aware when doing so, and be sure you understand the consequences. Some sites copy and paste the defaults below their custom code as a way to better visualize the code path (and not forget what the default stub is doing, even though it may be hidden behind the scenes).

Cookies and Varnish

In its default configuration, Varnish will not cache any request that has a cookie set. This means any logged-in user traffic will not be cached in Varnish, but it also means that any custom or contrib modules that set any type of cookie (session or otherwise) may cause Varnish cache misses. One common example of this is Google Analytics, which sets a tracking cookie for every visitor. With a default Varnish configuration, enabling Google Analytics would cause all page visits to miss the Varnish cache because of the cookie. That’s obviously not ideal behavior, so let’s take a look at how to modify the Varnish configuration to ignore certain cookies when deciding whether or not to serve a cached item.

The way to do this is to do a regular expression replacement (this is one of the few built-in functions available in Varnish) on the request cookie in order to strip out cookies that we know should be ignored as far as caching is concerned. Stripping out certain cookies should be dealt with in the vcl_recv subroutine, as that is where Varnish makes the request cookie object available. Consider this example:

sub vcl_recv {
  # Remove Google Analytics cookie.
  # These are all of the form "__utm[a-z]=<value>".
  set req.http.cookie = regsuball(req.http.cookie,
                                  "(^|;\s*)__utm[a-z]=[^;]*", "");
  # Remove a ";" prefix, if present.
  set req.http.cookie = regsub(req.http.cookie, "^;\s*", "");
  # Remove the cookie if it is now empty or contains only spaces.
  if (req.http.cookie ~ "^\s*$") {
    unset req.http.cookie;
  }
}

The preceding code will allow Varnish to cache pages even if a Google Analytics cookie is present. If you have multiple cookies you want to remove, simply add additional calls to regsuball to strip out known cookies that don’t affect caching.

Caching for Authenticated Users

We mentioned in the previous section that by default, requests from logged-in users will not be cached by Varnish. For many cases, that’s actually the preferred behavior. For example, you wouldn’t want to cache per-user or per-role page customizations and then serve those cached items to an anonymous user. However, there are some files that remain static for all requests, such as image, JavaScript, and CSS files. There is no reason not to cache those in Varnish and serve them for any request, regardless of whether or not the user is logged in.

There are a couple of different approaches to solve this particular issue. One option is to create a list of file extensions that should always be cached. The second option makes the assumption that any files served out of the sites/ subdirectory can be cached regardless of whether the user is logged in. Either way, the implementation is very similar: add a check in vcl_recv, and if the check is met, unset any cookies that might be present and return a lookup. return(lookup) will return an item from the cache or, if it’s not present, fetch it from the backend and store it in the cache for future requets. Here’s a VCL example for serving cached items for common “static” file extensions:

sub vcl_recv {
  if (req.url ~ "\.(js|css|jpg|jpeg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf)$") {
    unset req.http.Cookie;
    return (lookup);
  }
}

Edge-Side Includes

There are many cases where it would be possible (and ideal!) to cache a page in Varnish for authenticated users, but where there is some small amount of personalized content in that page that can’t be shared between users. Edge-side includes (ESI) provides a way to work around this problem by referencing the personalized data in a separate ESI tag. The full page without the personalized content can then be cached in Varnish, and that content can be pulled in separately and integrated with the cached content.

Imagine the simple case of a logged-in user block that displays the user’s name. This is something you wouldn’t want to cache and serve to other users, for obvious reasons. However, if the rest of the page contents are not user-specific, then Varnish could cache the entire page and dynamically pull in an ESI block containing the user-specific block content.

ESI is a very powerful tool that can greatly increase your cache hit rate by allowing much more of the site to be cached for authenticated users. However, misused or misconfigured, ESI could have the opposite effect and greatly reduce your site’s client-side performance, so be sure to thoroughly test any deployment. We could easily span an entire chapter (or book!) discussing ESI, but because of space constraints, we won’t go into it in more depth here.

Note

The Drupal ESI module provides example documentation for how to integrate Drupal ESI into Varnish.

Serving Expired Content

Sometimes it makes sense to serve cache content that has already expired. This isn’t quite as bad as it sounds; cached content is not like expired milk, and certainly smells better. Actually, there are a couple of good reasons that you might want to serve expired content:

The backend server is down or unreachable.
Varnish has sent a request to the backend for an object, but that request is slow to process on the backend. Meanwhile, Varnish could serve an old version of that same object to any incoming requests for the same object.

Both of these situations are handled by setting a grace period for requests to live in Varnish after they have expired. This can be set using the req.grace variable in vcl_recv. You’ll also need to set beresp.grace in vcl_fetch. req.grace controls the grace period for an object. beresp.grace affects the maximum grace time allowed for an object, controlling when the object will be purged from the cache. Consider the following VCL snippet:

sub vcl_recv {
  if (req.backend.healthy) {
    set req.grace = 20s;
  } else {
    set req.grace = 30m;
  }
}

sub vcl_fetch {
  set beresp.grace = 30m;
}

In vcl_recv, we check if the backend is healthy (health is based on backend probes or recent failed backend requests). If the backend is healthy, a grace period of 20 seconds is used for requests—this applies to cases where a cache item has expired and a new request has been sent to the backend for the updated object. In this case, any subsequent requests for the item will be served the expired content while the backend request is waiting to complete. On the other hand, if the backend is considered sick, the grace time is increased to 30 minutes. This allows Varnish to serve content up to 30 minutes past its expiration time, allowing time for the backend server to recover without taking the website entirely offline. After the grace period has run out, Varnish will return to the default behavior of fetching from the backend—in the case of a backend server downtime, this likely means Varnish will start returning errors after the grace period has expired.

The beresp.grace setting in vcl_fetch should simply reflect the maximum time that you use for req.grace, which is 30 minutes in this case.

Error Pages

More likely than not, you’ve seen a default Varnish error page. It probably was pretty ugly and would seem quite confusing to regular users of your website. Thankfully, the default error pages are quite easy to customize within the vcl_error function. All you need to do is use the synthetic keyword to define an HTML document to output for errors, and then have Varnish deliver that document. Here’s how this is achieved in the default VCL:

sub vcl_error {
  set obj.http.Content-Type = "text/html; charset=utf-8";
  set obj.http.Retry-After = "5";
  synthetic {"
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>"} + obj.status + " " + obj.response + {"</title>
  </head>
  <body>
    <h1>Error "} + obj.status + " " + obj.response + {"</h1>
    <p>"} + obj.response + {"</p>
    <h3>Guru Meditation:</h3>
    <p>XID: "} + req.xid + {"</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>
"};
  return (deliver);
}

Notice how the synthetic keyword is simply passed a long string containing HTML output. This can be easily overridden with your own HTML, but ideally this would not include external CSS or images since your backend may be down when this error page is served. Be sure to add the return (deliver) at the end of your custom vcl_error so that the default vcl_error isn’t used.

Memory Allocation

Varnish provides two stable methods for memory allocation: file-backed, or completely in-memory. With file-backed memory allocation, Varnish depends on the OS memory caching/paging system to keep recently used items in RAM; with in-memory allocation (“malloc”), Varnish claims a dedicated chunk of RAM and stores all cache items there. There are trade-offs to either option. Using a file-based backend means you can generally have a much larger cache, but certain cache items may be slower than others if they end up getting paged out to disk by the operating system. In the case of malloc, cached items are guaranteed to be in RAM, but if you are caching a lot of items, you may need more space than you have available RAM.

You will nearly always get better performance when using malloc. For that reason, we recommend using malloc except in the cases where you need to cache more than you have space for in RAM (and are unable to add more RAM). That said, Varnish is actually very smart about how it caches to disk: it relies on the operating system’s file cache, and frequently used cache items end up in the FS cache under optimal circumstances.

In order to figure out how much RAM is required for your website, it’s easiest to simply allocate something like 512 MB or 1 GB, and then let Varnish run for a while. Monitor the memory usage using top and varnishstat, specifically looking at the SMA bytes allocated output of varnishstat. If you see that the cache has filled up or are seeing many nuked objects, then you should increase the memory allocation.

Note

Varnish will use a bit more memory (or file system space) than allocated, due to internal overhead. When setting the amount of memory to use, you are only limiting the size of the cache; the overhead size is not configurable.

Logging and Monitoring Varnish

Once you have Varnish in place, there are a number of tools you can use to log requests and monitor hit rates, usage, and other information:

varnishncsa: Many administrators like to have a simple request log (much like Apache’s access log) for tracking all requests handled by Varnish. Varnish ships with the varnishncsa daemon, which provides just that. Simply start the varnishncsa service (command-line options include a file to output to), and it will start logging.
varnishstat: Run without any options, varnishstat will provide a continuously updated snapshot of statistics. This will give you an idea of your current cache hit rate, cache usage, and other request and backend statistics. If you run varnishstat with the -1 flag, it will output all statistics once and exit; this is useful for capturing the output to a file.
varnishhist: This will provide you with a visual representation of how long requests take to be served. Items graphed toward the left side are served faster than those on the right side. This can help give you an idea of your cache hit rate and help you spot any outliers that take an unusual amount of time.
varnishlog: varnishlog reads out of Varnish’s shared memory and outputs information about each request handled. You have the option of filtering out certain requests based on things like the request URL, which is almost always how this utility is used because otherwise it just gives way too much information.

All of these tools are very useful for tracking how well Varnish is working to cache your site’s content, and especially useful when you are making changes to your VCL file and want to see the effects. Using varnishstat to watch cache hits and misses and then tracking down misses using varnishlog or watching the request/response headers in your browser can be very useful when troubleshooting VCL issues.

Sample VCL for Drupal

To wrap up our discussion of Varnish, let’s take a look at a sample VCL file that can be used on a Drupal site with just a few configuration changes—you will need to adjust the backend declaration to point to your web server, and add additional backend definitions and a director if you have more than one backend server.

There are a few things included in this VCL file that weren’t covered in this chapter:

In vcl_deliver, we add response headers to track whether or not the item returned was cached. In the case of a cache hit, we also add a header with the number of cache hits for that particular item. This is very useful for tracking your hits and misses, especially when first setting up Varnish in a new environment.
There is a list of file extensions in vcl_recv that we always want to cache, so we unset any cookies for these requests. This same list is duplicated in vcl_fetch so that if the backend attempts to set a new cookie with the response, that Set-Cookie will be caught and dropped by Varnish ensuring that the item will be cached. The important thing to note here is that if you edit the list in vcl_recv, you should update the list in vcl_fetch to match.
vcl_fetch includes a check for a few different error codes (404, 301, 500), which correspond to page not found, moved permanently, and internal server error. By default, the backend will return these with a TTL of 0 so they won’t be cached by Varnish. But because these requests can actually cause a full Drupal bootstrap and database queries, it’s actually beneficial to cache the responses for some amount of time. In this example, we set the TTL to 10 minutes so that Varnish will maintain the responses in the cache.

Our sample Drupal VCL file looks like this:

# Sample VCL based on VCL created by Four Kitchens, available at
# https://fourkitchens.atlassian.net/wiki/display
#     /TECH/Configure+Varnish+3+for+Drupal+7
/*
 * Copyright (c) 2013 Four Kitchens
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE
 * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
 * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
 * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
 * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
 * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

# Default backend definition.  Set this to point to your content server.
backend default {
  .host = "127.0.0.1";
  .port = "81";
}

sub vcl_recv {
  # Use anonymous, cached pages if all backends are down.
  if (!req.backend.healthy) {
    unset req.http.Cookie;
  }

  # Allow the backend to serve up stale content if it is responding slowly.
  set req.grace = 6h;

  # Do not cache these paths.
  if (req.url ~ "^/status\.php$" ||
      req.url ~ "^/update\.php$" ||
      req.url ~ "^/admin$" ||
      req.url ~ "^/admin/.*$" ||
      req.url ~ "^/flag/.*$" ||
      req.url ~ "^.*/ajax/.*$" ||
      req.url ~ "^.*/ahah/.*$") {
       return (pass);
  }

  # Always cache the following file types for all users. This list of extensions
  # appears twice, once here and again in vcl_fetch, so make sure you edit both
  # and keep them equal.
  if (req.url ~
    "(?i)\.(pdf|txt|doc|xls|ppt|csv|png|gif|jpeg|jpg|ico|swf|css|js)(\?.*)?$") {
    unset req.http.Cookie;
  }

  # Remove all cookies that Drupal doesn't need to know about. We explicitly
  # list the ones that Drupal does need, the SESS and NO_CACHE cookies. If after
  # running this code we find that either of these two cookies remains, we
  # will pass as the page shouldn't be cached.
  if (req.http.Cookie) {
    # Append a semicolon to the front of the cookie string.
    set req.http.Cookie = ";" + req.http.Cookie;

    # Remove all spaces that appear after semicolons.
    set req.http.Cookie = regsuball(req.http.Cookie, "; +", ";");

    # Match the cookies we want to keep, adding back the space we removed
    # previously. "\1" is first matching group in the regular expression match.
    set req.http.Cookie = regsuball(req.http.Cookie,
                          ";(SESS[a-z0-9]+|SSESS[a-z0-9]+|NO_CACHE)=", "; \1=");

    # Remove all other cookies, identifying them by the fact that they have
    # no space after the preceding semicolon.
    set req.http.Cookie = regsuball(req.http.Cookie, ";[^ ][^;]*", "");

    # Remove all spaces and semicolons from the beginning and end of the
    # cookie string.
    set req.http.Cookie = regsuball(req.http.Cookie, "^[; ]+|[; ]+$", "");

    if (req.http.Cookie == "") {
      # If there are no remaining cookies, remove the cookie header
      # so that Varnish will cache the request.
      unset req.http.Cookie;
    }
    else {
      # If there are any cookies left (a session or NO_CACHE cookie), do not
      # cache the page. Pass it on to the backend directly.
      return (pass);
    }
  }
}

sub vcl_deliver {
  # Set a header to track if this was a cache hit or miss.
  # Include hit count for cache hits.
  if (obj.hits > 0) {
    set resp.http.X-Varnish-Cache = "HIT";
    set resp.http.X-Varnish-Hits = obj.hits;
  }
  else {
    set resp.http.X-Varnish-Cache = "MISS";
  }
}

sub vcl_fetch {
  # Items returned with these status values wouldn't be cached by default,
  # but by doing so we can save some Drupal overhead.
  if (beresp.status == 404 || beresp.status == 301 || beresp.status == 500) {
    set beresp.ttl = 10m;
  }

  # Don't allow static files to set cookies.
  # This list of extensions appears twice, once here and again in vcl_recv, so
  # make sure you edit both and keep them equal.
  if (req.url ~
    "(?i)\.(pdf|txt|doc|xls|ppt|csv|png|gif|jpeg|jpg|ico|swf|css|js)(\?.*)?$") {
    unset beresp.http.set-cookie;
  }

  # Allow items to be stale if needed, in case of problems with the backend.
  set beresp.grace = 6h;
}

sub vcl_error {
  # In the event of an error, show friendlier messages.
  set obj.http.Content-Type = "text/html; charset=utf-8";
  synthetic {"
<html>
<head>
  <title>Page Unavailable</title>
  <style>
    body { background: #303030; text-align: center; color: white; }
    .error { color: #222; }
  </style>
</head>
<body>
  <div id="page">
    <h1 class="title">Page Unavailable</h1>
    <p>The page you requested is temporarily unavailable.</p>
    <p>Please try again later.</p>
    <div class="error">(Error "} + obj.status + " " + obj.response + {")</div>
  </div>
</body>
</html>
"};
  return (deliver);
}

Content Delivery Networks

CDNs can be used either in place of or in addition to a reverse proxy. As we mentioned in the introduction to this chapter, a CDN can dramatically increase the speed of your website, not only by caching your content, but also by dispersing that content geographically and making it available on a fast network link in order to optimize performance for visitors from all over the world.

In the most simple configuration, a CDN can be set up to serve all static content from your site. However, CDNs are capable of doing much more: for example, they can handle all traffic to your website (imagine pointing your website’s domain to a CDN server instead of it pointing to your web server) and even potentially handle SSL requests, which is something that Varnish can’t do.

Serving Static Content Through a CDN

The easiest (and cheapest!) way to integrate a CDN with your site is to use the CDN as a static cache store. Generally this means storing images, JavaScript, and CSS on the CDN, but serving all page requests from your own servers. When a request comes in for a page on your site, the request is handled by your web server, but all static content is referenced with a URL that points to the CDN server so clients will fetch all of that content from the CDN server(s) directly. When missing or expired content is requested on the CDN, it makes requests directly to the backend server(s) to update its cache. The CDN Drupal module makes this configuration very easy by automatically rewriting URLs for you.

In the case of a CDN set up to pull content, rewriting the URLs is all that is needed because the CDN will automatically request items from your web server if it doesn’t have them in its cache. Another type of CDN is a push CDN, where you must manually upload content to the CDN servers before it will be served. The Drupal CDN module also handles such CDNs (with additional configuration), though pull-based CDNs are much more common.

When to Use a CDN

CDNs can provide an amazing performance boost to your site with very little configuration overhead. Cached items are served faster to visitors, and load is reduced on your servers as more requests are dealt with by the CDN. In general, if you can afford the cost of a CDN, then there is no reason not to use one. While the cost for larger sites and those needing special features can grow quite large, there are many affordable CDN providers for small- to medium-sized sites.

Choosing Between a CDN and a Reverse Proxy

There is actually no reason that this needs to be an either/or decision. If you have a reverse proxy in place, you will still see benefits from adding a CDN. The caches can layer well, and any special request handling that needs to happen can easily be configured with some custom headers passed on by the CDN and handled in the reverse proxy (or vice versa).

Get High Performance Drupal now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

High Performance Drupal by Jeff Sheltren, Narayan Newton, Nathaniel Catchpole

Chapter 19. Reverse Proxies and Content Delivery Networks

Using a Reverse Proxy with Drupal

Understanding Varnish Configuration Language

Note

Defining a Backend

Note

Note

Directors: Dealing with Multiple Backend Servers

Built-in VCL Subroutines

Note

Customizing Subroutines

Cookies and Varnish

Caching for Authenticated Users

Edge-Side Includes

Note

Serving Expired Content

Error Pages

Memory Allocation

Note

Logging and Monitoring Varnish

Sample VCL for Drupal

Content Delivery Networks

Serving Static Content Through a CDN

When to Use a CDN

Choosing Between a CDN and a Reverse Proxy

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly