O'Reilly logo

Web Caching by Duane Wessels

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Proxy Auto-Configuration Script

The proxy auto-configuration (PAC) technique is designed to fix many of the manual configuration problems described previously. Instead of using static proxy addresses, the browser executes a function for every request. This function returns a list of proxy addresses that the browser tries until the request is successfully forwarded.

The PAC function is written in JavaScript. In theory, any browser that supports JavaScript can also support PAC. Netscape invented the PAC feature, and it was first available in Version 2 of their browser. Microsoft added PAC support to MSIE Version 3.

Both the Netscape and Microsoft browsers retrieve the PAC script as a URL. This is perhaps the biggest drawback to proxy auto-configuration. Setting the PAC URL requires someone to enter the URL in a pop-up window, or the browser must be preconfigured with the URL.

The best thing about proxy auto-configuration is that it allows administrators to reconfigure the browsers without further intervention from the users. If the proxy address changes, the administrator simply edits the PAC script to reflect the change. The browsers fetch the PAC URL every time they are started, but apparently not while the browser is running, unless the user forces a reload.

Another very nice feature is failure detection, coupled with the ability to specify multiple proxy addresses. If the first proxy in the list is not available, the browser tries the next entry, and so on until the end of the list. Failure is detected when the browser receives a Connection Refused error or a timeout during connection establishment.

The PAC script provides greater flexibility than a manual configuration. Instead of forwarding all HTTP requests to a single proxy address, the auto-configuration script can select a proxy based on the URI. As mentioned in Section 2.2.7, objects with “cgi” in the URI are usually not cachable. A proxy auto-configuration script can detect these requests and forward them directly to origin servers. Unfortunately, the auto-configuration script is not given the request method, so we cannot have similar checks for POST and PUT requests, which are also rarely cachable.

The PAC script can also be used creatively to implement load sharing. Without a PAC script, one way to do load sharing is by entering multiple address records for the cache’s hostname in the DNS. This results in a somewhat random scheme. There is no guarantee that the same URI always goes to the same proxy cache. If possible, we would rather have requests for the same URIs always going to the same caches, to maximize our hit ratios. We can accomplish this by writing a function that always returns the same proxy list for a given URI. An example of this is shown later in Example 4-1.

Writing a Proxy Auto-Configuration Function

The proxy auto-configuration function is named FindProxyForURL() and has two arguments: url and host.[20] The return value is a string specifying how to forward the request. The return string is one or more of the following, separated by semicolons:

PROXY host:port
SOCKS host:port
DIRECT

For example:

"PROXY proxy.web-cache.net:3128; DIRECT;
 SOCKS socks.web-cache.net:1080;"

When writing FindProxyForURL(), you may want to use some of the built-in functions for analyzing the URL. The most useful ones are described here. For the full details, see Netscape’s PAC documentation at http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html.

  • The isPlainHostName(host) function returns true if host is a single-component hostname rather than a fully qualified domain name. If host contains any periods, this function returns false.

    Many PAC scripts are written so that requests with plain hostnames are sent directly to the origin server. It’s likely that such a request refers to an internal server, which probably doesn’t benefit from caching anyway. Also, the caching proxy may not be able to resolve unqualified hostnames, depending on how the proxy is configured.

  • The dnsDomainIs(host, , domain) function returns true if host is a member of domain. For example, foo.bar.com is a member of bar.com, whereas www.foobar.com is not.

  • The isResolvable(host) function returns true if a DNS lookup for host results in an IP address. This function allows the browser, instead of the proxy, to generate error pages for invalid hostnames. When the browser generates an error message rather than the proxy, users are less likely to complain that the proxy cache is broken. Fewer complaints, of course, means less headaches for your support staff.

  • The shExpMatch(string, , pattern) function performs Unix shell-style pattern matching on string. For example, to match URLs that end with .cgi, you can write:

    shExpMatch(url, "*.cgi")

    To match the request protocol at the beginning of the URL, use:

    shExpMatch(url, "ftp:*")

Some sample FindProxyForURL() functions are given in the next section.

The PAC script must be placed on a web server, and the server must be configured to return a specific MIME Content-type header in the response. If Content-type is not set to application/x-ns-proxy-autoconfig, browsers do not recognize it as a proxy auto-configuration script. Generally, administrators name the PAC script with a .pac extension and then instruct the HTTP server to return the desired Content-type for all URIs with that extension. With Apache, you can add this line to srm.conf:

AddType application/x-ns-proxy-autoconfig .pac

Sample PAC Scripts

First, let’s look at a very simple proxy auto-configuration script that returns a single proxy address for all HTTP and FTP requests. For all other requests, it instructs the browser to forward the request directly to the origin server:

function FindProxyForURL(url, host)
{
  if (shExpMatch(url, "http:*"))
    return "PROXY proxy.isp.net:8080";
  if (shExpMatch(url, "ftp:*"))
    return "PROXY proxy.isp.net:8080";
  return "DIRECT";
}

Now, let’s look at a more complicated example for a company with a firewall. We want to forward all internal requests directly and all external requests via the firewall proxy. First, we look for internal hosts. These are single-component hostnames or fully qualified hostnames inside our domain (company.com). We use the isResolvable() trick so error messages for invalid hostnames come directly from the browser instead of the proxy. This trick works only if the internal hosts can look up addresses for external hosts:

function FindProxyForURL(url, host)
{
  if (isPlainHostName(host))
    return "DIRECT";
  if (dnsDomainIs(host, "company.com"))
    return "DIRECT";
  if (!isResolvable(host))
    return "DIRECT";
  return "PROXY proxy.company.com:8080";
}

Next, let’s see how you can use a proxy auto-configuration script for load sharing and redundancy. Three methods are commonly used for sharing the load between a set of N caches. One simple approach is to assign N IP addresses to a single hostname. While this spreads the load, it has the undesirable effect of randomizing mappings from requests to caches. It is better to have the same request always sent to the same cache. A hash function accomplishes this effect. A hash function takes a string (e.g., a URL) as input and returns an integer value. Given the same input, a hash function always returns the same value. We apply the modulo operator to the hash result to select from the N caches. This scheme works well, but the mappings change entirely when additional caches are added. The final technique is to use some aspect of the URL, such as the domain name or perhaps the filename extension. For example, .com requests can be sent to one cache and all other domains to another cache. Depending upon the incoming requests, this approach might result in significantly unbalanced load sharing, however.

Example 4-1 uses the hash function technique. We have four caches, and the hash function is simply the length of the URL, modulo four. Furthermore, for redundancy, we return multiple proxy addresses. If the first is unavailable, the browser tries the second. This failover breaks the partitioning scheme, but the users are more likely to get service.

Example 4-1. Sample PAC Script with Hashing

var N = 4;

function FindProxyForURL(url, host)
{
    var i = url.length % N;
    if (i == 0)
        return "PROXY a.proxy.company.com:8080; "
             + "PROXY b.proxy.company.com:8080; "
             + "DIRECT";
    else if (i == 1)
        return "PROXY b.proxy.company.com:8080; "
             + "PROXY c.proxy.company.com:8080; "
             + "DIRECT";
    else if (i == 2)
        return "PROXY c.proxy.company.com:8080; "
             + "PROXY d.proxy.company.com:8080; "
             + "DIRECT";
    else if (i == 3)
        return "PROXY d.proxy.company.com:8080; "
             + "PROXY a.proxy.company.com:8080; "
             + "DIRECT";
}

Setting the Proxy Auto-Configuration Script

Once a PAC script has been written and placed on a server, configuring a browser to use it is relatively simple. All you need to do is enter the PAC script URL in the appropriate configuration window for your browser.

For Netscape’s browser, set the proxy auto-configuration URL in one of the same windows used for manual proxy configuration. Start by selecting Edit Preferences… from the main menu bar. In the Preferences window, click on the small triangle next to Advanced and select Proxies. Select the “Automatic proxy configuration” option and enter the URL as shown in Figure 4-5.

Netscape Navigator proxy configuration window

Figure 4-5. Netscape Navigator proxy configuration window

If you’re using Microsoft Internet Explorer, select View Internet Options from the main menu bar. Select the Connection tab; the window shown in Figure 4-1 appears. Again, click on LAN Settings… and you’ll see the window in Figure 4-2. At the top is a subwindow titled Automatic configuration. If you select “Automatically detect settings,” Explorer will try to use WPAD, which we’ll talk about next. To use a PAC script, select “Use automatic configuration script” and enter its URL in the Address box.

Normally, browsers read the PAC URL only at startup. Thus, if you change the PAC script, users do not get the changes until they exit and restart their browser. Users can force a reload of the PAC script at any time by going to the proxy auto-configuration window and clicking on the Reload or Refresh button. Unfortunately, the Netscape browser does not obey the Expires header for PAC replies. That is, you cannot make Netscape Navigator reload the PAC script by providing an expiration time in the response.

Organizations with hundreds or even thousands of desktop systems may want to preconfigure browsers with a PAC URL. One way to accomplish this is to use a special kit from the manufacturer that allows you to distribute and install specially configured browsers. Microsoft calls theirs the Internet Explorer Administration Kit. Netscape’s is the Client Customization Kit, but it works only on the Microsoft Windows and Macintosh versions of Netscape Navigator. Both kits are available for download at no charge.[21]



[20] Of course, the url includes host, but it has been extracted for your convenience.

[21] You can download the Internet Explorer Administration Kit from http://www.microsoft.com/windows/ieak/en/download/default.asp, and Netscape’s Client Customization Kit is available for download at http://home.netscape.com/download/cck.html.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required