The proxy auto-configuration (PAC) technique is designed to fix many of the manual configuration problems described previously. Instead of using static proxy addresses, the browser executes a function for every request. This function returns a list of proxy addresses that the browser tries until the request is successfully forwarded.
The PAC function is written in JavaScript. In theory, any browser that supports JavaScript can also support PAC. Netscape invented the PAC feature, and it was first available in Version 2 of their browser. Microsoft added PAC support to MSIE Version 3.
Both the Netscape and Microsoft browsers retrieve the PAC script as a URL. This is perhaps the biggest drawback to proxy auto-configuration. Setting the PAC URL requires someone to enter the URL in a pop-up window, or the browser must be preconfigured with the URL.
The best thing about proxy auto-configuration is that it allows administrators to reconfigure the browsers without further intervention from the users. If the proxy address changes, the administrator simply edits the PAC script to reflect the change. The browsers fetch the PAC URL every time they are started, but apparently not while the browser is running, unless the user forces a reload.
Another very nice feature is failure detection, coupled with the
ability to specify multiple proxy addresses. If the
first proxy in the list is not available, the browser tries
the next entry, and so on until the end of the list. Failure
is detected when the browser receives a Connection Refused
error or a timeout during connection
establishment.
The PAC script provides greater flexibility than a manual configuration. Instead of forwarding all HTTP requests to a single proxy address, the auto-configuration script can select a proxy based on the URI. As mentioned in Section 2.2.7, objects with “cgi” in the URI are usually not cachable. A proxy auto-configuration script can detect these requests and forward them directly to origin servers. Unfortunately, the auto-configuration script is not given the request method, so we cannot have similar checks for POST and PUT requests, which are also rarely cachable.
The PAC script can also be used creatively to implement load sharing. Without a PAC script, one way to do load sharing is by entering multiple address records for the cache’s hostname in the DNS. This results in a somewhat random scheme. There is no guarantee that the same URI always goes to the same proxy cache. If possible, we would rather have requests for the same URIs always going to the same caches, to maximize our hit ratios. We can accomplish this by writing a function that always returns the same proxy list for a given URI. An example of this is shown later in Example 4-1.
The proxy auto-configuration function is named
FindProxyForURL()
and has two arguments:
url
and
host
.[20]
The return value is
a string specifying how to forward the request.
The return string is one or more of the following, separated
by semicolons:
PROXY host:port SOCKS host:port DIRECT
For example:
"PROXY proxy.web-cache.net:3128; DIRECT; SOCKS socks.web-cache.net:1080;"
When writing FindProxyForURL()
, you
may want to use some of the built-in functions for
analyzing the URL. The most useful ones are described here.
For the full details, see
Netscape’s PAC documentation at
http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html.
The
isPlainHostName(
function returnshost
)true
ifhost
is a single-component hostname rather than a fully qualified domain name. Ifhost
contains any periods, this function returnsfalse
.Many PAC scripts are written so that requests with plain hostnames are sent directly to the origin server. It’s likely that such a request refers to an internal server, which probably doesn’t benefit from caching anyway. Also, the caching proxy may not be able to resolve unqualified hostnames, depending on how the proxy is configured.
The
dnsDomainIs(
function returnshost
, ,domain
)true
ifhost
is a member ofdomain
. For example, foo.bar.com is a member of bar.com, whereas www.foobar.com is not.The
isResolvable(
function returnshost
)true
if a DNS lookup forhost
results in an IP address. This function allows the browser, instead of the proxy, to generate error pages for invalid hostnames. When the browser generates an error message rather than the proxy, users are less likely to complain that the proxy cache is broken. Fewer complaints, of course, means less headaches for your support staff.The
shExpMatch(
function performs Unix shell-style pattern matching onstring
, ,pattern
)string
. For example, to match URLs that end with .cgi, you can write:shExpMatch(url, "*.cgi")
To match the request protocol at the beginning of the URL, use:
shExpMatch(url, "ftp:*")
Some sample FindProxyForURL()
functions are given in the next section.
The PAC script must be placed on a web server, and the
server must be configured to return a specific MIME
Content-type
header in the response. If
Content-type
is not set to
application/x-ns-proxy-autoconfig
,
browsers do not recognize it as a proxy auto-configuration
script. Generally, administrators name the PAC script with a
.pac
extension and then instruct the
HTTP server to return the desired
Content-type
for all URIs with that
extension. With Apache, you can add this line to
srm.conf
:
AddType application/x-ns-proxy-autoconfig .pac
First, let’s look at a very simple proxy auto-configuration script that returns a single proxy address for all HTTP and FTP requests. For all other requests, it instructs the browser to forward the request directly to the origin server:
function FindProxyForURL(url, host) { if (shExpMatch(url, "http:*")) return "PROXY proxy.isp.net:8080"; if (shExpMatch(url, "ftp:*")) return "PROXY proxy.isp.net:8080"; return "DIRECT"; }
Now, let’s look at a more complicated example for a company
with a firewall. We want to forward all internal requests
directly and all external requests via the firewall proxy.
First, we look for internal hosts. These are
single-component hostnames or fully qualified hostnames
inside our domain
(company.com). We use the
isResolvable()
trick so error
messages for invalid hostnames come directly from the
browser instead of the proxy. This trick works only if the
internal hosts can look up addresses for external
hosts:
function FindProxyForURL(url, host) { if (isPlainHostName(host)) return "DIRECT"; if (dnsDomainIs(host, "company.com")) return "DIRECT"; if (!isResolvable(host)) return "DIRECT"; return "PROXY proxy.company.com:8080"; }
Next, let’s see how you can use a proxy auto-configuration script for load sharing and redundancy. Three methods are commonly used for sharing the load between a set of N caches. One simple approach is to assign N IP addresses to a single hostname. While this spreads the load, it has the undesirable effect of randomizing mappings from requests to caches. It is better to have the same request always sent to the same cache. A hash function accomplishes this effect. A hash function takes a string (e.g., a URL) as input and returns an integer value. Given the same input, a hash function always returns the same value. We apply the modulo operator to the hash result to select from the N caches. This scheme works well, but the mappings change entirely when additional caches are added. The final technique is to use some aspect of the URL, such as the domain name or perhaps the filename extension. For example, .com requests can be sent to one cache and all other domains to another cache. Depending upon the incoming requests, this approach might result in significantly unbalanced load sharing, however.
Example 4-1 uses the hash function technique. We have four caches, and the hash function is simply the length of the URL, modulo four. Furthermore, for redundancy, we return multiple proxy addresses. If the first is unavailable, the browser tries the second. This failover breaks the partitioning scheme, but the users are more likely to get service.
Example 4-1. Sample PAC Script with Hashing
var N = 4; function FindProxyForURL(url, host) { var i = url.length % N; if (i == 0) return "PROXY a.proxy.company.com:8080; " + "PROXY b.proxy.company.com:8080; " + "DIRECT"; else if (i == 1) return "PROXY b.proxy.company.com:8080; " + "PROXY c.proxy.company.com:8080; " + "DIRECT"; else if (i == 2) return "PROXY c.proxy.company.com:8080; " + "PROXY d.proxy.company.com:8080; " + "DIRECT"; else if (i == 3) return "PROXY d.proxy.company.com:8080; " + "PROXY a.proxy.company.com:8080; " + "DIRECT"; }
Once a PAC script has been written and placed on a server, configuring a browser to use it is relatively simple. All you need to do is enter the PAC script URL in the appropriate configuration window for your browser.
For Netscape’s browser, set the proxy auto-configuration URL in one of the same windows used for manual proxy configuration. Start by selecting Edit → Preferences… from the main menu bar. In the Preferences window, click on the small triangle next to Advanced and select Proxies. Select the “Automatic proxy configuration” option and enter the URL as shown in Figure 4-5.
If you’re using Microsoft Internet Explorer, select View → Internet Options from the main menu bar. Select the Connection tab; the window shown in Figure 4-1 appears. Again, click on LAN Settings… and you’ll see the window in Figure 4-2. At the top is a subwindow titled Automatic configuration. If you select “Automatically detect settings,” Explorer will try to use WPAD, which we’ll talk about next. To use a PAC script, select “Use automatic configuration script” and enter its URL in the Address box.
Normally, browsers read the PAC URL only at startup. Thus,
if you change the PAC script, users do not get the changes
until they exit and restart their browser. Users can force
a reload of the PAC script at any time by going to the proxy
auto-configuration window and clicking on the
Reload or Refresh
button. Unfortunately, the Netscape browser does not obey
the Expires
header for PAC replies.
That is, you cannot make Netscape Navigator reload the PAC script
by providing an expiration time in the response.
Organizations with hundreds or even thousands of desktop systems may want to preconfigure browsers with a PAC URL. One way to accomplish this is to use a special kit from the manufacturer that allows you to distribute and install specially configured browsers. Microsoft calls theirs the Internet Explorer Administration Kit. Netscape’s is the Client Customization Kit, but it works only on the Microsoft Windows and Macintosh versions of Netscape Navigator. Both kits are available for download at no charge.[21]
[20]
Of course, the url
includes
host
, but it has been extracted
for your convenience.
[21] You can download the Internet Explorer Administration Kit from http://www.microsoft.com/windows/ieak/en/download/default.asp, and Netscape’s Client Customization Kit is available for download at http://home.netscape.com/download/cck.html.
Get Web Caching now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.