While this book contains a lot of detailed information about monitoring, load testing, problem analysis, and background about how things work, I often find myself referring to this small set of questions and answers I wrote up to quickly diagnose and treat the most common problems. Since a majority of problems can be solved by simply reading through this list and checking things off, I provide it here right up front. There are many references to concepts that have not been discussed yet, but they are explained later in the book.
echo AT > /dev/modem
echo AT > COM1
If your modem is connected, you will see the send and read lights flash as the modem responds OK to the AT command. If the lights do not flash, either the modem is not connected, or you have configured it for the wrong COM port, PCMCIA slot, or other attachment point.
External modems should have a light labeled CD (Carrier Detect) to indicate whether there is a carrier signal; that is, whether you are online. If it is not lit, it may be that the remote end hung up on you, or you lost your connection through too much noise on the line or an inactivity timeout.
Look at external modem lights when you request a web page. The read and send lights should be flashing. This is also true for DSL modems, cable modems, hubs, and other network equipment. The send light will tell you that your modem is trying to send data out to the Internet. The read light will tell you if your modem is getting anything back from the network. If you cannot see these lights flashing, there is no data flowing through the modem.
On Windows, check that you actually have an assigned IP address by using the ipconfig command from a DOS prompt. On Linux, use the command ifconfig -a to check that you have an IP address. An IP address consists of four numbers between 0 and 255, separated by periods.
If you do have an IP address, you still may have forgotten to set a gateway router entry in the operating system. Use the graphical configuration tools under Windows or the Mac to enter a valid gateway router for your IP address. Under Linux, you can use the route command, like this:
route add default gw
<router IP address>
If you can use telnet or ftp, then you are definitely connected with a valid IP address. If you can hit http://www.yahoo.com/ or other well-known sites from your web browser, you know you have a valid address and gateway. Try to hit a site with constantly changing content to be sure you’re not just seeing a previously cached page. Stock quote pages are good for checking whether you’re getting cached or viewing current data.
Browsers have been known to hang. On the other hand, your browser may just be thinking some deep thoughts at the moment. Give it a minute, especially if you just requested a page. The system call to resolve DNS names may hang the browser for a moment if the DNS server is slow. If you give it a minute and it’s still stuck, kill the browser and try again.
Many browsers have an offline mode where they disconnect themselves from the Internet even if the PC is still connected. Make sure your browser is not offline. If it is offline, you may see a little “disconnected wire” icon in the lower left corner of the browser.
Maybe your DNS server is down or not configured. Try a known IP address in the browser. In case you don’t keep the IP addresses of web servers around, try hitting http://126.96.36.199 (which is http://www.yahoo.com/). If this URL works, but http://www.yahoo.com/ does not, your problem is DNS resolution and you need to set a DNS server using one of the graphical tools on Windows or the Mac, or by entering the IP address of your DNS server in /etc/resolv.conf on Linux. Case does not matter for DNS names, but it does matter for the part of the URL after the machine name, which is confusing.
telnet www.yahoo.com 80
Note how long it takes before Telnet returns a “connected” response. If it is consistently a second or more, try a traceroute to the server to see how far you can get. The traceroute program comes packaged with most versions of Unix, but there is also an imitation called tracert on NT and a commercial version called Net.Medic from Vital Signs Software. If traceroute stops within your ISP, it could be that your Internet provider is down, i.e., not connected to the rest of the Internet. Sometimes your whole region may be down because of a regional NAP (Network Access Point) or Internet backbone issue. There’s not much you can do about that.
If any of the routers along the way show times of more than a few tenths of a second, consider which router it is. If it belongs to your ISP, you may benefit from changing your ISP. If it belongs to a large network provider such as MCI or Sprint, you may still benefit from changing your ISP because a different ISP may use a different route. But if the slow router belongs to the target site’s ISP, all you can do is complain to the webmaster of the target site. Quite often their email address is given on the site. Other times they fit a pattern like email@example.com. If you don’t have traceroute or tracert, you can simply try to Telnet to some other web servers. If the connect time is long for each of them, the problem is probably with a router in your own organization or your client machine itself.
If you are on a Unix system, you can try pointing rstat at any remote server to see if you can find out something about how loaded it is. Running rstat won’t hurt anything; if that server does not run the rstatd daemon, or if rstat’s request is blocked by a firewall, you won’t see any response. See Chapter 4 for more about rstat.
An immediate “connection refused” message means that the remote web server software is down, yet the remote web server machine is still up and working well enough to send a TCP reset packet, telling you that nothing is listening on the web server’s port. If the attempted connection hangs, it probably means you cannot get a connection to the remote web server machine at all, perhaps because it has crashed or is turned off or disconnected. You can check whether you’re getting any packets at all from the remote side by using the tcpdump tool on Linux or the snoop tool on Solaris.
If you’re trying to read a popular site, consider that there may be mirror sites that are less heavily loaded. Mirror sites are usually mentioned on the home page of a site. The Apache Web Server site (http://www.apache.org/), for example, has mirror sites around the world.
The Maximum Transmission Unit (MTU) is the largest packet your network interface will send. If you set it too big, packets will be rejected by the interface and returned to be split up. This will slow down your browsing. If you set it too small, you will send many small packets when you could have been more efficient and sent a few large ones. This will also slow down your browsing. See Chapter 12 for information about how to adjust your MTU.
Most large companies do not allow internal users to connect directly to the Internet, but instead require that they go through proxy serves, both for security and performance reasons. All browsers allow you to set a proxy through a preferences dialog box. Your organization can tell you the proxy settings to use. If you can telnet to port 80 of well-known web sites, then you are directly on the Internet and do not need to use a proxy server.
Note that some proxy servers cannot handle the Secure Socket Layer (SSL). If your proxy cannot handle SSL, and you have to use that proxy, then you just can’t see any SSL-protected pages. SSL-protected pages are the ones that start with “https” rather than “http”. Most commercial web sites use SSL for transaction security.
Most large organizations have several proxy servers, some more loaded than others. You can often see a dramatic increase in performance just by picking the right proxy. If you are within an organization with multiple proxy servers, try to find the most lightly loaded proxy. If your proxies run on Solaris or some other OS that supports the rstatd remote statistics daemon, you can use rstat or perfmeter to get an indication of which one is least loaded. See Chapter 4 for more information about rstat.
One problem is simply figuring out the names of your proxies. You can ask around in your company, or if your browser is automatically configured via a proxy.pac file, you can manually get the proxy.pac file through Telnet and look through it for the names of the proxies. It would be nice if browsers had the ability to automatically switch to a faster proxy based on rstat statistics or response times, but as far as I know, they do not. On the bright side, most proxies do not require any authentication, so you can switch to a faster one at will.
You can also see how fast the target site is by putting the URL into the web site analysis tool at http://patrick.net/. If the analysis tool reports that the site is fast, but you are getting it slowly, that also indicates that your proxy or some other network component may be to blame.
Some “network nanny” software may be installed on your PC to prevent you from viewing certain sites, or a proxy may refuse to access certain sites.
Check whether your client is overloaded from running other tasks in addition to the browser. On Linux, top will show you the top processes by CPU usage or memory usage and let you kill them. On Windows NT, Ctrl-Alt-Delete will bring up the task manager, which is an imitation of top.
Fewer processes are always better, but be sure you know what you’re killing, or you might crash your machine. If you aren’t completely sure that you know what you are doing, consider rebooting to get rid of processes that are leaking memory or otherwise abusing the system. Your initialization files may start them up again, but at least they will be starting small.
The classic sign of a memory shortage is very poor performance and a constantly working hard disk. This is because your operating system will try to use hard disk as “virtual” memory when there isn’t enough RAM. Unfortunately, disk is extremely slow compared to RAM. The solutions are to run fewer or smaller applications and turn off Java in the browser, or buy more RAM.
On the client side, the most common performance bottleneck is lack of bandwidth between ISP and PC. If you have to use a modem, it is well worth the money to buy the fastest modem available, but make sure your ISP supports that speed. ISDN is better than a modem, but difficult to configure. ADSL and cable modem are the best options for home users. If you are on a LAN, 100mbps “fast” Ethernet is noticeably better than standard 10mbps Ethernet. Fast Ethernet can be configured to run full-duplex, increasing its advantage even more.
Turning off autoloading of images will help performance dramatically if the problem is simply that your bandwidth is limited to that of a modem on a regular dial-up telephone line (also known as a POTS line, for Plain Old Telephone Service). Of course, without graphics you won’t enjoy a lot of what the Web is about, which is, well, graphics. On the other hand, you’ll escape most advertising. In Netscape, turn off automatic loading by choosing Edit → Preferences → Advanced and then unchecking the Automatically Load Images box.
Even if you turn off automatic loading of images, you can load and view an interesting image by clicking on the associated image icon. Your next question should be how to tell whether an image looks interesting before you’ve seen it. This is exactly what the HTML <ALT> tag is for: the HTML author is supposed to add a text description of the associated image, which the browser will display if image loading is off. ALT stands for “alternate text.” Here is an example:
<img src="images/foo.gif" alt="Picture of a Foo" width=190 height=24>
Most browsers also have a button that forces all unloaded images to load. Many sites offer a light-graphics or text-only link for the bandwidth-impaired user. Another option is to use a text-only browser such as lynx, which also has the advantage that it can be run remotely over a VT100 or other terminal-mode connection rather than requiring a TCP/IP connection all the way to the client. That is, your ISP may let you dial up and run lynx on the ISP’s computer rather than on your computer at home.
It is frequently helpful to set the browser to start on a blank page, so that you do not have to wait for a default page to load when starting up. The Netscape home page can be particularly heavy with graphics and features, so it’s a poor choice to leave as the default. To change the startup page to blank in Netscape, choose Edit → Preferences → Navigator and then click the radio button for “Navigator starts with blank page.”
If you don’t care about graphics, you could use the lynx browser, which starts instantly.
Newer browsers take advantage of the newest performance improvements in the HTTP protocol, but they also tend to get slower and fatter with each generation. On the other hand, Netscape 6 is a complete rewrite and much faster than Netscape 4. (Apparently Netscape has simply skipped the number 5.) IE 5 is also an improvement over its predecessors.
Browsers cache the documents you view and then retrieve an item from the browser’s cache if you request it again. Because the document may have changed in the meantime, the browser will by default contact the original server to validate the freshness of every cached page. If the document has changed on the server, the new version will be downloaded. If the locally cached copy is up to date, then it is displayed.
The validation request may require only a little network traffic if the document has not been modified, but you’ll still get better performance from using what’s in the cache without verification, and you won’t have to download any pages with trivial changes. You may get stale pages, but at least you’ll get them quickly.
To get the performance gain from not verifying cached documents in Netscape, set Options → Network Preferences → Verify Document: to Never. If you suspect you’ve got a stale page, it’s an easy matter to force Netscape to get the current version. Simply hold down the Shift key and hit Reload. Setting Verify Document: to “Once per Session” is second-best; this will verify the timeliness of the document just once for that Netscape session. Setting Verify Document: to “Every Time” is worse from a performance point of view. This instructs Netscape to check with the original server for a fresher version every time you view that page.
It can take 15 or 20 seconds to start up the Java virtual machine the first time you hit a page with Java in it. This Java initialization freezes the browser and cannot be interrupted, which can be very annoying. One solution is to turn off Java in the browser unless you know you want a specific applet. Another solution is to try the latest Java “Plugin” from Sun, which is capable of caching applets indefinitely, so you won’t need to download a particular applet more than once. However, the Plugin itself is very large and takes a long time to download when you first install it.
If you are spending most of your time getting data from one server, it may be worthwhile to get an account with the ISP that connects that server to the Internet. You’ll probably see better throughput and latency working from an account on the same ISP than from somewhere else. Telecommuters probably want an account with their company’s ISP.
If you are on the West Coast of the U.S., be aware that there is a lot of network traffic in the morning because the East Coast has been up and surfing for three hours already. So the East Coast gets better speed early in the morning because the Californians are asleep, and the West Coast is faster late at night because the East Coasters are asleep.
A proxy server between your organization and the Internet will cache frequently requested pages, reducing the load on your connection to the Internet while providing faster response time to the users for cached pages. The benefit you see depends on the number of times the requested page is in the cache. If all web requests were for unique URLs, then a proxy would actually reduce performance, but in practice, a few web pages are very popular and the cache is well used.
Keep in mind that proxies may make some Java applets unusable, since applets can currently connect only back to the server they came from. The server they came from will be the proxy, which is not where the applet probably thinks it came from.
Now let’s look at things from the server side. Here’s what you should look at if your web server seems sluggish.
If you are running a web site from a PC, be sure to disable the power conservation features that spin down the disk and go into sleep mode after a period of inactivity. Sleep mode will slow down the first user who hits your site while it is sleeping, because it takes a few moments for the disk to spin up again. Some operating systems — for example, Mac OS X — are capable of quickly serving pages in their sleep; but even they will eventually have to wake up to log to disk, so it is best to turn off sleep mode.
DNS servers can become overloaded like anything else on the Internet. Since DNS lookups block the calling process, a slow DNS server can have a big impact on perceived performance. Check whether your DNS server’s CPU or network load is nearing its capacity by monitoring that machine’s hardware statistics. See Chapter 4 for more information on monitoring.
If you determine that your DNS server is a problem, consider setting up additional servers or simply pointing your DNS resolver to another DNS server. Using a different DNS server is done by modifying /etc/resolv.conf under Linux or using the Network Control Panel on Windows.
Netscape browsers do not display a page at all until all images sizes are known. If you do not include the images sizes in your HTML, this means that the browser must actually download all the images before it knows the sizes, resulting in a long delay before the user sees anything at all. Many users also do not download images for one reason or another, but would like to know what kind of image it is they are missing, especially if you use images for navigation tools. So for best performance and usability, make sure all your images have size parameters in the HTML like this:
<img src="images/foo.gif" alt="Picture of a Foo" width=190 height=24>
Similarly, many users turn off Java because VM startup time and applet download time are very annoying. Like the ALT text for images, any text within the <APPLET></APPLET> tags will be displayed when Java is off, so the user will have an idea of whether he wants to turn Java back on and reload the page. This text can include any valid HTML, so it is possible for the content designer to create a useful alternative to the applet and put it within the applet tag.
<META HTTP-EQUIV = "Refresh" Content = "2;URL=http://www.go here.com">
Avoid redirects if at all possible because they waste time. But if you have to use one, at least make it fast by putting in a zero-second delay.
Web servers are often set by default to take the IP address of the client and do a reverse DNS lookup on it (finding the name associated with the IP address) in order to pass the name to the logging facility or to fill in the REMOTE_HOST CGI environment variable. This is time consuming and not necessary, since a log parsing program can do all the lookups when parsing your log file later.
You might be tempted to turn off logging altogether, but that would not be wise. You really need logs to show how much bandwidth you’re using, whether it’s increasing, and lots of other valuable performance information. You just don’t need to log DNS names. CGIs can also do the reverse lookup themselves if they need it. Every web server has the option to turn off reverse DNS lookups in its configuration files. Refer to your web server’s documentation.
TCP will begin a connection with the assumption that a segment has been lost if it has not been acknowledged within a certain amount of time, typically 200 milliseconds. For some slow Internet connections, this is not long enough. TCP segments may be arriving safely at the browser, only to be counted as lost by the server, which then retransmits them, using up bandwidth. Turning up the TCP retransmit timeout will fix this problem, but it will also reduce performance for fast but lossy connections, where the reliability is poor even if the speed is good. For long-lived TCP connections, TCP will dynamically adapt to the performance of that connection, but most connections to web servers are short, so the initial timeout setting has a big impact.
Internet Protocol data packets must go through a number of forks in the road on the way from the server to the client. Dedicated computers called routers make the decision about which fork to take for every packet. That decision, called a router “hop,” takes some small but measurable amount of time, typically a millisecond or two. Servers should be located as few router hops away from the audience as possible.
ISPs usually have their own high-speed network connecting all of their dial-in points of presence (POPs). A web surfer on a particular ISP will probably see better network performance from web servers on that same ISP than from web servers located elsewhere, partly because there are fewer routers between the surfer and the server. National ISPs are near a lot of people. If you know most of your users are on AOL, for example, get one of your servers located inside AOL. The worst situation is to try to serve a population far away, forcing packets to travel long distances and through many routers. A single HTTP transfer from New York to Sydney can be painfully slow to start and simply creep along once it does start, or just stall. The same is true for transfers that cross small distances but too many routers. Another solution is to host your data on one of the many content distribution services, such as Akamai.
The most effective blunt instrument for servers and users alike is a better network connection, with the caveat that it’s rather dangerous to spend money on it without doing any analysis. For example, a better network connection won’t help an overloaded server in need of a faster disk or more RAM. In fact, it may crash the server because of the additional load from the network.
While server hardware is rarely the bottleneck for serving static HTML, a powerful server is a big help if you are generating a lot of dynamic content or making a lot of database queries. If the CPU usage is at 100 percent, you have found a problem that needs immediate attention.
Whether you will benefit from a CPU upgrade depends entirely on the problem, and the vendor is not likely to tell you don’t really need more hardware. You may just have a poorly written application. If you’ve profiled your application and really need the extra power, it helps to upgrade from PC hardware to Unix boxes from Sun, IBM, or HP. They have much better I/O subsystems and scalability. Monitor your server’s hardware utilization to be aware of hardware bottlenecks.
On Solaris up to Version 7, run
vmstat and look at the
column, which is the scan rate for free memory. If the
sr column is consistently above zero, you have a
memory shortage. Other indications that you are short of memory are
any swapping (swapping activity should be zero at all times) or
consistent paging. On Solaris 8 and later, look at free memory.
RAM accesses data thousands of times faster than any disk. So getting more data from RAM rather than from disk can have a huge positive impact on performance. All free memory will automatically be used as filesystem cache in most versions of Unix and in NT, so your machine will perform repetitive file serving faster if you have more RAM. Web servers themselves can make use of available memory for caches. More RAM also gives you more room for network buffers and more room for concurrent CGIs to execute.
You may have plenty of memory, yet find it gets used up over time because a process is leaking (losing references to allocated memory). Simply by looking at the size of individual processes over time with top, you should be able to get a feel for which ones are leaking memory. They will have to either be fixed or restarted on a regular basis.
On Solaris, look at the output from iostat -x. Disk access latencies consistently higher than 100 milliseconds are a cause for concern. When buying disks, get those with the lowest seek time, because disks spend most of their time seeking (moving the arm to the correct track) in the kind of random access typical of web serving.
A collection of small disks is often better than a single large disk. 10,000 rpm is better than 7,200 rpm. Bigger disk controller caches are better. SCSI is better than IDE or EIDE. But all of these things cost more money as well.
Use multiple mirrored servers of the same capacity and balance the load between them. There are now many commercial services, such as Akamai, that provide caching servers. Your load will naturally be balanced to some degree if you are running a web site with an audience scattered across time zones or around the world, such as a web site for a multinational corporation.
Software generally gets faster and better with each revision. At least that’s how things are supposed to work. Try the latest version of the operating system and web server and apply all of the non-beta patches, especially the networking and performance-related patches. This rule can sometimes be profitably broken, since old software often takes less memory.
If a performance problem happens only at certain intervals, check what cron or Autosys jobs the server is running. (Autosys is a commercial version of cron from Computer Associates.) These intermittent problems can be infuriating if you notice the slowdown and look for the culprit just as it finishes and goes away. You might just leave perfmeter running if you’re on Solaris to look for regular CPU spikes. This should illustrate repeating load patterns well. You can disable the cron daemon if necessary.
Don’t run anything unnecessary for web service on your web server, middlware, or database machine. In particular, your web server should not be an NFS server, an NNTP server, a mail server, or a DNS server. Find those things other homes. You should run top (or taskmanager on Windows, or prstat on Solaris 8) and figure out which of the processes are using the most CPU and memory. Kill all unnecessary daemons, such as lpd.
Don’t even run a windowing system on your web server. You don’t really need it, and it takes up a lot of RAM. Terminal mode is sufficient for you to administer your web server. On Windows, however, you don’t have any choice; Windows always wastes memory and CPU on the GUI because there is no terminal mode.
Server Side Includes (SSI) are very inefficient. SSI means that the server parses your HTML and looks for commands to run programs and insert content. It is better to dynamically generate the whole page from one CGI or servlet than to run SSIs. CGI is not as bad as it used to be because operating systems have already improved the ability to run many short-lived processes because of demands from the Web.
You may think that you have to generate content on demand where that’s not really the case. You can update static HTML many times a day, giving the impression of dynamic content without incurring nearly the same overhead. It depends on the number of possible inputs from the user. If there are only a few, you can precalculate responses to them all.
If you use a middleware server that keeps a database connection pool, beware that growing that pool on demand is very bad for performance. You may be able to start the pool high enough that it will not need to increase. A typical symptom is that performance is fine at low loads, but intermittently slow as the load increases and the pool takes time to grow.
If you are allocating database connections from a pool but not reclaiming them, you may be forcing unnecessary growth of the pool, or even bringing your site to a halt until unused connections time out and are collected. To find such leaks, you can watch the number of connections used under load. Chapter 4 has a script that can screen scrape the Weblogic Admin web page and graph usage. To fix the connection leak, you will have to closely examine your code for overt failures to release connections, and for possible exceptions that can divert code from releasing connections.
Most network hardware, such as hubs, switches, and routers, are SNMP-compliant, meaning they will give statistics on their load and collision rates to any SNMP-compliant tool. Watch these statistics for signs of overload. Overloaded hubs are especially likely to be offenders, and are easily replaced with better-performing switches. Also beware of Ethernet connections misconfigured such that one side is full duplex while the other side is not.
garbage collection (GC). Since GC is usually single-threaded, you may
see one CPU at 100 percent while the others are at 0 percent during
the stall. Use mpstat on Solaris to see each
CPU’s load. While increasing the initial and maximum
heap sizes helps delay the inevitable GC, they also make it take
longer for most VM’s. IBM’s
generational garbage collecting VM may be an exception. Also, set
-verbosegc when you start the Java VM to clearly see
when garbage collection is happening. The latest JDK releases from
Sun allow the programmer some control over GC.
Don’t. The overhead of serialization of object parameters is very large. Local method calls are many thousands of times faster than remote calls. If at all possible, choose as your client a standard browser displaying HTML for your GUI, not an applet making RMI calls.
Run strace on Linux or truss on Solaris to see what your server processes are doing. It will quickly become apparent if you are doing too much logging. You will see many small write OS calls, all to the same file descriptor. First, try to buffer the logging, so that it happens in larger increments. Buffered logging is usually an option on most servers. Second, try not to log from Java programs, to avoid the overhead of temporary object creation and conversion between Unicode and ASCII.
Revision control systems are wonderful for tracking changes to HTML and code, but terrible for performance. Copy your production data to your web servers rather than serving directly out of ClearCase or other revision control systems.
Expanding a database connection pool
Reverse DNS lookups
TCP retransmit timeouts
Overloaded hubs and switches
Java doing garbage collection
Waiting for the return of an RMI or CORBA or EJB call
Writing massive amounts of data to JDBC logs or other logs in tiny increments
Accessing production web content directly from revision control systems such as Clearcase
Too few Apache daemons or Netscape threads
Turn off images on the client.
Turn off Java on the client.
Turn off cache validation on the client.
Put more RAM on the server.
Put more RAM on the client.
Buy a better connection to the Internet.
On a LAN, if you can cache static content in RAM, you can probably serve it at full network speed. If you can’t cache content, then your disk is probably the bottleneck.
On the Internet, the Internet is usually the bottleneck; the next bottlenecks are dynamic content generation and database queries.
If you have other suggestions for quick checks, please write firstname.lastname@example.org.