Here are a few tried-and-true configurations that have good performance relative to the cost of the setup. Beware that prices are volatile and these examples are only approximate.
A low-volume site gets one to ten thousand hits per day. Such a site can easily be run out of your home. A typical configuration for this level is good PC hardware ($2000) running Linux 2.0 (free), Apache 1.2.6 (free), with connectivity through a cable modem with 100kbps upstream ($100 per month). For database functionality, you may use flat files, or read all of the elements into a Perl hashtable or array in a CGI, and not see any performance problems for a moderate number of users if the database is smaller than, say, one thousand items. Once you start getting more than one hit per second, or when the database gets bigger than one thousand items or has multiple tables, you may want to move to the mSQL free relational database from http://www.hughes.com.au/. The database and connectivity are the weak links here. Apache and Linux, on the other hand, are capable of handling large sites.
A medium volume site gets ten thousand to a million hits per day. A typical configuration for a medium volume site is a Sun Ultra or an Intel Pentium Pro machine with 64MB for the operating system and filesystem buffer overhead plus 2 to 4MB per server process. Of course, more memory is better if you can afford it. Such workstation-class machines cost anywhere between $3,000 and $30,000. You should have separate disks for serving content and for logging hits (and consider a separate disk for swap space), but the size of the content disk really depends on how much content you are serving. RAID disks or any other kind of disk array gets better random access performance because multiple seeks can happen in parallel. You can increase the number of network interfaces to handle the expected number of hits by simply adding more 10BaseT or 100BaseT cards, up to a limit of about 45 for some Solaris systems. Apache web server still works fine for medium volume web sites, but you may want to go to one of the Netscape or other commercial servers for heavier loads, or for formal support or particular security or publishing features.
One million hits per day sounds like a lot, but that’s only about twelve hits per second if it’s spread evenly throughout the day. Even twenty hits per second is within the capacity of most workstations if the site is serving only static HTML and images rather than creating dynamic content. On the other hand, twenty hits per second is a pretty large load from a network capacity point of view. If the average hit is about 10KB, that’s 10KB × 8 bits/byte × 12 = 983040 bits/second, you might think that a single T1 line at 1544000 bits per second can handle one million hits per day, but remember that web traffic is bursty, because each HTML page results in an immediate request for all embedded images, applets, and so on, so you should expect frequent peaks of three to five times the average. This means you probably cannot effectively serve a million hits per day from a single T1 line, but you should be able to serve one hundred thousand hits per day.
If your site has a large database access component to it, you’ll probably want to use one of the high-capacity commercial RDBMS systems like Oracle, Informix, or Sybase, which can have price tags in the $10,000 to $20,000 range. You’ll get best performance by keeping database connections open with a connection manager package from the vendor, but you can also write a connection manager yourself or use a TP monitor to manage connections. You probably should not use CGIs at all, but rather servlets, FastCGI, or a server API such as the Apache API, NSAPI, or ISAPI.
A high-volume site gets more than one million hits per day. There is no standard configuration for such high volume sites yet, so let’s just consider a few examples.
AltaVista delivered about 20 million searches per day in late 1997, according to http://www.altavista.digital.com/av/content/pr122997.htm, which is very impressive given that each search required some sort of database access. AltaVista is one of the most popular sites on the Web, according to http://www.hot100.com/, and one of the best performing, according to http://www.keynote.com/.
The AltaVista query interface, which is the part you see when you go to http://www.altavista.digital.com/, is served from the site’s web server, which is a set of 3 Digital AlphaStation 500/500s, each with 1GB of RAM and 6GB of disk running Digital Unix. The web server and database are both custom software written in C under Digital Unix, making heavy use of multithreading. But that’s just the front end. The really big iron runs the search engine on the back end.
The AltaVista search engine runs on 16 AlphaServer 8400 5/440s, each with 12 64-bit CPUs, 8GB of RAM, and 300G of RAID disk. Each holds a portion of the web index and has a response time of under one second. Most queries are satisfied from RAM, minimizing disk access. Note that 8GB of RAM cannot be addressed with a 32-bit machine (232 is 4G) so a 64-bit CPU is a necessity rather than a luxury.
AltaVista’s Internet connection in Palo Alto is 100Mbps to UUNet plus two other connections of unspecified speed, to BBN and Genuity. The whole thing is mirrored at five sites distributed around the world.
See http://www.altavista.digital.com/av/content/about_our_technology.htm for more information on their site.
As of the end of 1997, Netscape was getting over 135 million hits per day, which is over 1000 HTTP connections per second. There are probably peaks of three to five times the average, or 3000 to 5000 hits per second. If you were to direct a click to /dev/audio for every hit on their servers, you’d hear a rather high-pitched whine. This is impressive.
Netscape’s web site runs on more than 100 servers, representing every major server hardware manufacturer. A little test showed that hits on home.netscape.com or www.netscape.com are indeed redirected to one of 102 servers with names following the pattern www#.netscape.com. Here’s a one-line Perl script that will show you the IP addresses for the 102 machines, with some help from the Linux nsquery program:
% perl -e 'for ($i=1; $i<103; $i++) { print `nsquery www$i.netscape.com`; }'
Telnetting to port 80 of one of the machines shows that it’s running Netscape Enterprise 2.01. The web site itself contains a little more information about the setup, but not much detail. There are four T3 lines, one going to each gateway machine, and all of the gateways are connected by a 100Mbps FDDI ring. There are mirrors of the main site in Europe and Australia. Some machines are hosted at Globalcenter, http://www.globalcenter.net/, one of several large providers of space for web servers and high speed connectivity. See http://home.netscape.com/site/.
Sun Microsystems’ web site at http://www.sun.com/ gets more than 2 million hits per day, making it about the 30th-busiest site on the web. The web server is a pair of UltraServer 1/170 systems, each with a 167MHz UltraSPARC processor, 256 MB of memory, and a SPARCstorage Array. The two systems are in different physical locations for reliability and round robin DNS is used to split the load between them. The OS is Solaris 2.6 and server software is NS Enterprise/2.01. Internet connectivity is through a T3 from BBN to the local router, which is connected via 100Mbps Ethernet to the web server. See http://www.sun.com/sun-on-net/.
There is a list of the 100 most trafficked web sites at http://www.hot100.com/. The data comes largely from analysis of proxy server logs. The site rankings vary over time, but the following sites are usually included:
AltaVista |
AOL |
City.Net |
CNET |
CNN |
Excite |
Geocities |
Magellan |
Microsoft |
Netscape |
Pathfinder |
Starwave |
Warner Bros. |
Yahoo! |
There is a list of Internet access providers and server software used for 40 big companies at http://www.keynote.com/measures/business/business40.html. You can figure out all of this information for yourself by using traceroute and telnetting to port 80 of well-known servers, but Keynote has been kind enough to publish the results of their own research. For the 40 sites they publish data on, the most popular Internet providers are UUNET, BBN, and MCI, and the most popular web servers are Netscape Enterprise and Apache.
Get Web Performance Tuning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.