Chapter 22. Summarizing Web Pages with HTML::Summary
Canon, like many other large companies, is a multinational organization with multiple web sites, each managed by a different part of the company. This is a problem for the typical Canon customer, who knows nothing about Canon’s internal organization and simply wants to find information about their cameras or download a new printer driver. They need a single clear way to find what they want.
CS-Web: A Search Engine for Canon’s Web Space
Back in 1997, we wrote CS-Web, a set of Perl programs to collect information from all of Canon’s web sites, index it, and make it searchable from the web. We wrote our own solution because at the time the available products were either services designed for searching the entire web (such as AltaVista), or tools for indexing and searching a single web site.
CS-Web consists of a robot, a database, and a web interface (written in mod_perl). The robot traverses all of Canon web sites and stores a description of each page in the database. The search engine queries the database and gives you a list of candidate documents, and their descriptions. You can try CS-Web for yourself: it is linked from the main “gateway” page for Canon (http://www.canon.com/). You can also access it directly at http://csweb.cre.canon.co.uk/.
CS-Web presented a variety of challenges, many of which make suitable war stories for TPJ. However, for this article, we will focus on one crucial problem: generating the ...