The code presented here is derived from the
perlindex script (available on CPAN), rewritten to emphasize clarity over speed. The scripts here are self-contained and are useful for simple applications; I reused them myself a couple of times when the full power of the WAIT wasn’t required. (WAIT is a Perl implementation of the WAIS information retrieval system, available on CPAN.)
Information retrieval—the science of matching documents to users—depends heavily on relevance: identifying when a document matches a user’s needs. Relevance is something that only users can assess; there’s no surefire way to compute it. IR researchers detest SQL-style document retrieval, because true IR systems take the users into account; they’re rated by their ability to fulfill users’ needs, not by the speed at which they process SQL statements. Good IR systems are like Perl—designed with the human being in mind.
In this article, we’ll develop a simple IR application: retrieving appropriate documents from a set of online manuals. Since our knowledge about the user’s need is inevitably incomplete, and the collection of documents is limited, the retrieval is doomed to some fuzziness.
Figure 25-1 depicts a generic IR system. On the left, knowledge in the real world is incorporated into a document, which in turn is transformed into some representation usable by the system. On the right, a user’s need is expressed in a query language that is also transformed ...