Chapter 25. Information Retrieval

Ulrich Pfeifer

Tip

The code presented here is derived from the perlindex script (available on CPAN), rewritten to emphasize clarity over speed. The scripts here are self-contained and are useful for simple applications; I reused them myself a couple of times when the full power of the WAIT wasn’t required. (WAIT is a Perl implementation of the WAIS information retrieval system, available on CPAN.)

Information retrieval—the science of matching documents to users—depends heavily on relevance: identifying when a document matches a user’s needs. Relevance is something that only users can assess; there’s no surefire way to compute it. IR researchers detest SQL-style document retrieval, because true IR systems take the users into account; they’re rated by their ability to fulfill users’ needs, not by the speed at which they process SQL statements. Good IR systems are like Perl—designed with the human being in mind.

In this article, we’ll develop a simple IR application: retrieving appropriate documents from a set of online manuals. Since our knowledge about the user’s need is inevitably incomplete, and the collection of documents is limited, the retrieval is doomed to some fuzziness.[14]

Figure 25-1 depicts a generic IR system. On the left, knowledge in the real world is incorporated into a document, which in turn is transformed into some representation usable by the system. On the right, a user’s need is expressed in a query language that is also transformed ...

Get Computer Science & Perl Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.