Chapter 6. Information Retrieval
In the previous chapter we came across common words that made it difficult to characterize a corpus. This is a problem for different kinds NLP tasks. Fortunately, the field of information retrieval has developed many techniques that can be used to improve a variety of NLP applications.
Earlier, we talked about how text data exists, and more is being generated every day. We need some way to manage and search through this data. If there is an ID or title, we can of course have an index on this data, but how do we search by content? With structured data, we can create logical expressions and retrieve all rows that satisfy the expressions. This can also be done with text, though less exactly.
The foundation of information retrieval predates computers. Information retrieval focuses on how to find specific pieces of information in a larger set of information, especially information in text data. The most common type of task in information retrieval is search—in other words, document search.
The following are the components of a document search:
A logical statement describing the document or kind of document you are looking for
- Query term
A term in the query, generally a token
- Corpus of documents
A collection of documents
A document in
t_dthat describe the document
- Ranking function
- A function that ranks the documents in
Daccording to relevance to the query
- The ranked list of documents