Chapter 6. Information Retrieval

In the previous chapter we came across common words that made it difficult to characterize a corpus. This is a problem for different kinds NLP tasks. Fortunately, the field of information retrieval has developed many techniques that can be used to improve a variety of NLP applications.

Earlier, we talked about how text data exists, and more is being generated every day. We need some way to manage and search through this data. If there is an ID or title, we can of course have an index on this data, but how do we search by content? With structured data, we can create logical expressions and retrieve all rows that satisfy the expressions. This can also be done with text, though less exactly.

The foundation of information retrieval predates computers. Information retrieval focuses on how to find specific pieces of information in a larger set of information, especially information in text data. The most common type of task in information retrieval is search—in other words, document search.

The following are the components of a document search:

Query q

A logical statement describing the document or kind of document you are looking for

Query term q_t

A term in the query, generally a token

Corpus of documents D

A collection of documents

Document d

A document in D with terms t_d that describe the document

Ranking function r(q, D)
A function that ranks the documents in D according to relevance to the query q
Result R
The ranked list of documents

Before ...

Get Natural Language Processing with Spark NLP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.