Chapter 5. HOW DOES A SEARCH ENGINE WORK

"Every once in a while a revolutionary product comes along that changes everything. One is very fortunate if you get to work on just one of these in your career."

Steve Jobs, Cofounder of Apple

In this chapter, the nuts and bolts of how a search engine works and is evaluated are described. We detail how content relevance of web pages is measured, how the link structure of the web is used to measure the authority of web pages (emphasis is given to the explanation of Google's PageRank), and how popularity measures can be used to improve the quality of search.

CHAPTER OBJECTIVES

  • Discuss the issue of the relevance of a search result to a query, and how we might measure this relevance.

  • Explain how the indexer processes web pages before adding words and updating their posting lists in the inverted index.

  • Explain how search engines process stop words, and discuss the issues related to stemming words.

  • Introduce the notion of term frequency (TF) as a measure of content relevance.

  • Discuss the activity of search engine optimization (SEO) and the problem of search engine spam.

  • Introduce Luhn's argument that the words that best discriminate between documents are in the mid-frequency range.

  • Introduce Zipf's law regarding the frequency of occurrences of words in texts.

  • Introduce the notion of inverse document frequency (IDF) as a measure of content relevance.

  • Explain how TF–IDF (term frequency–inverse document frequency) is computed for a web page with respect to ...

Get An Introduction to Search Engines and Web Navigation now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.