Search Engine Anatomy
When I talk about important search engines, I am really talking about the “big three”: Google, Bing, and Yahoo! Search. At the time of this writing, all of these search engines are using their own search technologies.
Web spider–based search engines usually comprise three key components: the so-called web spider, a search or query interface, and underlying indexing software (an algorithm) that determines rankings for particular search keywords or phrases.
Spiders, Robots, Bots, and Crawlers
The terms spider, robot, bot, and crawler represent the same thing: automated programs designed to traverse the Internet with the goal of providing to their respective search engine the ability to index as many websites, and their associated web documents, as possible.
Not all spiders are “good.” Rogue web spiders come and go as they please, and can scrape your content from areas you want to block. Good, obedient spiders conform to Robots Exclusion Protocol (REP), which we will discuss in Chapter 9.
Web spiders in general, just like regular users, can be tracked in your web server logs or your web analytics software. For more information on web server logs, see Chapters 6 and 7.
Web spiders crawl not only web pages, but also many other files, including robots.txt, sitemap.xml, and so forth. There are many web spiders. For a list of known web spiders, see http://www.user-agents.org/.
These spiders visit websites randomly. Depending on the freshness and size of your website’s content, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access