CRAWLING

The crawling process consists of several components. First, search engines have to learn about the pages. Then, they need to have enough time during the period they’ve allocated to the site to crawl those pages. Finally, they have to be able to technically access the pages.

Discovery

The first step in crawling is discovery. How does the engine find out about pages on the Web? Generally, this happens in one of the following ways:

  • By finding links to the pages from other sites on the Web
  • By finding links to the pages from within your site
  • From an XML Sitemap

Crawl Allocation

Search engine bots have resources to crawl every page on the Web, and they are also mindful of not overcrawling a site and causing an undue burden on the server. For these reasons, search engine bots spend only a limited period crawling each site. The following factors can aid in a more comprehensive crawl:

  • Fast server response times: If the server is slow to respond to requests, the search engine bots may slow down their crawl to ensure they aren’t overloading the server.
  • Fast page load times: The faster each page loads (server-side), the more pages search engines will likely be able to crawl during their allocated crawl period. You can monitor page load times for Google in Google Webmaster Tools. If you see a spike in page load times when you haven’t made significant changes to your site (such as added substantial multimedia content), there may be a problem with the server (Figure 7.1).

Get Marketing in the Age of Google: Your Online Strategy IS Your Business Strategy, Revised and Updated now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.