So far, we have implemented two simple crawlers that take advantage of the structure of our sample website to download all published countries. These techniques should be used when available, because they minimize the number of web pages to download. However, for other websites, we need to make our crawler act more like a typical user and follow links to reach the interesting content.
We could simply download the entire website by following every link. However, this would likely download many web pages we don't need. For example, to scrape user account details from an online forum, only account pages need to be downloaded and not discussion threads. The link crawler we use in this chapter will use regular expressions to determine ...