2.3.1 Crawlers and scrapers
A crawler is a software algorithm that generates a sequence of web pages that may be searched for news content. The word crawler signifies that the algorithm begins at some web page, and then chooses to branch out to other pages from there (i.e., “crawls” around the web). The algorithm needs to make intelligent choices from among all the pages it might look for. One common approach is to move to a page that is linked to (i.e., hyper-referenced) from the current page. Essentially a crawler explores the tree emanating from any given node, using heuristics to determine relevance along any path, and then chooses which paths to focus on. Crawling algorithms have become increasingly sophisticated (see Edwards, McCurley, and Tomlin, 2001).
A web scraper downloads the content of a chosen web page and may or may not format it for analysis. Almost all programming languages contain modules for web scraping. These inbuilt functions open a channel to the web, and then download user-specified (or crawler-specified) URLs. The growing statistical analysis of web text has led to most statistical packages containing inbuilt web-scraping functions. For example, R, a popular open-source environment for technical computing, has web scraping built into its base distribution. If we want to download a page into a vector of lines, simply proceed to use a single-line command, such as the one below that reads my web page
> text = readLines(“http://algo.scu.edu/~sanjivdas/”) ...