February 2018
Beginner to intermediate
364 pages
10h 32m
English
The code begins with the definition of CrawlSpider and the start URL:
class PaginatedSearchResultsSpider(CrawlSpider): name = "paginationscraper" start_urls = ["http://localhost:5001/pagination/page1.html" ]
Then the rules field is defined, which informs Scrapy how to parse each page to look for links. This code uses the XPath discussed earlier to find the Next link in the page. Scrapy will use this rule on every page to find the next page to process, and will queue that request for processing after the current page. For each page that is found, the callback parameter informs Scrapy which method to call for processing, in this case parse_result_page:
rules = (# Extract links for next pages Rule(LinkExtractor(allow=(),restrict_xpaths ...
Read now
Unlock full access