January 2020
Intermediate to advanced
640 pages
16h 56m
English
The last crawler component that we will be examining is the link extractor. It scans retrieved HTML documents and attempts to identify and extract all links present inside.
Link extraction is unfortunately not a trivial task. While it's true that the majority of links can be extracted via a bunch of regular expressions, there are a few edge-cases that require additional logic from our end, as in the following examples: