January 2020
Intermediate to advanced
640 pages
16h 56m
English
A naive crawler implementation would attempt to retrieve any links that are provided as input to it. But as we all know, the web is home to all sorts of content ranging from text or HTML documents to images, music, videos, and a wide variety of other types of binary data (for example, archives, ISOs, executables, and so on).
You would probably agree that attempting to download items that cannot be processed by the search engine would not only be a waste of resources but it would also incur additional running costs to the operator of the Links 'R' Us service: us! Consequently, excluding such content from the crawler would be a beneficial cost reduction strategy.
This is where the link filter component comes into play. Before ...