Breadth-first crawling
Breadth-first crawling is when priority is given to finding new domains and spreading out as far as possible, as opposed to continuing through a single domain in a depth-first manner.
Writing a breadth-first crawler will be left as an exercise for the reader based on the information provided in this chapter. It is not very different from the depth-first crawler in the previous section, except that it should prioritize URLs that point to domains that have not been seen before.
There are a couple of notes to keep in mind. If you're not careful and you don't set a maximum limit, you could potentially end up crawling petabytes of data! You might choose to ignore subdomains, or you can enter a site that has infinite subdomains ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access