Chapter 12. Avoiding Scraping Traps
There are few things more frustrating than scraping a site, viewing the output, and not seeing the data that’s so clearly visible in your browser. Or submitting a form that should be perfectly fine but gets denied by the web server. Or getting your IP address blocked by a site for unknown reasons.
These are some of the most difficult bugs to solve, not only because they can be so unexpected (a script that works just fine on one site might not work at all on another, seemingly identical, site), but because they purposefully don’t have any tell tale error messages or stack traces to use. You’ve been identified as a bot, rejected, and you don’t know why.
Regardless of how immediately useful this information is to you at the moment, I highly recommend you at least skim this chapter. You never know when it might help you solve a very difficult bug or prevent a problem altogether.
A Note on Ethics
In the first few chapters of this book, I discussed the legal gray area that web ...