O'Reilly logo

Instant Web Scraping with Java by Ryan Mitchell

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Going undercover (Intermediate)

Not every website is as welcoming to scraping as Wikipedia. There are many sites that will check to see if you're actually a web browser (or if you say that you are, at least) before sending you the site data. In this recipe, we will learn how to subvert this check (while making sure to comply with the Terms of Service) in order to get the desired data from a website.

Getting ready

Web servers can check which browser you are using by checking the HTTP header information you are sending with every request you make for a web page.

HTTP header information looks like this:

Host: www.google.com Connection: keep-alive User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required