One of the first questions that comes up when you start to move beyond the basics of web scraping is: “How do I access information behind a login screen?” The web is increasingly moving toward interaction, social media, and user-generated content. Forms and logins are an integral part of these types of sites and almost impossible to avoid. Fortunately, they are also relatively easy to deal with.
Until this point, most of our interactions with web servers in our example scrapers have consisted of using HTTP
GET to request information. This chapter focuses on the
POST method, which pushes information to a web server for storage and analysis.
Forms basically give users a way to submit a
POST request that the web server can understand and use. Just as link tags on a website help users format
GET requests, HTML forms help them format
POST requests. Of course, with a little bit of coding, it is possible to create these requests ourselves and submit them with a scraper.
Although it’s possible to navigate web forms by using only the Python core libraries, sometimes a little syntactic sugar makes life a lot sweeter. When you start to do more than a basic
GET request with
urllib, looking outside the Python core libraries can be helpful.
The Requests library is excellent at handling complicated HTTP requests, cookies, headers, and much more. Here’s what Requests creator Kenneth Reitz has to say about Python’s core tools: ...