Chapter 9. Crawling Through Forms and Logins

One of the first questions that comes up when you start to move beyond the basics of web scraping is: “How do I access information behind a login screen?” The Web is increasingly moving toward interaction, social media, and user-generated content. Forms and logins are an integral part of these types of sites and almost impossible to avoid. Fortunately, they are also relatively easy to deal with.

Up until this point, most of our interactions with web servers in our example scrapers has consisted of using HTTP GET to request information. In this chapter, we’ll focus on the POST method which pushes information to a web server for storage and analysis.

Forms basically give users a way to submit a POST request that the web server can understand and use. Just like link tags on a website help users format GET requests, HTML forms help them format POST requests. Of course, with a little bit of coding, it is possible to simply create these requests ourselves and submit them with a scraper.

Python Requests Library

Although it’s possible to navigate web forms using only the Python core libraries, sometimes a little syntactic sugar makes life a lot sweeter. When you start to do more than a basic GET request with urllib it can help to look outside the Python core libraries. 

The Requests library is excellent at handling complicated HTTP requests, cookies, headers, and much more. 

Here’s what Requests creator Kenneth Reitz has to say about Python’s ...

Get Web Scraping with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.