Chapter 5. Getting Data off the Web with Python
A fundamental part of the data visualizer’s skill set is getting the right dataset in as clean a form as possible. And more often than not these days, this involves getting it off the Web. There are various ways you can do this, and Python provides some great libraries that make sucking up the data easy.
The main ways to get data off the Web are:
Get a raw data file in a recognized data format (e.g., JSON or CSV) over HTTP
Use a dedicated API to get the data
Scrape the data by getting web pages via HTTP and parsing them locally for the required data
This chapter will deal with these ways in turn, but first let’s get acquainted with the best Python HTTP library out there:
Getting Web Data with the requests Library
As we saw in Chapter 4, the files that are used by web browsers to construct web pages are communicated via the Hypertext Transfer Protocol (HTTP), first developed by Tim Berners-Lee. Getting web content in order to parse it for data involves making HTTP requests.
Negotiating HTTP requests is a vital part of any general-purpose language, but getting web pages with Python used to be a rather irksome affair. The venerable
urllib2 library was hardly user-friendly, with a very clunky API.
requests, courtesy of Kennith Reitz, changed that, making HTTP a relative breeze and fast establishing itself as the go-to Python HTTP library.