Chapter 5. Getting Data off the Web with Python

A fundamental part of the data visualizer’s skill set is getting the right dataset in as clean a form as possible. And more often than not these days, this involves getting it off the Web. There are various ways you can do this, and Python provides some great libraries that make sucking up the data easy.

The main ways to get data off the Web are:

  • Get a raw data file in a recognized data format (e.g., JSON or CSV) over HTTP

  • Use a dedicated API to get the data

  • Scrape the data by getting web pages via HTTP and parsing them locally for the required data

This chapter will deal with these ways in turn, but first let’s get acquainted with the best Python HTTP library out there: requests.

Getting Web Data with the requests Library

As we saw in Chapter 4, the files that are used by web browsers to construct web pages are communicated via the Hypertext Transfer Protocol (HTTP), first developed by Tim Berners-Lee. Getting web content in order to parse it for data involves making HTTP requests.

Negotiating HTTP requests is a vital part of any general-purpose language, but getting web pages with Python used to be a rather irksome affair. The venerable urllib2 library was hardly user-friendly, with a very clunky API. requests, courtesy of Kennith Reitz, changed that, making HTTP a relative breeze and fast establishing itself as the go-to Python HTTP library.

requests is not part of the Python standard library1 but is part of the Anaconda ...

Get Data Visualization with Python and JavaScript now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.