Chapter 5. Accessing Web-Based Data

The internet is an incredible source of data; it is, arguably, the reason that data has become such a dominant part of our social, economic, political, and even creative lives. In Chapter 4, we focused our data wrangling efforts on the process of accessing and reformatting file-based data that had already been saved to our devices or to the cloud. At the same time, much of it came from the internet originally—whether it was downloaded from a website, like the unemployment data, or retrieved from a URL, like the Citi Bike data. Now that we have a handle on how to use Python to parse and transform a variety of file-based data formats, however, it’s time to look at what’s involved in collecting those files in the first place—especially when the data they contain is of the real-time, feed-based variety. To do this, we’re going to spend the bulk of this chapter learning how to get ahold of data made available through APIs—those application programming interfaces I mentioned early in Chapter 4. APIs are the primary (and sometimes only) way that we can access the data generated by real-time or on-demand services like social media platforms, streaming music, and search services—as well as many other private and public (e.g., government-generated) data sources.

While the many benefits of APIs (see “Why APIs?” for a refresher) make them a popular resource for data-collecting companies to offer, there are significant costs and risks to doing so. For advertising-driven ...

Get Practical Python Data Wrangling and Data Quality now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.