Data Sources

All of the data analyzed in this book are available on the book’s website and GitHub repository. These datasets are from open repositories and from individuals. We acknowledge them all here, and include, as appropriate, the filename for the data stored in our repository, a description of the resource, a link to the original source, a related publication, and the author(s)/owner(s).

To begin, we provide the sources for the four case studies in the book. Our analysis of the data in these case studies is based on research articles or, in one case, a blog post. We generally follow the line of inquiry in these sources, simplifying the analyses to match the level of the book.

Here are the four case studies:

seattle_bus_times.csv
Mark Hallenbeck of the Washington State Transportation Center provides the Seattle Transit data. Our analysis is based on “The Waiting Time Paradox, or, Why Is My Bus Always Late?” by Jake VanderPlas.
aqs_06-067-0010.csv, list_of_aqs_sites.csv, matched_pa_aqs.csv, list_of_purpleair_sensors.json, and purpleair_AMTS
The datasets used in the study of air quality monitors are available from Karoline Barkjohn of the Environmental Protection Agency. These were originally acquired by Barkjohn and collaborators from the US Air Quality System and PurpleAir. Our analysis is based on “Development and Application of a United States-Wide Correction for PM 2.5 Data Collected with the PurpleAir Sensor” by Barkjohn, Brett Gantt, and Andrea Clements.
donkeys.csv ...

Get Learning Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.