Chapter 3. Data

This chapter introduces the dataset we will work with in the rest of the book. It will also cover the kinds of tools we’ll be using, and our reasoning for doing so. Finally, it will outline multiple perspectives we’ll use in analyzing data for you to think about moving forward.

Air Travel Data

Air travel is an essential part of modern life. It is a fundamental part of globalized culture, linking major cities across the planet into a global urban economy. Thanks to regulation, there is a lot of aviation data out there that is freely available. In the course of the book, we’ll use many aviation datasets. The core or atomic logs we’ll be using are on-time records for each flight. We will supplement this with data on airlines, weather, routes, and more.

Flight on-time records aren’t quite big data, but they do add up to several gigabytes per year, uncompressed. We will immediately face a “big” (or actually, a “medium”) data problem—processing the data on your local machine will be just barely feasible. Working with data too large to fit in RAM requires that we use scalable tools, which is helpful as a learning device. Air travel is a familiar experience to all of us, and we’ll use it to give you a sense for how to analyze and query flight data and to help you see which techniques are effective. This is cultivating data intuition, a major theme in Agile Data Science.

In this book, we use the same tools that you would use at petabyte scale, but in local mode on your own machine. ...

Get Agile Data Science 2.0 now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.