Chapter 7. Geospatial and Temporal Data Analysis on Taxi Trip Data

Geospatial data refers to data that has location information embedded in it in some form. Such data is being generated currently at a massive scale by billions of sources, such as mobile phones and sensors, every day. Data about movement of humans and machines, and from remote sensing, is significant for our economy and general well-being. Geospatial analytics can provide us with the tools and methods we need to make sense of all that data and put it to use in solving problems we face.

The PySpark and PyData ecosystems have evolved considerably over the last few years when it comes to geospatial analysis. They are being used across industries for handling location-rich data and, in turn, impacting our daily lives. One daily activity where geospatial data manifests itself in a visible way is local transport. The phenomenon of digital cab hailing services becoming popular over the last few years has led to us being more aware of geospatial technology. In this chapter, we’ll use our PySpark and data analysis skills in this domain as we work with a dataset containing information about trips taken by cabs in New York City.

One statistic that is important to understanding the economics of taxis is utilization: the fraction of time that a cab is on the road and is occupied by one or more passengers. One factor that impacts utilization is the passenger’s destination: a cab that drops off passengers near Union Square at ...

Get Advanced Analytics with PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.