Chapter 1. Introducing Polars
In 2022, we found ourselves in the middle of a challenging project for a client. Their data pipeline was growing out of control. The codebase was a mix of Python and R, with the Python side relying heavily on the Pandas package for data wrangling. Over time, three major issues emerged: the code was becoming increasingly difficult to maintain, performance had slowed to a crawl, and memory consumption had skyrocketed to over 500 GB. These problems were stifling productivity and pushing the limits of the infrastructure.
Back then, Polars was still relatively unknown, but we had experimented with it and seen some promising results. Convincing the rest of the team to migrate both the pandas and R code to Polars wasn’t easy, but once the switch was made, the impact was immediate. The new data pipeline was much faster, and the memory footprint shrank to just 40 GB—a fraction of what it had been.
Thanks to this success, we’re fully convinced of the power of Polars. It’s why we wrote this book, Python Polars: The Definitive Guide, to share with you what we’ve learned and help you unlock the same potential in your data workflows.
In this chapter, you’ll learn:
-
The main features of Polars
-
Why Polars is fast and popular
-
How Polars compares to other data processing packages
-
Why you should use Polars
-
How we have organized this book
-
Why we focus on Python Polars
In addition, we’ll demonstrate Polars’ capabilities through a use case, in which we demonstrate how Polars can transform, analyze, and visualize data related to bike trips in New York City.
What Is This Thing Called Polars?
Polars is a high-performance data processing package designed for efficient handling of large-scale datasets. What started as a side project by Ritchie Vink to learn Rust and to better understand data processing has now grown into a popular package. Data scientists, data engineers, and software developers use it to perform data analysis, create data visualizations, and build data-intensive applications used in production.
Features
Here are some key features of Polars:
- Fast and efficient
-
Written in Rust and leveraging decades of database research, Polars is engineered for speed and performance. Thanks to parallel processing and memory optimization techniques, it can process large datasets significantly faster than other data processing packages, often 10—100 times faster for common operations.
- DataFrame structure
-
Polars uses DataFrames as its core data structure. A DataFrame is a two-dimensional data structure composed of rows and columns, similar to a spreadsheet or database table. Polars DataFrames are immutable, promoting functional-style operations and ensuring thread safety.
- Expressive API
-
Polars provides an intuitive and concise syntax for data processing tasks, making it easy to learn, use, and maintain.
Key Concepts
Key concepts you’ll become familiar with in this book include:
- Lazy evaluation
-
Polars employs “lazy” evaluation, where computations are built into an optimized execution plan and executed only when needed. This approach minimizes unnecessary work and can lead to substantial performance gains.
- Expressions
-
Polars uses expressions to define operations on DataFrames. These expressions are composable, allowing users to create complex data pipelines without intermediate results.
- Query optimization
-
The package automatically optimizes the execution plan for efficient resource use, based on expressions and data characteristics.
Advantages
Here is a quick rundown of Polars’ main advantages:
- Performance
-
Thanks to its efficient algorithms, parallel execution engine, and use of vectorization with Single Instruction, Multiple Data (SIMD), Polars is designed to take full advantage of modern hardware. It can optionally leverage NVIDIA GPUs to further improve performance (we benchmark the difference the GPU makes in the Appendix).
- Memory efficiency
-
Polars requires less memory for operations than other data processing packages.
- Interoperability
-
Built on Apache Arrow, a standardized columnar memory format for flat and hierarchical data, Polars offers excellent interoperability with other data processing tools and packages. It can be used directly in Rust, and has language bindings for Python, R, SQL, JavaScript, and Julia. In a moment we’ll explain why this book focuses on Python specifically.
- Streaming capabilities
-
Polars can process data in chunks, allowing for out-of-core computations on datasets that don’t fit in memory.
In summary, Polars is a powerful package for data analysis tasks, particularly suited for large-scale operations where performance and efficiency are crucial.
Why You Should Use Polars
“Come for the speed, stay for the API.” is a popular saying within the Polars community. It nicely captures the two main reasons for choosing Polars: performance and usability. Let’s dive into those two reasons. After that, we’ll also address the popularity of Polars.
Performance
First and foremost, you should use Polars for its outstanding performance. Figure 1-1 shows, for a number of data processing packages, the duration (in seconds) for running a variety of queries. These queries come from a standardized set of benchmarks and include reading the data from disk.
Polars consistently outperforms the other packages: Apache Spark, Pandas, Dask, DuckDB, and Vaex.
Usability
While performance may be the initial draw, many users find themselves staying with Polars due to its well-designed API. The Polars API is characterized by:
- Consistency
-
Operations behave predictably across different data types and structures, and are based on the data processing grammar that you already know.
- Expressiveness
-
Polars offers its own expression system that allows you to create complex data transformations in a concise and readable way.
- Functional approach
-
The API encourages a functional programming style, which fits well with data processing and makes your code easy to read, write, and maintain.
- Eager and lazy APIs
-
You can choose and easily switch between eager execution for quick, ad-hoc results and lazy evaluation for optimized performance, depending on your needs. You’ll get a preview of both APIs in the showcase at the end of this chapter.
The combination of performance and usability has led to Polars gaining popularity at a fast pace, as we’ll discuss next.
Popularity
You should never choose a particular piece of software just because it’s popular. It can cause you to miss out on options that might better fit your needs, as popularity doesn’t always mean it’s the best choice. On the other hand, picking something that isn’t well-known or maintained can lead to issues such as limited community support, security risks, and lack of updates, making it a less reliable option in the long run.
Luckily, Polars is very much actively maintained (with a new release nearly every week on GitHub1) and its community is growing rapidly (with a Discord server2 that currently has over 4500 members). From our own experience we can say that bugs are fixed quickly and that questions are addressed kindly and swiftly.
There’s no perfect measure for the popularity of an open source project. The number of GitHub stars is, however, a good indicator of a project’s visibility and community interest. It reflects how many people find it noteworthy or potentially useful. Figure 1-2 shows the number of GitHub stars for a variety of Python packages for processing data.
Pandas and Apache Spark, two projects that have been around for over 10 years, have the highest number of GitHub stars. Polars, which is one the youngest projects, comes in at third place. If these three projects maintain their current trajectory, then Polars is set to overtake both Pandas and Apache Spark within the next few years. In short, Polars is here to stay, and it’s worth investing time in learning it.
Sustainability
Because of the way Polars is designed, it efficiently computes queries. According to research done by Felix Nahrstedt et al. (2024) Polars consumes 63% of the energy needed by Pandas on the TPC-H benchmark, and uses eight times less energy than Pandas on synthetic data. In an age where there is more and more data to process, doing so sustainably becomes increasingly important. Polars sets an example for processing data with a low carbon footprint.
Polars Compared to Other Data Processing Packages
Polars is of course not the only package for processing data. In this section we provide an overview of how Polars compares to other popular data processing packages in Python. We highlight the strengths and weaknesses of each, helping you understand where Polars fits in the landscape of data processing tools.
- Pandas
-
Pandas is the most widely used data processing package for Python. It provides data structures like DataFrames and Series, along with a rich set of functions for data analysis, cleaning, and transformation.
Compared to Pandas, Polars offers significantly better performance, especially for large datasets. Polars is built on Rust and uses Apache Arrow for memory management, allowing it to process data much faster than Pandas. While Pandas uses eager execution by default, Polars provides both eager and lazy execution options, enabling query optimization. However, Pandas still has a larger ecosystem and better integration with other data science packages.
- Dask
-
Dask is a flexible package for parallel computing in Python. It extends the functionality of NumPy, Pandas, and Scikit-learn to distributed computing systems. Dask is particularly useful for processing datasets that are too large to fit in memory.
Like Dask, Polars supports parallel processing and can handle large datasets. However, Polars is designed for single-machine use, while Dask focuses on distributed computing. Polars generally offers better performance for operations that can fit in memory, while Dask excels at processing truly massive datasets across multiple machines.
- DuckDB
-
DuckDB is an in-process SQL OLAP database management system. It’s designed to be fast and efficient for analytical queries on structured data. DuckDB can be embedded directly in applications and supports SQL queries.
Both Polars and DuckDB are optimized for analytical workloads and offer excellent performance. Polars provides a more Pythonic API, while DuckDB uses SQL for querying.
- PySpark
-
PySpark is the Python API for Apache Spark, a distributed computing system designed for big data processing. It provides a wide range of functionalities including SQL queries, machine learning, and graph processing. PySpark is particularly useful for processing very large datasets across clusters of computers.
While PySpark is designed for distributed computing, Polars focuses on single-machine performance. Polars generally offers faster performance for datasets that can fit on a single machine. However, PySpark is more suitable for truly massive datasets that require distributed processing across multiple nodes. Polars is easier to set up and use than the more complex PySpark ecosystem.
- Vaex
-
Vaex is a high-performance Python package for lazy out-of-core DataFrames. It’s designed to handle datasets larger than memory efficiently. Vaex uses memory-mapping and lazy evaluation to process large datasets quickly.
Compared to Vaex, Polars offers a more comprehensive set of operations and better integration with the Python ecosystem. While both packages are optimized for large datasets, Polars generally provides faster in-memory processing. Vaex may offer an advantage when working with datasets that are significantly larger than the available RAM.
Why We Focus on Python Polars
Since Polars is built in Rust, and has language bindings for Python, R, SQL, JavaScript, and Julia, you might be wondering why we focus on its Python API.
According to the 2024 Stack Overflow Developer Survey3, Python is the most popular programming language among respondents who are learning to code and the fourth most popular among professional developers. This is not surprising, since Python is known for its simplicity, readability, and versatility and is widely used in data science, machine learning, web development, and more.
This popularity is reflected in the Polars community, where the Python API is the most complete, most used, and most updated API. Furthermore, Python is widely regarded as the language of choice for data analysis and data processing, and most data scientists and data engineers are familiar with it.
How This Book is Organized
This book contains 18 chapters, spread over five parts. Each starts with a short introduction of the things we’ll discuss and concludes with key takeaways.
- Part I, “Begin”
-
This first part, Begin, contains the first three chapters of the book. These chapters are meant to introduce you to Polars, to get you up and running, and to help you start using it yourself.
Chapter 1, this chapter, discusses what Polars is, explains why you should use it, and demonstrates its capabilities through a showcase. Chapter 2 covers everything to get started with Polars yourself, including instructions on how to install Polars and how to get the code and data used in this book. If you have any experience using Pandas, then Chapter 3 will help you transition to Polars, by explaining and showing the differences between the two.
- Part II, “Form”
-
The name of the second part, Form, has two meanings, as it’s about both the form of data structures and data types, as well as forming DataFrames from some source. In other words, you’ll learn how to read and write data, and how this data is stored and handled in Polars.
Chapter 4 provides an overview of the data structures and data types that Polars supports and how missing data is handled. Chapter 5 explains the difference between the eager API, which is used for quick results, and the lazy API, which is used for optimized performance. Chapter 6 shows how to read and write data from and to various file formats, such as CSV, Parquet, and Arrow.
- Part III, “Express”
-
Expressions play a central role within Polars, so it’s only fitting that this third part, Express, is in the middle of the book.
Chapter 7 starts with examples of where expressions are used, provides a formal definition of an expression, and explains how you can create them. Chapter 8, enumerates the many methods for continuing expressions, including mathematical operations, working with missing values, applying smoothing, and summarizing. Chapter 9 shows how to combine multiple expressions using, for example, arithmetic and Boolean logic.
- Part IV, “Transform”
-
Once you understand expressions, you can incorporate them into functions and methods to transform your data, which is what this fourth part, Transform, is all about.
Chapter 10 explains how to select and create columns and work with column names and selectors. Chapter 11 shows the different ways of filtering and sorting rows. Chapter 12 covers how to work with Textual, Temporal, and Nested data types. Chapter 13 goes into grouping, aggregating, and summarizing data. Chapter 14 explains how to combine different DataFrames using joins and concatenations. Chapter 15 shows how to reshape data, through (un)pivoting, stacking, and extending.
- Part V, “Advance”
-
The last part of this book, Advance, contains a variety of more advanced topics.
Chapter 16 explains how to visualize data using a variety of visualization packages, including Altair, hvplot, and Plotnine. Chapter 17 shows how to you can extend Polars with custom Python functions and your own Rust plugins. Chapter 18 looks behind the curtains of Polars, explaining how it’s built, how it works under the hood, and why it is so fast.
The book concludes with an appendix that covers how to leverage the power of GPUs to accelerate Polars, offering insights into maximizing performance.
An ETL Showcase
Now that you’ve learned where Polars comes from and how it will benefit you, it’s time to see it in action. We’ve prepared an extract-transform-load (ETL) showcase, in which we’re going to demonstrate the capabilities of Polars by transforming, analyzing, and visualizing data related to bike trips in New York City.
Strictly speaking, ETL is about extracting, transforming, and loading data, but we have added two data-visualization bonuses.
The outline of this showcase is as follows. First, we import the required packages. Second, we download the raw data. Third, we clean this raw data and enrich it with new columns. Finally, we’re going to write the data to Parquet files so that we can reuse it in later chapters.
Don’t Worry About the Syntax
The purpose of this showcase is to give you a taste of what Polars looks like. There will be lots of new syntax and concepts that you’re not yet familiar with. Don’t worry about this; everything will be explained throughout the course of this book. You don’t have to run these code snippets yourself. Instead, just read and enjoy the ride.
Let’s get started.
Extract
The first step of this ETL showcase is to extract the data. We are going to use two different sources: one is about the bike trips themselves, and the other is about New York City’s neighborhoods and Boroughs. However, we first need to import the packages that we’re going to use for this showcase.
Import Packages
Obviously, we will need Polars itself.
For the geographical operations in this showcase, we’ve made a custom plugin that you can import as polars_geo
.
We’ll explain how to compile and install this plug-in in Chapter 17).
We also need the Plotnine package, which we’re going to use to create a couple of data visualizations.
We’ll start with compiling the polars-geo
plugin.
Be warned, this can take a while.
!
cd
plugins
/
polars_geo
&&
maturin
develop
--
release
# And reset the kernel to make the new plug-in available
from
IPython.display
import
display
,
Javascript
display
(
Javascript
(
"Jupyter.notebook.kernel.restart()"
))
Now that it’s compiled and installed, we’re ready to import the packages we need:
import
polars
as
pl
import
polars_geo
from
plotnine
import
*
It’s customary to import Polars as the alias
pl
.In Python scripts, it’s not recommended to import all the functions of a package into the global namespace. In ad-hoc notebooks and in this showcase, however, it’s OK because it allows us to use the functions within the plot line package without having to type the package name, which is much more convenient.
Let’s move on to the next step, which is to download and extract the bike trips.
Download and Extract Citi Bike Trips
The data that we’re going to use comes from Citi Bike, New York City’s public bike rental system. This system offers bikes that you can hire for short trips up to 30 or 45 minutes, depending on whether you’re a member. The data is freely available from their website4.
The following commands download the ZIP file, extract the CSV file, and remove the ZIP file (as it’s no longer needed).
!
curl
-
sO
https
:
//
s3
.
amazonaws
.
com
/
tripdata
/
202403
-
citibike
-
tripdata
.
csv
.
zip
!
unzip
-
o
202403
-
citibike
-
tripdata
.
csv
.
zip
"*.csv"
-
x
"*/*"
-
d
data
/
citibike
/
!
rm
-
f
202403
-
citibike
-
tripdata
.
csv
.
zip
Shell Commands
These shell commands are not Python code.
In Jupyter, the exclamation mark (!
) causes these commands to be executed by a shell rather than the Python interpreter.
If you are on Windows, or if you’re not comfortable running commands like this, you can download and extract the data manually:
-
Visit the Citi Bike website
-
Click on the link “downloadable files of Citi Bike trip data”
-
Download the ZIP file 202403-citibike-tripdata.csv.zip
-
Extract the ZIP file
-
Move the CSV file to the data/citibike subdirectory
Let’s continue to the next step, which is to load this CSV file into a Polars DataFrame.
Read Citi Bike Trips into a Polars DataFrame
Before we read any raw data into a Polars DataFrame, we always like to inspect it first.
We’ll count the number of lines in this CSV file using wc
and print the first six lines using head
:
!
wc
-
l
data
/
citibike
/
202403
-
citibike
-
tripdata
.
csv
!
head
-
n
6
data
/
citibike
/
202403
-
citibike
-
tripdata
.
csv
2663296 data/citibike/202403-citibike-tripdata.csv "ride_id","rideable_type","started_at","ended_at","start_station_name","start_s… "62021B31AF42943E","electric_bike","2024-03-13 15:57:41.800","2024-03-13 16:07:… "EC7BE9D296FFD072","electric_bike","2024-03-16 10:25:46.114","2024-03-16 10:30:… "EC85C0EEC95157BB","classic_bike","2024-03-20 19:20:49.818","2024-03-20 19:28:0… "9DDE9AF5606B4E0F","classic_bike","2024-03-13 20:31:12.599","2024-03-13 20:40:3… "E4446F457328C5FE","electric_bike","2024-03-16 10:50:11.535","2024-03-16 10:53:…
It appears that we have over 2.6 million rows, where each row is one bike trip. The CSV file seems to be well-formatted, with a header, and a comma as the separator.
When we first tried to read this CSV file into Polars, we found two problematic columns.
The values stored in columns start_station_id
and end_station_id
are in fact strings, but Polars assumes that they are numbers, because in the first few rows, they look like numbers.
Specifying the types manually for these two columns solves this.
Let’s read the CSV file into a DataFrame called trips
and print the number of rows:
trips
=
pl
.
read_csv
(
"
data/citibike/202403-citibike-tripdata.csv
"
,
try_parse_dates
=
True
,
schema_overrides
=
{
"
start_station_id
"
:
pl
.
String
,
"
end_station_id
"
:
pl
.
String
,
}
,
)
.
sort
(
"
started_at
"
)
trips
.
height
2663295
You’ll learn about reading data in Chapter 6.
You’ll learn about sorting rows in Chapter 11.
Here’s what the DataFrame looks like.
Because it’s too wide to comfortably show on the page, we use print()
three times to show all the columns:
(
trips
[:,
:
4
])
(
trips
[:,
4
:
8
])
(
trips
[:,
8
:])
shape: (2_663_295, 4) ┌──────────────────┬───────────────┬───────────────────┬───────────────────┐ │ ride_id │ rideable_type │ started_at │ ended_at │ │ --- │ --- │ --- │ --- │ │ str │ str │ datetime[μs] │ datetime[μs] │ ╞══════════════════╪═══════════════╪═══════════════════╪═══════════════════╡ │ 9EC2AD5F3F8C8B57 │ classic_bike │ 2024-02-29 00:20… │ 2024-03-01 01:20… │ │ C76D82D96516BDC2 │ classic_bike │ 2024-02-29 07:54… │ 2024-03-01 08:54… │ │ … │ … │ … │ … │ │ D8B20517A4AB7D60 │ classic_bike │ 2024-03-31 23:56… │ 2024-03-31 23:57… │ │ 6BC5FAFEAC948FB1 │ electric_bike │ 2024-03-31 23:57… │ 2024-03-31 23:59… │ └──────────────────┴───────────────┴───────────────────┴───────────────────┘ shape: (2_663_295, 4) ┌───────────────────┬──────────────────┬───────────────────┬────────────────┐ │ start_station_na… │ start_station_id │ end_station_name │ end_station_id │ │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ ╞═══════════════════╪══════════════════╪═══════════════════╪════════════════╡ │ 61 St & 39 Ave │ 6307.07 │ null │ null │ │ E 54 St & 1 Ave │ 6608.09 │ null │ null │ │ … │ … │ … │ … │ │ Division St & Bo… │ 5270.08 │ Division St & Bo… │ 5270.08 │ │ Montrose Ave & B… │ 5068.02 │ Humboldt St & Va… │ 4956.02 │ └───────────────────┴──────────────────┴───────────────────┴────────────────┘ shape: (2_663_295, 5) ┌───────────┬────────────┬───────────┬────────────┬───────────────┐ │ start_lat │ start_lng │ end_lat │ end_lng │ member_casual │ │ --- │ --- │ --- │ --- │ --- │ │ f64 │ f64 │ f64 │ f64 │ str │ ╞═══════════╪════════════╪═══════════╪════════════╪═══════════════╡ │ 40.7471 │ -73.9028 │ null │ null │ member │ │ 40.756265 │ -73.964179 │ null │ null │ member │ │ … │ … │ … │ … │ … │ │ 40.714193 │ -73.996732 │ 40.714193 │ -73.996732 │ member │ │ 40.707678 │ -73.940297 │ 40.703172 │ -73.940636 │ member │ └───────────┴────────────┴───────────┴────────────┴───────────────┘
Not a bad start.
The DataFrame trips
has a variety of columns, including timestamps, categories, names, and coordinates.
This will allow us to produce plenty of interesting analyses and data visualizations.
Read in Neighborhoods from GeoJSON
New York City is a large place, with many neighborhoods spread over five boroughs: The Bronx, Brooklyn, Manhattan, Staten Island, and Queens.
Our trips
DataFrame lacks this information.
If we were to add the neighborhood and borough where each trip starts and ends, we would be able to compare boroughs with each other or answer questions, such as “What is the busiest neighborhood in Manhattan?”
To add this information, we are going to read a GeoJSON file that contains all the boroughs and neighborhoods of New York City. The raw data, which is on GitHub, looks like this5:
!
python
-
m
json
.
tool
data
/
citibike
/
nyc
-
neighborhoods
.
geojson
{ "type": "FeatureCollection", "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } }, "features": [ { "type": "Feature", "properties": { "neighborhood": "Allerton", "boroughCode": "2", "borough": "Bronx", "X.id": "http://nyc.pediacities.com/Resource/Neighborhood/Aller… }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -73.84859700000018, 40.871670000000115 ], [ -73.84582253683678, 40.870239076236174 ], … with 134239 more lines
This deeply nested structure contains all the information we need. The areas are stored as polygons, which are sequences of coordinates. We transform this deep nested structure into a rectangular form: that is, a DataFrame.
neighborhoods
=
(
pl
.
read_json
(
"
data/citibike/nyc-neighborhoods.geojson
"
)
.
select
(
"
features
"
)
.
explode
(
"
features
"
)
.
unnest
(
"
features
"
)
.
unnest
(
"
properties
"
)
.
select
(
"
neighborhood
"
,
"
borough
"
,
"
geometry
"
)
.
unnest
(
"
geometry
"
)
.
with_columns
(
polygon
=
pl
.
col
(
"
coordinates
"
)
.
list
.
first
(
)
)
.
select
(
"
neighborhood
"
,
"
borough
"
,
"
polygon
"
)
.
filter
(
pl
.
col
(
"
borough
"
)
!=
"
Staten Island
"
)
.
sort
(
"
neighborhood
"
)
)
neighborhoods
You’ll learn about reshaping nested data structures in Chapter 15.
Staten Island doesn’t have any Citi Bike stations.
shape: (258, 3) ┌─────────────────┬──────────┬─────────────────────────────────────────────────┐ │ neighborhood │ borough │ polygon │ │ --- │ --- │ --- │ │ str │ str │ list[list[f64]] │ ╞═════════════════╪══════════╪═════════════════════════════════════════════════╡ │ Allerton │ Bronx │ [[-73.848597, 40.87167], [-73.845823, 40.87023… │ │ Alley Pond Park │ Queens │ [[-73.743333, 40.738883], [-73.743714, 40.7394… │ │ Arverne │ Queens │ [[-73.789535, 40.599972], [-73.789541, 40.5999… │ │ Astoria │ Queens │ [[-73.901603, 40.76777], [-73.902696, 40.76688… │ │ Bath Beach │ Brooklyn │ [[-73.99381, 40.60195], [-73.99962, 40.596469]… │ │ … │ … │ … │ │ Williamsburg │ Brooklyn │ [[-73.957572, 40.725097], [-73.952998, 40.7222… │ │ Windsor Terrace │ Brooklyn │ [[-73.980061, 40.660753], [-73.979878, 40.6607… │ │ Woodhaven │ Queens │ [[-73.86233, 40.695962], [-73.856544, 40.69707… │ │ Woodlawn │ Bronx │ [[-73.859468, 40.900517], [-73.85926, 40.90033… │ │ Woodside │ Queens │ [[-73.900866, 40.757674], [-73.90014, 40.75615… │ └─────────────────┴──────────┴─────────────────────────────────────────────────┘
We now have a clean DataFrame with 258 neighborhoods, the boroughs in which they are located, and their polygons.
If a neighborhood consists of multiple separate areas (that is, multiple polygons), it will appear multiple times in this DataFrame.
Before we use this neighborhoods
DataFrame to add information to the trips
DataFrame, we first want to visualize it so that we have some context.
Bonus: Visualizing Neighborhoods and Stations
To visualize the neighborhoods of New York City and all the Citi Bike stations, we are going to use the Plotnine package. Plotnine expects the DataFrame in a long format—that is, one row per coordinate—so we have some wrangling to do:
neighborhoods_coords
=
(
neighborhoods
.
with_row_index
(
"id"
)
.
explode
(
"polygon"
)
.
with_columns
(
lon
=
pl
.
col
(
"polygon"
)
.
list
.
first
(),
lat
=
pl
.
col
(
"polygon"
)
.
list
.
last
(),
)
.
drop
(
"polygon"
)
)
neighborhoods_coords
shape: (27_569, 5) ┌─────┬──────────────┬─────────┬────────────┬───────────┐ │ id │ neighborhood │ borough │ lon │ lat │ │ --- │ --- │ --- │ --- │ --- │ │ u32 │ str │ str │ f64 │ f64 │ ╞═════╪══════════════╪═════════╪════════════╪═══════════╡ │ 0 │ Allerton │ Bronx │ -73.848597 │ 40.87167 │ │ 0 │ Allerton │ Bronx │ -73.845823 │ 40.870239 │ │ 0 │ Allerton │ Bronx │ -73.854559 │ 40.859954 │ │ 0 │ Allerton │ Bronx │ -73.854665 │ 40.859586 │ │ 0 │ Allerton │ Bronx │ -73.856389 │ 40.857594 │ │ … │ … │ … │ … │ … │ │ 257 │ Woodside │ Queens │ -73.910618 │ 40.755476 │ │ 257 │ Woodside │ Queens │ -73.90907 │ 40.757565 │ │ 257 │ Woodside │ Queens │ -73.907828 │ 40.756999 │ │ 257 │ Woodside │ Queens │ -73.90737 │ 40.756988 │ │ 257 │ Woodside │ Queens │ -73.900866 │ 40.757674 │ └─────┴──────────────┴─────────┴────────────┴───────────┘
To get the coordinates of the stations, we calculate, per station, the median coordinates of the start location of each bike trip:
stations
=
(
trips
.
group_by
(
station
=
pl
.
col
(
"
start_station_name
"
)
)
.
agg
(
lon
=
pl
.
col
(
"
start_lng
"
)
.
median
(
)
,
lat
=
pl
.
col
(
"
start_lat
"
)
.
median
(
)
,
)
.
sort
(
"
station
"
)
.
drop_nulls
(
)
)
stations
You’ll learn about aggregation in Chapter 13.
shape: (2_143, 3) ┌──────────────────────────────┬────────────┬───────────┐ │ station │ lon │ lat │ │ --- │ --- │ --- │ │ str │ f64 │ f64 │ ╞══════════════════════════════╪════════════╪═══════════╡ │ 1 Ave & E 110 St │ -73.938203 │ 40.792327 │ │ 1 Ave & E 16 St │ -73.981656 │ 40.732219 │ │ 1 Ave & E 18 St │ -73.980544 │ 40.733876 │ │ 1 Ave & E 30 St │ -73.975361 │ 40.741457 │ │ 1 Ave & E 38 St │ -73.971822 │ 40.746202 │ │ … │ … │ … │ │ Wyckoff Ave & Stanhope St │ -73.917914 │ 40.703545 │ │ Wyckoff St & 3 Ave │ -73.982586 │ 40.682755 │ │ Wythe Ave & Metropolitan Ave │ -73.963198 │ 40.716887 │ │ Wythe Ave & N 13 St │ -73.957099 │ 40.722741 │ │ Yankee Ferry Terminal │ -74.016756 │ 40.687066 │ └──────────────────────────────┴────────────┴───────────┘
The following code snippet contains the Plotnine code to produce Figure 1-3. Each dot is a bike station. The four colors indicate the boroughs. The shading of the neighborhoods is only used to make them visually more separate; it has no meaning.
(
ggplot
(
neighborhoods_coords
,
aes
(
x
=
"lon"
,
y
=
"lat"
,
group
=
"id"
))
+
geom_polygon
(
aes
(
alpha
=
"neighborhood"
,
fill
=
"borough"
),
color
=
"white"
)
+
geom_point
(
stations
,
size
=
0.1
)
+
scale_x_continuous
(
expand
=
(
0
,
0
))
+
scale_y_continuous
(
expand
=
(
0
,
0
,
0
,
0.01
))
+
scale_alpha_ordinal
(
range
=
(
0.3
,
1
))
+
scale_fill_brewer
(
type
=
"qual"
,
palette
=
2
)
+
guides
(
alpha
=
False
)
+
labs
(
title
=
"New York City Neighborhoods and Citi Bike Stations"
,
subtitle
=
"2143 stations across 106 neighborhoods"
,
caption
=
"Source: https://citibikenyc.com/system-data"
,
fill
=
"Borough"
,
)
+
theme_void
(
base_family
=
"Guardian Sans"
,
base_size
=
14
)
+
theme
(
dpi
=
200
,
figure_size
=
(
7
,
9
),
plot_background
=
element_rect
(
fill
=
"white"
,
color
=
"white"
),
plot_caption
=
element_text
(
style
=
"italic"
),
plot_title
=
element_text
(
ha
=
"left"
),
)
)
Isn’t New York City beautiful?
Transform
No dataset is perfect and neither is ours. That’s why the second step of this ETL showcase is to transform the data. We’ll start with the columns, and subsequently clean up the rows. We will also be adding some new columns along the way.
Clean Up Columns
The snippet below cleans up the columns of our trips
DataFrame in the following ways:
-
It gets rid of the columns
ride_id
,start_station_id
, andend_station_id
, because we don’t need them. -
It shortens the column names so that they’re easier to work with.
-
It turns
bike_type
andrider_type
into categories, which better reflects the data types of these columns. -
It adds a new column called
duration
, which is based on the start and end times of the bike trip.
trips
=
trips
.
select
(
bike_type
=
pl
.
col
(
"
rideable_type
"
)
.
str
.
split
(
"
_
"
)
.
list
.
get
(
0
)
.
cast
(
pl
.
Categorical
)
,
rider_type
=
pl
.
col
(
"
member_casual
"
)
.
cast
(
pl
.
Categorical
)
,
datetime_start
=
pl
.
col
(
"
started_at
"
)
,
datetime_end
=
pl
.
col
(
"
ended_at
"
)
,
station_start
=
pl
.
col
(
"
start_station_name
"
)
,
station_end
=
pl
.
col
(
"
end_station_name
"
)
,
lon_start
=
pl
.
col
(
"
start_lng
"
)
,
lat_start
=
pl
.
col
(
"
start_lat
"
)
,
lon_end
=
pl
.
col
(
"
end_lng
"
)
,
lat_end
=
pl
.
col
(
"
end_lat
"
)
,
)
.
with_columns
(
duration
=
(
pl
.
col
(
"
datetime_end
"
)
-
pl
.
col
(
"
datetime_start
"
)
)
)
trips
.
columns
You’ll learn about expressions in Chapter 7.
You’ll learn about selecting and creating columns in Chapter 10.
['bike_type', 'rider_type', 'datetime_start', 'datetime_end', 'station_start', 'station_end', 'lon_start', 'lat_start', 'lon_end', 'lat_end', 'duration']
Let’s continue with the rows of the trips
DataFrame.
Clean Up Rows
You may have noticed that some of the rows are missing values. Because we have plenty of data anyway, it doesn’t hurt to remove those rows. If you have very little data, then you may want to use a different strategy, such as imputing the missing values with, say, the average value or the most common value.
There are a few bike trips that started in February and ended in March. It’ll make our analyses and visualizations cleaner if we remove those trips as well. Finally, let’s also remove all bike rides that started and ended at the same bike station and had a duration of less than five minutes, as those are not actually trips:
from
datetime
import
date
trips
=
(
trips
.
drop_nulls
(
)
.
filter
(
(
pl
.
col
(
"
datetime_start
"
)
>
=
date
(
2024
,
3
,
1
)
)
&
(
pl
.
col
(
"
datetime_end
"
)
<
date
(
2024
,
4
,
1
)
)
)
.
filter
(
~
(
(
pl
.
col
(
"
station_start
"
)
==
pl
.
col
(
"
station_end
"
)
)
&
(
pl
.
col
(
"
duration
"
)
.
dt
.
total_seconds
(
)
<
5
*
60
)
)
)
)
trips
.
height
You’ll learn about filtering rows in Chapter 11.
2639170
Once we’ve done that, the DataFrame trips
still has more than 2.6 million rows, which is plenty.
Add Trip Distance
The distance of a bike trip would be interesting to have because we could then correlate it with, say, the duration. We don’t have the actual bike routes available to us, so the best that we can do is take the start and end coordinates and then calculate what is known as the Haversine distance.6
The Haversine distance can be calculated using the methods that Polars provides, but we would like to use an existing package called geo
.
There’s just one thing: this package is created in Rust, not in Python.
So we have created a custom plugin, specifically for this book, that turns the geo
package into a Polars plugin.
This allows us to calculate the Haversine distance as if it were a Polars method.
The method Expr.geo.haversine_distance()
expects a coordinate, meaning longitude—latitude pair:
trips
=
trips
.
with_columns
(
distance
=
pl
.
concat_list
(
"
lon_start
"
,
"
lat_start
"
)
.
geo
.
haversine_distance
(
pl
.
concat_list
(
"
lon_end
"
,
"
lat_end
"
)
)
/
1000
)
trips
.
select
(
"
lon_start
"
,
"
lon_end
"
,
"
lat_start
"
,
"
lat_end
"
,
"
distance
"
,
"
duration
"
,
)
The result of the
geo
Haversine method is reported in meters. Then we divide by a thousand to get kilometers. You’ll learn more about our custom plugin in Chapter 17.
shape: (2_639_170, 6) ┌────────────┬────────────┬───────────┬───────────┬──────────┬───────────────┐ │ lon_start │ lon_end │ lat_start │ lat_end │ distance │ duration │ │ --- │ --- │ --- │ --- │ --- │ --- │ │ f64 │ f64 │ f64 │ f64 │ f64 │ duration[μs] │ ╞════════════╪════════════╪═══════════╪═══════════╪══════════╪═══════════════╡ │ -73.995071 │ -74.007319 │ 40.749614 │ 40.707065 │ 4.842569 │ 27m 36s 805ms │ │ -73.896576 │ -73.927311 │ 40.816459 │ 40.810893 │ 2.659582 │ 9m 25s 264ms │ │ -73.988559 │ -73.989186 │ 40.746424 │ 40.742869 │ 0.398795 │ 3m 29s 483ms │ │ -73.995208 │ -74.013219 │ 40.749653 │ 40.705945 │ 5.09153 │ 30m 56s 960ms │ │ -73.957559 │ -73.979881 │ 40.69067 │ 40.668663 │ 3.08728 │ 11m 32s 483ms │ │ … │ … │ … │ … │ … │ … │ │ -73.974552 │ -73.977724 │ 40.729848 │ 40.729387 │ 0.272175 │ 1m 41s 374ms │ │ -73.971092 │ -73.965269 │ 40.763505 │ 40.763126 │ 0.492269 │ 3m 30s 363ms │ │ -73.959621 │ -73.955151 │ 40.808625 │ 40.81 │ 0.406138 │ 1m 46s 248ms │ │ -73.965971 │ -73.962644 │ 40.712996 │ 40.712605 │ 0.283781 │ 1m 43s 906ms │ │ -73.940297 │ -73.940636 │ 40.707678 │ 40.703172 │ 0.501835 │ 2m 6s 109ms │ └────────────┴────────────┴───────────┴───────────┴──────────┴───────────────┘
Keep in mind that the Haversine distance is “as the crow flies”, not as the biker rides. Still, this gives us a decent approximation of the trip distance.
Add Borough and Neighborhood
Previously, we obtained the coordinates of the stations and the polygons of the neighborhoods. To determine in which neighborhood each station lies, we need to test whether every coordinate is in every polygon. This method is not perfect. Some stations do not match; some match more than once. This has to do with the borders of the neighborhood.
Again, we can use our custom plugin for this, because it also has a method called Expr.geo.point_in_polygon()
:
stations
=
(
stations
.
with_columns
(
point
=
pl
.
concat_list
(
"lon"
,
"lat"
))
.
join
(
neighborhoods
,
how
=
"cross"
)
.
with_columns
(
in_neighborhood
=
pl
.
col
(
"point"
)
.
geo
.
point_in_polygon
(
pl
.
col
(
"polygon"
))
)
.
filter
(
pl
.
col
(
"in_neighborhood"
))
.
unique
(
"station"
)
.
select
(
"station"
,
"borough"
,
"neighborhood"
,
)
)
stations
shape: (2_133, 3) ┌──────────────────────────────┬───────────┬──────────────────┐ │ station │ borough │ neighborhood │ │ --- │ --- │ --- │ │ str │ str │ str │ ╞══════════════════════════════╪═══════════╪══════════════════╡ │ 1 Ave & E 110 St │ Manhattan │ East Harlem │ │ 1 Ave & E 16 St │ Manhattan │ Stuyvesant Town │ │ 1 Ave & E 18 St │ Manhattan │ Stuyvesant Town │ │ 1 Ave & E 30 St │ Manhattan │ Kips Bay │ │ 1 Ave & E 38 St │ Manhattan │ Murray Hill │ │ … │ … │ … │ │ Wyckoff Ave & Stanhope St │ Brooklyn │ Bushwick │ │ Wyckoff St & 3 Ave │ Brooklyn │ Gowanus │ │ Wythe Ave & Metropolitan Ave │ Brooklyn │ Williamsburg │ │ Wythe Ave & N 13 St │ Brooklyn │ Williamsburg │ │ Yankee Ferry Terminal │ Manhattan │ Governors Island │ └──────────────────────────────┴───────────┴──────────────────┘
We can add this information to the trips
DataFrame by joining on the station
column twice: once with station_start
and once with station_end
:
trips
=
(
trips
.
join
(
stations
.
select
(
pl
.
all
()
.
name
.
suffix
(
"_start"
)),
on
=
"station_start"
)
.
join
(
stations
.
select
(
pl
.
all
()
.
name
.
suffix
(
"_end"
)),
on
=
"station_end"
)
.
select
(
"bike_type"
,
"rider_type"
,
"datetime_start"
,
"datetime_end"
,
"duration"
,
"station_start"
,
"station_end"
,
"neighborhood_start"
,
"neighborhood_end"
,
"borough_start"
,
"borough_end"
,
"lat_start"
,
"lon_start"
,
"lat_end"
,
"lon_end"
,
"distance"
,
)
)
Here’s what the final DataFrame looks like:
(
trips
[:,
:
4
])
(
trips
[:,
4
:
7
])
(
trips
[:,
7
:
11
])
(
trips
[:,
11
:])
shape: (2_638_971, 4) ┌───────────┬────────────┬─────────────────────────┬─────────────────────────┐ │ bike_type │ rider_type │ datetime_start │ datetime_end │ │ --- │ --- │ --- │ --- │ │ cat │ cat │ datetime[μs] │ datetime[μs] │ ╞═══════════╪════════════╪═════════════════════════╪═════════════════════════╡ │ electric │ member │ 2024-03-01 00:00:02.490 │ 2024-03-01 00:27:39.295 │ │ electric │ member │ 2024-03-01 00:00:04.120 │ 2024-03-01 00:09:29.384 │ │ … │ … │ … │ … │ │ electric │ member │ 2024-03-31 23:55:41.173 │ 2024-03-31 23:57:25.079 │ │ electric │ member │ 2024-03-31 23:57:16.025 │ 2024-03-31 23:59:22.134 │ └───────────┴────────────┴─────────────────────────┴─────────────────────────┘ shape: (2_638_971, 3) ┌───────────────┬──────────────────────────────┬────────────────────────┐ │ duration │ station_start │ station_end │ │ --- │ --- │ --- │ │ duration[μs] │ str │ str │ ╞═══════════════╪══════════════════════════════╪════════════════════════╡ │ 27m 36s 805ms │ W 30 St & 8 Ave │ Maiden Ln & Pearl St │ │ 9m 25s 264ms │ Longwood Ave & Southern Blvd │ Lincoln Ave & E 138 St │ │ … │ … │ … │ │ 1m 43s 906ms │ S 4 St & Wythe Ave │ S 3 St & Bedford Ave │ │ 2m 6s 109ms │ Montrose Ave & Bushwick Ave │ Humboldt St & Varet St │ └───────────────┴──────────────────────────────┴────────────────────────┘ shape: (2_638_971, 4) ┌────────────────────┬────────────────────┬───────────────┬─────────────┐ │ neighborhood_start │ neighborhood_end │ borough_start │ borough_end │ │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ ╞════════════════════╪════════════════════╪═══════════════╪═════════════╡ │ Chelsea │ Financial District │ Manhattan │ Manhattan │ │ Longwood │ Mott Haven │ Bronx │ Bronx │ │ … │ … │ … │ … │ │ Williamsburg │ Williamsburg │ Brooklyn │ Brooklyn │ │ Williamsburg │ Williamsburg │ Brooklyn │ Brooklyn │ └────────────────────┴────────────────────┴───────────────┴─────────────┘ shape: (2_638_971, 5) ┌───────────┬────────────┬───────────┬────────────┬──────────┐ │ lat_start │ lon_start │ lat_end │ lon_end │ distance │ │ --- │ --- │ --- │ --- │ --- │ │ f64 │ f64 │ f64 │ f64 │ f64 │ ╞═══════════╪════════════╪═══════════╪════════════╪══════════╡ │ 40.749614 │ -73.995071 │ 40.707065 │ -74.007319 │ 4.842569 │ │ 40.816459 │ -73.896576 │ 40.810893 │ -73.927311 │ 2.659582 │ │ … │ … │ … │ … │ … │ │ 40.712996 │ -73.965971 │ 40.712605 │ -73.962644 │ 0.283781 │ │ 40.707678 │ -73.940297 │ 40.703172 │ -73.940636 │ 0.501835 │ └───────────┴────────────┴───────────┴────────────┴──────────┘
Before we continue with the third and final step of the ETL showcase, we would like to share one more data visualization.
Bonus: Visualizing Daily Trips per Borough
Now that we have this information, we can analyze and visualize all sorts of interesting things, such as the number of trips per day per borough:
trips_per_hour
=
trips
.
group_by_dynamic
(
"datetime_start"
,
group_by
=
"borough_start"
,
every
=
"1d"
)
.
agg
(
num_trips
=
pl
.
len
())
trips_per_hour
shape: (124, 3) ┌───────────────┬─────────────────────┬───────────┐ │ borough_start │ datetime_start │ num_trips │ │ --- │ --- │ --- │ │ str │ datetime[μs] │ u32 │ ╞═══════════════╪═════════════════════╪═══════════╡ │ Manhattan │ 2024-03-01 00:00:00 │ 56434 │ │ Manhattan │ 2024-03-02 00:00:00 │ 17450 │ │ Manhattan │ 2024-03-03 00:00:00 │ 69195 │ │ Manhattan │ 2024-03-04 00:00:00 │ 63734 │ │ Manhattan │ 2024-03-05 00:00:00 │ 33309 │ │ … │ … │ … │ │ Queens │ 2024-03-27 00:00:00 │ 6232 │ │ Queens │ 2024-03-28 00:00:00 │ 3770 │ │ Queens │ 2024-03-29 00:00:00 │ 6637 │ │ Queens │ 2024-03-30 00:00:00 │ 6583 │ │ Queens │ 2024-03-31 00:00:00 │ 6237 │ └───────────────┴─────────────────────┴───────────┘
Again, we will be using Plotnine to create the visualization (see Figure 1-4).
(
ggplot
(
trips_per_hour
,
aes
(
x
=
"datetime_start"
,
y
=
"num_trips"
,
fill
=
"borough_start"
),
)
+
geom_area
()
+
scale_fill_brewer
(
type
=
"qual"
,
palette
=
2
)
+
scale_x_datetime
(
date_labels
=
"
%-d
"
,
date_breaks
=
"1 day"
,
expand
=
(
0
,
0
))
+
scale_y_continuous
(
expand
=
(
0
,
0
))
+
labs
(
x
=
"March 2024"
,
fill
=
"Borough"
,
y
=
"Trips per day"
,
title
=
"Citi Bike Trips Per Day In March 2024"
,
subtitle
=
"On March 23, nearly 10cm of rain fell in NYC"
,
)
+
theme_tufte
(
base_family
=
"Guardian Sans"
,
base_size
=
14
)
+
theme
(
axis_ticks_major
=
element_line
(
color
=
"white"
),
figure_size
=
(
8
,
5
),
legend_position
=
"top"
,
plot_background
=
element_rect
(
fill
=
"white"
,
color
=
"white"
),
plot_caption
=
element_text
(
style
=
"italic"
),
plot_title
=
element_text
(
ha
=
"left"
),
)
)
There will be many more data visualizations made with many other packages in Chapter 16.
Load
The third and final step of this ETL showcase is to load the data. In other words, we are going to write the data back to disk.
Write Partitions
Instead of writing back a CSV file, we use the Parquet file format. Parquet provides several advantages over CSV:
-
It includes the data type for each column, known as the schema.
-
It uses columnar storage instead of row-based, enabling faster, optimized reads.
-
Data is organized into chunks with embedded statistics, allowing for efficient skipping of unnecessary data.
-
It applies compression, reducing the overall storage footprint.
Instead of writing a single Parquet file, we are going to write one file for each day. This way, each file is small enough to be hosted on GitHub, which is necessary to share the data easily with you.
Each file name starts with the string trips
, followed by a dash (-
) and the date.
trips_parts
=
(
trips
.
sort
(
"datetime_start"
)
.
with_columns
(
date
=
pl
.
col
(
"datetime_start"
)
.
dt
.
date
()
.
cast
(
pl
.
String
))
.
partition_by
([
"date"
],
as_dict
=
True
,
include_key
=
False
)
)
for
key
,
df
in
trips_parts
.
items
():
df
.
write_parquet
(
f
"data/citibike/trips-
{
key
[
0
]
}
.parquet"
)
Verify
Let’s verify that the previous code snippet produced 31 Parquet files using ls
:
!
ls
-
1
data
/
citibike
/*.
parquet
data/citibike/trips-2024-03-01.parquet data/citibike/trips-2024-03-02.parquet data/citibike/trips-2024-03-03.parquet data/citibike/trips-2024-03-04.parquet data/citibike/trips-2024-03-05.parquet … with 26 more lines
Using globbing,7 we can easily read all the Parquet files into a single DataFrame:
pl
.
read_parquet
(
"data/citibike/*.parquet"
)
.
height
2638971
Excellent. We’ll use this data later in Chapter 16, to create many exciting data visualizations. Now that data has been loaded, we could conclude this ETL showcase. However, there is just one more bonus that we would like to share, which enables you to make the entire ETL showcase faster.
Bonus: Becoming Faster by Being Lazy
Up till now, we have been using Polars’ so-called eager API. Eager in this context means that commands are executed straightaway. Polars is already fast relative to its competitors. If we use Polars’ lazy API, our calculations can sometimes be even faster.
With the lazy API, we’re not operating directly on a DataFrame. Instead, we’re building a recipe of instructions. When we are ready, we tell Polars to execute it. Before Polars actually does so, it will first optimize this recipe.
As for our showcase, being completely lazy is slightly less efficient than being eager. That’s because certain parts that would need to be computed twice, because Polars doesn’t know how to cache the results. The code snippet below fixes this using lazy execution in some parts then turning them into DataFrames so that these results are properly cached. It’s all of the code from above, with just a few minor changes:
trips
=
(
pl
.
scan_csv
(
"
data/citibike/202403-citibike-tripdata.csv
"
,
try_parse_dates
=
True
,
schema_overrides
=
{
"
start_station_id
"
:
pl
.
String
,
"
end_station_id
"
:
pl
.
String
,
}
,
)
.
select
(
bike_type
=
pl
.
col
(
"
rideable_type
"
)
.
str
.
split
(
"
_
"
)
.
list
.
get
(
0
)
,
rider_type
=
pl
.
col
(
"
member_casual
"
)
,
datetime_start
=
pl
.
col
(
"
started_at
"
)
,
datetime_end
=
pl
.
col
(
"
ended_at
"
)
,
station_start
=
pl
.
col
(
"
start_station_name
"
)
,
station_end
=
pl
.
col
(
"
end_station_name
"
)
,
lon_start
=
pl
.
col
(
"
start_lng
"
)
,
lat_start
=
pl
.
col
(
"
start_lat
"
)
,
lon_end
=
pl
.
col
(
"
end_lng
"
)
,
lat_end
=
pl
.
col
(
"
end_lat
"
)
,
)
.
with_columns
(
duration
=
(
pl
.
col
(
"
datetime_end
"
)
-
pl
.
col
(
"
datetime_start
"
)
)
)
.
drop_nulls
(
)
.
filter
(
~
(
(
pl
.
col
(
"
station_start
"
)
==
pl
.
col
(
"
station_end
"
)
)
&
(
pl
.
col
(
"
duration
"
)
.
dt
.
total_seconds
(
)
<
5
*
60
)
)
)
.
with_columns
(
distance
=
pl
.
concat_list
(
"
lon_start
"
,
"
lat_start
"
)
.
geo
.
haversine_distance
(
pl
.
concat_list
(
"
lon_end
"
,
"
lat_end
"
)
)
/
1000
)
)
.
collect
(
)
neighborhoods
=
(
pl
.
read_json
(
"
data/citibike/nyc-neighborhoods.geojson
"
)
.
lazy
(
)
.
select
(
"
features
"
)
.
explode
(
"
features
"
)
.
unnest
(
"
features
"
)
.
unnest
(
"
properties
"
)
.
select
(
"
neighborhood
"
,
"
borough
"
,
"
geometry
"
)
.
unnest
(
"
geometry
"
)
.
with_columns
(
polygon
=
pl
.
col
(
"
coordinates
"
)
.
list
.
first
(
)
)
.
select
(
"
neighborhood
"
,
"
borough
"
,
"
polygon
"
)
.
sort
(
"
neighborhood
"
)
.
filter
(
pl
.
col
(
"
borough
"
)
!=
"
Staten Island
"
)
)
stations
=
(
trips
.
lazy
(
)
.
group_by
(
station
=
pl
.
col
(
"
station_start
"
)
)
.
agg
(
lat
=
pl
.
col
(
"
lat_start
"
)
.
median
(
)
,
lon
=
pl
.
col
(
"
lon_start
"
)
.
median
(
)
,
)
.
with_columns
(
point
=
pl
.
concat_list
(
"
lon
"
,
"
lat
"
)
)
.
drop_nulls
(
)
.
join
(
neighborhoods
,
how
=
"
cross
"
)
.
with_columns
(
in_neighborhood
=
pl
.
col
(
"
point
"
)
.
geo
.
point_in_polygon
(
pl
.
col
(
"
polygon
"
)
)
)
.
filter
(
pl
.
col
(
"
in_neighborhood
"
)
)
.
unique
(
"
station
"
)
.
select
(
pl
.
col
(
"
station
"
)
,
pl
.
col
(
"
borough
"
)
,
pl
.
col
(
"
neighborhood
"
)
,
)
)
.
collect
(
)
trips
=
(
trips
.
join
(
stations
.
select
(
pl
.
all
(
)
.
name
.
suffix
(
"
_start
"
)
)
,
on
=
"
station_start
"
)
.
join
(
stations
.
select
(
pl
.
all
(
)
.
name
.
suffix
(
"
_end
"
)
)
,
on
=
"
station_end
"
)
.
select
(
"
bike_type
"
,
"
rider_type
"
,
"
datetime_start
"
,
"
datetime_end
"
,
"
duration
"
,
"
station_start
"
,
"
station_end
"
,
"
neighborhood_start
"
,
"
neighborhood_end
"
,
"
borough_start
"
,
"
borough_end
"
,
"
lat_start
"
,
"
lon_start
"
,
"
lat_end
"
,
"
lon_end
"
,
"
distance
"
,
)
)
trips
.
height
2639179
The function
pl.scan_csv()
returns a LazyFrame, making all the subsequent methods lazy.The method
lf.collect()
turns a LazyFrame into a DataFrame.The method
df.lazy()
turns a DataFrame into a LazyFrame.
For a single month of bike trips, Polars doesn’t speed up much, because the point-in-polygon test is dominating the timing. However, when we take a years worth of bike trips, the lazy approach is 33% faster than the eager approach. That’s a substantial speedup for just a couple of code changes.
You’ll learn more about eager and lazy APIs in Chapter 5.
Takeaways
-
Polars is a blazingly fast DataFrame package with a focus on performance and ease of use through an intuitive API.
-
Polars is written in Rust and has bindings for Python, R, JavaScript, and Julia.
-
The Python version is the most mature and most used verion of Polars.
-
Polars is a very popular Python package, as measured by the number of GitHub stars.
-
Polars is, in many cases, faster than its competitors.
-
When using the lazy API, Polars can be even faster.
-
Polars is great for transforming, analyzing, and visualizing data.
In the next chapter, we will show how to install Polars and how to get started with it. Additionally, we talk about how you can follow along with the code examples in this book.
1 See https://github.com/pola-rs/polars/releases.
2 You can join the Discord server at https://discord.gg/4qf7UVDZmd.
3 See https://survey.stackoverflow.co/2024/ for the full report.
4 https://citibikenyc.com/system-data
5 The original filename is custom-pedia-cities-nyc-Mar2018.geojson.
6 The haversine distance is the shortest distance between two points on a sphere.
7 Globbing is a pattern-matching technique used to select file names based on wildcard characters like *
and ?
.
Get Python Polars: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.