Chapter 4. Using DuckDB with Polars
Most data scientists and data analysts are familiar with the pandas library. With pandas, you can organize your dataset into Series or DataFrame structures and employ the diverse array of functions provided by the pandas library for data manipulation. However, one of the main complaints about pandas is its slow speed and inefficiencies when dealing with large datasets. This is because pandas was originally designed to work with tabular data that fits in memory. When dealing with large datasets, it becomes slow because it needs to swap data in and out of memory.
To address the inefficiencies of pandas in working with large datasets, there is a competing library—Polars. The first part of this chapter provides an introduction to Polars and how you can work with it (just like with pandas). The second part of this chapter shows how you can query Polars DataFrames using DuckDB.
Introduction to Polars
Polars is a DataFrame library that is completely written in Rust. Polars is designed with the following in mind:
- Speed
Polars leverages Rust, a system programming language known for its performance.
- Parallelism
Polars can take advantage of multicore processors, which provide substantial speed improvements for CPU-bound operations.
- Memory efficiency
Polars uses lazy evaluation, which means an operation is not performed until it is needed. In addition, queries can be chained and optimized before execution, resulting in much more efficient execution.
- Efficient storage of data
Polars stores data in columnar format, which is more efficient than the row-based storage in pandas.
- Ease of use
Polars supports a SQL-like syntax for data manipulation, making it immediately accessible to a large group of users. In addition, it has many methods that are similar to pandas, making it very easy for pandas users to migrate to.
To install Polars, use the pip
command:
!pip install polars
Note
The version of Polars used in this book is 1.8.2.
In the following sections, you’ll learn how to get started with the Polars library and understand the magic behind the efficiencies of Polars—lazy evaluation.
Creating a Polars DataFrame
Let’s start off with the basics. Let’s create a Polars DataFrame using a Python dictionary. The following code snippet creates a Polars DataFrame containing six columns and eight rows (see Figure 4-1):
import
polars
as
pl
df
=
pl
.
DataFrame
(
{
'Model'
:
[
'Camry'
,
'Corolla'
,
'RAV4'
,
'Mustang'
,
'F-150'
,
'Escape'
,
'Golf'
,
'Tiguan'
],
'Year'
:
[
1982
,
1966
,
1994
,
1964
,
1975
,
2000
,
1974
,
2007
],
'Engine_Min'
:[
2.5
,
1.8
,
2.0
,
2.3
,
2.7
,
1.5
,
1.0
,
1.4
],
'Engine_Max'
:[
3.5
,
2.0
,
2.5
,
5.0
,
5.0
,
2.5
,
2.0
,
2.0
],
'AWD'
:[
False
,
False
,
True
,
False
,
True
,
True
,
True
,
True
],
'Company'
:
[
'Toyota'
,
'Toyota'
,
'Toyota'
,
'Ford'
,
'Ford'
,
'Ford'
,
'Volkswagen'
,
'Volkswagen'
],
}
)
df
Note
Just like pandas, Jupyter Notebook will pretty-print the Polars DataFrame when you print it out.
If you observe the output, you should find it similar to a pandas DataFrame, except:
A Polars DataFrame does not have an index. This is one of the design philosophies behind Polars: the index in a DataFrame is not useful and seldom needed.
Below the headers of the DataFrame, Polars displays the data type of each column (
str
,i64
,f64
, andbool
)
To display the full name of the data type of each column in the Polars DataFrame, use the dtypes
property:
df
.
dtypes
For the DataFrame in Figure 4-1, this statement displays the following result:
[String, Int64, Float64, Float64, Boolean, String]
To get the column names, use the columns
property:
df
.
columns
# ['Model', 'Year', 'Engine_Min', 'Engine_Max', 'AWD', 'Company']
If you want to get all the rows of the DataFrame, use the rows()
method:
df
.
rows
()
The rows are returned as a list of tuples:
[('Camry', 1982, 2.5, 3.5, False, 'Toyota'), ('Corolla', 1966, 1.8, 2.0, False, 'Toyota'), ('RAV4', 1994, 2.0, 2.5, True, 'Toyota'), ('Mustang', 1964, 2.3, 5.0, False, 'Ford'), ('F-150', 1975, 2.7, 5.0, True, 'Ford'), ('Escape', 2000, 1.5, 2.5, True, 'Ford'), ('Golf', 1974, 1.0, 2.0, True, 'Volkswagen'), ('Tiguan', 2007, 1.4, 2.0, True, 'Volkswagen')]
With the DataFrame loaded, the next few sections will show you how to select parts (such as columns and rows) of the DataFrame.
Selecting columns
To select a particular column in the DataFrame, use the select()
method:
df
.
select
(
'Model'
)
This returns the column named Model
(see Figure 4-2).
Tip
If you’re familiar with pandas, you might be wondering if the square bracket indexing method still works. Well, df['Model']
works just like using the select()
method. However, the Polars documentation specifically mentions that the square bracket indexing method is an anti-pattern for Polars because it is sometimes confusing. So, while df['Model']
works, there is a possibility that the square bracket indexing method may be removed in a future version of Polars.
If you need to retrieve more than one column, enclose the column names in a list (or simply specify the additional column names):
df
.
select
(
[
'Model'
,
'Company'
]
# or 'Model','Company'
)
If you want to retrieve all the string columns (that is, columns of type pl.String
) in the DataFrame, you can use an expression within the select()
method:
df
.
select
(
pl
.
col
(
pl
.
String
)
)
Tip
The statement pl.col(pl.String)
is known as an expression in Polars. You can interpret this expression as “get me all the columns whose data type is String
”.
This statement prints out the Model
and Company
columns (see Figure 4-3).
Expressions are powerful in Polars. For example, you can pipe together multiple expressions:
df
.
select
(
pl
.
col
([
'Year'
,
'Model'
,
'Engine_Max'
])
.
sort_by
([
'Engine_Max'
,
'Year'
],
descending
=
[
False
,
True
])
)
In this code snippet, the first expression selects the three columns Year
, Model
, and Engine_Max
. The result of the first expression is then piped to the second expression, which sorts the column Engine_Max
in ascending order and the Year
column in descending order. The result is shown in Figure 4-4.
You can also group multiple expressions in a list. For example the following code snippet lists all the string columns plus the Year
column:
df
.
select
(
[
pl
.
col
(
pl
.
String
),
'Year'
]
)
Selecting rows
If you want to get a particular row in a Polars DataFrame, you can use the row()
method and pass in a row number. For example, the following statement retrieves the first row in the table:
df
.
row
(
0
)
# ('Camry', 1982, 2.5, 3.5, False, 'Toyota')
If you want to get multiple rows, you can use the square bracket indexing method, though this is not recommended:
df
[
1
:
3
]
# returns the second and third rows
Tip
Instead of using square bracket indexing, Polars encourages the use of more explicit forms of querying and functions to manipulate data. In the real world, you often retrieve rows based on certain criteria, instead of specific row numbers. Despite this, Polars still provides support for square bracket indexing (at least for now).
Like pandas, Polars supports common methods like head()
, tail()
, and sample()
.
To select rows, Polars recommends using the filter()
method. For example, if you want to select all the rows that contains cars from Toyota, use the filter()
method with the following expression:
df
.
filter
(
pl
.
col
(
'Company'
)
==
'Toyota'
)
Figure 4-5 shows all rows containing cars from Toyota.
You can also specify multiple conditions using logical operators. The following example retrieves all cars that are from Toyota or Ford:
df
.
filter
(
(
pl
.
col
(
'
Company
'
)
==
'
Toyota
'
)
|
(
pl
.
col
(
'
Company
'
)
==
'
Ford
'
)
)
Note
Remember to use a pair of parentheses to enclose each condition.
If you want to match multiple brands of cars, it is easier to use the is_in()
method:
df
.
filter
(
(
pl
.
col
(
'Company'
)
.
is_in
([
'Toyota'
,
'Ford'
]))
)
The following example retrieves all cars from Toyota that were launched after 1980:
df
.
filter
(
(
pl
.
col
(
'
Company
'
)
==
'
Toyota
'
)
&
(
pl
.
col
(
'
Year
'
)
>
1980
)
)
The following example retrieves all the cars other than Toyota:
df
.
filter
(
~
(
pl
.
col
(
'
Company
'
)
==
'
Toyota
'
)
)
Alternatively, you can also use the !=
operator:
df
.
filter
(
(
pl
.
col
(
'
Company
'
)
!=
'
Toyota
'
)
)
Selecting rows and columns
Now that you have seen how to use the select()
method to select columns and the filter()
method to select rows from a Polars DataFrame, let’s see how you can chain them together to select specific rows and columns.
For example, if you want to get all the various models from Toyota, you can chain the filter()
and select()
methods:
df
.
filter
(
pl
.
col
(
'Company'
)
==
'Toyota'
)
.
select
(
'Model'
)
This code snippet prints out all the models from Toyota (see Figure 4-6).
If you want to select multiple columns, simply contain the column names using a list:
df
.
filter
(
pl
.
col
(
'
Company
'
)
==
'
Toyota
'
)
.
select
(
[
'
Model
'
,
'
Year
'
]
)
Using SQL on Polars
While you can use the various methods in Polars to select rows and columns from the DataFrame, you can also use SQL to directly query a Polars DataFrame. This is done through the SQLContext
class. In Polars, SQLContext
provides a way to execute SQL statements against Polars DataFrames using SQL syntax.
Here is an example:
ctx
=
pl
.
SQLContext
(
cars
=
df
)
ctx
.
execute
(
"SELECT * FROM cars"
,
eager
=
True
)
SQLContext
takes a named
parameter (cars
in this case) with its value set to the Polars DataFrame. You use the SQLContext
object to execute the SQL statement against the Polars DataFrame. Figure 4-7 shows the result returned by the SQL statement.
Here’s another example using SQL to find the average minimum and maximum engine capacities for each company:
ctx
.
execute
(
'''
SELECT Company,
AVG(Engine_Min) AS avg_engine_min,
AVG(Engine_Max) AS avg_engine_max
FROM cars
GROUP BY Company;
'''
,
eager
=
True
)
Figure 4-8 shows the output of this query.
Up to this point, you have seen the techniques to select rows and columns from a Polars DataFrame. But you have not seen the most compelling reason to use Polars yet—lazy evaluation. In the next section, you’ll see how Polars uses this technique to improve performance when manipulating DataFrames.
Understanding Lazy Evaluation in Polars
One of the key features of Polars is its support for lazy evaluation. Lazy evaluation is a technique that allows for the construction of query plans that represent a sequence of operations without immediately executing them. Rather, the operations are executed only when the final result is explicitly requested. This approach makes it very efficient when dealing with large datasets or complex transformations because it avoids unnecessary computations.
To really understand why this efficiency measure is so important, you need to first understand how things are done in pandas. In pandas, you usually use the read_csv()
function to read a CSV file into a pandas DataFrame:
import
pandas
as
pd
df
=
pd
.
read_csv
(
'flights.csv'
)
df
If your CSV file is large, you will spend a long time (and a lot of memory) to load all the rows in the CSV file into the pandas DataFrame. The flights.csv file has more than 5.8 million rows; hence it takes a significant amount of memory to load the entire file into memory (see Figure 4-9).
A typical operation with pandas is to load the CSV file into a DataFrame and then perform some filtering on it:
df
=
pd
.
read_csv
(
'flights.csv'
)
df
=
df
[(
df
[
'MONTH'
]
==
5
)
&
(
df
[
'ORIGIN_AIRPORT'
]
==
'SFO'
)
&
(
df
[
'DESTINATION_AIRPORT'
]
==
'SEA'
)]
df
This is inefficient because you must load the entire CSV file into memory only to filter out a subset of it. In Polars, there is a much more efficient way of loading a DataFrame, known as lazy evaluation. There are two types of lazy evaluation:
- Implicit lazy evaluation
This is where you use functions that inherently support lazy evaluation (such as the
scan_csv()
function, which you’ll see in the next section).- Explicit lazy evaluation
This is where you use functions that do not inherently support lazy evaluation (such as the
read_csv()
function), and you explicitly make them use lazy evaluation.
Let’s dig a little deeper into each of these.
Implicit lazy evaluation
To understand how lazy evaluation works, let’s walk through an example. Instead of using the read_csv()
function (which also works with Polars), you use the scan_csv()
function:
import
polars
as
pl
q
=
pl
.
scan_csv
(
'flights.csv'
)
type
(
q
)
The scan_csv()
function returns an object of type polars.lazyframe.frame.LazyFrame
, which is a representation of a lazy computation graph/query against a DataFrame. Put simply, when you use the scan_csv()
function to load a CSV file, the contents of the CSV file are not loaded immediately. Instead, the function waits for further queries so that it can optimize the entire set of queries before loading the contents of the CSV file.
Contrast this with using the read_csv()
function to load a CSV file in Polars:
df
=
pl
.
read_csv
(
'flights.csv'
)
type
(
df
)
The read_csv()
function returns a polars.dataframe.frame.DataFrame
object, which is similar to a pandas DataFrame. However, unlike the scan_csv()
method, the read_csv()
method uses eager execution mode, which means that it will immediately load the entire dataset into the DataFrame before you perform any other queries.
Once you have obtained a LazyFrame
object, you can apply your queries to it:
q
=
pl
.
scan_csv
(
'flights.csv'
)
q
=
q
.
select
([
'MONTH'
,
'ORIGIN_AIRPORT'
,
'DESTINATION_AIRPORT'
])
q
=
q
.
filter
(
(
pl
.
col
(
'MONTH'
)
==
5
)
&
(
pl
.
col
(
'ORIGIN_AIRPORT'
)
==
'SFO'
)
&
(
pl
.
col
(
'DESTINATION_AIRPORT'
)
==
'SEA'
))
Tip
The select()
and filter()
methods work on Polars DataFrames as well as on LazyFrame
objects.
For readability, you should ideally use a pair of parentheses to chain up the various methods in Polars:
q
=
(
pl
.
scan_csv
(
'flights.csv'
)
.
select
([
'MONTH'
,
'ORIGIN_AIRPORT'
,
'DESTINATION_AIRPORT'
])
.
filter
(
(
pl
.
col
(
'MONTH'
)
==
5
)
&
(
pl
.
col
(
'ORIGIN_AIRPORT'
)
==
'SFO'
)
&
(
pl
.
col
(
'DESTINATION_AIRPORT'
)
==
'SEA'
))
)
You can call the show_graph()
method to show the execution graph:
q
.
show_graph
(
optimized
=
True
)
Figure 4-10 shows the execution graph of your query. You can see that it first scans the CSV file (top of the graph), and then performs a filter (bottom of the graph).
In contrast, if you call the show_graph()
method with the optimized
argument set to False
, you will see that it performs a scan of the CSV file, loads all 31 columns, and only then performs the filter one-by-one (see Figure 4-11):
q
.
show_graph
(
optimized
=
False
)
Note
By default, show_graph()
prints out the query in its optimized format. However, if you print out the q
object, it displays the graph in non-optimized mode.
To execute the queries, call the collect()
method:
q
.
collect
()
The collect()
method returns the result of the queries as a Polars DataFrame (see Figure 4-12).
Explicit lazy evaluation
Earlier I mentioned that if you use the read_csv()
function to read a CSV file, Polars will use eager execution and immediately load the DataFrame. Consider the following code snippet:
df
=
(
pl
.
read_csv
(
'flights.csv'
)
.
select
([
'MONTH'
,
'ORIGIN_AIRPORT'
,
'DESTINATION_AIRPORT'
])
.
filter
(
(
pl
.
col
(
'MONTH'
)
==
5
)
&
(
pl
.
col
(
'ORIGIN_AIRPORT'
)
==
'SFO'
)
&
(
pl
.
col
(
'DESTINATION_AIRPORT'
)
==
'SEA'
))
)
df
Observe that after loading the CSV file, we perform a selection of columns followed by filtering of rows. All these queries are cascaded and are performed one after another. This is because the read_csv()
function does not implicitly support lazy evaluation.
To ensure that all the subsequent queries after the CSV is loaded can be optimized, use the lazy()
method immediately after the read_csv()
function to explicitly indicate that you want the read_csv()
function to use lazy evaluation:
q
=
(
pl
.
read_csv
(
'
flights.csv
'
)
.
lazy
(
)
.
select
(
[
'
MONTH
'
,
'
ORIGIN_AIRPORT
'
,
'
DESTINATION_AIRPORT
'
]
)
.
filter
(
(
pl
.
col
(
'
MONTH
'
)
==
5
)
&
(
pl
.
col
(
'
ORIGIN_AIRPORT
'
)
==
'
SFO
'
)
&
(
pl
.
col
(
'
DESTINATION_AIRPORT
'
)
==
'
SEA
'
)
)
)
df
=
q
.
collect
(
)
display
(
df
)
The lazy()
function returns a LazyFrame
object, and with that you can chain further queries using methods such as select()
, filter()
, and so on. All the queries will now be optimized before execution.
Now that you are familiar with the basics of Polars, it’s time to see how it can be used together with DuckDB so that we can have the best of both worlds: the ability to harness the efficiency of the Polars database, together with the use of SQL for querying the dataset.
Querying Polars DataFrames Using DuckDB
Despite the ease of use, manipulating Polars DataFrames still requires a bit of practice and has a relatively steep learning curve for beginners. But since most developers are already familiar with SQL, isn’t it more convenient to manipulate the DataFrames directly using SQL? Using this approach, developers have the best of both worlds:
The ability to query Polars DataFrames using all the various functions.
The ability to use SQL for cases where it is much more natural and easier to extract the data that they want.
The good news is that DuckDB has support for Polars DataFrames through Apache Arrow. This means that you can use SQL to directly query a Polars DataFrame.
Note
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process, and move data quickly. PyArrow is the Python implementation of Arrow.
Using the sql() Function
Let’s now cover how you can use DuckDB to query a Polars DataFrame. For this, we’ll use the Polars DataFrame that we created earlier:
import
polars
as
pl
df
=
pl
.
DataFrame
(
{
'Model'
:
[
'Camry'
,
'Corolla'
,
'RAV4'
,
'Mustang'
,
'F-150'
,
'Escape'
,
'Golf'
,
'Tiguan'
],
'Year'
:
[
1982
,
1966
,
1994
,
1964
,
1975
,
2000
,
1974
,
2007
],
'Engine_Min'
:[
2.5
,
1.8
,
2.0
,
2.3
,
2.7
,
1.5
,
1.0
,
1.4
],
'Engine_Max'
:[
3.5
,
2.0
,
2.5
,
5.0
,
5.0
,
2.5
,
2.0
,
2.0
],
'AWD'
:[
False
,
False
,
True
,
False
,
True
,
True
,
True
,
True
],
'Company'
:
[
'Toyota'
,
'Toyota'
,
'Toyota'
,
'Ford'
,
'Ford'
,
'Ford'
,
'Volkswagen'
,
'Volkswagen'
],
}
)
To use DuckDB to query a Polars DataFrame, you need to install the PyArrow library:
pip install pyarrow
Tip
You can perform the installation either in Jupyter Notebook or in Terminal/Command Prompt. In Jupyter Notebook, after the installation, remember to restart the kernel.
To select all the rows from df
, use the sql()
function from the duckdb
module:
import
duckdb
result
=
duckdb
.
sql
(
'''
SELECT *
FROM df
'''
)
result
The sql()
function returns a duckdb.DuckDBPyRelation
object, which is displayed as a table when printed in Jupyter Notebook (see Figure 4-13).
A DuckDBPyRelation
object is part of DuckDB’s Relational API, which can be used to construct queries. Later in this chapter, we’ll discuss this object in more detail.
To convert the DuckDBPyRelation
object to a Polars DataFrame, use the pl()
method:
result
.
pl
()
Tip
To convert the DuckDBPyRelation
object to a pandas DataFrame, use the df()
method.
With the DuckDBPyRelation
object, you can perform several tasks. For example, you can use the describe()
method to generate some basic statistics (e.g., min, max, median, count) for each column in the DataFrame:
result
.
describe
()
The result of describe()
is yet another DuckDBPyRelation
object, which you can convert to a Polars or pandas DataFrame if you wish.
Figure 4-14 shows the output of the describe()
method when called on the result
object.
You can sort the result using the order()
method:
result
.
order
(
'Year'
)
In this example, the result is sorted by year in ascending order (see Figure 4-15).
If you want to sort by year in descending order, use the DESC
keyword:
result
.
order
(
'
Year
DESC
'
)
You can use the apply()
method to apply a function to a particular column, such as if you want to get the minimum value in the Year
column:
result
.
apply
(
'min'
,
'Year'
)
Figure 4-16 shows the output.
While you can use the various methods from the DuckDBPyRelation
object to extract data, there are always instances where it is easier to accomplish the same task using SQL. For example, say you want to sort the rows based on company followed by model. It would be very easy to accomplish this using SQL:
duckdb
.
sql
(
'''
SELECT Company, Model
FROM df
ORDER by Company, Model
'''
)
.
pl
()
Figure 4-17 shows the result of this query.
Or, if you want to count the number of models for each company, you can use the SQL GROUP BY
statement:
duckdb
.
sql
(
'''
SELECT Company, count(Model) as count
FROM df
GROUP BY Company
'''
)
.
pl
()
Figure 4-18 shows the output of this query.
You can perform the same query in Polars using the following statement:
result
.
pl
()
.
select
(
pl
.
col
(
'Company'
)
.
value_counts
()
)
.
unnest
(
'Company'
)
In the next section, you’ll learn more about the DuckDBPyRelation
object and how you can use it to perform various DataFrame operations.
Using the DuckDBPyRelation Object
In the previous sections, you saw several mentions of the DuckDBPyRelation
object. This object represents an alternative way for you to construct queries to extract data from your databases. Typically, you create DuckDBPyRelation
objects from SQL queries or directly from a connection object.
Let’s first create a DuckDB connection and then use the connection to create three tables: customers, products, and sales.
The following code snippet shows how this is done:
import
duckdb
conn
=
duckdb
.
connect
()
conn
.
execute
(
'''
CREATE TABLE customers
(customer_id INTEGER PRIMARY KEY, name STRING)
'''
)
conn
.
execute
(
'''
CREATE TABLE products
(product_id INTEGER PRIMARY KEY, product_name STRING)
'''
)
conn
.
execute
(
'''
CREATE TABLE sales
(customer_id INTEGER, product_id INTEGER, qty INTEGER,
PRIMARY KEY(customer_id,product_id))
'''
)
Now that the tables are created in DuckDB, you can load a specific table using the table()
method from the conn
object:
customers_relation
=
conn
.
table
(
'customers'
)
The result from the table()
method is a duckdb
.DuckDBPyRelation
object. As you learned earlier in this chapter, you can convert this object to a pandas or Polars DataFrame:
# convert to a pandas DataFrame
customers_relation
.
df
()
# convert to a Polars DataFrame
customers_relation
.
pl
()
Inserting rows
Using the DuckDBPyRelation
object, you can call the insert()
function to insert a new row into the table. The following code snippet inserts three rows into the customers table:
customers_relation
.
insert
([
1
,
'Alice'
])
customers_relation
.
insert
([
2
,
'Bob'
])
customers_relation
.
insert
([
3
,
'Charlie'
])
At the same time, let’s also insert rows into the products and sales tables:
products_relation
=
conn
.
table
(
'products'
)
products_relation
.
insert
([
10
,
'Paperclips'
])
products_relation
.
insert
([
20
,
'Staple'
])
products_relation
.
insert
([
30
,
'Notebook'
])
sales_relation
=
conn
.
table
(
"sales"
)
sales_relation
.
insert
([
1
,
20
,
1
])
sales_relation
.
insert
([
1
,
10
,
2
])
sales_relation
.
insert
([
2
,
30
,
7
])
sales_relation
.
insert
([
3
,
10
,
3
])
sales_relation
.
insert
([
3
,
20
,
2
])
Joining tables
Now that we have three DuckDBPyRelation
objects representing the three tables, we can perform joins between the tables using the join()
method:
result
=
customers_relation
.
join
(
sales_relation
,
condition
=
"customer_id"
,
how
=
"inner"
)
.
join
(
products_relation
,
condition
=
"product_id"
,
how
=
"inner"
)
In this code snippet, the customers table is joined to the sales table, and the result is then joined with the products table.
To see the resultant table after the joins, simply print out the value of result
. Figure 4-19 shows the result.
Filtering rows
After you have performed a join on the tables, you can use the result to extract the rows that you want using the filter()
method:
result
.
filter
(
'customer_id = 1'
)
Figure 4-20 shows the products bought by Alice (customer ID 1).
Alternatively, you can use the execute()
method and pass in a SQL statement:
# execute a query on the result to fetch and print the joined data
conn
.
execute
(
'''
SELECT *
FROM result
WHERE customer_id = 1
'''
)
.
pl
()
Aggregating rows
You can also perform aggregation using a DuckDBPyRelation
object. Suppose you want to sum up the purchases for all customers. You can use the aggregate()
method like this:
result
.
aggregate
(
'customer_id, MAX(name) AS Name, '
+
'SUM(qty) as "Total Qty"'
,
'customer_id'
)
The first argument takes in an aggregate expression, while the second argument takes in a group expression. The result is shown in Figure 4-21.
This aggregate function is identical to the following GROUP BY
statement:
SELECT
customer_id
as
'Customer ID'
,
MAX
(
name
)
AS
Name
,
sum
(
qty
)
as
'Total Qty'
FROM
result
GROUP
BY
customer_id
Projecting columns
Using the DuckDBPyRelation
object, you can select specific columns to display using the project()
method:
result
.
project
(
'name, qty, product_name'
)
This statement displays the three columns—name
, qty
, and product_name
—in the result (see Figure 4-22).
Limiting rows
To limit the number of rows returned, use the limit()
method:
result
.
limit
(
3
)
Figure 4-23 shows the first three rows in the result object.
If you want to start with the third row and display the next three rows, specify the number of rows to display, followed by the offset:
result
.
limit
(
3
,
2
)
# display 3 rows, starting at offset 2 (third row)
The result is shown in Figure 4-24.
Summary
In this chapter, you have learned about the Polars DataFrame library, including the basics of extracting rows and columns, followed by an explanation of how lazy evaluation works in Polars. More importantly, you also learned how DuckDB and Polars can be used together to query DataFrames. Utilizing both libraries gives you the best of both worlds—you can manipulate your data using methods that you are already familiar with and you can use a familiar querying language (SQL) to query an efficient DataFrame.
In the next chapter, you’ll learn how to perform exploratory data analysis with DuckDB using a real-world dataset!
Get DuckDB: Up and Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.