Chapter 4. Using DuckDB with Polars

Most data scientists and data analysts are familiar with the pandas library. With pandas, you can organize your dataset into Series or DataFrame structures and employ the diverse array of functions provided by the pandas library for data manipulation. However, one of the main complaints about pandas is its slow speed and inefficiencies when dealing with large datasets. This is because pandas was originally designed to work with tabular data that fits in memory. When dealing with large datasets, it becomes slow because it needs to swap data in and out of memory.

To address the inefficiencies of pandas in working with large datasets, there is a competing library—Polars. The first part of this chapter provides an introduction to Polars and how you can work with it (just like with pandas). The second part of this chapter shows how you can query Polars DataFrames using DuckDB.

Introduction to Polars

Polars is a DataFrame library that is completely written in Rust. Polars is designed with the following in mind:

Speed: Polars leverages Rust, a system programming language known for its performance.
Parallelism: Polars can take advantage of multicore processors, which provide substantial speed improvements for CPU-bound operations.
Memory efficiency: Polars uses lazy evaluation, which means an operation is not performed until it is needed. In addition, queries can be chained and optimized before execution, resulting in much more efficient execution.
Efficient storage of data: Polars stores data in columnar format, which is more efficient than the row-based storage in pandas.
Ease of use: Polars supports a SQL-like syntax for data manipulation, making it immediately accessible to a large group of users. In addition, it has many methods that are similar to pandas, making it very easy for pandas users to migrate to.

To install Polars, use the pip command:

!pip install polars

Note

The version of Polars used in this book is 1.8.2.

In the following sections, you’ll learn how to get started with the Polars library and understand the magic behind the efficiencies of Polars—lazy evaluation.

Creating a Polars DataFrame

Let’s start off with the basics. Let’s create a Polars DataFrame using a Python dictionary. The following code snippet creates a Polars DataFrame containing six columns and eight rows (see Figure 4-1):

import polars as pl

df = pl.DataFrame(
     {
         'Model': ['Camry','Corolla','RAV4',
                   'Mustang','F-150','Escape',
                   'Golf','Tiguan'],
         'Year': [1982,1966,1994,1964,1975,2000,1974,2007],  
         'Engine_Min':[2.5,1.8,2.0,2.3,2.7,1.5,1.0,1.4],
         'Engine_Max':[3.5,2.0,2.5,5.0,5.0,2.5,2.0,2.0],
         'AWD':[False,False,True,False,True,True,True,True],
         'Company': ['Toyota','Toyota','Toyota','Ford',
                     'Ford','Ford','Volkswagen','Volkswagen'],
     }
)
df

The Polars DataFrame containing six columns and eight rows

Note

Just like pandas, Jupyter Notebook will pretty-print the Polars DataFrame when you print it out.

If you observe the output, you should find it similar to a pandas DataFrame, except:

A Polars DataFrame does not have an index. This is one of the design philosophies behind Polars: the index in a DataFrame is not useful and seldom needed.
Below the headers of the DataFrame, Polars displays the data type of each column (str, i64, f64, and bool)

To display the full name of the data type of each column in the Polars DataFrame, use the dtypes property:

df.dtypes

For the DataFrame in Figure 4-1, this statement displays the following result:

[String, Int64, Float64, Float64, Boolean, String]

To get the column names, use the columns property:

df.columns
# ['Model', 'Year', 'Engine_Min', 'Engine_Max', 'AWD', 'Company']

If you want to get all the rows of the DataFrame, use the rows() method:

df.rows()

The rows are returned as a list of tuples:

[('Camry', 1982, 2.5, 3.5, False, 'Toyota'),
 ('Corolla', 1966, 1.8, 2.0, False, 'Toyota'),
 ('RAV4', 1994, 2.0, 2.5, True, 'Toyota'),
 ('Mustang', 1964, 2.3, 5.0, False, 'Ford'),
 ('F-150', 1975, 2.7, 5.0, True, 'Ford'),
 ('Escape', 2000, 1.5, 2.5, True, 'Ford'),
 ('Golf', 1974, 1.0, 2.0, True, 'Volkswagen'),
 ('Tiguan', 2007, 1.4, 2.0, True, 'Volkswagen')]

With the DataFrame loaded, the next few sections will show you how to select parts (such as columns and rows) of the DataFrame.

Selecting columns

To select a particular column in the DataFrame, use the select() method:

df.select(
    'Model'
)

This returns the column named Model (see Figure 4-2).

The DataFrame with the Model column printed

Tip

If you’re familiar with pandas, you might be wondering if the square bracket indexing method still works. Well, df['Model'] works just like using the select() method. However, the Polars documentation specifically mentions that the square bracket indexing method is an anti-pattern for Polars because it is sometimes confusing. So, while df['Model'] works, there is a possibility that the square bracket indexing method may be removed in a future version of Polars.

If you need to retrieve more than one column, enclose the column names in a list (or simply specify the additional column names):

df.select(
    ['Model','Company']  # or 'Model','Company'
)

If you want to retrieve all the string columns (that is, columns of type pl.String) in the DataFrame, you can use an expression within the select() method:

df.select(
    pl.col(pl.String)
)

Tip

The statement pl.col(pl.String) is known as an expression in Polars. You can interpret this expression as “get me all the columns whose data type is String”.

This statement prints out the Model and Company columns (see Figure 4-3).

The DataFrame with the Model and Company columns printed

Expressions are powerful in Polars. For example, you can pipe together multiple expressions:

df.select(
    pl.col(['Year','Model','Engine_Max'])
    .sort_by(['Engine_Max','Year'],descending = [False,True])
)

In this code snippet, the first expression selects the three columns Year, Model, and Engine_Max. The result of the first expression is then piped to the second expression, which sorts the column Engine_Max in ascending order and the Year column in descending order. The result is shown in Figure 4-4.

The DataFrame with the Year, Model, and Engine_Max columns printed

You can also group multiple expressions in a list. For example the following code snippet lists all the string columns plus the Year column:

df.select(
    [pl.col(pl.String), 'Year']
)

Selecting rows

If you want to get a particular row in a Polars DataFrame, you can use the row() method and pass in a row number. For example, the following statement retrieves the first row in the table:

df.row(0)
# ('Camry', 1982, 2.5, 3.5, False, 'Toyota')

If you want to get multiple rows, you can use the square bracket indexing method, though this is not recommended:

df[1:3]  # returns the second and third rows

Tip

Instead of using square bracket indexing, Polars encourages the use of more explicit forms of querying and functions to manipulate data. In the real world, you often retrieve rows based on certain criteria, instead of specific row numbers. Despite this, Polars still provides support for square bracket indexing (at least for now).

Like pandas, Polars supports common methods like head(), tail(), and sample().

To select rows, Polars recommends using the filter() method. For example, if you want to select all the rows that contains cars from Toyota, use the filter() method with the following expression:

df.filter(
    pl.col('Company') == 'Toyota'
)

Figure 4-5 shows all rows containing cars from Toyota.

The DataFrame contains all the cars from Toyota

You can also specify multiple conditions using logical operators. The following example retrieves all cars that are from Toyota or Ford:

df.filter(
    (pl.col('Company') == 'Toyota') |
    (pl.col('Company') == 'Ford')
)

Note

Remember to use a pair of parentheses to enclose each condition.

If you want to match multiple brands of cars, it is easier to use the is_in() method:

df.filter(
    (pl.col('Company').is_in(['Toyota','Ford']))
)

The following example retrieves all cars from Toyota that were launched after 1980:

df.filter(
    (pl.col('Company') == 'Toyota') &
    (pl.col('Year') > 1980)
)

The following example retrieves all the cars other than Toyota:

df.filter(
    ~(pl.col('Company') == 'Toyota')
)

Alternatively, you can also use the != operator:

df.filter(
    (pl.col('Company') != 'Toyota')
)

Selecting rows and columns

Now that you have seen how to use the select() method to select columns and the filter() method to select rows from a Polars DataFrame, let’s see how you can chain them together to select specific rows and columns.

For example, if you want to get all the various models from Toyota, you can chain the filter() and select() methods:

df.filter(
    pl.col('Company') == 'Toyota'
).select(
    'Model'
)

This code snippet prints out all the models from Toyota (see Figure 4-6).

The DataFrame contains all the models from Toyota

If you want to select multiple columns, simply contain the column names using a list:

df.filter(
    pl.col('Company') == 'Toyota'
).select(
    ['Model','Year']
)

Using SQL on Polars

While you can use the various methods in Polars to select rows and columns from the DataFrame, you can also use SQL to directly query a Polars DataFrame. This is done through the SQLContext class. In Polars, SQLContext provides a way to execute SQL statements against Polars DataFrames using SQL syntax.

Here is an example:

ctx = pl.SQLContext(cars = df)
ctx.execute("SELECT * FROM cars", eager=True)

SQLContext takes a named parameter (cars in this case) with its value set to the Polars DataFrame. You use the SQLContext object to execute the SQL statement against the Polars DataFrame. Figure 4-7 shows the result returned by the SQL statement.

Here’s another example using SQL to find the average minimum and maximum engine capacities for each company:

ctx.execute('''
    SELECT Company,
           AVG(Engine_Min) AS avg_engine_min,
           AVG(Engine_Max) AS avg_engine_max
    FROM cars
    GROUP BY Company;
''', eager=True)

Figure 4-8 shows the output of this query.

The average minimum and maximum engine capacities of each company’s cars

Up to this point, you have seen the techniques to select rows and columns from a Polars DataFrame. But you have not seen the most compelling reason to use Polars yet—lazy evaluation. In the next section, you’ll see how Polars uses this technique to improve performance when manipulating DataFrames.

Understanding Lazy Evaluation in Polars

One of the key features of Polars is its support for lazy evaluation. Lazy evaluation is a technique that allows for the construction of query plans that represent a sequence of operations without immediately executing them. Rather, the operations are executed only when the final result is explicitly requested. This approach makes it very efficient when dealing with large datasets or complex transformations because it avoids unnecessary computations.

To really understand why this efficiency measure is so important, you need to first understand how things are done in pandas. In pandas, you usually use the read_csv() function to read a CSV file into a pandas DataFrame:

import pandas as pd

df = pd.read_csv('flights.csv')
df

If your CSV file is large, you will spend a long time (and a lot of memory) to load all the rows in the CSV file into the pandas DataFrame. The flights.csv file has more than 5.8 million rows; hence it takes a significant amount of memory to load the entire file into memory (see Figure 4-9).

A typical operation with pandas is to load the CSV file into a DataFrame and then perform some filtering on it:

df = pd.read_csv('flights.csv')
df = df[(df['MONTH'] == 5) &
        (df['ORIGIN_AIRPORT'] == 'SFO') &
        (df['DESTINATION_AIRPORT'] == 'SEA')]
df

This is inefficient because you must load the entire CSV file into memory only to filter out a subset of it. In Polars, there is a much more efficient way of loading a DataFrame, known as lazy evaluation. There are two types of lazy evaluation:

Implicit lazy evaluation: This is where you use functions that inherently support lazy evaluation (such as the scan_csv() function, which you’ll see in the next section).
Explicit lazy evaluation: This is where you use functions that do not inherently support lazy evaluation (such as the read_csv() function), and you explicitly make them use lazy evaluation.

Let’s dig a little deeper into each of these.

Implicit lazy evaluation

To understand how lazy evaluation works, let’s walk through an example. Instead of using the read_csv() function (which also works with Polars), you use the scan_csv() function:

import polars as pl

q = pl.scan_csv('flights.csv')
type(q)

The scan_csv() function returns an object of type polars.lazyframe.frame.LazyFrame, which is a representation of a lazy computation graph/query against a DataFrame. Put simply, when you use the scan_csv() function to load a CSV file, the contents of the CSV file are not loaded immediately. Instead, the function waits for further queries so that it can optimize the entire set of queries before loading the contents of the CSV file.

Contrast this with using the read_csv() function to load a CSV file in Polars:

df = pl.read_csv('flights.csv')
type(df)

The read_csv() function returns a polars.dataframe.frame.DataFrame object, which is similar to a pandas DataFrame. However, unlike the scan_csv() method, the read_csv() method uses eager execution mode, which means that it will immediately load the entire dataset into the DataFrame before you perform any other queries.

Once you have obtained a LazyFrame object, you can apply your queries to it:

q = pl.scan_csv('flights.csv')
q = q.select(['MONTH', 'ORIGIN_AIRPORT','DESTINATION_AIRPORT'])
q = q.filter(
    (pl.col('MONTH') == 5) &
    (pl.col('ORIGIN_AIRPORT') == 'SFO') &
    (pl.col('DESTINATION_AIRPORT') == 'SEA'))

Tip

The select() and filter() methods work on Polars DataFrames as well as on LazyFrame objects.

For readability, you should ideally use a pair of parentheses to chain up the various methods in Polars:

q = (
    pl.scan_csv('flights.csv')
    .select(['MONTH', 'ORIGIN_AIRPORT','DESTINATION_AIRPORT'])
    .filter(
        (pl.col('MONTH') == 5) &
        (pl.col('ORIGIN_AIRPORT') == 'SFO') &
        (pl.col('DESTINATION_AIRPORT') == 'SEA'))
)

You can call the show_graph() method to show the execution graph:

q.show_graph(optimized=True)

Figure 4-10 shows the execution graph of your query. You can see that it first scans the CSV file (top of the graph), and then performs a filter (bottom of the graph).

The execution graph of the optimized query

In contrast, if you call the show_graph() method with the optimized argument set to False, you will see that it performs a scan of the CSV file, loads all 31 columns, and only then performs the filter one-by-one (see Figure 4-11):

q.show_graph(optimized=False)

The execution graph of the unoptimized query

Note

By default, show_graph() prints out the query in its optimized format. However, if you print out the q object, it displays the graph in non-optimized mode.

To execute the queries, call the collect() method:

q.collect()

The collect() method returns the result of the queries as a Polars DataFrame (see Figure 4-12).

The DataFrame returned by the collect() method

Explicit lazy evaluation

Earlier I mentioned that if you use the read_csv() function to read a CSV file, Polars will use eager execution and immediately load the DataFrame. Consider the following code snippet:

df = (
    pl.read_csv('flights.csv')
    .select(['MONTH', 'ORIGIN_AIRPORT','DESTINATION_AIRPORT'])
    .filter(
        (pl.col('MONTH') == 5) &
        (pl.col('ORIGIN_AIRPORT') == 'SFO') &
        (pl.col('DESTINATION_AIRPORT') == 'SEA'))
    )
df

Observe that after loading the CSV file, we perform a selection of columns followed by filtering of rows. All these queries are cascaded and are performed one after another. This is because the read_csv() function does not implicitly support lazy evaluation.

To ensure that all the subsequent queries after the CSV is loaded can be optimized, use the lazy() method immediately after the read_csv() function to explicitly indicate that you want the read_csv() function to use lazy evaluation:

q = (
    pl.read_csv('flights.csv')
    .lazy()
    .select(['MONTH', 'ORIGIN_AIRPORT','DESTINATION_AIRPORT'])
    .filter(
        (pl.col('MONTH') == 5) &
        (pl.col('ORIGIN_AIRPORT') == 'SFO') &
        (pl.col('DESTINATION_AIRPORT') == 'SEA'))
    )
df = q.collect()
display(df)

The lazy() function returns a LazyFrame object, and with that you can chain further queries using methods such as select(), filter(), and so on. All the queries will now be optimized before execution.

Now that you are familiar with the basics of Polars, it’s time to see how it can be used together with DuckDB so that we can have the best of both worlds: the ability to harness the efficiency of the Polars database, together with the use of SQL for querying the dataset.

Querying Polars DataFrames Using DuckDB

Despite the ease of use, manipulating Polars DataFrames still requires a bit of practice and has a relatively steep learning curve for beginners. But since most developers are already familiar with SQL, isn’t it more convenient to manipulate the DataFrames directly using SQL? Using this approach, developers have the best of both worlds:

The ability to query Polars DataFrames using all the various functions.
The ability to use SQL for cases where it is much more natural and easier to extract the data that they want.

The good news is that DuckDB has support for Polars DataFrames through Apache Arrow. This means that you can use SQL to directly query a Polars DataFrame.

Note

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process, and move data quickly. PyArrow is the Python implementation of Arrow.

Using the sql() Function

Let’s now cover how you can use DuckDB to query a Polars DataFrame. For this, we’ll use the Polars DataFrame that we created earlier:

import polars as pl
df = pl.DataFrame(
     {
         'Model': ['Camry','Corolla','RAV4',
                   'Mustang','F-150','Escape',
                   'Golf','Tiguan'],
         'Year': [1982,1966,1994,1964,1975,2000,1974,2007],   
         'Engine_Min':[2.5,1.8,2.0,2.3,2.7,1.5,1.0,1.4],
         'Engine_Max':[3.5,2.0,2.5,5.0,5.0,2.5,2.0,2.0],
         'AWD':[False,False,True,False,True,True,True,True],
         'Company': ['Toyota','Toyota','Toyota','Ford',
                     'Ford','Ford','Volkswagen','Volkswagen'],
     }
)

To use DuckDB to query a Polars DataFrame, you need to install the PyArrow library:

pip install pyarrow

Tip

You can perform the installation either in Jupyter Notebook or in Terminal/Command Prompt. In Jupyter Notebook, after the installation, remember to restart the kernel.

To select all the rows from df, use the sql() function from the duckdb module:

import duckdb

result = duckdb.sql('''
    SELECT *
    FROM df
''')
result

The sql() function returns a duckdb.DuckDBPyRelation object, which is displayed as a table when printed in Jupyter Notebook (see Figure 4-13).

The result from the sql() function is displayed as a table in Jupyter Notebook

A DuckDBPyRelation object is part of DuckDB’s Relational API, which can be used to construct queries. Later in this chapter, we’ll discuss this object in more detail.

To convert the DuckDBPyRelation object to a Polars DataFrame, use the pl() method:

result.pl()

Tip

To convert the DuckDBPyRelation object to a pandas DataFrame, use the df() method.

With the DuckDBPyRelation object, you can perform several tasks. For example, you can use the describe() method to generate some basic statistics (e.g., min, max, median, count) for each column in the DataFrame:

result.describe()

The result of describe() is yet another DuckDBPyRelation object, which you can convert to a Polars or pandas DataFrame if you wish.

Figure 4-14 shows the output of the describe() method when called on the result object.

You can sort the result using the order() method:

result.order('Year')

In this example, the result is sorted by year in ascending order (see Figure 4-15).

If you want to sort by year in descending order, use the DESC keyword:

result.order('Year DESC')

You can use the apply() method to apply a function to a particular column, such as if you want to get the minimum value in the Year column:

result.apply('min', 'Year')

Figure 4-16 shows the output.

Getting the minimum value in the Year column

While you can use the various methods from the DuckDBPyRelation object to extract data, there are always instances where it is easier to accomplish the same task using SQL. For example, say you want to sort the rows based on company followed by model. It would be very easy to accomplish this using SQL:

duckdb.sql('''
    SELECT Company, Model
    FROM df
    ORDER by Company, Model
''').pl()

Figure 4-17 shows the result of this query.

Sorting the output by company and model using SQL

Or, if you want to count the number of models for each company, you can use the SQL GROUP BY statement:

duckdb.sql('''
    SELECT Company, count(Model) as count
    FROM df
    GROUP BY Company
''').pl()

Figure 4-18 shows the output of this query.

Using GROUP BY to count the number of models for each company

You can perform the same query in Polars using the following statement:

result.pl().select(
    pl.col('Company').value_counts()
).unnest('Company')

In the next section, you’ll learn more about the DuckDBPyRelation object and how you can use it to perform various DataFrame operations.

Using the DuckDBPyRelation Object

In the previous sections, you saw several mentions of the DuckDBPyRelation object. This object represents an alternative way for you to construct queries to extract data from your databases. Typically, you create DuckDBPyRelation objects from SQL queries or directly from a connection object.

Let’s first create a DuckDB connection and then use the connection to create three tables: customers, products, and sales.

The following code snippet shows how this is done:

import duckdb

conn = duckdb.connect()

conn.execute('''
    CREATE TABLE customers
    (customer_id INTEGER PRIMARY KEY, name STRING)
''')

conn.execute('''
    CREATE TABLE products
    (product_id INTEGER PRIMARY KEY, product_name STRING)
''')

conn.execute('''
    CREATE TABLE sales
    (customer_id INTEGER, product_id INTEGER, qty INTEGER,
     PRIMARY KEY(customer_id,product_id))
''')

Now that the tables are created in DuckDB, you can load a specific table using the table() method from the conn object:

customers_relation = conn.table('customers')

The result from the table() method is a duckdb.DuckDBPyRelation object. As you learned earlier in this chapter, you can convert this object to a pandas or Polars DataFrame:

# convert to a pandas DataFrame
customers_relation.df()

# convert to a Polars DataFrame
customers_relation.pl()

Inserting rows

Using the DuckDBPyRelation object, you can call the insert() function to insert a new row into the table. The following code snippet inserts three rows into the customers table:

customers_relation.insert([1, 'Alice'])
customers_relation.insert([2, 'Bob'])
customers_relation.insert([3, 'Charlie'])

At the same time, let’s also insert rows into the products and sales tables:

products_relation = conn.table('products')
products_relation.insert([10, 'Paperclips'])
products_relation.insert([20, 'Staple'])
products_relation.insert([30, 'Notebook'])

sales_relation = conn.table("sales")
sales_relation.insert([1,20,1])
sales_relation.insert([1,10,2])
sales_relation.insert([2,30,7])
sales_relation.insert([3,10,3])
sales_relation.insert([3,20,2])

Joining tables

Now that we have three DuckDBPyRelation objects representing the three tables, we can perform joins between the tables using the join() method:

result = customers_relation.join(
    sales_relation,
    condition = "customer_id",
    how = "inner"
).join(
    products_relation,
    condition = "product_id",
    how = "inner"
)

In this code snippet, the customers table is joined to the sales table, and the result is then joined with the products table.

To see the resultant table after the joins, simply print out the value of result. Figure 4-19 shows the result.

Filtering rows

After you have performed a join on the tables, you can use the result to extract the rows that you want using the filter() method:

result.filter('customer_id = 1')

Figure 4-20 shows the products bought by Alice (customer ID 1).

The result shows the products bought by Alice

Alternatively, you can use the execute() method and pass in a SQL statement:

# execute a query on the result to fetch and print the joined data
conn.execute('''
    SELECT *
    FROM result
    WHERE customer_id = 1
''').pl()

Aggregating rows

You can also perform aggregation using a DuckDBPyRelation object. Suppose you want to sum up the purchases for all customers. You can use the aggregate() method like this:

result.aggregate('customer_id, MAX(name) AS Name, ' +
                 'SUM(qty) as "Total Qty"',
                 'customer_id')

The first argument takes in an aggregate expression, while the second argument takes in a group expression. The result is shown in Figure 4-21.

The total quantity of items bought by each customer

This aggregate function is identical to the following GROUP BY statement:

SELECT customer_id as 'Customer ID', MAX(name) AS Name,
       sum(qty) as 'Total Qty'
FROM result
GROUP BY customer_id

Projecting columns

Using the DuckDBPyRelation object, you can select specific columns to display using the project() method:

result.project('name, qty, product_name')

This statement displays the three columns—name, qty, and product_name—in the result (see Figure 4-22).

Displaying specific columns using the project() method

Limiting rows

To limit the number of rows returned, use the limit() method:

result.limit(3)

Figure 4-23 shows the first three rows in the result object.

Use the limit() method to specify the number of rows to return

If you want to start with the third row and display the next three rows, specify the number of rows to display, followed by the offset:

result.limit(3,2)  # display 3 rows, starting at offset 2 (third row)

The result is shown in Figure 4-24.

Displaying three rows, starting from offset 2 (third row)

Summary

In this chapter, you have learned about the Polars DataFrame library, including the basics of extracting rows and columns, followed by an explanation of how lazy evaluation works in Polars. More importantly, you also learned how DuckDB and Polars can be used together to query DataFrames. Utilizing both libraries gives you the best of both worlds—you can manipulate your data using methods that you are already familiar with and you can use a familiar querying language (SQL) to query an efficient DataFrame.

In the next chapter, you’ll learn how to perform exploratory data analysis with DuckDB using a real-world dataset!

Get DuckDB: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 4. Using DuckDB with Polars

Introduction to Polars

Note

Creating a Polars DataFrame

Figure 4-1. The Polars DataFrame containing six columns and eight rows

Note

Selecting columns

Figure 4-2. The DataFrame with the Model column printed

Tip

Tip

Figure 4-3. The DataFrame with the Model and Company columns printed

Figure 4-4. The DataFrame with the Year, Model, and Engine_Max columns printed

Selecting rows

Tip

Figure 4-5. The DataFrame contains all the cars from Toyota

Note

Selecting rows and columns

Figure 4-6. The DataFrame contains all the models from Toyota

Using SQL on Polars

Figure 4-7. The result of the SQL query

Figure 4-8. The average minimum and maximum engine capacities of each company’s cars

Understanding Lazy Evaluation in Polars

Figure 4-9. Loading a large CSV file using pandas

Implicit lazy evaluation

Tip

Figure 4-10. The execution graph of the optimized query

Figure 4-11. The execution graph of the unoptimized query

Note

Figure 4-12. The DataFrame returned by the collect() method

Explicit lazy evaluation

Querying Polars DataFrames Using DuckDB

Note

Using the sql() Function

Tip

Figure 4-13. The result from the sql() function is displayed as a table in Jupyter Notebook

Tip

Figure 4-14. The output from the describe() method

Figure 4-15. The output when sorted by year

Figure 4-16. Getting the minimum value in the Year column

Figure 4-17. Sorting the output by company and model using SQL

Figure 4-18. Using GROUP BY to count the number of models for each company

Using the DuckDBPyRelation Object

Inserting rows

Joining tables

Figure 4-19. The output of the join operation

Filtering rows

Figure 4-20. The result shows the products bought by Alice

Aggregating rows

Figure 4-21. The total quantity of items bought by each customer

Projecting columns

Figure 4-22. Displaying specific columns using the project() method

Limiting rows

Figure 4-23. Use the limit() method to specify the number of rows to return

Figure 4-24. Displaying three rows, starting from offset 2 (third row)

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Figure 4-2. The DataFrame with the `Model` column printed

Figure 4-3. The DataFrame with the `Model` and `Company` columns printed

Figure 4-4. The DataFrame with the `Year`, `Model`, and `Engine_Max` columns printed

Figure 4-12. The DataFrame returned by the `collect()` method

Figure 4-13. The result from the `sql()` function is displayed as a table in Jupyter Notebook

Figure 4-14. The output from the `describe()` method

Figure 4-18. Using `GROUP BY` to count the number of models for each company

Figure 4-22. Displaying specific columns using the `project()` method

Figure 4-23. Use the `limit()` method to specify the number of rows to return