Chapter 1. Getting Started with DuckDB
When it comes to data analytics, pandas is often the go-to library for many developers. Recently, Polars has emerged as a faster and more efficient alternative for handling DataFrames. However, despite the popularity of these libraries, SQL (Structured Query Language) remains the most widely recognized and used language among developers. If your data is stored in a database that supports SQL, using SQL to query and manipulate that data is often the most intuitive and effective approach.
While Python has become the dominant language in data science—particularly for working with data in tabular formats through DataFrame objects—SQL continues to be the universal language of data. Given that most developers are already comfortable with SQL, wouldn’t it be more efficient to use SQL directly for data manipulation?
This is where DuckDB shines. DuckDB was initially conceptualized in 2018 as an OLAP (online analytical processing) database optimized for fast analytical queries. Its aim was to bridge the gap between fully-fledged database systems and the simplicity of embedded DBs like SQLite, but with a focus on analytical rather than transactional workloads. The first stable release of DuckDB was in 2019, and its ease of integration with Python and R made it a very popular choice among the data science and analytics communities. While DuckDB is open source, DuckDB Labs was founded in 2021 to provide commercial support and further development. To bring DuckDB to the cloud, MotherDuck was built around DuckDB, enabling users to access it as a SaaS (software as a service). With MotherDuck, developers can now use DuckDB in a distributed and managed environment, making it much easier to scale for larger datasets and collaborative use cases (more on this in Chapter 9).
In this chapter, we’ll dive into what DuckDB is, why it’s a powerful tool for data analytics, and how you can harness its capabilities to streamline your data analysis tasks. DuckDB offers the performance and flexibility of SQL right within your Python environment, making it an invaluable tool for any data scientist or analyst.
Introduction to DuckDB
DuckDB is a relational database management system (RDBMS) that supports SQL and is specifically engineered for OLAP, making it ideal for data analytics tasks.
Unlike traditional database systems that require a separate installation process, DuckDB operates entirely in-process, so you don’t need to worry about installation or setup. One of the most compelling features of DuckDB is its ability to run SQL queries directly on pandas data without the need for importing or duplicating the data. This seamless integration with pandas makes DuckDB an exceptionally powerful tool for data scientists and analysts who are already familiar with the pandas ecosystem.
Moreover, DuckDB is built with vectorized data processing, significantly boosting its efficiency by processing data in CPU-friendly chunks within a single machine. This contrasts with big data frameworks like Spark or Flink, which distribute data and computation across multiple nodes to achieve scalability through parallelism in large clusters.
Additionally, instead of using the traditional row-based storage format found in databases like MySQL and SQLite, DuckDB employs a columnar storage format. This columnar structure is key to its high performance—particularly for large-scale analytical queries—enabling DuckDB to excel in scenarios where speed and efficiency are critical.
Why Use DuckDB?
Today, your datasets typically come from one or more of the following sources:
-
CSV (comma-separated values) files
-
Excel spreadsheets
-
XML files
-
JSON files
-
Parquet files
-
Databases
If you want to use SQL in an ELT (extract, load, transform) process, you’d typically first load the dataset (such as a CSV file) into a database server. From there, you would load the data into a pandas DataFrame through an application (such as written in Python) using SQL (see Figure 1-1).
Note
ELT is a data integration process used in data pipelines to move and prepare data for analysis. Its three main steps are:
-
Extract data from multiple sources such as databases, APIs, or flat files.
-
Load the extracted data directly into a target system, such as a data warehouse or data lake.
-
Transform the data, such as by filtering, aggregating, cleaning, etc.
DuckDB eliminates the need to load the dataset into a database server, allowing you to directly manipulate the dataset using SQL. This streamlined process simplifies data manipulation and analysis, enabling you to work more efficiently with your data (see Figure 1-2).
Once the pandas DataFrame is loaded, you can use DuckDB and SQL to further slice and dice the data. This allows for powerful and flexible data manipulation directly within your Python environment, leveraging the strengths of SQL without the overhead of a separate database server (see Figure 1-3).
In the following sections, you will learn about the various features of DuckDB and what makes it so powerful.
High-Performance Analytical Queries
One of DuckDB’s strengths lies in its ability to execute fast analytical queries, making it a powerful tool for data-intensive tasks. This performance is driven by several key design features:
- Columnar storage format
-
Unlike traditional databases and file systems that store data in a row-based format (in which all fields of a row are stored together), DuckDB uses a columnar storage format. In a columnar format, data is stored column by column, rather than row by row. This design is particularly beneficial for analytical workloads, where queries often require reading and analyzing a small subset of columns over many rows (e.g., summing or filtering one or two columns across a large dataset).
By reading only the necessary columns from disk, DuckDB significantly reduces the amount of data that needs to be transferred into memory, speeding up query execution. For example, if a query needs data from only two columns out of ten, DuckDB can ignore the rest, whereas a row-based database would need to load all columns of every row.
- Vectorized execution engine
-
DuckDB processes data in vectors, operating with chunks of rows rather than processing one row at a time. This technique, known as vectorized execution, allows for more efficient use of the CPU. By working on multiple rows in one go, DuckDB minimizes the overhead that comes with handling data row-by-row, such as memory access and instruction dispatch, which can slow down processing.
Additionally, vectorized execution makes better use of the CPU cache, reducing the frequency of cache misses (when the CPU has to access slower memory). This design optimizes the use of modern hardware, leading to faster execution times, especially for complex analytical queries.
- Efficient memory usage
-
DuckDB is designed to work directly on in-memory data structures, meaning that it doesn’t need to create unnecessary copies of data that could potentially slow down operations. This allows DuckDB to handle large datasets without requiring excessive amounts of memory, and it manages memory intelligently to prevent bottlenecks during query execution.
DuckDB’s ability to process data in chunks also plays a role here, as it can operate on data that’s too large to fit into memory all at once by processing it piece-by-piece, further optimizing resource usage.
- Parallel execution
-
Modern CPUs typically have multiple cores, allowing them to perform multiple operations at the same time. DuckDB takes full advantage of this by running queries in parallel across different CPU cores. This parallel execution allows it to process large datasets more quickly, as parts of the query can be run simultaneously on different portions of the data.
For example, if you are performing an aggregation or a join across a large dataset, DuckDB can break this task into smaller chunks and process them concurrently, leveraging all available processing power to complete the task more quickly.
- Late materialization
-
DuckDB uses a technique called late materialization, where data is only fetched or processed when absolutely necessary. In traditional databases, materializing data (i.e., fetching and loading full rows into memory) is done early in query execution, even if only a subset of columns is needed for the final result. DuckDB, however, postpones this materialization step as much as possible, working with metadata (e.g., column indices) rather than actual row data until it needs to materialize only the specific columns required for the query result.
This approach minimizes unnecessary data movement and processing, leading to substantial performance improvements, especially for complex queries that involve filtering or joining large datasets.
- Optimized query planner
-
DuckDB features an optimized query planner that analyzes and restructures queries before they are executed. The query planner’s job is to find the most efficient way to execute a query, especially for operations that are typically resource-intensive, such as joins, aggregations, and filtering operations.
By reorganizing the query plan and applying advanced optimization techniques like predicate pushdown (pushing filters as close as possible to the data source) and join reordering (choosing the most efficient order in which to join tables), DuckDB reduces the computational load, making query execution faster and more efficient.
- Portability
-
One of DuckDB’s standout features is its portability. Unlike many traditional database systems that require complex server setups or external dependencies, DuckDB is an in-process database, meaning it runs directly within the application without needing a separate server. This makes it highly portable, as it can be embedded into a wide range of environments, from local applications to data science notebooks, without any special configuration.
DuckDB’s portability is particularly beneficial for data scientists and developers who want to analyze data on their own machines without relying on heavy infrastructure. It can be embedded in Python, R, or even inside other applications with minimal effort. Additionally, DuckDB’s small footprint and ability to work seamlessly with various file formats like CSV and Parquet mean it can be used across different platforms (Windows, Linux, macOS) with ease, allowing users to take their analytical workflows anywhere.
The powerful performance features of DuckDB are complemented by its versatility and ease of use across multiple programming environments, which you will learn in the next section.
Versatile Integration and Ease of Use Across Multiple Programming Languages
DuckDB offers full support for standard SQL syntax, including SELECT, INSERT, UPDATE, and DELETE statements. It integrates smoothly with various data formats, such as CSV, Parquet, JSON, and pandas DataFrames. By running directly within the same process as your application—whether it’s a Jupyter Notebook or a Python script—DuckDB eliminates the need for complex setups or network communications.
Its ease of handling basic operations makes DuckDB an excellent choice for both beginners and experienced users. Whether you’re executing simple queries, loading data, or performing quick transformations, DuckDB provides a fast, efficient, and user-friendly experience that enhances productivity and supports a broad range of data processing tasks.
DuckDB is also designed to work seamlessly with several programming languages, making it a versatile option for data analysis and processing in various environments. Here are some of the languages that DuckDB supports:
-
Python
-
R
-
C/C++
-
Julia
-
Java
-
Go
-
Node.js
-
Rust
In addition to its broad language support, DuckDB is open source, which greatly enhances its appeal.
Open Source
DuckDB is open source, making it freely accessible to anyone who wants to use, modify, or contribute to its development. As an open source project, DuckDB’s source code is available to the public, allowing developers and data professionals to inspect, enhance, and tailor the software to their specific needs. This brings several key advantages:
- Transparency
Users can view exactly how DuckDB is implemented, fostering trust and confidence among developers through its open and transparent design.
- Rapid iteration and updates
The open source nature of DuckDB enables quick iteration and the continuous addition of new features. The community can propose, test, and implement improvements swiftly, ensuring the software stays at the forefront of technological advancements.
- Cost-effective
DuckDB is completely free, with no licensing fees, allowing users to deploy it in any environment without concerns about cost.
- Strong ecosystem
The open source model nurtures the growth of a vibrant ecosystem of tools, libraries, and extensions that enhance DuckDB’s functionality. Users gain access to a wealth of community-contributed resources, including documentation, tutorials, and plug-ins.
Now that you have seen the features that make DuckDB so useful and powerful, it is time to dive in and see how it works.
A Quick Look at DuckDB
In the following sections, we will walk through a few examples of how to use DuckDB to:
-
Create a database
-
Create a table
-
Insert records into the table
-
Retrieve records from the table
-
Perform aggregation on the records
-
Perform joins on multiple tables
-
Load data directly from pandas DataFrames
Note
For this book, we’ll be using Jupyter Notebook for coding, unless stated otherwise. You can use the Jupyter Notebook for Windows, macOS, or Linux.
To use DuckDB, you first need to install the duckdb
package. You can do so via the pip
command in Jupyter Notebook:
!pip install duckdb
To create a DuckDB database, you can use the connect()
function of the duckdb
package:
import
duckdb
# create a connection to a new DuckDB database file
conn
=
duckdb
.
connect
(
'my_duckdb_database.db'
)
This creates a persistent database file named my_duckdb_database.db in the current directory where you launched your Jupyter Notebook.
Alternatively, you can create an in-memory copy of the database by passing the :memory:
argument to the connect()
function:
# alternatively, to create an in-memory database:
conn
=
duckdb
.
connect
(
':memory:'
)
Warning
Whatever changes you made to the in-memory database will be lost when you shut down the database. To retain data between sessions, you should use a persistent DuckDB database file instead of an in-memory database.
In the next section, you will learn how to create a table within the database that you have just created.
Loading Data into DuckDB
Once you have created the database, you can create a table by passing the CREATE TABLE
SQL statement to the connection’s execute()
method:
# create a table
conn
.
execute
(
'''
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name VARCHAR,
age INTEGER,
department VARCHAR
)
'''
)
To verify that the table is created correctly, use the SHOW TABLES
statement:
conn
.
execute
(
'SHOW TABLES'
)
.
df
()
Because the execute()
method runs the SQL query and returns a DuckDB result set, you need to convert it to a DataFrame so that you can view the result.
Figure 1-4 shows the table name returned as a DataFrame.
Now that the table is created, it is time to insert a few records into the table. The next section shows you how.
Inserting a Record
Now let’s insert a few rows into the table using the INSERT INTO
statement:
# insert data into the table
conn
.
execute
(
'''
INSERT INTO employees VALUES
(1, 'Alice', 30, 'HR'),
(2, 'Bob', 35, 'Engineering'),
(3, 'Charlie', 28, 'Marketing'),
(4, 'David', 40, 'Engineering')
'''
)
In this statement, I added three rows to the employees table. To verify that the records are correctly inserted into the table, we’ll perform a query, which you will see demonstrated in the next section.
Querying a Table
Now that the records are inserted into the table, we can retrieve them by using the SELECT
statement:
conn
.
execute
(
'''
SELECT * FROM employees
'''
)
.
df
()
Figure 1-5 shows the result.
Performing Aggregation
A common operation performed on a table is aggregation, which involves summarizing data by grouping it based on one or more columns and then applying functions such as COUNT
, SUM
, AVERAGE
, MIN
, and MAX
. Aggregation is essential for extracting insights, as it condenses large datasets into meaningful summaries, enabling more straightforward analysis.
Let’s perform some aggregation on the records in our table. First, let’s count the number of employees in each department using the COUNT
function and the GROUP BY
statement in SQL:
conn
.
execute
(
'''
SELECT
department,
COUNT(*) AS employee_count
FROM
employees
GROUP BY
department
'''
)
.
df
()
Figure 1-6 shows the result returned as a DataFrame.
You can calculate the average age of employees in the company using the AVG
function:
conn
.
execute
(
'''
SELECT
AVG(age) AS average_age
FROM
employees
'''
)
.
df
()
Figure 1-7 shows the result as a DataFrame.
If you want to find the oldest employee in each department, use the MAX
function in SQL:
conn
.
execute
(
'''
SELECT
department,
MAX(age) AS oldest_age
FROM
employees
GROUP BY
department
'''
)
.
df
()
Figure 1-8 shows the result as a DataFrame.
Finally, you can find the average age of employees in each department:
conn
.
execute
(
'''
SELECT
department,
AVG(age) AS average_age
FROM
employees
GROUP BY
department
'''
)
.
df
()
Figure 1-9 shows the result.
Now that you have seen how to perform aggregation on your table, let’s see in the next section how to perform joins, another common operation involving multiple tables in a database.
Joining Tables
In addition to working with single tables, DuckDB enables you to perform joins on multiple tables. Let’s illustrate this by creating two tables in the existing database and then populating them with some records:
# create an in-memory copy of the database
conn
=
duckdb
.
connect
()
# create first table - orders
conn
.
execute
(
'''
CREATE TABLE orders (
order_id INTEGER,
customer_id INTEGER,
amount FLOAT)
'''
)
# add some records to the orders table
conn
.
execute
(
'''
INSERT INTO orders
VALUES (1, 1, 100.0),
(2, 2, 200.0),
(3, 1, 150.0)
'''
)
# create second table - customers
conn
.
execute
(
'''
CREATE TABLE customers (
customer_id INTEGER,
name VARCHAR)
'''
)
conn
.
execute
(
'''
INSERT INTO customers
VALUES (1, 'Alice'),
(2, 'Bob')
'''
)
Let’s display the contents of the two tables we just created:
display
(
conn
.
execute
(
'''
SELECT * FROM orders
'''
)
.
df
()
)
display
(
conn
.
execute
(
'''
SELECT * FROM customers
'''
)
.
df
()
)
Figure 1-10 shows the contents of the two tables.
Suppose you want a list of amounts spent by each customer. You can achieve this by joining the orders and customers tables based on the customer_id
field in each table:
# join the two tables
conn
.
execute
(
'''
SELECT
customers.customer_id,
customers.name,
orders.amount,
FROM
orders
JOIN
customers
ON
orders.customer_id = customers.customer_id
ORDER by
customers.customer_id
'''
)
.
df
()
The result is shown as a DataFrame (see Figure 1-11).
Suppose you now want to know the total amount spent by each customer. You can achieve this by aggregating the amount spent using the SUM
function in SQL. In addition, you need to use the GROUP BY
statement for aggregating the total amount spent:
# join the two tables
conn
.
execute
(
'''
SELECT
customers.customer_id,
customers.name,
SUM(orders.amount) as total_spent
FROM
orders
JOIN
customers
ON
orders.customer_id = customers.customer_id
GROUP BY
customers.customer_id,
customers.name
ORDER by
customers.customer_id
'''
)
.
df
()
The result is shown as a DataFrame (see Figure 1-12).
The next section will show how to use DuckDB to directly manipulate pandas DataFrames.
Reading Data from pandas
All the examples up to this point have involved creating the database directly in DuckDB. What if your data is already in a pandas DataFrame? Well, DuckDB can work directly with the pandas DataFrames that you already have in memory.
Suppose you have the following DataFrames:
import
pandas
as
pd
# Employee DataFrame
employees
=
pd
.
DataFrame
({
'employee_id'
:
[
1
,
2
,
3
,
4
],
'name'
:
[
'Alice'
,
'Bob'
,
'Charlie'
,
'David'
],
'age'
:
[
30
,
35
,
28
,
40
],
'department'
:
[
'HR'
,
'Engineering'
,
'Marketing'
,
'Engineering'
]
})
# Sales DataFrame
sales
=
pd
.
DataFrame
({
'sale_id'
:
[
101
,
102
,
103
,
104
,
105
],
'employee_id'
:
[
1
,
2
,
1
,
3
,
4
],
'sale_amount'
:
[
200
,
500
,
150
,
300
,
700
],
'sale_date'
:
[
'2023-01-01'
,
'2023-01-03'
,
'2023-01-04'
,
'2023-01-05'
,
'2023-01-07'
]
})
display
(
employees
)
display
(
sales
)
Figure 1-13 shows the contents of the employees
and the sales
DataFrames.
Suppose you want to find the total sales for each department, as well as find out the average sales per employee for each department. To do this, you’ll need to join the two DataFrames and perform some aggregations. Most importantly, in DuckDB you simply refer to the DataFrames by their names, as this code snippet shows:
# create an in-memory copy of the database using DuckDB
conn
=
duckdb
.
connect
()
# join the DataFrames, group by department, and perform aggregations
query
=
'''
SELECT
e.department,
SUM(s.sale_amount) AS total_sales,
AVG(s.sale_amount) AS average_sale_per_employee,
COUNT(DISTINCT e.employee_id) AS number_of_employees
FROM
employees e
LEFT JOIN
sales s ON e.employee_id = s.employee_id
GROUP BY
e.department
'''
conn
.
execute
(
query
)
.
df
()
The result is shown in Figure 1-14.
How about finding the top performers in the company and listing their departments? The following code snippet shows how this is done:
query
=
'''
SELECT
e.department,
e.name AS top_employee,
MAX(s.sale_amount) AS top_sale_amount
FROM
employees e
LEFT JOIN
sales s ON e.employee_id = s.employee_id
GROUP BY
e.department,
e.name
ORDER BY
top_sale_amount DESC
'''
conn
.
execute
(
query
)
.
df
()
Figure 1-15 shows the result of the query.
As demonstrated, DuckDB enables direct reference to pandas DataFrames within SQL statements.
Why DuckDB Is More Efficient
Earlier in this chapter, we mentioned that DuckDB is both efficient and high-performing. When working with CSV files, for example, it does not need to load the entire CSV file into memory before it can process it. Rather, DuckDB can read and process data from the file on the fly. To see this in action, let’s use the 2015 Flight Delays and Cancellations dataset.
Note
We’ll use this dataset more in Chapter 2, where you learn how to download the various CSV files in the dataset.
For this example, we will use the flights.csv file in the dataset, which contains the details of all the flights in the US for 2015. This file is a good candidate for evaluating the efficiency of DuckDB, as it is relatively large (nearly 600 MB) and has more than 5.8 million rows of data. It has the following fields: YEAR, MONTH, DAY, DAY_OF_WEEK, AIRLINE, FLIGHT_NUMBER, TAIL_NUMBER, ORIGIN_AIRPORT, DESTINATION_AIRPORT, SCHEDULED_DEPARTURE, DEPARTURE_TIME, DEPARTURE_DELAY, TAXI_OUT, WHEELS_OFF, SCHEDULED_TIME, ELAPSED_TIME, AIR_TIME, DISTANCE, WHEELS_ON, TAXI_IN, SCHEDULED_ARRIVAL, ARRIVAL_TIME, ARRIVAL_DELAY, DIVERTED, CANCELLED, CANCELLATION_REASON, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY.
There are two aspects that we will examine in this example:
-
The speed of execution of DuckDB
-
The memory usage of DuckDB
Execution Speed
Let’s examine the traditional approach of manipulating the CSV file using pandas. First, you need to load the CSV file into a pandas DataFrame:
import
pandas
as
pd
# load the CSV file and time it
%
timeit
df
=
pd
.
read_csv
(
'flights.csv'
)
This code uses the %timeit
magic command in Jupyter Notebook to measure the time it takes to load the CSV file into a DataFrame. On my machine, it took about 7.5 seconds to load the 5.8 millions rows of data:
7.46 s ± 568 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Tip
The percent symbol (%), when used in Jupyter Notebook, is a prefix to denote a magic command. Magic commands are special commands that provide various functionalities and utilities for working within the Jupyter environment.
Next, we will perform a simple aggregation on the data by calculating the mean arrival delay time for each airline:
df
=
pd
.
read_csv
(
'flights.csv'
)
%
timeit
df
.
groupby
(
'AIRLINE'
)[
'ARRIVAL_DELAY'
]
.
mean
()
.
reset_index
()
Running the aggregation without the %timeit
magic command should yield the result shown in Figure 1-16:
df
.
groupby
(
'AIRLINE'
)[
'ARRIVAL_DELAY'
]
.
mean
()
.
reset_index
()
On average, it took pandas about 186 milliseconds to perform the aggregation:
186 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In total, using the pandas approach took about 7.5 seconds.
Let’s now try the aggregation using DuckDB. DuckDB has a function named read_csv_auto()
to read the CSV file:
import
duckdb
conn
=
duckdb
.
connect
()
query
=
'''
SELECT
AIRLINE,
AVG(ARRIVAL_DELAY) AS MEAN_ARRIVAL_DELAY
FROM
read_csv_auto('flights.csv')
GROUP BY
AIRLINE
ORDER BY
AIRLINE;
'''
%
timeit
df
=
conn
.
execute
(
query
)
.
df
()
The read_csv_auto()
function does not need to load the CSV file into memory; rather, it processes the data on the fly, allowing for efficient querying without the overhead of memory consumption associated with loading the entire dataset. This enables DuckDB to handle larger datasets seamlessly, leveraging disk I/O for analytical operations while maintaining low memory usage. At the same time, the above statements also perform the data aggregation.
These statements took about half a second to complete:
496 ms ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
From this simple example, you can draw the following observations:
-
Using DuckDB dramatically shortens the time needed to perform analytics on your data. This is because DuckDB doesn’t need to spend extra time loading the CSV into memory before it starts to perform the data aggregation. This is useful if you usually perform one-off operations on your CSV file.
-
If you need to perform multiple operations on your data, it might be more efficient to load your data into a pandas DataFrame if you have enough memory on your system to store the data.
-
Overall, DuckDB works efficiently on large datasets.
Now that we have examined the performance aspect of DuckDB, let’s examine its memory usage.
Memory Usage
To examine the memory usage of DuckDB, let’s first create a function to calculate the memory used by a process using the psutil
package:
import
psutil
def
memory_usage
():
process
=
psutil
.
Process
()
return
process
.
memory_info
()
.
rss
/
(
1024
**
2
)
# convert bytes to MB
Let’s measure the memory used by the current process before and after loading the CSV file into a DataFrame:
import
pandas
as
pd
# measure memory before query execution
memory_before
=
memory_usage
()
(
f
"Memory used before query:
{
memory_before
:
.2f
}
MB"
)
# load the CSV file
df
=
pd
.
read_csv
(
'flights.csv'
)
# measure memory after query execution
memory_after
=
memory_usage
()
(
f
"Memory used after query:
{
memory_after
:
.2f
}
MB"
)
You’ll see something like the following:
Memory used before query: 130.64 MB Memory used after query: 4362.61 MB
Tip
Be sure to restart the kernel in your Jupyter Notebook to get a more accurate view of the memory used by the DataFrame.
The memory used is a whopping 4.2 GB! All of this memory was used just to load the CSV file into a DataFrame in memory. Let’s now compare it with DuckDB, where we don’t have to load the entire CSV file into memory before we can perform processing:
import
duckdb
conn
=
duckdb
.
connect
()
query
=
'''
SELECT
AIRLINE,
AVG(ARRIVAL_DELAY) AS MEAN_ARRIVAL_DELAY
FROM
read_csv_auto('flights.csv')
GROUP BY
AIRLINE
ORDER BY
AIRLINE;
'''
# measure memory before query execution
memory_before
=
memory_usage
()
(
f
"Memory used before query:
{
memory_before
:
.2f
}
MB"
)
# run the query
df
=
conn
.
execute
(
query
)
.
df
()
# measure memory after query execution
memory_after
=
memory_usage
()
(
f
"Memory used after query:
{
memory_after
:
.2f
}
MB"
)
The result looks like the following:
Memory used before query: 72.19 MB Memory used after query: 348.48 MB
As you can see, DuckDB used only about 280 MB of memory, compared to the 4.2 GB used by pandas.
Summary
In this chapter, I have introduced some of the key features of DuckDB and provided a quick overview of its capabilities and features. I began by introducing what DuckDB is and why it stands out in the realm of data management and analytics. Its high-performance analytical queries—combined with versatile integration across multiple programming languages—make DuckDB a powerful tool for various data processing tasks.
I also highlighted DuckDB’s open source nature, which not only makes it cost-effective but also fosters a robust, community-driven ecosystem that continuously enhances its functionality.
Through a quick look at DuckDB, we covered essential operations including loading data, inserting records, querying tables, performing aggregations, and joining tables. We also demonstrated how to seamlessly read data from pandas, showcasing DuckDB’s compatibility with popular data science tools.
Overall, DuckDB offers a unique blend of performance, flexibility, and ease of use, making it an excellent choice for both simple and complex data processing needs. Whether you’re a beginner looking for a straightforward solution or an experienced user seeking a high-performance analytics tool, DuckDB provides a versatile and powerful platform to support your data-driven projects.
In the next chapter, you’ll learn how to use DuckDB to work with various data formats—CSV, Parquet, Excel, and MySQL databases.
Get DuckDB: Up and Running now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.