book

DuckDB: Up and Running

Name: DuckDB: Up and Running
Author: Wei-Meng Lee
ISBN: 9781098159696

by Wei-Meng Lee

December 2024

Intermediate to advanced

308 pages

6h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgements
1. Getting Started with DuckDB
Introduction to DuckDBWhy Use DuckDB?High-Performance Analytical QueriesVersatile Integration and Ease of Use Across Multiple Programming LanguagesOpen SourceA Quick Look at DuckDBLoading Data into DuckDBInserting a RecordQuerying a TablePerforming AggregationJoining TablesReading Data from pandasWhy DuckDB Is More EfficientExecution SpeedMemory UsageSummary
2. Importing Data into DuckDB
Creating DuckDB DatabasesLoading Data from Different Data Sources and FormatsWorking with CSV FilesWorking with Parquet FilesWorking with Excel FilesWorking with MySQLSummary
3. A Primer on SQL
Using the DuckDB CLIImporting Data into DuckDBDot CommandsPersisting the In-Memory Database on DiskDuckDB SQL PrimerCreating a DatabaseCreating TablesViewing the Schemas of TablesDropping a TableWorking with TablesPopulating Tables with RowsUpdating RowsDeleting RowsQuerying TablesJoining TablesAggregating DataAnalyticsSummary
4. Using DuckDB with Polars
Introduction to PolarsCreating a Polars DataFrameUnderstanding Lazy Evaluation in PolarsQuerying Polars DataFrames Using DuckDBUsing the sql() FunctionUsing the DuckDBPyRelation ObjectSummary
5. Performing EDA with DuckDB
Our Dataset: The 2015 Flight Delays DatasetGeospatial AnalysisDisplaying a MapDisplaying All Airports on the MapUsing the spatial Extension in DuckDBPerforming Descriptive AnalyticsFinding the Airports for Each State and CityAggregating the Total Number of Airports in Each StateObtaining the Flight Counts for Each Pair of Origin and Destination AirportsGetting the Canceled Flights from AirlinesGetting the Flight Count for Each Day of the WeekFinding the Most Common Timeslot for Flight DelaysFinding the Airlines with the Most and Fewest DelaysSummary
6. Using DuckDB with JSON Files
Primer on JSONObjectStringBooleanNumberNested ObjectArraynullLoading JSON Files into DuckDBUsing the read_json_auto() FunctionUsing the read_json() FunctionUsing the COPY-FROM StatementExporting Tables to JSONSummary
7. Using DuckDB with JupySQL
What Is JupySQL?Installing JupySQLLoading the sql ExtensionIntegrating with DuckDBPerforming QueriesStoring SnippetsVisualizationHistogramsBox PlotsPie ChartsBar PlotsIntegrating with MySQLUsing Environment VariablesUsing an .ini FileUsing keyringSummary
8. Accessing Remote Data Using DuckDB
DuckDB’s httpfs ExtensionQuerying CSV and Parquet Files RemotelyAccessing CSV FilesAccessing Parquet FilesQuerying Hugging Face DatasetsUsing Hugging Face DatasetsReading the Dataset Using hf:// PathsAccessing Files Within a FolderQuerying Multiple Files Using the Glob SyntaxWorking with Private Hugging Face DatasetsSummary
9. Using DuckDB in the Cloud with MotherDuck
Introduction to MotherDuckSigning Up for MotherDuckMotherDuck PlansGetting Started with MotherDuckAdding TablesCreating SchemasSharing DatabasesCreating a DatabaseDetaching a DatabaseUsing the Databases in MotherDuckQuerying Your DatabaseWriting SQL Using AIUsing MotherDuck Through the DuckDB CLIConnecting to MotherDuckQuerying Databases on MotherDuckCreating Databases on MotherDuckPerforming Hybrid QueriesSummary

Index
About the Author

Content preview from DuckDB: Up and Running

Chapter 1. Getting Started with DuckDB

When it comes to data analytics, pandas is often the go-to library for many developers. Recently, Polars has emerged as a faster and more efficient alternative for handling DataFrames. However, despite the popularity of these libraries, SQL (Structured Query Language) remains the most widely recognized and used language among developers. If your data is stored in a database that supports SQL, using SQL to query and manipulate that data is often the most intuitive and effective approach.

While Python has become the dominant language in data science—particularly for working with data in tabular formats through DataFrame objects—SQL continues to be the universal language of data. Given that most developers are already comfortable with SQL, wouldn’t it be more efficient to use SQL directly for data manipulation?

This is where DuckDB shines. DuckDB was initially conceptualized in 2018 as an OLAP (online analytical processing) database optimized for fast analytical queries. Its aim was to bridge the gap between fully-fledged database systems and the simplicity of embedded DBs like SQLite, but with a focus on analytical rather than transactional workloads. The first stable release of DuckDB was in 2019, and its ease of integration with Python and R made it a very popular choice among the data science and analytics communities. While DuckDB is open source, DuckDB Labs was founded in 2021 to provide commercial support and further development. To bring ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098159689Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

DuckDB: Up and Running

by Wei-Meng Lee

Chapter 1. Getting Started with DuckDB

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.