DuckDB: Up and Running

Book description

DuckDB, an open source in-process database created for OLAP workloads, provides key advantages over more mainstream OLAP solutions: It's embeddable and optimized for analytics. It also integrates well with Python and is compatible with SQL, giving you the performance and flexibility of SQL right within your Python environment. This handy guide shows you how to get started with this versatile and powerful tool.

Author Wei-Meng Lee takes developers and data professionals through DuckDB's primary features and functions, best practices, and practical examples of how you can use DuckDB for a variety of data analytics tasks. You'll also dive into specific topics, including how to import data into DuckDB, work with tables, perform exploratory data analysis, visualize data, perform spatial analysis, and use DuckDB with JSON files, Polars, and JupySQL.

    Understand the purpose of DuckDB and its main functions
  • Conduct data analytics tasks using DuckDB
  • Integrate DuckDB with pandas, Polars, and JupySQL
  • Use DuckDB to query your data
  • Perform spatial analytics using DuckDB's spatial extension
  • Work with a diverse range of data including Parquet, CSV, and JSON

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgements
  2. 1. Getting Started with DuckDB
    1. Introduction to DuckDB
      1. Why Use DuckDB?
      2. High-Performance Analytical Queries
      3. Versatile Integration and Ease of Use Across Multiple Programming Languages
      4. Open Source
    2. A Quick Look at DuckDB
      1. Loading Data into DuckDB
      2. Inserting a Record
      3. Querying a Table
      4. Performing Aggregation
      5. Joining Tables
      6. Reading Data from pandas
    3. Why DuckDB Is More Efficient
      1. Execution Speed
      2. Memory Usage
    4. Summary
  3. 2. Importing Data into DuckDB
    1. Creating DuckDB Databases
    2. Loading Data from Different Data Sources and Formats
      1. Working with CSV Files
      2. Working with Parquet Files
      3. Working with Excel Files
      4. Working with MySQL
    3. Summary
  4. 3. A Primer on SQL
    1. Using the DuckDB CLI
      1. Importing Data into DuckDB
      2. Dot Commands
      3. Persisting the In-Memory Database on Disk
    2. DuckDB SQL Primer
      1. Creating a Database
      2. Creating Tables
      3. Viewing the Schemas of Tables
      4. Dropping a Table
    3. Working with Tables
      1. Populating Tables with Rows
      2. Updating Rows
      3. Deleting Rows
      4. Querying Tables
      5. Joining Tables
      6. Aggregating Data
      7. Analytics
    4. Summary
  5. 4. Using DuckDB with Polars
    1. Introduction to Polars
      1. Creating a Polars DataFrame
      2. Understanding Lazy Evaluation in Polars
    2. Querying Polars DataFrames Using DuckDB
      1. Using the sql() Function
      2. Using the DuckDBPyRelation Object
    3. Summary
  6. 5. Performing EDA with DuckDB
    1. Our Dataset: The 2015 Flight Delays Dataset
    2. Geospatial Analysis
      1. Displaying a Map
      2. Displaying All Airports on the Map
      3. Using the spatial Extension in DuckDB
    3. Performing Descriptive Analytics
      1. Finding the Airports for Each State and City
      2. Aggregating the Total Number of Airports in Each State
      3. Obtaining the Flight Counts for Each Pair of Origin and Destination Airports
      4. Getting the Canceled Flights from Airlines
      5. Getting the Flight Count for Each Day of the Week
      6. Finding the Most Common Timeslot for Flight Delays
      7. Finding the Airlines with the Most and Fewest Delays
    4. Summary
  7. 6. Using DuckDB with JSON Files
    1. Primer on JSON
      1. Object
      2. String
      3. Boolean
      4. Number
      5. Nested Object
      6. Array
      7. null
    2. Loading JSON Files into DuckDB
      1. Using the read_json_auto() Function
      2. Using the read_json() Function
      3. Using the COPY-FROM Statement
    3. Exporting Tables to JSON
    4. Summary
  8. 7. Using DuckDB with JupySQL
    1. What Is JupySQL?
      1. Installing JupySQL
      2. Loading the sql Extension
      3. Integrating with DuckDB
      4. Performing Queries
      5. Storing Snippets
    2. Visualization
      1. Histograms
      2. Box Plots
      3. Pie Charts
      4. Bar Plots
    3. Integrating with MySQL
      1. Using Environment Variables
      2. Using an .ini File
      3. Using keyring
    4. Summary
  9. 8. Accessing Remote Data Using DuckDB
    1. DuckDB’s httpfs Extension
    2. Querying CSV and Parquet Files Remotely
      1. Accessing CSV Files
      2. Accessing Parquet Files
    3. Querying Hugging Face Datasets
      1. Using Hugging Face Datasets
      2. Reading the Dataset Using hf:// Paths
      3. Accessing Files Within a Folder
      4. Querying Multiple Files Using the Glob Syntax
      5. Working with Private Hugging Face Datasets
    4. Summary
  10. 9. Using DuckDB in the Cloud with MotherDuck
    1. Introduction to MotherDuck
      1. Signing Up for MotherDuck
      2. MotherDuck Plans
    2. Getting Started with MotherDuck
      1. Adding Tables
      2. Creating Schemas
      3. Sharing Databases
      4. Creating a Database
      5. Detaching a Database
    3. Using the Databases in MotherDuck
      1. Querying Your Database
      2. Writing SQL Using AI
    4. Using MotherDuck Through the DuckDB CLI
      1. Connecting to MotherDuck
      2. Querying Databases on MotherDuck
      3. Creating Databases on MotherDuck
      4. Performing Hybrid Queries
    5. Summary
  11. Index
  12. About the Author

Product information

  • Title: DuckDB: Up and Running
  • Author(s): Wei-Meng Lee
  • Release date: December 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098159696