Chapter 5. Performing EDA with DuckDB

By this point, you should have a pretty good grip on the basics of DuckDB. You have seen how to load up your DuckDB databases from data stored in file formats such as CSV and Parquet, and have also learned how to load it up from database servers, such as MySQL. In this chapter, we’ll apply DuckDB in practical scenarios, utilizing it for conducting exploratory data analysis.

EDA is an approach to analyzing and visualizing datasets to summarize their main characteristics. The key goal of EDA is to understand the patterns, trends, and relationships within the data. In EDA, we often use the following techniques on our data:

Data summarization

Uses descriptive statistics (such as mean, median, standard deviation, and more) to understand the distribution of the dataset.

Data visualization

Uses libraries such as Matplotlib and Seaborn to plot various types of charts (such as bar charts, pie charts, and more) to visually inspect the distribution of data and the relationships between different types of data.

Trends identification

Identifies the patterns, trends, and anomalies within the data and provides insights into potential factors affecting these observations.

In this chapter, you will learn how to use DuckDB to explore and visualize the 2015 Flight Delays dataset. In particular, you will learn about geospatial analysis, where you will learn how to:

  • Display a map

  • Display all the airports on a map

  • Use the spatial extension in DuckDB

  • Convert ...

Get DuckDB: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.