Chapter 7. Using DuckDB with JupySQL

Traditionally, data scientists use Jupyter Notebook to pull data from database servers or from external datasets (such as CSV, JSON files, etc.) and store it into pandas DataFrames (see Figure 7-1).

Traditional way of querying data as pandas DataFrames and then using them for data visualization
Figure 7-1. Traditional way of querying data as pandas DataFrames and then using them for data visualization

They then use the DataFrames for visualization purposes. This approach has a couple of drawbacks:

  • Querying a database server may degrade the performance of the database server, which may not be optimized for analytical workloads.

  • Loading the data into DataFrames takes up precious resources, including memory and compute. For example, if the intention is to visualize certain aspects of the dataset, you need to load the entire dataset into memory before you can perform visualization on it.

  • Plotting visualizations using Matplotlib also uses a significant amount of memory. Behind the scenes, Matplotlib maintains various objects such as figures, axes, lines, text, and other graphical elements in memory. Each of these elements consumes resources as they are created and rendered. Additionally, Matplotlib handles data arrays used for plotting and temporarily stores them in memory for processing. If you’re creating multiple plots or figures, each figure and its associated data remain in memory until explicitly closed or cleared, leading to increased ...

Get DuckDB: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.