Chapter 8. Accessing Remote Data Using DuckDB

So far, in all the previous chapters, you have used DuckDB to work with local data, whether the data is in MySQL databases or in CSV, JSON, and Parquet files. In practical scenarios, the data you work with typically resides on remote servers and is frequently sourced from multiple locations. Fortunately, DuckDB provides the httpfs extension to enable you to access remote datasets. What’s more, DuckDB also provides support for accessing datasets hosted by Hugging Face, a platform where users can share pretrained models for machine learning. Hugging Face also hosts a large repository of datasets, which developers can download for training their own models.

In this chapter, you’ll learn how to use the httpfs extension in DuckDB to work with remote datasets, as well as use DuckDB to access the vast datasets hosted by Hugging Face.

DuckDB’s httpfs Extension

DuckDB’s httpfs extension is an autoloadable extension that implements a file system that allows reading and writing remote files. This extension enables DuckDB to read and write files directly over the HTTP and HTTPS protocols, without needing to download them locally first. This is particularly helpful when handling large datasets that exceed local storage, accessing real-time or frequently updated data, querying distributed data from multiple remote sources, or integrating seamlessly with cloud storage. It enables efficient remote data analysis, making it ideal for scenarios involving ...

Get DuckDB: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.