6

Using the Arrow Datasets API

In the current ecosystem of data lakes and lakehouses, many datasets are now huge collections of files in partitioned directory structures rather than a single file. To facilitate this workflow, the Arrow libraries provide an API for easily interacting with these types of structured and unstructured data. This is called the Datasets API and is designed to perform a lot of the heavy lifting by querying these types of datasets for you.

The Datasets API provides a series of utilities for easily interacting with large, distributed, and possibly partitioned datasets that are spread across multiple files. It also leverages the Compute APIs and integrates very easily with Acero, which we covered previously in Chapter ...

Get In-Memory Analytics with Apache Arrow - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.