book

Apache Polaris: The Definitive Guide

by Alex Merced, Andrew Madson, Tomer Shiran

September 2025

Beginner to intermediate

258 pages

5h 47m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Data Lakehouses and Apache Iceberg Fundamentals
1. Data Lakehouse and Apache Iceberg
Modern Data ChallengesThe World of Data WarehousesMoving Forward with Data LakesThe Cloud RevolutionFile-Based Analytics with Apache ParquetThe Data Lakehouse SolutionThe Key Benefits of a Data LakehouseThe Path Forward: Data Lakehouse Table FormatsThe Role of Table FormatsThe Benefits of Table FormatsExisting Table FormatsApache IcebergWhat Is Apache Iceberg?Metadata File (metadata.json)Manifest ListManifest FilesData FilesDelete FilesConclusion
2. The Role of Apache Iceberg Catalogs
What Is and Isn’t an Apache Iceberg CatalogThe Mechanics of Apache Iceberg CatalogsTypes of Apache Iceberg CatalogsFile-System CatalogsService CatalogsChallenges of Diverse Catalog OptionsClient-Side ComplexityConfiguration ChallengesAuthorization ChallengesThe Need for a Unified ApproachThe Apache Iceberg REST Catalog SpecificationKey Benefits of the REST Catalog SpecificationThe Evolution of REST Catalog ImplementationsApache PolarisThe Birth of Apache PolarisPolaris: A New Era of Lakehouse CatalogsConclusion
II. Apache Polaris
3. The Apache Polaris Security Model
What Is Polaris?CatalogsKey Features of Polaris CatalogsBenefits of Multi-Catalog ArchitecturePrincipalsWhat Are Principals?Managing PrincipalsPrincipal LifecycleCatalog RolesDefining Permissions in Catalog RolesAssigning Catalog Roles to PrincipalsBest Practices for Catalog RolesPrincipal RolesWhat Are Principal Roles?Benefits of Principal RolesBest Practices for Principal RolesPolaris Security Best PracticesMulti-Tenant EnvironmentsCross-Team CollaborationCompliance and Sensitive Data GovernanceCloud-Native DeploymentsConclusion
4. External Catalogs
NessieWhat Makes Nessie Unique?Why Use Nessie with Polaris?Example: Nessie and Polaris in ActionGravitinoWhat Makes Gravitino Unique?Why Use Gravitino with Polaris?Example: Distributed Metadata GovernanceLakekeeperWhat Makes Lakekeeper Unique?Why Use Lakekeeper with Polaris?Example: Multi-Tenant Metadata GovernanceAWS GlueWhy Use the AWS Glue Catalog?Why Use Glue with Polaris?Example: Hybrid Team CollaborationConclusion
5. Polaris REST API
Catalog OperationsList CatalogsCreate a CatalogGet Catalog DetailsUpdate a CatalogDelete a CatalogPrincipal OperationsList PrincipalsCreate a PrincipalGet Principal DetailsUpdate a PrincipalDelete a PrincipalRotate Principal CredentialsManaging RolesCreate a Catalog RoleCreate a Principal RoleList Catalog RolesList Roles Assigned to a PrincipalList All Principal RolesList Principals Assigned to a Principal RoleGet Catalog Roles Mapped to a Principal RoleGet Details of a Principal RoleAdd a Grant to a Catalog RoleRevoke a Grant from a Catalog RoleAssign a Catalog Role to a Principal RoleAssign a Role to a PrincipalUpdate a Principal RoleRevoke a Role from a PrincipalRevoke a Catalog Role from a Principal RoleDelete a Principal RoleDelete a Catalog RoleApache Iceberg REST Catalog EndpointsConfiguration APIOAuth2 APITable APIView APIConclusion
III. Hands-on with Apache Polaris

6. Working with Apache Polaris OSS
Deploying Locally with DockerPrerequisitesStep 1: Clone the RepositoryStep 2: Configure Environment VariablesStep 3: Understand the Docker Compose FileStep 4: Starting the EnvironmentStep 5: Stopping the EnvironmentCreating CatalogsWhen to Create a CatalogCreating Catalog RolesWhen to Create Catalog RolesCreating PrincipalsCreating Principal RolesWhen to Create a Principal RoleAssigning the Catalog Role to the Principal Role and Setting Permissions on the CatalogSummary
7. Using Apache Polaris with Apache Spark
Connecting Your Apache Polaris Catalog to Apache SparkUsing Spark Dataframe API with Apache Polaris (Incubating)Creating a TableQuerying a TableUpdating a TableDeleting RowsAppending DataReading Metadata TablesUsing SparkSQL with Apache PolarisCreating a TableQuerying a TableInserting DataUpdating DataDeleting DataMerging DataReading Metadata TablesTime Travel QueriesUsing Spark Streaming with Apache PolarisSetting Up Spark Streaming with PolarisStreaming Reads from PolarisStreaming Writes to PolarisHandling Deletes and OverwritesUsing Partitioned TablesMaintaining Streaming TablesConclusion
8. Using Apache Polaris with Snowflake
Establishing Connectivity Between Snowflake and PolarisConfiguring an External VolumeCreating a Polaris Catalog IntegrationQuerying Iceberg Tables via Snowflake and PolarisRegistering an Existing Polaris Table in SnowflakeQuerying the External Iceberg TableUsing Snowflake Open Catalog (Managed Polaris)Polaris-Backed Tables vs. Native Snowflake TablesConclusion
9. Using Apache Polaris with Dremio
Connecting Dremio to an Apache Polaris CatalogConnecting Polaris Using the REST Catalog ConnectorConnecting Snowflake’s Open Catalog to DremioWhy Disable Use Vended Credentials?Using Dremio SQL with Apache PolarisQuerying Iceberg Tables via PolarisQuerying the Iceberg Metadata TablesCreating Tables and CTAS in Polaris via DremioAdding Data from Files to a Table Using Copy IntoMaintaining Your Iceberg Tables with DremioDremio Automates OptimizationConclusion
10. Advanced Polaris Configuration and CLI Management
Using the Polaris CLICLI Structure, Authentication, and ProfilesManaging Entities with the CLIUnderstanding RealmsObservability: Metrics, Tracing, and LoggingMetrics with Micrometer and PrometheusTracing with OpenTelemetryLogging and Debugging with QuarkusConfiguring Polaris for ProductionSecurity and Authentication ConfigurationDurable Metadata with MetastoresHardening Defaults and Managing Feature FlagsScaling, Concurrency, and Rate LimitsFinalizing and Verifying Your Production SetupConclusion
11. Looking to the Future of Apache Polaris
Managed PolarisThe REST Catalog EcosystemData Processing EnginesStreaming and Ingestion PlatformsOther Data-Stack ToolsThe Apache Polaris RoadmapGeneric Table SupportPolicy StoreTable Maintenance FrameworkSQL and NoSQL PersistenceS3-Compatible Storage SupportCatalog UIFederated CatalogsFederated Role SupportPolaris Event ListenersUnstructured Data in PolarisConclusion
Index
About the Authors

Content preview from Apache Polaris: The Definitive Guide

Chapter 7. Using Apache Polaris with Apache Spark

With your experimental Apache Polaris environment successfully set up on your laptop, you’re now ready to start exploring its integration with Apache Spark. If you’ve followed the previous chapter, you should have your environment running and be able to access Jupyter Notebook at http://localhost:8888. While we will be working in this local setup, the steps and concepts covered in this chapter are equally applicable to any Spark environment, whether it’s a local cluster or a cloud-based setup.

Apache Spark is a powerful, open source, unified analytics engine for processing large-scale data. Its in-memory computation and distributed architecture make it incredibly fast and efficient for handling complex workloads, from batch processing to real-time analytics and machine learning tasks.

In this chapter, we’ll dive into the practical steps to connect your Polaris catalog to Spark, explore the Spark DataFrame API, execute SQL queries on Polaris-managed data, and even use Spark Streaming to interact with Polaris in real-time. By the end of this chapter, you’ll have a comprehensive understanding of how to harness the combined power of Apache Polaris and Apache Spark in your data workflows. Let’s get started!

You can see all these code snippets as well in the book’s GitHub repository:

https://github.com/developer-advocacy-dremio/apache-polaris-the-definitive-guide.

Connecting Your Apache Polaris Catalog to Apache Spark

To use Apache ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9798341608139Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Polaris: The Definitive Guide

by Alex Merced, Andrew Madson, Tomer Shiran

Chapter 7. Using Apache Polaris with Apache Spark

Connecting Your Apache Polaris Catalog to Apache Spark

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.