Use Iceberg with AWS
AWS analytics services, such as Amazon EMR, AWS Glue, Amazon Athena, and Amazon Redshift, include native support for Apache Iceberg, so you can easily build transactional data lakes on top of Amazon Simple Storage Service (Amazon S3) on AWS. All services seamlessly integrate with AWS Glue Data Catalog and use it as the Iceberg catalog.
The following figure illustrates a data pipeline architecture on AWS that utilizes Apache Iceberg for data management. The raw data is sourced from either Amazon S3 or streaming services like Amazon MSK (Kafka) and Kinesis. Ingestion tools such as Amazon EMR, AWS Glue, or Kinesis Data Analytics are used to process the data, and metadata management is handled by AWS Glue Data Catalog or Lake Formation. The processed data is then stored in Apache Iceberg tables on S3, which can be efficiently queried and analyzed by various consumer tools including Amazon Redshift, Athena, EMR, and SageMaker.
Figure 0.
Let’s highlight some of these services here:
- Amazon Athena
An interactive query service that enables users to analyze data directly in Amazon S3 using standard SQL, eliminating the need for complex data loading or ETL processes.
- Amazon EMR
A managed cluster platform that simplifies running big data frameworks like Apache Hadoop and Apache Spark on AWS to process ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access