Amazon SageMaker Lakehouse Supercharges Apache Iceberg at AWS Pi Day 2025

5 min readMar 18, 2025

Forget the “two pizzas per team” parties and dessert buffets, folks. AWS Pi Day isn’t some carb-loading celebration. No, it’s a birthday bash for Amazon S3, the granddaddy of AWS services (other ones being EC2 and SQS), now a ripe 19 years young. Since 2021, it’s become AWS’s annual ‘look what we’ve been cooking up’ extravaganza.

And let me tell you, they’ve been busy! Just 98 days post-re:Invent, AWS decided to unleash a data deluge, announcing the general availability (GA) of SageMaker Unified Studio. The next generation of SageMaker unifies all your data, analytics, and AI.

This blog is your technical decoder ring for those new announcements. We’re skipping the ‘Intro to Lakehouses 101’ lecture here. If you want the background, please check my re:Invent 2024 deep dive. This blog focuses on the ‘what’ and ‘how’ for the latest additions to the AWS lakehouse ecosystem.

S3 Metadata

In my post-re:Invent 2024 analysis, I overlooked S3 Metadata, a significant offering. This feature provides real-time, object-level metadata (distinct from bucket-level metadata), encompassing details like bucket information, prefixes, keys, and object size.

It integrates seamlessly with S3 object tags, enabling users to define up to 10 custom tags, such as “sensitive data/PII,” for fine-grained IAM policy enforcement. This allows for logical data grouping, object lifecycle management, and selective cross-region replication.

Metadata examples include “content-source,” indicating data origin (e.g., Bedrock model or synthetic data). Notably, this metadata is stored as an Iceberg table, ensuring accessibility via any Iceberg-compatible analytical tool.

Furthermore, AWS has reduced S3 object tagging pricing by 35% lowering the cost to easily capture and query custom metadata that is stored as object tags.

S3 Tables

S3 Tables, optimized for Iceberg at the storage layer, deliver significant performance gains. Continual table optimization automatically scans and rewrites table data in the background, achieving up to 3x faster query performance compared to unmanaged Iceberg tables. Additionally, S3 Tables include optimizations specific to Iceberg workloads that deliver up to 10x higher transactions per second compared to Iceberg tables stored in general purpose S3 buckets.

Since their re:Invent launch in three regions, S3 Tables are now available in eleven regions, with the per-bucket limit increased to 10,000.

Additionally, S3 Tables now offer table management APIs that are compatible with the Apache Iceberg REST Catalog standard, enabling any Iceberg-compatible application to easily create, update, list, and delete tables in an S3 table bucket.

Partner integrations include DuckDB, Starburst/Trino, Snowflake, PuppyGraph, StreamNative, Daft, Dremio, and HighByte. The DuckDB demonstration highlighted its new notebook-style IDE, enabling users to “attach” to S3 buckets via Glue and execute queries.

SageMaker Lakehouse

Introduced at re:Invent 2024, SageMaker Lakehouse provides unified access across Amazon S3 data lakes, Amazon Redshift data warehouses, external data sources and applications, and federated data sources. Its integration with S3 Tables is now generally available (GA). You can access SageMaker Lakehouse from Amazon SageMaker Unified Studio, a single data and AI development environment that brings together functionality and tools from AWS analytics and AI/ML services and engines such as Amazon Athena, Amazon EMR, Amazon Redshift, and Apache Iceberg-compatible engines like Apache Spark or PyIceberg.

SageMaker Lakehouse leverages AWS Glue, which maintains its proprietary API for services like Amazon EMR. However, the key advancement is its support for the Iceberg REST Catalog (IRC) specification, enabling interoperability with any Apache Iceberg-compatible engine, in addition to traditional use cases with Redshift, Athena, and EMR.

This blog demonstrates Snowflake’s ability to query S3 Tables, facilitated by the AWS Glue IRC mapping of Iceberg tables to S3 Tables and AWS Lake Formation credential vending.

SageMaker Lakehouse supports across Amazon S3 data lakes including S3 Tables and Amazon Redshift data warehouse and through zero-ETL integrations from operational databases like Amazon RDS, Aurora, DynamoDB, and applications, such as Salesforce, Facebook Ads, Instagram Ads, ServiceNow, SAP, Zendesk, and Zoho CRM. The lakehouse supports querying data in place via federated query. SageMaker Lakehouse Federation includes connectors, like Google BigQuery and Snowflake. These complement the mid-2024 general availability (GA) of integration with Salesforce Data Cloud, enabling bidirectional data sharing between Salesforce and customer data lakes.

Real-time data streams from sources utilizing Amazon Managed Service for Kafka (Amazon MSK) or Amazon Kinesis Data Streams can now be directly written to Iceberg Tables, alongside traditional S3 storage, through the integration with Amazon Data Firehose. Features like schema evolution are currently available in preview.

SageMaker Catalog

To facilitate the effective utilization of data assets, machine learning models, and AI applications within a defined business context, organizations require a unified catalog. Amazon SageMaker Catalog, leveraging the capabilities of Amazon DataZone, fulfills this requirement.

It provides a centralized repository for discovering and accessing approved data and models, utilizing semantic search capabilities enhanced by generative AI-driven metadata enrichment. Users can engage in collaborative data and AI asset management through streamlined publishing and subscription workflows. SageMaker Catalog also offers automated data quality monitoring, sensitive data identification, and detailed data and ML lineage tracking.

SageMaker Catalog abstracts operational complexities, enabling users to focus on data utilization. It provides a single permission model within Amazon SageMaker Unified Studio, allowing for the consistent definition and enforcement of fine-grained access control policies, ensuring data and AI governance and security.

SageMaker Unified Studio

Amazon SageMaker Unified Studio provides a unified development environment for data scientists, analysts, engineers, and developers engaged in data, analytics, and AI/ML workflows. It aggregates a suite of tools from AWS analytics and AI/ML services, encompassing data processing, SQL analytics, ML model development, and generative AI application development, within a cohesive user interface.

The general availability (GA) of SageMaker Unified Studio and its integrations with Amazon Q and Amazon Bedrock helps in accelerating AI application development.

The Studio supports the rapid prototyping and deployment of custom generative AI applications, providing access to a heterogeneous model ecosystem, including native Amazon Nova models, Anthropic Claude, and partner models like DeepSeek. It also integrates with a range of Amazon Bedrock capabilities, such as Knowledge Bases, Guardrails, Agents, and Flows.

Furthermore, Amazon Q Developer can be leveraged within the unified studio for natural language assistance with tasks like query development for Amazon Redshift and Athena data sources, code authoring in Jupyter Lab notebooks, data discovery in SageMaker Catalog, troubleshooting, etc.

The integration of Amazon Athena within the SageMaker Unified Studio’s user interface facilitates a streamlined data management experience, eliminating context switching between disparate tools. For example, users can directly leverage Athena’s query capabilities to inspect S3 Metadata and execute S3 Table creation commands within the Studio environment. This integration enhances operational efficiency and data accessibility.

AWS provides a multi-faceted approach to service access, offering graphical user interfaces such as the AWS Management Console, as well as programmatic interfaces through SDKs, CLIs, and APIs, catering to a wide range of user interaction preferences and technical requirements.

--

--

Sanjeev Mohan
Sanjeev Mohan

Written by Sanjeev Mohan

Sanjeev researches the space of data and analytics. Most recently he was a research vice president at Gartner. He is now a principal with SanjMo.

No responses yet