Apache Iceberg Support in Collate

Apache Iceberg Metadata in Collate: Why We Support It Through Your Query Engine

Apache Iceberg has become the dominant open table format for data lakes. If you're running analytics at any meaningful scale, chances are Iceberg is somewhere in your stack, whether you put it there intentionally or inherited it from a platform like Snowflake or Databricks. Getting visibility into those Iceberg tables inside a data catalog is something data teams have been asking about for years, and how Collate approaches it is worth explaining, because the design decision is not intuitive at first glance. A companion video is available here for additional perspective.

Article Contents

A Brief History of Iceberg Catalog Support
The Design Shift: Use the Compute Engine, Not the Format
Ingesting Iceberg Metadata Through Trino: A Quick Explanation
What This Means in Practice
Conclusion

A Brief History of Iceberg Catalog Support

When Iceberg gained traction, early access was through query engines that served as an abstraction layer over the Iceberg catalog. Trino was one of the first engines to support it in a practical way. AWS Glue served as the initial catalog implementation for many teams. Then the Iceberg REST Catalog API arrived, and the ecosystem moved toward that direction, with engines like Athena, Dremio, Snowflake, and BigQuery adding their own integrations on top.

Collate's initial Iceberg integration followed the same path the ecosystem took early on. It relied on the PyIceberg client to fetch metadata and bring it into the platform. That worked well enough when Iceberg was accessed primarily through dedicated tooling. But as adoption broadened and more teams started working with Iceberg tables through their existing query engines, a gap emerged: users could ingest metadata via the Iceberg connector, but they couldn't run data quality tests or profiling on those tables. The connector got the metadata in, but it couldn't delegate computation back to a query engine.

The Design Shift: Use the Compute Engine, Not the Format

The rethinking that led to the current approach starts with a simple observation. Nobody works directly with raw Iceberg. You're not manually traversing manifest files or poking around in the underlying Parquet. You're using Snowflake, Trino, Databricks, Athena, ClickHouse, Doris, or some other compute layer that knows how to talk to your Iceberg catalog.

Those engines already handle the catalog connection, the metadata resolution, the partition pruning, and all the lower-level mechanics. So rather than building a separate integration path for every Iceberg catalog variant (Polaris, LakeKeeper, Nessie, and whatever comes next), Collate takes a different approach: surface Iceberg tables through the native connectors you're already using.

If you're on Snowflake, your external Iceberg tables are accessed via the Snowflake connector. If you're on Trino, they come through the Trino connector. BigQuery, Athena, same idea. The connector you use to bring in your regular tables also brings in the Iceberg-backed ones, and because those connectors have a live query path back to the engine, data quality tests, profiling, lineage, and usage statistics all work the same way they do for any other table.

This matters because it's not just about metadata ingestion. When you run a data quality check against an Iceberg table in Collate, the actual computation is delegated to your query engine, which is the only component in your stack that knows how to efficiently scan Iceberg data. Collate doesn't need to reinvent that.

Ingesting Iceberg Metadata Through Trino: A Quick Explanation

Once you have Trino connected, and after metadata ingestion, you see on the Collate Explore page all the data asset catalogs for that particular Trino connection displayed. The Iceberg catalog is listed with its schemas, and the tables within are tagged with their table type, so it's clear which ones are Iceberg-backed. From there, every standard Collate feature is available: data observability, profiling, lineage, usage tracking, and schema-level ER diagrams.

That last point is worth a brief note. ER diagrams for warehouse-style engines often appear empty by default, because systems like Trino don't enforce primary or foreign key constraints the way a relational database does. Collate handles this by letting you define those relationships manually through the catalog interface. You can link columns, specify foreign key types, and document relationships directly, making the catalog a source of truth for organizational data semantics rather than just a reflection of what the engine enforces.

What This Means in Practice

If you have Iceberg tables in your environment and you're already using one of Collate's 120-plus connectors, you likely have Iceberg support without any additional configuration. Connect your query engine, run ingestion, and those Iceberg tables show up alongside everything else.

The benefit is consistency. Whether a table lives in a traditional relational database, a columnar warehouse, or an Iceberg-backed data lake, the interaction model inside Collate is the same. You get the same observability features, the same quality testing interface, and the same lineage tracking. The underlying storage format becomes an implementation detail rather than something data teams need to think about at the catalog level.

For teams managing mixed environments, this is a meaningful simplification. The query engine you're already paying for and already trust handles the heavy lifting. Collate handles the governance layer on top.

Conclusion

Collate has significantly simplified the process of supporting Apache Iceberg and including it with your metadata ecosystem. No extra work is required by the data engineers to include it, thus reducing workload and complexity. That's governance doing its job across the full stack, not just the parts that feel familiar.

To explore further, consider the Collate Free Tier for managed OpenMetadata or the Product Sandbox with demo data.