Snowflake data catalog: two native options, one data estate

What is a Snowflake data catalog?

Running Snowflake at scale without a catalog creates costs that compound over time. Engineers rebuild tables that already exist because nobody can find them, and permission changes drift out of step with who needs access. Analysts working from different starting points produce different answers to the same customer-count question, and AI agents built on the same inconsistencies return contradictions that erode trust in every downstream system. The mismatch tends to surface during an audit rather than before one, and the cleanup runs up against budget and time nobody planned for.

A Snowflake data catalog is the inventory that stops those failures. It records every object in the account (databases, schemas, tables, columns, views, pipes, and functions) and pairs each one with business context, lineage, and access rules. Our first article in this series walked through the three layers of Snowflake metadata. Technical metadata describes what the data looks like, business metadata explains what it means, and operational metadata tracks how it is used. A catalog is what turns those three layers into something people and AI agents can search, trust, and govern.

The question for most teams today is not whether to have a catalog. Snowflake now ships two of its own. Snowflake Horizon Catalog handles the platform's own objects. Snowflake Open Catalog handles Apache Iceberg tables shared across engines. The real question is which catalog surfaces you need, how far they take you, and where you still need a layer on top.

Article Contents

Snowflake's two native catalogs
Key features that a data catalog delivers in Snowflake
Snowflake native vs. third-party data catalogs
Best practices for Snowflake data catalog management
Frequently asked questions

Snowflake's two native catalogs

Each of Snowflake's two catalogs serves a different class of data. Horizon Catalog is the governance and discovery layer for objects stored in Snowflake. Open Catalog is the Iceberg REST catalog that lets Snowflake, Spark, Trino, and other engines read and write the same Apache Iceberg tables (Iceberg is the open table format for cloud data lakes). Teams that mix Snowflake-managed data with Iceberg tables end up using both catalogs, and knowing which surface owns which part of the estate is the first step in any Snowflake catalog strategy.

Snowflake Horizon Catalog

Snowflake Horizon Catalog covers the objects Snowflake itself manages. It includes centralized object tagging and classification, masking and row-access policies that attach to those tags, Universal Search across every database in the account, Cortex Search for natural-language discovery, and column-level lineage derived from ACCESS_HISTORY. Snowflake Intelligence, the agent interface layered on top of Horizon, lets analysts and engineers ask questions in natural language and get answers grounded in the catalog.

Data platform teams often reach for ACCESS_HISTORY. The view exposes column-level lineage through the objects_modified JSON blob. Each record carries directSources (the columns a query reads immediately) and baseSources (the columns the data traces back to through upstream transformations). A query against the view answers the "if I change this column, what breaks" question without leaving SQL.

SELECT query_id, query_start_time, objects_modified
FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY
WHERE query_start_time > DATEADD(day, -7, CURRENT_TIMESTAMP())
  AND ARRAY_SIZE(objects_modified) > 0
ORDER BY query_start_time DESC;

Horizon capabilities ship with Snowflake editions. Tagging, classifications, masking, row-access, and ACCESS_HISTORY require Enterprise Edition or higher, and Cortex Search and Snowflake Intelligence bill credits separately from standard warehouse compute.

Snowflake Open Catalog

Snowflake Open Catalog is a managed service for Apache Polaris, the Iceberg REST catalog project that Snowflake open-sourced in July 2024, which graduated to an Apache top-level project in February 2026. Open Catalog governs Apache Iceberg tables, allowing multiple engines to read and write the same tables without proprietary lock-in. The official Apache Polaris list includes Doris, Flink, Spark, Dremio OSS, StarRocks, and Trino. Snowflake, Starburst, Amazon Athena, and other engines interoperate through the Iceberg REST specification.

Registering an external Iceberg REST catalog inside Snowflake uses CREATE CATALOG INTEGRATION with CATALOG_SOURCE = ICEBERG_REST. A minimal integration for Open Catalog looks like this.

CREATE OR REPLACE CATALOG INTEGRATION polaris_catalog
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'default'
  REST_CONFIG = (
    CATALOG_URI = 'https://<account>.snowflakecomputing.com/polaris/api/catalog'
    CATALOG_NAME = 'my_catalog'
    CATALOG_API_TYPE = PUBLIC
  )
  REST_AUTHENTICATION = (
    TYPE = OAUTH
    OAUTH_CLIENT_ID = '<client_id>'
    OAUTH_CLIENT_SECRET = '<client_secret>'
    OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL')
  )
  ENABLED = TRUE;

Why both catalogs matter now

Iceberg has moved from an edge case to a default pattern for enterprises that want multiple query engines to work against the same tables. Horizon covers the Snowflake-managed slice of that picture, and Open Catalog coordinates the Iceberg slice, including tables that live outside Snowflake. Most enterprise data estates use both surfaces, and the rest of this article covers the features they deliver and where the native catalogs hit their boundary.

Key features that a data catalog delivers in Snowflake

With both native catalogs in view, the next question is what they deliver. Catalog features in Snowflake handle three jobs. The catalog inventories what exists, provides the context on what each object means and whether to trust it, and enforces the access rules that govern its use. The features that matter most serve all three at once.

Inventory and discovery

A catalog starts as an inventory. Every database, schema, table, column, view, pipe, and function the account contains becomes a searchable record with ownership and classification attached. Horizon's Universal Search scans across databases, and Cortex Search layers natural-language discovery on top, so a user can ask "where is revenue data sliced by region" without knowing which schema or column names to type. Tags drive much of that discovery, so the first step for most teams is to define and apply tags that reflect how people search for data.

CREATE TAG data_domain
  ALLOWED_VALUES 'sales', 'finance', 'marketing', 'product', 'hr';

ALTER TABLE sales.customers SET TAG data_domain = 'sales';
ALTER TABLE sales.orders SET TAG data_domain = 'sales';

Once tags are in place, search results can be filtered by domain, owner, or sensitivity, making the catalog useful rather than exhaustive.

Meaning and trust

Inventory alone is not enough. A list of tables and columns without business context still leaves users guessing what each field means. A catalog closes the distance by attaching descriptions, glossary definitions, classifications, and quality signals to each object. In Snowflake, table and column COMMENT clauses carry the descriptions Horizon exposes in search results, and classifications flag sensitive data (PII, PHI, confidentiality levels) so downstream tools know to apply the right policies. Data quality signals from Snowflake's own checks or a third-party tool add freshness and validity context so users can tell whether a given number is fit to act on. Lineage belongs in this cluster too, since understanding where a column came from is part of knowing whether to trust it.

Governance and access

Governance and access turn the catalog into something regulated industries can rely on. Role-based access control grants privileges at the database, schema, table, or column level. Masking policies and row-access policies attach to tags, so a policy written once applies to every object carrying that tag. Integration across Snowflake-managed tables, Iceberg tables registered through Open Catalog, and external sources accessed through external stages keeps governance consistent regardless of where the data sits. The catalog is what lets policy, access, and discovery operate on the same set of records.

Snowflake native vs. third-party data catalogs

Horizon Catalog and Open Catalog cover a lot of ground inside the Snowflake plane, and the question most teams eventually hit is what happens outside it. What native catalogs cover well is data that lives in Snowflake or is shaped as Apache Iceberg. Horizon governs Snowflake-managed objects, their tags, classifications, policies, and in-platform lineage. Open Catalog coordinates Iceberg tables across the engines, and the REST spec supports it. For a team whose analytics stack terminates in Snowflake and whose pipelines never leave it, the native catalogs are often enough.

Where they stop is the rest of the stack. Tracing a column from a Snowflake table through a dbt model into a Tableau dashboard and then into an ML feature store is out of scope for both Horizon and Open Catalog. Each surface only sees the queries and objects within its own plane. A unified search across Snowflake, Databricks, Postgres, and SaaS sources sits outside their field of view. A business glossary maintained by a governance council that works across every system cannot live in either catalog. AI agents that pull from the entire data estate need a single semantic layer to avoid contradictions, and that layer has to span every source, which is the job of a cross-platform lineage system.

A third-party data catalog makes sense when any of those cases apply. Multi-warehouse and multi-cloud estates need a layer above the individual catalogs. Regulated industries running governance committees rely on workflow systems that native catalogs do not provide. AI agent initiatives that depend on consistent semantics across all sources require a single place to define the meaning of a term. The decision rule is simple: choose native catalogs when data stays within Snowflake, and add a third-party catalog when the estate spans multiple systems.

Collate is the AI for Data a semantic intelligence platform built for that work. Built on the Apache 2.0 OpenMetadata foundation that serves as the open context layer for AI agents and human users, Collate sits above Snowflake's native catalogs rather than replacing them. It pulls metadata from Horizon and Open Catalog alongside dbt, Fivetran, Tableau, feature stores, and every other source in the data estate, and exposes a unified semantic intelligence graph that people and AI agents can rely on.

Best practices for Snowflake data catalog management

Whatever catalog mix a team lands on, a small set of practices decides whether the investment compounds or decays. Five of them show up repeatedly in teams that keep Snowflake catalogs healthy at scale. Each addresses a failure mode that incurs real-time costs or credits when ignored.

Standardize naming conventions across every object

Start with a naming standard that covers databases, schemas, tables, columns, and tags. Encode environment (prod, stage, dev), domain (sales, finance, marketing), and owner where it fits the object. Without a standard, the same customer table appears under three spellings across schemas, and nobody can tell which is authoritative. Naming matters more once Open Catalog enters the picture, since Iceberg tables registered through Polaris get queried from Spark, Trino, and other engines that will carry those names outside Snowflake itself.

Add business-friendly descriptions to every high-value dataset

Every high-value table and column deserves a description that tells a reader what the object represents, who owns it, and what it does not cover. Snowflake carries that content in COMMENT clauses that Horizon exposes in search results, and a one-line comment becomes the authoritative definition every tool in the estate can read.

COMMENT ON COLUMN sales.customers.customer_id IS
  'Primary customer identifier sourced from Salesforce CRM. Unique per account; does not represent end-user contacts.';

Descriptions are the cheapest trust signal a catalog can carry, and teams that adopt the habit during schema changes save hours of Slack DMs every week.

Apply tagging and classification consistently

Tags are how a catalog filters, governs, and reports. Define a short set of categories (sensitivity, domain, ownership, environment) and commit to applying them during every new object creation rather than as a cleanup project six months later. Pair classification tags with masking policies so the policy itself is written once, and every newly tagged object inherits it.

CREATE TAG pii_classification
  ALLOWED_VALUES 'public', 'internal', 'confidential', 'restricted';

ALTER TABLE sales.customers MODIFY COLUMN email
  SET TAG pii_classification = 'confidential';

Tagging breaks down when categories proliferate, so keep the vocabulary short and publish it where developers can find it.

Document data ownership on every object that matters

Attach a named owner to every high-value table and view. An owner tag that points to an individual or team gives auditors and AI assistants a deterministic answer to "who is accountable for this data," rather than the "whoever wrote the pipeline left the company" answer that shows up during governance reviews.

Monitor lineage as a program, not an incident response

Lineage belongs in a monthly platform review, not only a post-outage investigation. Pull ACCESS_HISTORY against the last thirty days, look for tables with upstream sources that changed without warning, and catch schema drift before a dashboard starts returning zeros. ACCESS_HISTORY stops at the Snowflake boundary, so teams running dbt, Fivetran, or BI tools downstream need a lineage layer that continues past the warehouse (the Collate Snowflake integration is one example). Teams that treat lineage as continuous catch small corrections every week instead of writing long postmortems every quarter.

The five practices compound into a single discipline. Naming standards make tags readable, allowing classifications to be reused across dozens of objects. Once classifications are reusable, ownership reviews move from manual spreadsheet work to a tag query, and lineage monitoring starts catching schema drift before it causes an outage. Treated as a program with an owner and a monthly review cadence, the Snowflake catalog becomes a reliable foundation for everything downstream.

For teams whose data stays inside Snowflake, these practices are the whole job. Estates that extend into dbt, Fivetran, Tableau, or ML feature stores need the same discipline applied to every source. Collate picks up the thread there, carrying naming, tagging, ownership, and lineage across the full stack.

Frequently asked questions

What is a Snowflake data catalog?

A Snowflake data catalog is an inventory of every object in a Snowflake account (databases, schemas, tables, columns, views, pipes, and functions) paired with business context, lineage, and access rules. It turns the three layers of Snowflake metadata (technical, business, operational) into something people and AI agents can search, trust, and govern. Snowflake ships two catalog products today: Horizon Catalog for Snowflake-managed objects and Open Catalog for Apache Iceberg tables shared across engines.

Does Snowflake have a built-in data catalog?

Yes, Snowflake ships two of them. Horizon Catalog is the governance and discovery layer for Snowflake-managed objects, including object tagging, classification, masking, and row-access policies, Universal Search, Cortex Search for natural-language discovery, and column-level lineage through ACCESS_HISTORY. Snowflake Open Catalog is a managed Apache Polaris service that enables multiple query engines to read and write the same Apache Iceberg tables via the Iceberg REST specification. Most enterprise Snowflake estates end up using both.

What is the difference between Snowflake Horizon Catalog and Snowflake Open Catalog?

Horizon Catalog governs the objects Snowflake itself manages, including tags, classifications, masking and row-access policies, search, and in-platform lineage through ACCESS_HISTORY. Snowflake Open Catalog is a managed Apache Polaris service for Apache Iceberg tables, built on the Iceberg REST specification so that Snowflake, Spark, Trino, Flink, Dremio, and other engines can all read and write the same tables without proprietary lock-in. Horizon focuses on the Snowflake-native side, while Open Catalog handles multi-engine Iceberg data.

Do I still need a third-party data catalog if I use Horizon?

Not necessarily, if everything lives inside Snowflake, Horizon and Open Catalog together cover Snowflake-managed objects and Iceberg tables across engines, which is enough for teams whose analytics stack terminates in Snowflake. A third-party catalog is necessary when the estate extends to dbt, Fivetran, Tableau, ML feature stores, Databricks, or SaaS sources, or when an AI agent needs a single semantic layer across all sources. Cross-platform lineage, unified search, and governance workflows all sit outside Snowflake's plane.

How does Snowflake Open Catalog work with Apache Iceberg?

Snowflake Open Catalog is a managed service for Apache Polaris, the Iceberg REST catalog project that Snowflake open-sourced in July 2024. Polaris graduated to an Apache top-level project in February 2026. Open Catalog implements the Iceberg REST Catalog specification, which means query engines like Apache Spark, Trino, Flink, Dremio, and StarRocks can all read and write the same Iceberg tables Snowflake sees. In Snowflake, Open Catalog is registered through CREATE CATALOG INTEGRATION with CATALOG_SOURCE = ICEBERG_REST.

Is Snowflake Horizon Catalog free?

Horizon Catalog is not sold as a separate SKU, and its capabilities ship with Snowflake editions. Tagging, classifications, masking policies, row-access policies, and ACCESS_HISTORY require Enterprise Edition or higher. Cortex Search and Snowflake Intelligence consume credits separately from standard warehouse compute, with Cortex Search billing across embedding tokens, serving compute, and storage. Plan for Enterprise Edition pricing plus Cortex Search credit consumption if the deployment depends on natural language discovery or AI agent interactions.