Snowflake Data Warehouse: the architecture that broke the data stack open

What is the Snowflake data warehouse?

Snowflake's architecture began with a single decision that no major data warehouse had made before. Compute and storage would scale independently, on the same data, without copies. The founding team described the choice in a 2016 SIGMOD paper, "The Snowflake Elastic Data Warehouse", and that paper is the reason most data professionals know Snowflake by name today. Every capability data teams now take for granted (workload isolation, secure data sharing, usage-based billing, near-instant scaling, time travel without backups) traces back to that one architectural idea.

The Snowflake Data Warehouse is a fully managed cloud data platform built on what Snowflake calls multi-cluster shared data architecture. Storage holds data once on cloud object storage as columnar micro-partitioned files. Compute runs in virtual warehouses that scale independently, sized by T-shirt from XS to 6XL, billed per second of running time. A cloud services layer holds metadata and orchestrates the work between the other two layers.

The platform serves the workloads enterprises run on a SQL-first warehouse, as well as those that have sat outside warehouses for years. Business intelligence and analytics are the historic core. Data engineering pipelines load and transform data through SQL or Snowpark. The data lake pattern arrives through native Apache Iceberg support and Snowflake Open Catalog. Machine learning and agentic AI run through Snowpark Container Services, plus a set of AI capabilities Snowflake calls Cortex. Snowflake offers four editions, with most enterprise deployments running on Enterprise or Business Critical because the governance primitives that make Horizon Catalog useful require those tiers.

The article also defines the boundary of what Snowflake handles. The platform separated compute from storage. The layers above separate the meaning from the substrate. This piece is the foundational article in a three-part series. The two sister articles cover the metadata and catalog layers that ride on top.

Article Contents

The architecture that broke the data stack open

The 2016 SIGMOD paper described the design as multi-cluster, shared data. Traditional data warehouses ran on shared-nothing clusters where each node owned a slice of the storage. Adding more compute meant rebalancing data across nodes. Storing more data without more compute was difficult. Two workloads against the same data meant ETL pipelines and schedules fought over a single resource for compute and storage.

Snowflake's architecture removed those constraints by making storage external and compute disposable. Multiple virtual warehouses can read the same data simultaneously without interfering with one another because the data they access resides in cloud object storage rather than on the compute nodes. Spinning up a new warehouse for a new workload takes seconds. Suspending it when the work finishes takes the same time.

Six capabilities follow from that architecture, and most of what data teams reach for in Snowflake can be traced back to one of them.

  1. Independent scaling. Storage grows because the business loads more data, not because the warehouse needs more memory.
  2. Workload isolation. A finance team's quarterly close runs on a different warehouse from the data engineering team's pipelines, on the same data.
  3. Per-second billing. A warehouse running only when a query is in flight changes the cost model from rack-sized appliances to a metered utility.
  4. Secure data sharing. Two Snowflake accounts can read the same table without copying it, because storage is the same physical layer.
  5. Time Travel. Up to 365 days of history through AT and BEFORE clauses, without restoring backups, because the storage layer keeps versioned micro-partitions.
  6. Marketplace economics. The same shared-storage primitive lets vendors publish data products that customers query in place.

Each capability was hard or impossible on shared-nothing architectures. Snowflake's three-layer design is what made each one routine. The next architectural shift in this stack will not affect the storage or compute layers. It will change the layer above, where meaning, semantics, and AI-ready context live.

What Snowflake handles by design

Storage. Tables live as immutable columnar micro-partitions on cloud object storage. The platform handles compression, clustering, and partitioning, with the option to override clustering keys on the largest tables. Every micro-partition is versioned, which is what makes Time Travel and zero-copy clones possible. Cloning a multi-terabyte table is metadata-only, takes seconds, and adds no extra storage cost until the clone diverges from the source.

Compute. Virtual warehouses are the unit of compute. Sizes run from XS at one server to 6XL at 512 servers, with credit consumption doubling at each step. Multi-cluster warehouses, available on Enterprise edition and above, scale horizontally by spinning up additional clusters during periods of high concurrency. Auto-suspend and auto-resume remove the operational burden of capacity management. A warehouse with auto-suspend set to 60 seconds idles when no queries are running and resumes within seconds when the next query arrives.

Sharing. Secure data sharing allows two accounts to query the same physical data without ETL or data duplication, governed by access controls on the provider side. The Snowflake Marketplace exposes the same primitives at scale, enabling customers to query third-party data in place. Reader accounts let providers grant access to consumers who do not run Snowflake themselves.

Governance primitives. Snowflake Horizon Catalog ships with the platform on Enterprise and above. Object tagging and classifications attach business context to Snowflake-managed objects. Masking and row-access policies enforce rules at query time. Universal Search finds objects across every database in the account, and Cortex Search adds natural language discovery on top. The full walkthrough of Horizon and the Open Catalog companion that handles Iceberg lives in the data catalog article in this series.

AI. Snowflake's AI surface includes Cortex for managed access to LLMs and embedding models, plus Cortex Search for semantic search over Snowflake data. Snowpark Container Services hosts custom model serving, and Snowflake Intelligence agents answer natural language questions across the platform. All four capabilities run on Snowflake's compute and read from Snowflake's storage. The metadata those capabilities can read defines the boundary of what they can answer, which is where the layers above the warehouse start to matter.

How Snowflake compares to other data warehouses

The compute-storage separation is the architectural decision that distinguishes Snowflake from earlier warehouses. The contrast clarifies what the platform's design optimizes for.

Legacy on-premises warehouses (Teradata, Oracle Exadata, Vertica). Coupled compute and storage. Capacity planning was required, often years in advance. Hardware appliances were dedicated to specific workloads. Independent scaling of storage and compute was not possible by design.

Google BigQuery. Serverless compute model with no warehouse sizing. Queries draw from a shared pool of slots on Google Cloud, billed by data scanned or by reservation. Compute and storage are separated, but the abstraction differs from Snowflake's virtual warehouses, and the platform is integrated tightly with Google Cloud-native services.

Databricks SQL Warehouse. Lakehouse-first architecture, built around Delta Lake (and now Iceberg) on cloud object storage. Compute lives in clusters that run on Apache Spark. Snowflake and Databricks have converged on Iceberg as a shared table format, and the architectural distinction has narrowed over the past two years.

Amazon Redshift. Tighter integration with AWS services. RA3 nodes added compute-storage separation in 2019, six years after Snowflake's design landed in the field. Concurrency scaling on Redshift requires more capacity planning than Snowflake's multi-cluster model.

The architectural pattern introduced by Snowflake has now influenced every cloud warehouse. Compute-storage separation is the default. The remaining question is how the platform's engineering decisions translate to your team's workloads, your existing cloud commitments, and your governance requirements across the rest of the stack.

Where the platform's responsibility ends

Snowflake's architecture defines what the platform is responsible for. Storage, compute, the cloud services that orchestrate them, and the governance primitives that operate on Snowflake-managed objects. Each is engineered to a high standard. The architecture also defines what falls outside the platform's responsibility, which is most of what shapes how trustworthy your data estate is for AI and governance.

Cross-platform metadata. Horizon Catalog inventories what Snowflake manages. It does not inventory the dbt models that load the warehouse, the Tableau and Looker dashboards that read from it, the Kafka topics that feed it, the S3 buckets that hold pre-load files, or the Salesforce records the data originated from. Each of those systems carries its own metadata, and a unified view requires a layer that ingests all of them.

Business meaning. A column called customer_id exists in the schema. What customer_id means for your business (whether it represents a household or an individual, whether it is the same customer across the orders system and the CRM, and whether the definition has changed over the past two years) is not in the schema. Glossary definitions, ownership records, and business context become necessary inputs whenever an analyst or AI workload needs to interpret the data the way the business does.

AI-ready context. Cortex can run an LLM. Without business glossary terms and ownership records, plus lineage and policy context from outside Snowflake, the LLM substitutes its training data for your business logic. The cohort retention failure described in the text-to-SQL benchmark blog is one example. A query ran without error and returned the wrong answer because the model had no encoded definition of what "month" meant in the business.

Governance across the data estate. Tags applied to a Snowflake table do not propagate to the same data when it lands in Databricks or BigQuery. Data classified as PII in Snowflake can re-enter your estate from a SaaS source without that classification attached. A governance program that operates only inside Snowflake leaves the rest of the estate uncovered.

None of this is a critique of Snowflake's design. The architecture delivers what it was built to deliver, and the rest of the data estate's needs sit on different layers by design. Snowflake decoupled compute from storage, an architectural breakthrough that defined the platform. The next breakthrough decouples business meaning from any one workload, so glossary definitions, lineage, and governance sit on a shared layer that every tool can read.

The two layers above: metadata and catalog

Two layers rise above the Snowflake substrate, and the two sister articles in this series cover each in depth.

The metadata layer is what data is, who owns it, and how it changes. Snowflake exposes the technical slice of metadata through INFORMATION_SCHEMA, ACCOUNT_USAGE, SHOW, DESCRIBE, and metadata functions. The business and operational slices (definitions, ownership, classifications, lineage across the estate) require a layer that sits above the warehouse and connects to the rest of your stack. The first article in this series, Snowflake metadata: the three layers and how to manage them, walks through the access patterns and where the boundary sits.

The catalog layer is where data lives, what it means, and who can use it. Snowflake ships Horizon Catalog for Snowflake-managed objects and Open Catalog for Apache Iceberg tables. Both catalogs serve their purpose well. The estate-wide catalog (one search, one glossary, one lineage view across Snowflake, dbt, Tableau, Databricks, ML feature stores, and SaaS sources) needs a layer that sits above both. The second article in the series, Snowflake data catalog: two native options, one data estate, covers Horizon, Open Catalog, and where a third-party layer takes over.

Together with semantic primitives like glossary definitions, knowledge centers, and metric specifications, the metadata and catalog layers form what an AI workload reads before it answers. Collate is a semantic intelligence platform built on the Apache 2.0 OpenMetadata foundation, and it delivers those layers across the entire data estate so Snowflake's substrate sits below a layer that any workload can rely on.

What comes after compute-storage separation

Snowflake's compute-storage separation made the modern data warehouse possible. What comes next is not about storage or compute. The next decade of data work happens on the layers above, where meaning is encoded, governance extends across the whole estate, and AI workloads pick up the business context they need to return defensible answers.

Every breakthrough architecture defines a new boundary. Snowflake defines its boundaries at the platform level. You choose the layer above, and the choice you make in 2026 shapes what your AI workloads will run on for the next decade.

See Collate's Snowflake learning center for the full picture of what a platform built on top of a semantic context layer for AI semantic intelligence layer covers, and how it operates above Snowflake.


Frequently asked questions

What is the Snowflake Data Warehouse?

The Snowflake Data Warehouse is a fully managed cloud data platform built on a multi-cluster, shared-data architecture. Storage holds data once on cloud object storage as columnar micro-partitioned files, compute runs in virtual warehouses that scale independently from XS to 6XL, and a cloud services layer holds metadata and orchestrates the work between them. The platform supports business intelligence, data engineering, data lake patterns with Apache Iceberg, machine learning with Snowpark, and AI workloads with Cortex.

How is Snowflake different from a traditional data warehouse?

Traditional warehouses such as Teradata, Oracle Exadata, and Vertica ran on shared-nothing clusters where compute and storage scaled together. Adding more storage required adding compute, and multiple workloads on the same data competed for the same hardware. Snowflake separated compute from storage and put both behind a cloud services layer, enabling independent scaling, workload isolation, secure data sharing, time travel, and per-second billing. The 2016 SIGMOD paper that described the architecture set the pattern that every cloud warehouse now follows.

What workloads is Snowflake designed for?

Snowflake is designed for the workloads enterprises run on a SQL-first platform. Business intelligence and analytics are the historic core. Data engineering pipelines load and transform data through SQL or Snowpark. Apache Iceberg support and Snowflake Open Catalog cover the data lake pattern. Machine learning runs through Snowpark Container Services, and agentic AI runs through Cortex LLM functions, Cortex Search, and Snowflake Intelligence agents. The multi-cluster architecture lets each workload run on its own compute without interfering with the others.

How does Snowflake handle governance?

Snowflake Horizon Catalog ships with the platform and governs Snowflake-managed objects. It includes object tagging and classifications for business context, plus masking and row-access policies that enforce rules at query time. Universal Search finds objects across every database in the account, and Cortex Search adds natural language discovery on top. These capabilities require Enterprise edition or higher. Governance that spans Snowflake, Databricks, Tableau, Salesforce, and other systems requires a layer above Horizon to ingest metadata from all sources.

What does Snowflake leave to other layers?

Snowflake's architecture handles the substrate, including storage and compute, as well as cloud services and governance for Snowflake-managed objects. It leaves cross-platform metadata and business meaning to the layers above, along with AI-ready context and governance across the rest of the estate. A column called customer_id exists in the Snowflake schema; what it means for the business does not. Cortex can run an LLM, but the model's answers depend on metadata, glossary terms, and lineage that often live outside Snowflake. The layer above the warehouse provides those inputs.

How does AI work on Snowflake?

Snowflake's AI surface includes Cortex for managed access to LLMs and embedding models, plus Cortex Search for semantic search over Snowflake data. Snowpark Container Services hosts custom model serving, and Snowflake Intelligence agents answer natural language questions across the platform. All four capabilities run on Snowflake's compute and read from Snowflake's storage. The accuracy of the answers depends on the quality of the metadata and business context the model can read, which is why a semantic context layer above the warehouse matters for production AI workloads.

Ready for trusted intelligence?
See how Collate helps teams work smarter with trusted data