Why Semantic Intelligence Is the Missing Foundation for Trusted AI
A frequent question we hear from data leaders is:
“We invested in a data catalog. Why don’t our teams trust the data?”
The question stems from how our industry has approached metadata over the past decade. We treated it as an inventory problem. Build a catalog, index the data assets, assign owners, and track lineage. The assumption was that if people could find data, they would use it. But what if the real problem was never about finding data? What if the real problem is that nobody agreed on what the data means?
I’ve spent over a decade building metadata systems at Hortonworks and Uber. At every stop, the same pattern emerged. Teams could find data, but they just couldn’t agree on what it meant. At Uber, Finance, Marketing, and Product each had their own definition of “monthly active users.” Engineering maintained five different “trips” tables. When executives asked “how many miles did drivers drive last month,” they got different answers because everyone worked from different definitions with no way to know which was authoritative.
We tried better documentation, data quality tools, and governance workflows. None of it worked because we were treating symptoms. The root cause was that the metadata had no meaning. Metadata tells you what exists, but it doesn’t tell you what it means. And you can’t solve problems of meaning, relationships, and context by adding more fields to a catalog.
AI turns semantic ambiguity from a slow problem into a dangerous one
Humans work around semantic ambiguity every day. A data analyst sees five customer tables and messages a colleague: “Which one for the EMEA quarterly report?” Problem solved, albeit slowly. LLMs can’t do this. When an AI agent encounters ambiguous data, it picks something and proceeds at machine speed. That speed is what changes the calculus.
Consider a logistics company processing nine million package-tracking records daily with 18,000 dashboards across the organization. Most contain overlapping, contradictory views of the same metrics. When humans were the only consumers, the confusion was manageable. People built informal networks: “Ask Sarah, she knows which table is right.” Those informal networks don’t scale to AI agents making hundreds of decisions per minute. The same ambiguity that caused dashboard conflicts in the BI era (remember the “single version of the truth” mantra?) now leads to AI hallucinations, incorrect recommendations, and compliance violations.
We see this across the organizations we work with. AI initiatives stall because teams spend months feeding LLMs “context” that lacks shared definitions. Outputs are unreliable because the model doesn’t know which customer definition matters for which use case. Governance policies written in documents can’t be operationalized because the AI agent has no semantic understanding of what constitutes EU data or where it flows.
The organizations succeeding with AI have solved this by building what we call semantic intelligence: the layer that gives both humans and machines a shared understanding of what data means.
What actually changes when you add meaning to metadata
Let me be specific. Traditional metadata for an email column includes the column name, data type, owner, and a PII tag. That’s useful but incomplete. Semantic intelligence gives you that same column mapped to a standard type (schema.org/Person/email), connected to a meaning (“primary contact email for customer communications”), linked to related concepts (Person, Customer, Contact Information), governed by specific policies (PII.Email, subject to GDPR, can’t be copied to non-EU regions), and annotated with usage context (feeds 12 dashboards and 3 ML models).
The first tells you it’s an email field. The second tells you what it means in your business context, how it relates to other concepts, how it should be governed, and who depends on it. That distinction makes three things possible that metadata context alone can’t achieve.
AI can reason, not just retrieve. When you ask “show me customer lifetime value,” the system understands CLV as a calculated metric with a specific business definition. It knows the right tables to use, their quality scores, and why those are the right choice. We tested this directly. In our initial internal benchmarks against a complex, multi-database environment with cross-table joins, semantic intelligence improved text-to-SQL accuracy from 10.8% to 54.0%, a 43-percentage-point increase driven by grounded semantic context rather than raw metadata. While the absolute score is lower due to this complexity, the semantic layer proved its value in handling ambiguity.
Governance becomes automatic. Classify a field as PII.SensitivePersonal, and that classification propagates everywhere it appears across your entire data estate, with access policies updating automatically. Carrefour Brazil operates at this scale today, where governance changes propagate across 133 petabytes of daily data and 33,000+ tables without human intervention.
Semantic ambiguity becomes structurally impossible. When Marketing and Finance both say “customer,” the system knows whether they mean the same thing. Kansai Airports had exactly this problem with “passenger count”: some departments included crew and counted transfers, while others didn’t. Now there’s one semantic definition that both humans and AI understand in context.
The architectural decisions behind OpenMetadata
When we designed OpenMetadata, we made architectural choices based on what failed at Uber and what we learned at Yahoo and Hortonworks. These aren’t features we added later but rather decisions made on day one. In particular, three stand out:
-
One semantic metadata graph, not modules bolted together. Traditional catalogs added lineage as a module, quality as another, and governance as a third. Each creates synchronization problems. We built one graph that makes every relationship between data, people, policies, and processes native.
-
Metadata as executable infrastructure, not documentation. Business glossaries map directly to physical tables. Policies propagate to source systems. Quality tests are generated from semantic types and business rules. This is the difference between a document that says “this data is sensitive” and infrastructure that enforces it.
-
Open source is the foundation. We built OpenMetadata as an open source project because it takes a community to get metadata right. No single vendor sees every data system, every edge case, every organizational pattern. The 3,000+ deployments and 400+ contributors aren’t just adoption metrics. They’re how the semantic model improves. This is the same lesson we learned at Hortonworks with Hadoop: open source wins when the community builds for real users, not for one company’s internal needs.
We designed around open standards (JSON Schema mapped to RDF, DCAT, and DPROD) so your semantic model stays portable. And we built AI automation into the architecture from the start. Agents can document assets, propose quality tests, classify sensitive data, and enforce governance across 120+ integrated systems. This is work that would take human teams months, and it only functions because the meaning is machine-readable by design.
What this looks like in production
Our customers and community members see measurable results when they build on semantic intelligence across industries, geographies, and team sizes.
Wix, the Israel-founded, NASDAQ-listed global SaaS platform powering over 200 million users, built an AI-ready data foundation that supports in-production AI agents with governed semantic metadata. By centralizing 25,000+ data assets, 130,000+ lineage connections, and 6,000 data quality signals, Wix enables agents to reliably identify the right tables, understand schemas, and generate accurate queries in about one minute. With trusted, continuously updated metadata delivered via APIs and MCP, Wix improved AI agent accuracy and reduced on-call engineering toil by 675 hours per month—turning metadata into a production backbone for AI-driven product decisions.
VRT, the public broadcaster serving 2.3 million daily viewers in Flanders, built a single source of truth for metadata across its CMS and streaming platforms. Business users now configure custom data quality alerts in ~30 seconds without engineering support, dramatically reducing time-to-resolution. By grounding automation in consistent, governed metadata, VRT laid the foundation for AI-driven analysis and automation built on shared semantic meaning.
Mango, the global fashion retailer with over 1,000 data users across product, retail, ecommerce, and supply chain, replaced scattered in-house tooling with Collate to establish a single source of truth. Data contracts replaced email agreements between teams, cutting new data integration time by 3x. Higher data quality feeds ML models for pricing and discounting, where even a 1% improvement in accuracy has a measurable revenue impact across thousands of stores.
Turning Semantic Intelligence into Action
A semantic foundation on its own is not the end goal. It is the prerequisite.
Once meaning is explicit, machine-readable, and governed, it becomes programmable. That is what enables the next step in our platform: AI agents that can reason over your data environment and take action safely.
With the 1.12 release, we are introducing Collate AI Studio and AI SDK. This gives data teams the ability to build, tune, and deploy agents that operate on top of the semantic metadata graph. These agents can document assets, add additional semantics, design quality tests, enforce governance policies, and automate repetitive data workflows — all grounded in shared, governed meaning.
Harsha Chintalapani, our Co-Founder and CTO, walks through the architecture, capabilities, and real-world use cases of AI Studio and AI SDK in detail in his blog.
If this post explains why semantic intelligence is required, his post shows how it becomes operational.
→ Read Harsha’s deep dive on AI Studio and AI SDK.
Where to start
You don’t need to model your entire data estate semantically on day one. Start with the 1,000 to 3,000 tables that drive actual business decisions (we have AI agents to automate asset tiering): the ones people query repeatedly, that feed critical dashboards, that power your most important models. Make business definitions executable by mapping them to physical tables, quality tests, and governance policies through the semantic graph. Automate from the start, because manual documentation decays and will always do so. Our community learned this across 3,000+ deployments.
Woop’s two-person data team started with discovery and lineage for 1,600+ assets, expanded to governance and data contracts, and today serves 100+ internal users without adding headcount. Fundcraft onboarded new team members to full productivity in three hours. The pattern is consistent: prove value on one high-impact use case, then expand.
Start laying the semantic foundation now
Every organization we work with that successfully deployed AI built the semantic foundation first. The ones that struggled tried to bolt on meaning after the fact, and the cost of retrofitting always exceeded the cost of building it right. If your AI gives different answers to the same question, the problem isn’t the model. It’s that your organization never agreed on what the question means, or that meaning lives in someone’s head instead of in executable infrastructure.
We built OpenMetadata because metadata should be a solved problem by now. It shouldn’t take a decade-long career building three separate metadata systems to arrive at this conclusion, but here we are. The good news is that you don’t have to learn these lessons the hard way.
The semantic intelligence foundation is production-ready, open source, and backed by 3,000+ deployments and a community of 12,000+ members. Explore the platform at getcollate.io. Join the community on Slack. Dive into the code on GitHub. If you have feature requests, file a GitHub issue. If you’re stuck with a legacy system, let’s build a migration path together.