Announcing Data Quality as Code

Every data team knows the scenario: By the time quality tests catch errors, bad data has already reached production and hit user dashboards and business reports. The problem isn't that organizations don't test their data; it's that they test it too late. Traditional approaches validate data after it lands in destination tables, turning data quality into damage control rather than prevention.

OpenMetadata’s new Data Quality as Code capability changes this. This developer-first approach shifts validation left in your pipelines, letting data engineers integrate quality checks directly into ETL workflows using Python code. The result: bad data never makes it to production, and data quality becomes proactive prevention rather than reactive firefighting. These capabilities complement the business-user-friendly UI-based approach already available in OpenMetadata, with shared data quality dashboarding, reporting, and incident management across both. Together, they deliver a comprehensive end-to-end solution for both technical and non-technical users, providing an industry-leading approach to data quality.

The Challenge: Siloed Quality Checks That Come Too Late

Organizations invest heavily in data quality tools and testing frameworks. Yet three fundamental challenges persist:

1. Decentralized Testing Creates Inconsistency and Silos

When different teams define quality tests independently, inconsistency creeps in. Consider a simple example: checking for null values. One team uses a not null test that checks column IS NOT NULL. Another team uses a missing value test that also accounts for empty strings, "NA" values, and the string "null" — all of which SQL treats as non-null.

Both teams think they're checking for the same thing, but they're not. When asked to compare data quality across datasets, the results don't align because they're using different tests for the same validation concept. Making matters worse, these inconsistent tests are scattered across different pipelines with no central inventory. Teams can't see what validations already exist, leading to duplicate work and coverage gaps.

Data engineers need centralized control — standard validation rules that apply uniformly across all pipelines. They need a shared approach where tests are defined once and reused consistently, ensuring that when two teams check for missing values, they're using the exact same logic. Additionally, these data teams need a single pane of glass for data quality testing and results across different pipelines, or they risk creating further quality metadata silos across their data landscape.

2. Quality Testing Lives Outside the Development Workflow

Data engineers build and maintain pipelines in code: Python scripts, Spark jobs, dbt transformations. But configuring quality tests often requires switching contexts to a separate UI, learning different workflows, and managing tests outside version control. This context switching slows development and creates a disconnect between data transformation logic and data validation logic.

Most data teams want quality checks to live alongside their transformation code: defined programmatically, version-controlled with the pipeline, and executed as part of the same workflow.

3. By the Time Quality Tests Run, It's Too Late

Here's the problem with traditional data quality architectures:

The standard pattern loads transformed data into model tables first, then runs quality tests. If tests fail, the data is already accessible to BI tools, data scientists, and business users. You catch the problem after the fact, when the downstream impact has already occurred.

What data teams really need is a circuit breaker — a way to test data quality during transformation and prevent bad data from landing in production tables in the first place.

Introducing Data Quality as Code

OpenMetadata’s Data Quality as Code transforms how organizations validate data by embedding quality checks directly into data pipelines through the Python SDK. This developer-first approach gives data engineers the ability to:

Define tests programmatically using Python code alongside transformation logic
Execute validation during transformation before data loads to destination tables
Act as a circuit breaker to stop, rollback, or filter bad data automatically
Leverage centralized test definitions created by data stewards in the OpenMetadata UI
Publish results back to OpenMetadata for unified visibility and alerting

Data engineers can choose between bolting quality checks onto the end of pipelines or integrating them as a native part of the data engineering workflow — validating early, failing fast, and ensuring only clean data reaches production.

Four Pillars of Data Quality as Code

Pillar 1: A Developer-Native SDK for Quality Testing

Data engineers can use the SDK to import and run any test type supported by OpenMetadata — null checks, uniqueness validation, value ranges, custom SQL queries, and more — without leaving their development environment. The SDK automatically leverages the service connections, schemas, and metadata you've already configured. Tests execute against live data sources with the same validation engine that powers OpenMetadata's UI-based testing, ensuring consistency across all validation methods.

from metadata.sdk.data_quality import TestRunner, TableRowCountToBeBetween

runner = TestRunner.for_table("MySQL.default.db.table")
runner.add_test(TableRowCountToBeBetween(min_count=100, max_count=1000))

results = runner.run()  # Publishes results to OpenMetadata

Pillar 2: Centralized Governance, Decentralized Execution

Data quality needs happen at two levels for complete coverage:

Structural quality checks are standardized rules that data engineers want to enforce uniformly across all pipelines. These include fundamental validations such as "ID columns must not be null" and "columns should have no missing values."
Business quality checks are domain-specific rules that teams close to the data implement based on their understanding of the business context. Examples include checking that price fields are never negative or that discount percentages fall within expected ranges.

Data Quality as Code supports both approaches:

For structural checks, data engineers create a centralized library of test definitions, essentially a shared Python module that all pipelines import and execute. This ensures that when Team A and Team B both check for missing values, they're using the exact same test logic.

Here's how missing values consistency works in practice: OpenMetadata provides both a "not null" test and a "missing values" test. The not null test simply checks column IS NOT NULL. But the missing values test accounts for NULL values, empty strings (""), "NA" strings, and the string "null"; all values that teams might consider "missing" even though SQL treats some as non-null.

With a centralized library that uses the missing-value test, both teams get consistent results because they're using the same test definition.
For business checks, data stewards define tests through the OpenMetadata UI, adding domain knowledge and validation rules directly to dataset metadata. Data engineers then load these UI-defined tests into their code with a single line:
```
# Automatically loads all tests defined in UI
runner = TestRunner.for_table("BigQuery.analytics.customer_360")

results = runner.run()
```
This bridges the gap between centralized governance (everyone can see and manage test definitions in OpenMetadata) and decentralized execution (teams run tests in their own pipelines on their own schedules).

Pillar 3: Keep Data Clean with Circuit Breaker Validation

One of the core features of Data Quality as Code is its ability to prevent bad data from reaching production. Traditional architectures test data after loading:

Traditional Approach:

Data Quality as Code uses the DataFrame validator, which shifts the process left, letting you test during transformation:

Data Quality as Code:

Here's a concrete example using DataFrame validation:

import pandas as pd
from metadata.sdk.data_quality.dataframes import DataFrameValidator
from metadata.sdk.data_quality import ColumnValuesToBeNotNull

df = pd.read_csv('path/to/data.csv')

validator = DataFrameValidator()
validator.add_test(ColumnValuesToBeNotNull(column="email"))
result = validator.validate(df)

if result.success:
    load_to_destination(df)

result.publish("MySQL.default.db.table")

This circuit breaker pattern gives you complete control over how to handle quality failures:

Stop processing entirely: Prevent any data from landing if critical validations fail
Rollback transactions: Revert partial loads when issues are detected
Filter bad records: Load only the rows that pass validation and quarantine failures for review
Alert and delay: Notify teams of issues and delay the pipeline until resolved

No matter which strategy you choose, the outcome is the same: bad data never reaches downstream consumers, dashboards stay accurate, and trust in your data platform remains intact.

Pillar 4: Your Data, Your Infrastructure, Your Security

Data Quality as Code doesn't lock you into a specific infrastructure or deployment model. Unlike SaaS-only solutions that force your data through external systems, Data Quality as Code executes validation wherever your data lives: on your infrastructure, using your compute resources, with your security controls.

Run Tests Where Your Data Lives

The Python SDK executes tests directly against your data sources using the service connections you've already configured in OpenMetadata. No data leaves your environment. Whether your data warehouse runs in AWS, GCP, Azure, or on-premises, validation happens within your security perimeter:

# This code will run in your infrastructure of choice, directly against your data sources.
runner = TestRunner.for_table("Snowflake.production.analytics.customers")

results = runner.run()

This architecture means:

Data never moves: Tests query your systems directly; no data replication to external services
Data sampling: Your data never leaves your infrastructure. Unless explicitly configured, OpenMetadata will only keep your metadata.
Support for multiple Secret Managers: If you’re already familiar with Cloud Secret Management solutions, OpenMetadata supports them out of the box!
Existing security applies: Authentication, authorization, and access controls work the same way they do for your other tools
Compute costs stay predictable: Test execution uses your existing infrastructure budget, not surprise charges from external validation services

The Path Forward

Data quality testing shouldn't happen after the damage is done. With Data Quality as Code, you validate during transformation, prevent bad data from landing in production, and turn quality checks into a natural part of your development workflow.

This is Part 1 in our Data Quality as Code series. We've covered what it is and why it matters. In Part 2, we'll show you exactly how to implement it: setting up the SDK, building validation patterns, and integrating quality checks into your existing pipelines.

Continue to the implementation guide →

Want to try it now? Data Quality as Code is available in OpenMetadata 1.11+. View the documentation or explore our sandbox with demo data.

Introducing Data Quality as Code: Validate Data Before It Lands, Not After