Data Contracts in Practice: A Field Guide for Analytics Engineers
Data contracts have been discussed as a concept in the analytics engineering community for several years. The actual practice of implementing them — in production, across a real team with real time pressure — turns out to be more specific and more learnable than the high-level discussion suggests. This is a field guide, not a philosophy post. We'll walk through what a contract actually contains, how to introduce one into an existing pipeline, what enforcement looks like in CI and in incremental runs, and where the common failure modes are.
What a data contract actually specifies
A data contract is a machine-readable specification of the guarantees an upstream data producer makes to downstream consumers about the structure and quality of a specific table or dataset. It covers four categories of constraints:
Structural constraints
These define which columns must be present, their data types, and whether they are required. A structural contract violation means the table shape changed in a way that downstream consumers didn't agree to. Example: account_id must be present, must be a non-null VARCHAR(36), and must match the pattern of a UUID.
Semantic constraints
These define acceptable value ranges, cardinality, and referential relationships. Example: subscription_status must be one of: active, churned, paused, trial. monthly_recurring_revenue must be >= 0. customer_id must have a corresponding row in the customers table.
Freshness constraints
These define how recently the data must have been updated for it to be considered valid for consumption. Example: the fct_daily_revenue table must have a row with event_date = CURRENT_DATE - 1 by 08:00 UTC each morning.
Ownership metadata
These specify who is responsible for the contract: the team or person who produces the data, the point-of-contact for contract violations, and the escalation path. This turns a contract from a passive spec into an actionable communication channel.
The producer-consumer relationship: who writes what
The most common question when introducing contracts is: who is responsible for writing them? In our experience, the pattern that actually gets maintained is: producers write contracts, consumers ratify them.
The upstream team closest to the source data has the most authoritative knowledge of what guarantees they can reliably make. They write the initial contract specification. Downstream consumers — typically the analytics engineers who build models on top — review the contract, flag any missing guarantees they need, and negotiate the final spec before the contract is enforced in CI.
This framing shifts the dynamic from "analytics engineers policing upstream teams" (which creates friction and resistance) to "upstream teams explicitly owning their output quality" (which creates accountability aligned with their work). It also means contract violations are routed to the right person: the producer, who can actually fix the source, rather than the consumer, who can only work around it.
Introducing a contract into an existing pipeline
Starting with a greenfield contract is easier than retrofitting one onto a pipeline that has been running for two years. Here is the sequence we recommend for retrofit:
- Audit the current schema. Run a full column inventory of the target table. Note data types, null rates, value distributions, and cardinality for each column. This is your baseline — the "what it actually is" before you codify "what it should be."
- Identify the highest-stakes downstream consumers. Which metrics, reports, or dashboards depend on this table? Who reviews them for business decisions? These consumers define the minimum contract surface you need to protect.
- Write the contract as a YAML spec. Start with the columns those high-stakes consumers actually use. Don't try to contract every column in the first version. Four to six columns with clear constraints is better than 40 columns with vague ones.
- Run the contract against current data. Before enforcement, run the contract as a check against your actual data. How many rows violate the null constraints? Are there unexpected values in enumerated columns? Use this as a gap analysis, not a production block.
- Fix violations at the source or update the contract. Some violations are bugs in the source data — fix them. Some violations are expectations that don't match reality — update the contract. This negotiation phase typically takes one to two weeks.
- Promote to enforcement. Once the contract passes against current data, add it to CI and to the incremental load promotion gate. From this point, upstream changes that violate the contract block data from reaching the governed layer.
Enforcement in CI versus enforcement at load time
There are two enforcement points, and they serve different purposes.
CI enforcement runs contract checks when a model change is proposed. This catches cases where a schema migration in a pull request would break a downstream contract before the change is merged. It's the "shift left" version — catch violations before they reach production data.
Load-time enforcement runs contract checks on each incremental load before data is promoted to the governed output layer. This catches cases where upstream source data violates a contract that was valid when it was written — a new type of event that doesn't match the accepted_values list, a batch of rows with unexpected nulls in a required field.
Both enforcement points are necessary. CI without load-time enforcement misses runtime violations from data sources you don't control. Load-time enforcement without CI catches violations after they've already been merged to the transformation code.
A contract that only runs in CI is a static check on your SQL. A contract that runs at load time is a live guarantee to your consumers. You need both, but if you're introducing contracts for the first time, start at load time — it catches the violations that actually break dashboards.
What contract violations look like in practice
When a contract check fails at load time, the violation should produce a structured report, not just an error log line. Specifically, it should tell you:
- Which contract was violated (table name + contract version)
- Which constraint failed (column name + constraint type + specific value that failed)
- The number of affected rows and the percentage of the batch they represent
- The contract owner (producer team) and the point-of-contact for the violation
- Whether the violation blocked promotion or was flagged as a warning
This structure matters because it turns a contract failure into an actionable engineering task, not a vague "something broke" incident. The on-call engineer knows exactly what failed, who owns the upstream source, and whether it's a hard block or a soft warning.
Common failure modes when introducing contracts
Contracts fail to stick in three predictable ways:
Over-specification on the first pass. Teams write contracts that cover every column with tight constraints, then spend the next month fighting false positives from edge cases in source data. Start narrow. Cover the 5-8 columns that actually matter to downstream consumers. Add coverage incrementally as you understand the data better.
No clear producer ownership. A contract with no named producer is a contract nobody enforces. Every contract must have an explicit owner — a team and a person — who is responsible for keeping the source data within spec. Without this, violations generate Slack messages that go to nobody.
Blocking on warnings instead of hard violations. Not all contract violations are equally critical. A null in a column that's always been nullable in practice is a different severity from a missing required UUID. Configure severity tiers: hard blocks that stop promotion, soft warnings that log and alert, and informational notes that record but don't alert. Treating all violations as hard blocks creates noise that engineers learn to ignore.
Data contracts, done right, shift the cost of schema changes from the consumers (who discover breakage reactively) to the producers (who own their output quality proactively). That shift is the mechanism that makes a data platform trustworthy — and trustworthy data platforms are the ones that get used for real decisions.