Freshness SLAs for Data: Setting and Enforcing Expectations Across Your Metric Catalog

Freshness SLAs for Data: Setting and Enforcing Expectations Across Your Metric Catalog

At 8:47am on a Tuesday, the head of product sends a message to the #data-support channel: "DAU chart looks off — is the data from yesterday loaded?" The analytics engineer checks the pipeline. The nightly load ran. The dbt models show green. The Airflow task completed without error. But the dashboard is showing data from 36 hours ago. The load ran, technically. It just ran slowly enough that only the previous day's data made it through before the dashboard's cache refreshed. No alert was sent. Nobody knew until someone looked at a chart and noticed the date.

This is a freshness failure that a generic pipeline monitor doesn't catch. The pipeline didn't fail — it just delivered stale data. Catching it requires something more specific: a per-metric freshness SLA with explicit thresholds and metric-level context in the alert.

The difference between pipeline health and metric freshness

Pipeline health monitoring tells you whether your ETL jobs ran successfully. This is necessary but not sufficient. A pipeline that ran successfully and loaded data that's 36 hours old has met the pipeline's operational SLA without meeting the business's data freshness expectation.

Metric freshness monitoring asks a different question: is the metric current enough to be used for the decisions it supports? A daily active users metric that's 36 hours old is not usable for a product team making decisions at their 9am stand-up. A revenue metric that's 25 hours old is borderline for a finance team reviewing yesterday's bookings. A real-time fraud score that's 90 seconds old may be completely adequate.

The threshold is metric-specific. It's determined by the business cadence that consumes the metric, not by a single uniform SLA applied across all tables. This is why freshness SLAs belong on individual metrics in a semantic catalog, not as a blanket rule in your orchestration platform.

Defining a freshness SLA for a metric

A freshness SLA for a metric specifies:

  • Maximum acceptable age: The metric must have data as of at most X hours ago (or as of a specific time in each calendar day)
  • Measurement point: The timestamp that defines freshness — last load timestamp, maximum event_time in the output table, or the last successful reconciliation against the source
  • Breach definition: The condition under which an SLA is considered violated (strictly greater than the maximum age, or exceeds a degraded zone threshold)
  • Alert routing: Who gets paged when the SLA is breached — the metric owner, the on-call analytics engineer, a shared channel
  • Degraded zone: An optional warning threshold before the SLA is fully breached (e.g., warn at 20 hours, breach at 24 hours) that allows proactive remediation before the hard threshold is hit

Here's an example SLA specification for a revenue metric:

metric: daily_net_arr
freshness_sla:
  max_age_hours: 24
  measurement: last_load_timestamp
  degraded_zone_hours: 20
  breach_action: page
  alert_channel: "#data-sla-alerts"
  owner: finance-eng
  escalation_after_minutes: 30

Why metric-level context in alerts matters

When a generic pipeline alert fires, it tells you a job failed or a table is stale. It does not tell you which downstream metrics are affected, how critical they are, or what business impact the staleness creates. The on-call engineer has to investigate all of that before they can even decide how urgent the issue is.

When a metric-level freshness alert fires, it tells you that daily_net_arr — the metric the CFO uses for daily revenue reconciliation — is 22 hours old with a 24-hour SLA, currently in the degraded zone, and that the last successful load was at 10:17pm last night. The on-call engineer knows immediately what is affected, who cares about it, and how much time they have before the SLA is breached. The triage happens in 30 seconds instead of 20 minutes.

A generic pipeline failure alert tells you something broke. A metric freshness SLA breach tells you which business capability is degraded, how severely, and who needs to be in the conversation. The difference determines whether your on-call response takes minutes or hours.

Setting realistic SLA thresholds

The most common mistake when setting freshness SLAs is applying uniform thresholds across all metrics. Not all metrics have the same freshness requirements. A framework we've found useful:

Metric category Typical business cadence Recommended max age Degraded zone
Operational dashboards (DAU, active sessions) Reviewed at daily stand-up (9am) 18 hours 14 hours
Finance metrics (ARR, bookings, churn) Reviewed at daily close (EOD) 24 hours 20 hours
Weekly planning metrics (cohort retention, feature adoption) Reviewed at weekly planning meetings 48 hours 36 hours
Board/investor metrics Reviewed quarterly As agreed with finance, min 24 hours 18 hours
Real-time operational metrics (fraud scores, API error rates) Monitored continuously 5-15 minutes 2-5 minutes

The SLA audit before you set thresholds

Before codifying SLA thresholds, run a two-week measurement period on your current loads. For each critical metric, record the actual load completion time (not the scheduled time — the actual time data was available in the output table) for each run. Plot the distribution. You'll typically find:

  • Most loads complete within 30-60 minutes of their scheduled start time
  • 5-10% of loads have extended runtimes due to compute contention, large incremental batches, or upstream delays
  • Occasional outliers where loads run 2-4x the normal duration

Set your degraded zone threshold at roughly the p90 of your actual load completion times. Set your SLA breach threshold at the maximum latency your business consumers can tolerate. If your typical daily load finishes by 6am but p90 is 7:30am, and the finance team needs fresh data by 9am, your SLA should have a degraded zone at 7:30am and a breach threshold at 8:30am — giving you 30 minutes to remediate before a hard breach.

Connecting SLA monitoring to remediation

An SLA alert that fires with no remediation path is just noise. Well-designed freshness monitoring includes a direct link to the failed load step — not the pipeline run log (which shows raw execution output), but the specific metric-owning load job that failed to meet the SLA, with the last successful timestamp, the current data age, and the load step that produced the last successful run.

This means your on-call engineer can jump from the SLA alert to a remediation action in under 2 minutes: they see which load step failed, they run a targeted retry or triage the upstream source, and they know when the SLA will be satisfied without checking a separate dashboard.

Freshness SLAs done well transform data reliability from a reactive debugging practice into a proactive, measurable commitment. Not just "the pipeline ran" but "the metric is current" — which is the thing your business consumers actually need to know.