Incremental ETL Without the Footguns: High-Watermark State at Scale

Incremental ETL Without the Footguns: High-Watermark State at Scale

Full table scans are the original sin of analytics engineering. Every pipeline eventually starts with them — they're simple to write, trivial to reason about, and completely correct. They're also expensive, slow, and fragile at any meaningful data volume. The shift to incremental loads sounds obvious. It's the implementation where teams consistently introduce the bugs that take months to fully surface. This post covers the specific failure modes of high-watermark incremental loads and the patterns that prevent them.

Why full scans stay too long

Full table scans survive longer than they should because they work. The pipeline runs at 2am, scans 50 million rows, costs $4 in compute, and produces correct results. That's fine until your data volume doubles and the nightly scan starts competing with production workloads, blowing past cost budgets, or timing out before the next run starts. Teams often make the switch to incremental loads reactively, under time pressure, which is the worst time to think carefully about watermark management.

How high-watermark state works and where it breaks

A high-watermark load tracks the maximum value of a timestamp or sequence column from the last successful run, then processes only rows where that column is greater than the watermark. In theory, this is straightforward. In practice, four things reliably go wrong:

Late-arriving records

Your warehouse receives event data that was generated before the watermark but arrives after it. This happens constantly in real systems: mobile clients that were offline, batch syncs from third-party APIs that run on 24-48 hour delays, event queues that back up under load. A naive watermark discards these records permanently. The pipeline ran correctly from its own perspective; the data is just silently missing.

The fix is a configurable late-arrival window. Instead of loading rows where event_time > last_watermark, you load rows where event_time > last_watermark - INTERVAL '2 hours' (or 24 hours, or 72 hours, depending on your upstream SLAs). You then deduplicate on merge to avoid double-counting records that fall within the overlap window and were already processed.

Schema drift mid-run

The source table changes schema in the middle of an incremental run. The first batch of rows from the run has the old schema; the second batch has the new schema. If your pipeline reads the schema once at the start of the run and applies it to all rows, you'll either fail mid-run (best case) or silently produce null columns for new fields in the second batch (worst case).

The fix is schema validation at the start of each run against the last-known schema. If the schema has changed, the run surfaces a structured schema diff before processing any data — not after half the rows are loaded.

Watermark loss after infrastructure failure

The run succeeds, the watermark is updated in memory, and then the process crashes before writing the new watermark to persistent state. The next run starts from the old watermark and reprocesses rows that were already loaded. If your merge logic is idempotent (deduplicate by primary key before insert), this is recoverable. If not, you've double-counted.

The fix is storing watermark state in a durable, transactional store — not in a config file, not in an environment variable, not in a local SQLite database on the scheduler node. A dedicated watermark table in your warehouse, updated atomically with the data load in the same transaction, is the most reliable pattern.

Deleted rows that the watermark never sees

High-watermark loads are inherently insert-and-update optimized. They don't see hard deletes in the source system because deleted rows don't update the timestamp column — they disappear. If your analytics layer needs to reflect source-system deletions (GDPR erasure, account deactivation, test data cleanup), a pure watermark approach will silently diverge from the source of truth over time.

The fix depends on your deletion semantics. If your source system supports soft deletes (a deleted_at timestamp), the watermark will capture those naturally. If it uses hard deletes, you need a periodic reconciliation run — typically weekly or monthly — that does a full primary key comparison between the warehouse and the source, identifies orphaned rows, and removes them. This is a separate job from the incremental load, not a workaround built into it.

Idempotency: the property that saves you at 3am

The most important property of a well-designed incremental load is idempotency: running the same load twice should produce the same result as running it once. Idempotent loads allow you to retry safely without investigating whether partial state was written before the failure.

Idempotency in practice requires:

  • A deterministic, deduplicated merge key that is consistent across runs
  • Watermark state stored and updated atomically with the data
  • Late-arrival windows wide enough that a re-run of the last N intervals produces the same output as the original run
  • Explicit MERGE or UPSERT semantics rather than bare INSERT — so reprocessed rows update in place rather than creating duplicates

An incremental load that isn't idempotent isn't safe for production. You will retry it eventually — after a timeout, after a network hiccup, after a scheduler restart. If retrying produces different data than the original run, you have a correctness bug that only surfaces under failure conditions.

Partition granularity and its tradeoffs

Choosing the right partition granularity for an incremental load is a tradeoff between reprocessing cost and precision. Daily partitions are common for event data; hourly partitions are appropriate for high-volume streams where a 24-hour late-arrival window is too wide and too expensive to scan. The wrong choice produces one of two failure modes:

  • Granularity too coarse: Late arrivals within the partition window are handled, but the late-arrival window itself is expensive (scanning 24 hours of data for 2% of rows that actually changed).
  • Granularity too fine: Hourly partitions with a 2-hour late-arrival window means you're reading 3 partitions per run. Fine for high-volume streams; wasteful overhead for low-volume event tables that change 500 rows per day.

Our practical guidance: start with daily partitions and a 6-hour late-arrival window for most event tables. Profile your actual late-arrival distribution after two weeks of running. If 99% of late arrivals land within 2 hours, shrink the window. If you're seeing arrivals 18+ hours late, expand it. Don't pick granularity a priori — measure first.

Configuration as the interface

The best incremental load pipelines have all the parameters we've discussed — late-arrival window, partition granularity, error thresholds, watermark store location, schema drift behavior — expressed as configuration, not code. This matters operationally: when late arrivals suddenly spike because a third-party API changed its sync schedule, the fix should be a YAML change and a run, not a code change, PR, code review, and deployment.

Analytics teams at the 50-500 employee stage typically don't have dedicated data engineers who can turn around pipeline code changes in hours. They have analytics engineers who need to be able to tune pipeline behavior in minutes. Configuration-first design is not a nice-to-have at that team size — it's the difference between a load that can be tuned by the person on-call and one that requires a wake-up call to a specific engineer who wrote the code two years ago.