Building data pipelines that don't page you at 3am

By Arshad Ansari

Most data problems in financial services aren't modelling problems. They're plumbing problems. A file lands late, a schema drifts, a vendor silently changes a column, and suddenly your "automated" pipeline needs a human at 3am.

After fifteen years building these systems — including the data backbone at Stockopedia and the macro pipeline behind Quantamental — I've found the difference between a pipeline you babysit and one you forget about comes down to a handful of habits.

1. Make every step idempotent

If you can't safely re-run a step, you can't recover from failure without a human deciding what's safe. Design every stage so that running it twice produces the same result as running it once. Upserts over inserts. Partition overwrites over appends. Deterministic output paths keyed by the input partition.

# Bad: appends duplicate rows on re-run
df.write.mode("append").saveAsTable("prices")

# Good: re-running the same partition is a no-op
df.write.mode("overwrite") \
  .option("replaceWhere", f"date = '{run_date}'") \
  .saveAsTable("prices")

2. Validate at the boundary, not in the dashboard

The cheapest place to catch bad data is the moment it enters your system. Assert row counts, null rates, and value ranges before you let data downstream. A pipeline that fails loudly on ingestion is worth ten that quietly serve wrong numbers for a week.

3. Separate "late" from "broken"

A vendor file arriving two hours late is normal. A vendor file that never arrives is an incident. If your alerting can't tell these apart, you'll either get paged for nothing or miss the real failures. Encode expected arrival windows and only escalate when they're genuinely breached.

4. Backfill is a first-class feature, not an afterthought

You will need to reprocess history — after a bug fix, a schema change, or a new derived column. If backfilling means hand-editing scripts and praying, it will go wrong under pressure. Build the same code path for "today's run" and "the last three years," parameterised by date range.

5. Observability beats heroics

You can't fix what you can't see. Every run should emit: what it processed, how many rows, how long it took, and whether the data-quality checks passed. When something breaks, the answer should be in a dashboard, not in someone's memory of how the system works.

The payoff

None of this is glamorous. But the result is infrastructure your team can trust and forget about — which is the entire point. The pipeline runs, the checks pass, and nobody thinks about it until they want to add something new.

That's the bar I build to. If your data infrastructure isn't there yet, let's talk.

Building something data-heavy? Let's talk.