Do you actually need a data warehouse?

"We need a data warehouse" is one of those decisions that gets made by default. Someone reads that the modern data stack starts with Snowflake or BigQuery, a contract gets signed, and six months later you're maintaining infrastructure nobody can quite justify. Before you do that, it's worth actually asking the question.

Here's the framework I use with clients.

Start with the data, not the tool

The warehouse decision should fall out of your data's properties — not the other way around. Four questions, in order:

1. How large is the data you actively query? Your archive can be huge; that's a storage question. What matters is the hot slice analysts touch. Under ~100GB, a single machine running DuckDB over Parquet will out-run a warehouse and cost you nothing per query. You probably don't need one.

2. How many people (and jobs) query at once? A warehouse earns its keep on concurrency — dozens of users and pipelines hitting shared data simultaneously. A handful of analysts and a nightly job is not that. That's a laptop-shaped problem wearing a warehouse-shaped price tag.

3. Do you need a single governed source of truth? If many teams must share one authoritative, access-controlled dataset with row-level governance, that's a genuine warehouse strength. If it's really one team and a few dashboards, you're buying governance you won't use.

4. How fast is the data growing — really? Plan for the next 18 months, not the hypothetical exabyte future. "We might scale" is how teams end up paying today for problems they never have.

The honest scorecard

Mostly "small, few, one team, slow growth"? You don't need a warehouse yet. A files-first stack (Parquet + DuckDB + Arrow) will be faster, cheaper, and simpler — and you can always graduate later.
Mostly "large, many, shared, fast"? Buy the warehouse. It's the right tool and you'll be glad you did.
A mix? The best answer is usually both, deliberately: keep the bulk of the workload local and cheap, and reserve the warehouse for the slice that genuinely needs shared, governed, high-concurrency access.

The real failure mode

The expensive mistake isn't choosing wrong once. It's choosing by reflex and never revisiting — running (and paying for) a warehouse for years to serve a workload that outgrew the need for it the moment it stopped being warehouse-scale. Architecture should match the weight of the problem, and the problem changes.

If your stack feels heavier than your data, it probably is.

Free book: Local-First Analytics walks through the whole files-first approach — DuckDB, Parquet, Arrow — with runnable code and real datasets. Grab it free here.

Want a second opinion before you sign a five-figure contract? Let's talk — a 30-minute architecture sanity-check is cheaper than a year of the wrong warehouse.

Start with the data, not the tool

The honest scorecard

The real failure mode

Building something data-heavy?