Do you actually need a data warehouse?
By Arshad Ansari
"We need a data warehouse" is one of those decisions that gets made by default. Someone reads that the modern data stack starts with Snowflake or BigQuery, a contract gets signed, and six months later you're maintaining infrastructure nobody can quite justify. Before you do that, it's worth actually asking the question.
Here's the framework I use with clients.
Start with the data, not the tool
The warehouse decision should fall out of your data's properties — not the other way around. Four questions, in order:
1. How large is the data you actively query? Your archive can be huge; that's a storage question. What matters is the hot slice analysts touch. Under ~100GB, a single machine running DuckDB over Parquet will out-run a warehouse and cost you nothing per query. You probably don't need one.
2. How many people (and jobs) query at once? A warehouse earns its keep on concurrency — dozens of users and pipelines hitting shared data simultaneously. A handful of analysts and a nightly job is not that. That's a laptop-shaped problem wearing a warehouse-shaped price tag.
3. Do you need a single governed source of truth? If many teams must share one authoritative, access-controlled dataset with row-level governance, that's a genuine warehouse strength. If it's really one team and a few dashboards, you're buying governance you won't use.
4. How fast is the data growing — really? Plan for the next 18 months, not the hypothetical exabyte future. "We might scale" is how teams end up paying today for problems they never have.
The honest scorecard
- Mostly "small, few, one team, slow growth"? You don't need a warehouse yet. A files-first stack (Parquet + DuckDB + Arrow) will be faster, cheaper, and simpler — and you can always graduate later.
- Mostly "large, many, shared, fast"? Buy the warehouse. It's the right tool and you'll be glad you did.
- A mix? The best answer is usually both, deliberately: keep the bulk of the workload local and cheap, and reserve the warehouse for the slice that genuinely needs shared, governed, high-concurrency access.
The real failure mode
The expensive mistake isn't choosing wrong once. It's choosing by reflex and never revisiting — running (and paying for) a warehouse for years to serve a workload that outgrew the need for it the moment it stopped being warehouse-scale. Architecture should match the weight of the problem, and the problem changes.
If your stack feels heavier than your data, it probably is.
Free book: Local-First Analytics walks through the whole files-first approach — DuckDB, Parquet, Arrow — with runnable code and real datasets. Grab it free here.
Want a second opinion before you sign a five-figure contract? Let's talk — a 30-minute architecture sanity-check is cheaper than a year of the wrong warehouse.
Building something data-heavy? Let's talk.