Local-first analytics in practice: DuckDB, Parquet, and killing the round-trip
By Arshad Ansari
In the previous post I argued that a lot of analytics infrastructure exists to solve laptop-sized problems. Here's what the alternative actually looks like in code.
The setup: data as files, not endpoints
The local-first pattern starts by treating your dataset as a file — ideally Parquet, a columnar format that's compact, typed, and splittable. You can sit it in object storage, ship it in a release, or cache it on disk. No connection string, no credentials rotation, no warehouse to keep warm.
Then you point an in-process engine at it. DuckDB reads Parquet natively and runs the query where you are.
import duckdb
# No server. No connection pool. Just a query against a file.
con = duckdb.connect()
result = con.execute("""
SELECT
symbol,
date_trunc('month', trade_date) AS month,
avg(close) AS avg_close,
count(*) AS days
FROM 'prices/*.parquet' -- glob straight over partitioned files
WHERE trade_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, 2
""").df() # straight into a pandas DataFrame
print(result.head())
That query reads only the columns it needs (columnar), only the row groups that match the predicate (pushdown), and never leaves the machine. On a few gigabytes of price data it returns in milliseconds — on the same laptop you're reading this on.
It runs in the browser too
The part that surprises people: the same engine compiles to WebAssembly. With duckdb-wasm you can ship a Parquet file and a query to the browser and render an interactive analytics view with no backend at all.
import * as duckdb from '@duckdb/duckdb-wasm'
const db = await initDuckDB() // wasm engine in the tab
const con = await db.connect()
await con.query(`
SELECT regime, count(*) AS n
FROM 'https://cdn.example.com/regimes.parquet'
GROUP BY regime
`)
The user's machine does the work. Your "API" is a static file on a CDN. Your hosting bill is whatever a CDN charges to serve a few megabytes — and there's no query endpoint to attack, rate-limit, or scale.
Where this wins
This pattern quietly replaces a surprising amount of infrastructure:
- Internal dashboards over data that updates daily, not per-second.
- Embedded analytics in a product, without standing up a query service.
- Data exploration where analysts want SQL speed without warehouse round-trips.
- Reproducible reports — the data is the artifact; pin the Parquet, pin the result.
The honest boundaries
You give up live concurrent writes, centralized governance, and real-time freshness. If you need those, keep your warehouse. But for the broad middle — read-mostly, modest-sized, latency-sensitive analytics — local-first is faster to build, cheaper to run, and easier to reason about.
This is one chapter's worth of an idea; Local-First Analytics works through the full architecture, including caching, freshness, and when to graduate back to a warehouse. And if you're staring at a data stack that's heavier than it needs to be, that's exactly the kind of problem I help with.
Building something data-heavy? Let's talk.