Local-first analytics in practice: DuckDB, Parquet, and killing the round-trip

In the previous post I argued that a lot of analytics infrastructure exists to solve laptop-sized problems. Here's what the alternative actually looks like in code.

The setup: data as files, not endpoints

The local-first pattern starts by treating your dataset as a file — ideally Parquet, a columnar format that's compact, typed, and splittable. You can sit it in object storage, ship it in a release, or cache it on disk. No connection string, no credentials rotation, no warehouse to keep warm.

Then you point an in-process engine at it. DuckDB reads Parquet natively and runs the query where you are.

import duckdb

# No server. No connection pool. Just a query against a file.
con = duckdb.connect()

result = con.execute("""
    SELECT
        symbol,
        date_trunc('month', trade_date) AS month,
        avg(close) AS avg_close,
        count(*) AS days
    FROM 'prices/*.parquet'      -- glob straight over partitioned files
    WHERE trade_date >= '2024-01-01'
    GROUP BY 1, 2
    ORDER BY 1, 2
""").df()                         # straight into a pandas DataFrame

print(result.head())

That query reads only the columns it needs (columnar), only the row groups that match the predicate (pushdown), and never leaves the machine. On a few gigabytes of price data it returns in milliseconds — on the same laptop you're reading this on.

It runs in the browser too

The part that surprises people: the same engine compiles to WebAssembly. With duckdb-wasm you can ship a Parquet file and a query to the browser and render an interactive analytics view with no backend at all.

import * as duckdb from '@duckdb/duckdb-wasm'

const db = await initDuckDB()            // wasm engine in the tab
const con = await db.connect()
await con.query(`
  SELECT regime, count(*) AS n
  FROM 'https://cdn.example.com/regimes.parquet'
  GROUP BY regime
`)

The user's machine does the work. Your "API" is a static file on a CDN. Your hosting bill is whatever a CDN charges to serve a few megabytes — and there's no query endpoint to attack, rate-limit, or scale.

Where this wins

This pattern quietly replaces a surprising amount of infrastructure:

Internal dashboards over data that updates daily, not per-second.
Embedded analytics in a product, without standing up a query service.
Data exploration where analysts want SQL speed without warehouse round-trips.
Reproducible reports — the data is the artifact; pin the Parquet, pin the result.

The honest boundaries

You give up live concurrent writes, centralized governance, and real-time freshness. If you need those, keep your warehouse. But for the broad middle — read-mostly, modest-sized, latency-sensitive analytics — local-first is faster to build, cheaper to run, and easier to reason about.

This is one chapter's worth of an idea; Local-First Analytics works through the full architecture, including caching, freshness, and when to graduate back to a warehouse. And if you're staring at a data stack that's heavier than it needs to be, that's exactly the kind of problem I help with.

The setup: data as files, not endpoints

It runs in the browser too

Where this wins

The honest boundaries

Building something data-heavy?