Blog

Notes on data engineering

Pipelines, analytics infrastructure, ML systems, and lessons from building them in production.

July 14, 2026

What it actually costs to build a data pipeline (real numbers)

The real drivers behind the cost of a data pipeline or platform, honest market ranges, and why the recurring bill matters more than the build price.

data-engineeringdata-pipelinecostdata-platform

July 14, 2026

What data engineering actually costs, and how I price it

Why a day rate is the wrong way to price data-engineering work, how I actually structure engagements, and how to think about what a data platform is worth to you.

data-engineeringconsultingpricingdata-platform

July 14, 2026

DuckDB vs Snowflake: which one does your team actually need

A practical comparison, not a benchmark war. When a single-node engine like DuckDB is enough, when you actually need Snowflake, and how to tell the difference.

duckdbsnowflakedata-warehouselocal-firstdata-engineering

July 14, 2026

Data engineering consultant vs. full-time hire: an honest breakdown

The real fully-loaded cost of a full-time data engineer, when a consultant is the better call, and how to tell which one your situation actually needs.

data-engineeringhiringconsultingfractional

July 14, 2026

How to connect an LLM to your database without a horror story

The demo is easy; the production version is where people get hurt. The guardrails that let an LLM query your database without dropping a table or leaking data.

llmdatabaseai-automationsecuritynatural-language-to-sql

July 14, 2026

ClickHouse vs Snowflake: a practical comparison, not a benchmark war

Managed elastic warehouse vs. a columnar engine you run yourself. When ClickHouse wins, when Snowflake is worth the bill, and what changes when you actually operate it.

clickhousesnowflakedata-warehouseolapdata-engineering

July 14, 2026

Chat with your database: a practical guide, not a product pitch

What it takes to make natural-language querying actually useful for non-technical teams — the schema semantics, guardrails and verification the demos skip.

natural-language-to-sqlllmanalyticsai-automationdatabase

July 13, 2026

How I built a paid-grade data product on 100% free public data

A teardown of a macro data product: 171 countries scored daily from 10 free public data sources, point-in-time correct, with the licensing and cost work done.

data-engineeringdata-productspoint-in-time-datapublic-datadagster

July 13, 2026

LLM agents in production: a teardown of AEGIS, my open-source automation layer

Four LLM agents, 28 Temporal workflows, every risky action behind a human approval gate. An architecture teardown of AEGIS — and what it teaches about production AI automation.

ai-automationllm-agentsproductiontemporalopen-sourceaegis

July 13, 2026

Building a data platform solo: the Ansaar teardown

NSE and crypto ingestion, ClickHouse, LightGBM, backtests and a Prolog compliance layer — a self-hosted data platform built and run by one person.

data-engineeringdata-platformclickhousedagstermachine-learning

July 10, 2026

AEGIS Is Open Source

Three months ago I wrote about my personal AI orchestration system and ended with a section titled 'Why It's Not Open Source Yet.' That's fixed. AEGIS is on GitHub, MIT-licensed — here's what it actually took to get there.

aiautomationagentsself-hostedlocal-llmopen-source

July 5, 2026

Behavior Is Data, Not Code

The biggest refactor of the AEGIS open-sourcing sprint: removing every branch on an agent's identity so behavior lives in the database as capability tags — resolved at runtime, edited from a UI, and safe even when a tag has no owner.

aiagentsarchitecturerefactoringself-hosted

July 4, 2026

A To-Do Is a Tweet: Social Publishing With Approval Cards

Scheduling a social post doesn't need a new UI. A to-do already has copy, a time, and labels — so in AEGIS, a Todoist task with a publish label is a scheduled post, and nothing goes out until a card in chat gets a human tap.

automationsocial-mediatemporalself-hostedworkflows

July 3, 2026

Bring Your Own Cloud: The Infrastructure Registry

For a self-hosted system that's meant to be forked, there's exactly one honest way to handle infrastructure credentials: the user brings their own, the system stores them encrypted, and nothing in the code assumes a vendor. Here's how AEGIS got there — and what it cost me to cut my own vendors out.

infrastructureself-hostedkubernetessecurityopen-source

July 2, 2026

Labels, Not Projects: Rethinking GTD

The day Next and Someday stopped being Todoist projects and became labels — and why that small modelling decision is what made AEGIS's GTD layer, and its bidirectional sync, actually work.

gtdproductivityautomationtodoistself-hosted

July 1, 2026

Day One of Opening the Box

AEGIS is going open source, and today the first public commits landed. Day one of turning a private system into a shippable one: twenty-seven migrations squashed into a baseline, credentials evicted from the build, and an admin panel redesigned around decisions.

aiopen-sourceself-hostedrefactoringautomation

June 27, 2026

Your Snowflake bill is mostly overhead

If you're paying thousands a month to query tens of gigabytes, you're renting a freight train to carry a backpack. Here's how to tell, and what to do instead.

snowflakecostduckdbdata-engineeringanalytics

June 27, 2026

When the Bot Learns to Stay Quiet

AEGIS watches my homelab and GitHub, but the interesting part is what happens between an alert firing and me hearing about it. Most of the time, the correct answer is nothing.

aiautomationtemporalobservabilityself-hostedhomelab

June 27, 2026

DuckDB in production: what it's actually good at (and what it isn't)

DuckDB is having a moment, and the hype outruns the nuance. Here's an honest map of where an in-process analytical engine belongs in a real production system — and where it doesn't.

duckdbproductiondata-engineeringparquetanalytics

June 27, 2026

Do you actually need a data warehouse?

A short, honest decision framework for the question every data team eventually faces — before you sign a five-figure annual contract for infrastructure your data might not need.

data-warehousearchitectureduckdbdata-engineeringcost

June 27, 2026

Building data pipelines that don't page you at 3am

Most financial data pipelines break for boring, predictable reasons. Here are the five habits that separate a pipeline you babysit from one you forget about.

data-engineeringpipelinesreliabilityfinancial-data

June 27, 2026

Ask your database in plain English — locally

A small SQL-tuned model running on your machine can convert plain English questions into correct SQL. No API keys, no data leaving your laptop, no vector database.

text-to-sqlduckdblocal-llmollamadata-engineering

June 26, 2026

Local-first analytics in practice: DuckDB, Parquet, and killing the round-trip

A concrete walkthrough of the local-first pattern: query Parquet directly with DuckDB, no warehouse, no server, no per-query bill — from a notebook or the browser.

duckdbparquetlocal-first-analyticsdata-engineeringpython

June 26, 2026

Calling an LLM from inside a SQL query

Run a model directly in SQL to classify rows, extract text fields, or summarize data without exporting. Here's where it works, what it costs, and when not to bother.

llmsqlduckdbdata-engineeringailocal-first

June 25, 2026

Why I wrote Local-First Analytics

Most analytics stacks are cloud round-trips solving problems that fit on a laptop. Local-first analytics is the case for bringing the compute back to the data — and to the user.

local-first-analyticsduckdbdata-engineeringanalytics

June 25, 2026

Turning messy PDFs into clean tables, on your own machine

Parse PDFs to markdown, extract structured data with a local model and strict validation schema, no API needed.

pdf-extractiondata-engineeringlocal-llmpydanticdocling

June 24, 2026

Connecting an AI assistant to your database with MCP — without letting it do damage

Use the Model Context Protocol to let Claude explore your data in plain English. The whole game is permissions: read-only role, standard server, 90% of the value with almost none of the risk.

mcpdatabaseclaudeduckdbpostgrespermissions

June 23, 2026

Search your documents without a vector database

Most teams reach for Pinecone or Weaviate the moment they hear semantic search. For a team-sized knowledge base, you don't need one. Your database already has what you need.

local-firstsemantic-searchragduckdbsqliteembeddings

June 22, 2026

Letting Claude Code write your dbt models — and why you still read every line

Claude Code can run dbt in a loop, read errors, and fix them. It needs you to steer grain, business logic, and naming — and to review the tests it writes.

dbtclaude-codedata-engineeringsqlai-assisted-development

June 21, 2026

Using an LLM to catch the data quality problems your tests can't

Structural tests catch NOT NULLs and schema errors. Semantic problems—does this description match its category?—need a judge. How to use one responsibly.

data-qualityllmsemantic-validationpipelinecost-control

June 20, 2026

Should there be an LLM in your data pipeline?

An LLM is probabilistic, slow, and costly. Here's how to decide whether it belongs in your pipeline — and where it clearly doesn't.

llmdata-pipelineengineeringover-engineeringdeterminism

June 19, 2026

Local model or paid API? The honest math for data work

When does running an open-weight model on your own hardware beat paying per token? The answer depends on volume, frequency, and whether you have real privacy constraints.

llmcost-analysisdata-engineeringbudgeting

June 18, 2026

Generating realistic test data with a local model

How to harden a data pipeline by feeding it carefully generated edge cases, using a local LLM when simpler generators aren't enough.

test-datalocal-modelsdata-qualitypipeline-testing

June 17, 2026

When local-first runs out of road

Local-first, file-based data has a real ceiling. Most teams never reach it. When you do, here's what actually changes.

local-firstdata-warehousesscalingduckdbdbtanti-over-engineering

June 16, 2026

Running a local model over millions of rows without it falling over

Batching, checkpointing, and idempotency are what make local inference scale. Get these right and a single GPU chews through millions of rows overnight.

batch-inferencelocal-modelsmlopsdata-engineeringcheckpointing

May 19, 2026

One Primitive for Every Interruption

The design decision in AEGIS I'm proudest of isn't an agent or a model — it's a single interactions primitive. How one Postgres table, five kinds of card, and a Temporal workflow replaced every per-domain approval pattern.

aiautomationtemporalworkflowsself-hosted

April 21, 2026

Meet AEGIS: My Weird Little Operating Layer

Every week there’s another agent demo, another workflow canvas, another pitch about software running software for us. OpenClaw, browser agents, Zapier, n8n, Claude Code, MCP s

llmautomationknowledge-graphlocal-llm

April 18, 2026

Building with Claude Code: Faster Iteration, Faster Drift

I have been rebuilding a personal project called AEGIS into its third version, and on paper this rewrite should have dragged on for weeks. It did not. Claude Code helped me mo

aideveloper-toolsclaude-codeengineering

April 18, 2026

Building AEGIS: How Prolog + a Knowledge Graph Make Local LLMs Actually Useful

I’ve been building a personal AI orchestration system. The first version — Jarvis — worked, but became a 20,000-line monolith I dreaded touching. So I rebuilt it from scratch

llmprologknowledge-graphlocal-llm

October 11, 2025

Market Regime Detection: From Hidden Markov Models to Wasserstein Clustering

Financial markets move through distinct phases — bullish rallies, sharp crashes, quiet consolidations, and volatile swings. These market regimes differ not just in price direc

machine-learningquantitative-financedata-sciencepython