Should there be an LLM in your data pipeline?

By Arshad Ansari

You're building a data pipeline. The data's messy. Someone suggests running each row through an LLM — for classification, extraction, judgment, whatever. It sounds smart. Everyone's doing it.

Don't, unless you've answered a specific set of questions first.

An LLM is a probabilistic, slow, costly, non-deterministic step. Your pipeline's whole value is usually being correct and repeatable. Those are nearly opposite properties. The trade might be worth it. Often it's not.

The default: stay deterministic

Start here. If a rule, regex, lookup, join, or constraint can produce the answer, use that instead. Always.

A rule is faster by orders of magnitude. A rule is free. A rule is testable — you know exactly what it does on Thursday's data and Friday's. A rule is reproducible — run it again next month and you get the same answer. A rule scales without API quotas or rate limits.

Yes, rules have limits. They require you to know the logic upfront. They can't handle "everything that kind of looks like this." They break when the input changes shape. But those limits are features, not bugs. They force you to think about what you actually want.

An example: you have customer descriptions — "Joe's a trader from Mumbai", "hedge fund manager located in Hong Kong", "retired doctor, Singapore". You want to extract their role and country.

A rule path: regex for country names, then a CASE statement or lookup table for the role keywords. Is that enough? Maybe. You'll miss edge cases. But you'll know which ones. You'll be able to list them, fix them, test them.

An LLM path: prompt it to extract both, call it for every row. It'll catch more edge cases. It'll also hallucinate roles for people who don't have one. It'll sometimes return "country: null" for Mumbai. It'll give you slightly different answers if you run it twice, or if the API temperature drifts, or if the model updates. And you're paying for it.

The honest answer is usually: start with rules. When rules fail systematically and visibly, add an LLM for the cases that break.

When an LLM actually belongs

There are real cases where the input is genuinely fuzzy and no clean rule exists. These are its native home.

Unstructured text extraction is one: a customer complaint, a doctor's note, a research paper. The structure varies. The language is natural. You need semantic judgment, not pattern matching.

Messy classification with endless edge cases is another: is this news article about inflation a buy signal or a sell signal? Is this filing ambiguous enough to be a red flag? Does this company's narrative match its numbers? Rules get unwieldy. An LLM can handle the ambiguity.

Semantic judgment — "is this content spam?" "does this review sound authentic?" — lives here too. You need a model that reasons about meaning, not form.

Those cases exist. If that's genuinely your problem, an LLM might be the right tool. Not the only tool, but the right one.

The mistake is using an LLM where a CASE statement would work, or where a rule would suffice.

The four costs before you commit

Money. An LLM at a $1-per-million-tokens rate, processing 1 million rows daily, isn't free. Neither is the API latency. Add up the column's cost of ownership. Does the improvement justify it?

Latency. An LLM call takes 500 milliseconds to several seconds. A rule takes 1 millisecond. If your pipeline has a deadline, add that up.

Non-determinism. The same input can yield different outputs. Your model can update. The API can change temperature settings. You'll find yourself re-running a pipeline and getting different numbers. That's usually unacceptable.

Reproducibility. When you re-run the pipeline six months later, the LLM's output may be different. This breaks historical comparisons, audit trails, and trust. If you ever need to explain why a decision was made on row 47,000 in March, and it's different now, you're in trouble.

These aren't disqualifiers. They're costs you need to quantify and approve before you add the step.

If you do add one: contain it

Put the LLM at the edge. Ingestion layer, enrichment layer — somewhere the rest of the pipeline can stay deterministic.

Make the step idempotent. Hash the input, cache the output. Same input always returns the same cached output. No re-computation, no drift.

# Sketch: cache pattern for LLM reproducibility
import hashlib

cache = {}

def classify_with_cache(text, model_fn):
    key = hashlib.sha256(text.encode()).hexdigest()
    if key not in cache:
        cache[key] = model_fn(text)
    return cache[key]

Materialize the output. Write the LLM's decision to a table or file. The rest of your pipeline reads from that table, not from a live LLM call. A downstream re-run doesn't recompute.

This way, the LLM step is isolated. The rest of your pipeline stays reproducible. A re-run doesn't silently rewrite history.

The honest lists

Where LLMs clearly belong:

  • Extracting structured data from unstructured documents
  • Classifying text where the rules are fuzzy or contextual
  • Summarizing or translating content
  • Semantic similarity or intent detection

Where they're a liability:

  • Anything a lookup table or CASE statement can handle
  • High-volume deterministic classification (category, status, type)
  • Simple text cleaning or normalization
  • Any step where reproducibility is required and non-determinism is a deal-breaker
  • Anything cost-sensitive at scale

The fashionable mistake right now is adding an LLM for deterministic work. The opposite mistake — refusing one where the input really is unstructured — is less common but still real.

The rule

Match the weight of the tool to the weight of the problem. An LLM is the heaviest tool in your box. Most problems are lighter than that.

Start deterministic. Be explicit about the costs. Contain the LLM at the edge if you add one. Cache its output so the rest of the pipeline stays reproducible.

That's it. Simple.

If you're building a data system and thinking through the architecture, we've helped dozens of teams sort these decisions. Read through our free data engineering book, or reach out if you want to talk through your specific case.

Building something data-heavy? Let's talk.