Using an LLM to catch the data quality problems your tests can't

Your data tests work hard. A dbt not_null test catches missing addresses. A constraint catches duplicate IDs. A range check catches prices below zero. These are structural problems: the schema is broken, the format is wrong, the value is outside bounds.

Semantic problems are different. Your address column is populated and well-formatted. But is it a real address? Your product description is a string—but does it actually describe the product? Your summary is present—but is it faithful to the source article?

Tests can't answer those. An LLM can.

The gap between structural and semantic

I ran a 50-million-row product feed for a fintech platform. Our validation pipeline was tight: NOT NULL checks, regex for SKU format, price ranges, category constraints. Everything passed. The schema was clean.

But semantic problems were everywhere. Product names paired with mismatched categories. Descriptions that were template boilerplate, not real content. Prices that looked valid in range but contradicted the product details. A regex can't catch those.

There's a layer of quality—meaning—that sits above schema. Tests don't reach it.

The practical judge: instructor + Pydantic

Use an LLM to surface candidates for review. Don't auto-correct. Don't auto-delete. Just tell a human, "row 47,382 looks suspect—check it."

Here's the shape:

import instructor
from pydantic import BaseModel

class Verdict(BaseModel):
    ok: bool
    reason: str

def judge_description(product_name: str, description: str) -> Verdict:
    """Ask the model: is this description relevant to the product?"""
    # provider string is "provider/model" — swap for "ollama/llama3.2",
    # "anthropic/claude-haiku-4-5", etc.
    client = instructor.from_provider("openai/gpt-4o-mini")
    
    verdict = client.create(
        response_model=Verdict,
        messages=[
            {
                "role": "user",
                "content": f"""Does this description match the product?

Product: {product_name}
Description: {description}

Answer with only ok (true/false) and a one-line reason."""
            }
        ]
    )
    return verdict

# Test it
result = judge_description("iPhone 15", "A flagship smartphone with camera and processor")
print(result.ok, result.reason)  # True, "Description is relevant"

The response_model parameter tells instructor to parse the LLM's output into a Pydantic object. No string extraction, no brittle JSON parsing. It's structured.

Now sample your data instead of judging everything:

import random

def audit_descriptions(df, sample_size=500):
    """Judge a sample. Aggregate. Return suspect rows."""
    sample = df.sample(n=min(sample_size, len(df)))
    results = []
    
    for _, row in sample.iterrows():
        verdict = judge_description(row['product_name'], row['description'])
        results.append((row, verdict))
    
    failures = [row for row, v in results if not v.ok]
    fail_rate = 100 * len(failures) / len(sample)
    
    print(f"Sample size: {sample_size}")
    print(f"Failures: {len(failures)} ({fail_rate:.1f}%)")
    
    return failures  # Hand to humans

Route the failures to a human. They review, decide if it's bad data or a bad rule. You're not the judge. The LLM surfaces. Humans decide.

Cost and sampling

LLM calls cost tokens. Two moves.

First: sample, don't score everything. A 500-row sample catches the pattern. If 7% fail in your sample, you likely have a 7% problem in the full 50-million. You don't need to judge all rows.

Second: cache by content. Same description used in multiple rows? Judge once, reuse.

import hashlib

verdict_cache = {}

def judge_cached(product_name: str, description: str) -> Verdict:
    key = hashlib.md5(f"{product_name}:{description}".encode()).hexdigest()
    if key not in verdict_cache:
        verdict_cache[key] = judge_description(product_name, description)
    return verdict_cache[key]

And don't use a judge for what a free check already does. If the rule is "description must be longer than 20 characters," use len(description) > 20. No tokens. Judge only at the semantic layer where nothing cheaper works.

The failure mode I learned from

Here's the catch: your judge can silently produce meaningless results if its output gets truncated.

You use a reasoning model—Claude with extended thinking, or Qwen with explicit reasoning chains. It's set to max_tokens=512. The model spends most of that budget on reasoning, then tries to emit your JSON verdict. The response gets cut off mid-JSON. Your response_model parser can't extract the Verdict. Instructor retries, still fails, and returns None or a default.

You don't notice. You compute "all rows passed" and push to production.

I've seen this. The signal was a suspiciously flat score: every row scored 1.0 (or 0.0), or every verdict was None. That's not reality. That's a broken judge.

Always validate before scale:

# Before you run 50,000 rows, test with known cases
good = judge_description("iPhone 15", "Flagship smartphone with advanced camera")
bad = judge_description("iPhone 15", "purple blob of infinite wonder")

print(f"Good: ok={good.ok} reason='{good.reason}'")
print(f"Bad: ok={bad.ok} reason='{bad.reason}'")

# If both pass, or both fail, or either is None—stop. Fix the judge first.
if good.ok and not bad.ok:
    print("Judge looks sane. Proceed.")
else:
    print("Judge is broken. Debug before going further.")

Honest boundaries

The judge is an LLM. It's unreliable. It drifts. It can be swayed by formatting or prompt injection. Never auto-delete or auto-correct rows on a judge's verdict alone. Use it to flag candidates. Humans review.

It costs tokens. Each call is a real expense. Test on 100 rows first, calculate cost, then decide if sampling 1000 or 5000 makes sense for your budget.

It's not your final authority. It's one lens. The decision—which rule to judge, how many to sample, what to escalate—is still yours.

The real anti-pattern

Don't LLM-judge what schema and constraints already catch. Don't judge every row because it feels more accurate. Don't assume a flat score is real data.

Use the judge where it matters: at the semantic layer, for human review, on a sample. It's a tool for surfacing, not solving.

If you're building data pipelines and want to think about where LLMs actually help (and where they don't), the book I wrote covers local-first, cost-conscious approaches to LLM use. Or get in touch if you're dealing with similar semantic quality problems across large datasets—we work on this.