Generating realistic test data with a local model

The cheapest way to harden a pipeline is to feed it the ugly rows before production does. Random data will find some bugs. But sometimes you need semantic coherence: a support ticket that reads like a real one, a product description that matches real length and tone. That's when a local model becomes useful — and it should only be then.

Start with Faker and Hypothesis

Before spinning up an LLM, reach for what you already have.

Faker generates structured fakes: names, emails, addresses, phone numbers, dates. Deterministic, fast, offline. For most columns — IDs, emails, timestamps — Faker is the right tool.

from faker import Faker
fake = Faker()
for i in range(100):
    record = {
        "user_id": fake.uuid4(),
        "email": fake.email(),
        "phone": fake.phone_number(),
    }

Hypothesis does property-based testing. You define constraints and it generates thousands of valid edge cases, hunting for bugs.

from hypothesis import given, strategies as st

@given(email=st.emails(), age=st.integers(0, 150))
def test_registration(email, age):
    assert register_user(email, age).success

Both are simple and fast. Use them unless you need semantic coherence.

When semantic coherence matters

Faker generates independent columns. If you need fields that hang together — a city name matching a postal code, a ticket title describing the actual problem — you need semantic understanding. That's where a local model helps.

Define your schema as a Pydantic model, ask the model to generate records, let Pydantic validate. No OpenAI needed. A local model via Ollama stays on your machine. No real data leaves. No API cost.

Schema-constrained generation with instructor

Here's a support ticket schema:

from pydantic import BaseModel, Field, field_validator
from typing import Optional

class SupportTicket(BaseModel):
    ticket_id: str = Field(description="Unique ticket identifier (e.g., TKT-1234)")
    title: str = Field(description="Issue title in 10–100 words")
    description: str = Field(description="Detailed problem statement, 50–500 words")
    customer_email: str = Field(description="Valid email address")
    priority: str = Field(description="One of: low, medium, high, critical")
    status: str = Field(description="One of: open, in_progress, resolved, closed")
    
    @field_validator('priority')
    @classmethod
    def validate_priority(cls, v):
        if v not in ['low', 'medium', 'high', 'critical']:
            raise ValueError('priority must be low, medium, high, or critical')
        return v
    
    @field_validator('status')
    @classmethod
    def validate_status(cls, v):
        if v not in ['open', 'in_progress', 'resolved', 'closed']:
            raise ValueError('status must be open, in_progress, resolved, or closed')
        return v

Now use instructor to ask a local model to generate 20 valid records:

import instructor

# Runs against a local model via Ollama — nothing leaves your machine.
# Pull it first: `ollama pull llama3.2`. The string is "provider/model".
client = instructor.from_provider("ollama/llama3.2")

# Generate a stream of validated tickets
prompt = """
Generate 20 realistic support tickets. 
Each ticket should have a believable title and description, 
with the priority and status matching the content 
(e.g., a critical ticket should have urgent language in the description).
"""

tickets = client.create_iterable(
    response_model=SupportTicket,
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

# Iterate and collect
generated_tickets = []
for ticket in tickets:
    print(f"Generated ticket: {ticket.ticket_id} - {ticket.title}")
    generated_tickets.append(ticket)

print(f"Total generated: {len(generated_tickets)}")

Instructor wraps your LLM calls and validates the output against your Pydantic schema. If the model returns invalid JSON or a field fails validation, instructor retries with the error message, coercing the model to correct itself.

Adversarial edge cases

More useful than happy-path data: deliberately generating the rows that crash pipelines at 3am.

Ask the model to break things:

edge_prompt = """
Generate 30 edge-case support tickets designed to break a pipeline.
Include: titles that are empty or 10k chars, emails missing @, 
priority fields with invalid values, mixed encodings, control characters.
These test that the pipeline handles garbage gracefully.
"""

edge_cases = client.create_iterable(
    response_model=SupportTicket,
    messages=[{"role": "user", "content": edge_prompt}]
)

for ticket in edge_cases:
    try:
        result = my_pipeline.process(ticket.model_dump())
        print(f"✓ Handled: {ticket.ticket_id}")
    except Exception as e:
        print(f"✗ Failed: {e}")

This surfaces null-check bugs, encoding issues, regex escaping, and off-by-one bugs that Faker won't find.

Use it in fixtures

Save the generated data to a test fixture file:

import json

# Generate once, save forever
all_tickets = list(client.create_iterable(...))

# Serialize to JSON
with open("tests/fixtures/support_tickets.json", "w") as f:
    json.dump(
        [t.model_dump() for t in all_tickets],
        f,
        indent=2,
        default=str
    )

# In your test
def test_pipeline_with_realistic_data():
    with open("tests/fixtures/support_tickets.json") as f:
        tickets = json.load(f)
    
    for ticket_dict in tickets:
        result = my_pipeline.process(ticket_dict)
        assert result.success, f"Failed on {ticket_dict['ticket_id']}"

The generation happens once. Tests run against the same data every time, so they're deterministic and reproducible.

Honest boundaries

Synthetic data has real limits.

It doesn't match real distributions. A model can generate a plausible email, but it won't respect the actual frequency of Gmail vs corporate domains in your user base. Never use synthetic data to benchmark performance, train ML models, or estimate latencies as if it were real.

It encodes the model's biases. If you ask a model to generate customer names, the output will skew toward names the model saw most during training. It won't match your actual customer demographics. Again: for testing plumbing and edge cases only, not for fidelity.

Faker and Hypothesis solve most problems. If you're generating IDs, emails, phone numbers, or random dates, you don't need a model. A model adds latency, network overhead (even local), and complexity. Reach for it only when you need semantic coherence — records that hang together — or when you want to explicitly generate weird edge cases that pure randomness won't surface.

The goal is to surface bugs before users do. A local model generates realistic test records offline and deliberately breaks your pipeline with edge cases. But start with Faker. Only spin up a model when you need semantic coherence.

For help with data engineering, check out the free book or reach out.