Running a local model over millions of rows without it falling over
By Arshad Ansari
You've built a prompt that works. Tested it on 100 rows in your notebook. Got the output format right. Now your manager says "run this over 5 million rows and have it done by tomorrow."
The jump from notebook to millions is not about a bigger model. It's about throughput and not losing work. Get batching, checkpointing, and idempotency right, and a single GPU finishes overnight. Get them wrong and you're debugging hung processes and re-running four hours of work after a crash.
The naive loop kills throughput
Most people's first pass at scale looks like this:
results = []
for row in data:
response = model(row.text) # One call per row
results.append(response)
If your model runs on a GPU and each inference takes 0.5 seconds, 5 million rows takes 2.5 million seconds—about 29 days. The GPU sits idle between calls. The network waits. The CPU waits. Everything waits.
The fix is batching. Instead of one prompt at a time, pack many prompts into a single inference call and let the GPU handle them together.
batch_size = 64
for batch_start in range(0, len(data), batch_size):
batch = data[batch_start:batch_start + batch_size]
prompts = [row.text for row in batch]
responses = model.generate(prompts, sampling_params) # One call, many rows
results.extend(responses)
That same 5 million rows now takes 29 days ÷ 64, or about 11 hours. The GPU is busy almost the entire time. One machine, one GPU, overnight.
Batching without stopping for breath
If you're using vLLM (which handles batching for you), the offline inference API is straightforward:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
prompts = [text1, text2, text3, ...]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
vLLM schedules the batch internally, keeps the GPU fed, and returns results when they're all done. You don't think about tokens or scheduling; you hand it a list and get a list back.
Batch size depends on your GPU memory. A 7B model on an A100 (40GB) handles batch size 64–128 comfortably. A 13B model might need batch size 32. An RTX 4090 (24GB) more like 8–16. Start at 64, measure peak memory with nvidia-smi, and scale up or down.
Checkpoints and idempotency save time
Running 5 million rows for 11 hours and crashing at hour 10 is a different kind of pain. You either run all 11 hours again or figure out which 0.8 million you already finished and resume from there.
The pattern is simple: before processing a batch, check what you've already done.
import json
import os
from pathlib import Path
checkpoint_dir = Path("checkpoints")
checkpoint_dir.mkdir(exist_ok=True)
processed_ids = set()
checkpoint_file = checkpoint_dir / "processed_ids.json"
# Load what we've already done
if checkpoint_file.exists():
processed_ids = set(json.load(open(checkpoint_file)))
for batch_start in range(0, len(data), batch_size):
batch = data[batch_start:batch_start + batch_size]
# Skip rows we've already processed
batch = [row for row in batch if row.id not in processed_ids]
if not batch:
continue
prompts = [row.text for row in batch]
outputs = llm.generate(prompts, sampling_params)
# Write results incrementally (don't wait until the end)
with open(f"results_batch_{batch_start}.jsonl", "a") as f:
for row, output in zip(batch, outputs):
result = {
"id": row.id,
"text": output.outputs[0].text
}
f.write(json.dumps(result) + "\n")
# Update checkpoint
for row in batch:
processed_ids.add(row.id)
with open(checkpoint_file, "w") as f:
json.dump(list(processed_ids), f)
Key moves:
- Stable IDs: Each row has an immutable identifier. Use a database primary key or hash of the raw input, not the array index.
- Skip already done: Before processing, filter out rows in
processed_ids. This is idempotency—running the job twice with the same input is safe. - Write incrementally: Don't accumulate everything in memory and flush at the end. Write each batch to disk immediately. If the process dies, you keep what you wrote.
- Checkpoint separately: Keep a simple JSON file of the IDs you've processed. Load it on restart and skip those rows.
This survives crashes. A crash at hour 10 costs you at most one batch of results (rescan and recompute). At hour 11, you reload the checkpoint, skip 4.95 million done rows, and process the last 50k in seconds.
Structured output avoids cleanup debt
Free-text generation is a gift to downstream debugging. "The model said the sentiment was ' happy ' sometimes and 'Happy' other times." Now your data pipeline has to guess or fail.
Use constrained or guided decoding to force structured output.
from vllm import LLM, SamplingParams
llm = LLM(model="...", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=256,
guided_json='{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0}'
)
prompts = [text1, text2, ...]
outputs = llm.generate(prompts, sampling_params)
This constrains the model to output only valid JSON that matches your schema. Every output is parseable; no cleaning. (The exact API—whether it's guided_json, guided_choice, or grammar—varies by vLLM version and backend, so check your docs, but the idea is the same.)
Structured output does have a cost: the model can't deviate from the schema, even if it's the right move. If your schema doesn't fit the task, the outputs get subtly distorted or truncated to fit. Build your schema to be permissive enough that the model has room to think.
Handle failures gracefully
Not every row succeeds. A prompt might be malformed, too long for the model, or hit a transient error.
try:
outputs = llm.generate(prompts, sampling_params)
except ValueError as e:
# Prompt format error or length issue
print(f"Batch failed: {e}")
# Process one row at a time to isolate the bad one
for row in batch:
try:
output = llm.generate([row.text], sampling_params)
# Handle success
except ValueError:
# Log and skip the row
print(f"Row {row.id} is unparseable, skipping")
continue
This is defensive but cheap: 99.8% of your rows will succeed in bulk; the slow path is just for the 0.2% that break the mold.
Also cap your batch size to your VRAM. If you run out of memory, the GPU thrashes and you lose hours to paging. Monitor with nvidia-smi during your first few batches and dial it in. Better to finish in 12 hours at batch 32 than hang at batch 128.
Honest boundaries
Throughput is not correctness. Running a flawed prompt over 5 million rows doesn't make it good; it multiplies the flaw. You still need a validation or judge step afterward—perhaps sampling 1000 outputs and checking them manually, or using a smaller judge model to score quality.
GPU memory caps both batch size and model size. You're not moving the constraint, just sliding the dial. A 70B model won't fit on a single 24GB card, no matter how well you batch. A 7B model finishes in one night on a 4090; a 13B model takes three nights.
Constrained decoding can distort outputs. If your schema forces all sentiments into three buckets, a nuanced answer gets hammered into the nearest bucket. Schema design matters.
None of this fixes a bad prompt. Batching scales whatever you already have—good or bad.
Don't over-engineer for scale
Resist the urge to reach for Spark, Ray, or a distributed cluster. A single machine with one GPU and these patterns—batching, checkpointing, idempotency, and incremental writes—covers far more than people assume. 5 million rows in 11 hours is not a distributed-systems problem. It's a patience problem.
Scale out when one machine truly can't keep up: when you have 100 million rows and need it done in an hour, or when you're running inference every few seconds and need low latency. Until then, the setup is simpler, the debugging is faster, and the cost is lower.
If you're building this, start with one batch on your laptop. Get checkpointing working. Then push it to a GPU machine and let it run. A working, boring pipeline that finishes overnight beats a clever one that you're still debugging next week.
For more on the math of local models versus APIs, see Local model or paid API—the honest math. For validating quality at scale, check LLM as judge for data quality.
If you're building a data pipeline and need help thinking through these tradeoffs, reach out.
Building something data-heavy? Let's talk.