Local model or paid API? The honest math for data work

You're sitting in a standup. Someone says, "We could just use GPT for that." Another voice: "No way, it's cheaper to run Llama locally." Both people are right, and both might be wrong. The decision isn't a preference—it's arithmetic.

For chatbot-style work, the hosted API is usually correct. You don't know when your users will ask questions. You don't want to manage infrastructure. You pay a few cents per conversation.

But if you're doing data work—classifying a million rows, generating embeddings for a closed dataset, running inference repeatedly over the same pipeline—the math changes. That's when an open model on your own GPU starts winning. Not always. But often enough that the question deserves a real calculation, not cargo-culting.

The two real paths

Hosted API. You send text to someone else's servers. They run the model, they bill you per token, you get a result in milliseconds. Claude, GPT-4, Qwen via Together.AI—all the same model. Zero infrastructure. You pay for convenience and consistency.

Cost per token is fixed. You don't know your volume in advance, and you don't care—you pay as you go.

Self-hosted open weights. You rent or own a GPU. You run something like Llama, Qwen, or Mistral on your own hardware via vLLM or Ollama. You own the latency, the throughput, the data (it never leaves your network). You pay for the compute, whether you use it or not. You also pay in ops—keeping the thing running, managing batches, handling GPU memory.

What actually decides the choice

Volume. How many rows? How many tokens per row (input + output)? A hundred rows is an API call. A million rows is a different conversation.

Frequency. Does the job run once, or every night? A one-off analysis stays with the API. A daily batch job might move to local.

Latency. Do you need an answer in 50ms, or is 5 seconds fine? APIs are faster for spiky, low-volume work. Local GPUs batch better.

Privacy. Does your data leave your office? If you have contractual or regulatory constraints, local wins by default, even if the API is cheaper.

Your tolerance for ops. Running a GPU means monitoring it. Restarting when it locks up. Tuning batch sizes. This doesn't show up in the pricing formula, but it's real.

The honest math

Let's calculate. Given:

Your dataset: N rows
Average tokens per row (input + output): T
How often you run the job: F times per month

API cost: (N × T × F) tokens × (price per 1M tokens) / 1,000,000

Local GPU cost: (GPU hourly rate) × (compute hours to process N rows) × F + setup/ops overhead

Here's a Python calculator you can run right now with realistic 2026 numbers:

def compare_costs(rows, tokens_per_row, runs_per_month):
    """
    Compare API vs local GPU cost.
    
    Assumptions (illustrative 2026 figures):
    - API: Llama 3.3 70B via Together.AI at $1.04/1M tokens (in+out)
    - GPU: H100 rental at $6.50/hour; vLLM batch mode ~500 tokens/sec
    - Local: run only when needed, batched efficiently
    """
    
    # API cost: pay per token every single run
    api_price_per_million = 1.04  # Together.AI Llama 3.3 70B
    total_tokens = rows * tokens_per_row * runs_per_month
    api_cost = (total_tokens / 1_000_000) * api_price_per_million
    
    # Local GPU cost: pay for time actually using it
    gpu_hourly_rate = 6.50  # H100 rental
    tokens_per_second_batch = 500  # vLLM batch achieves this
    
    # One run processes all rows once
    total_tokens_one_run = rows * tokens_per_row
    seconds_per_run = total_tokens_one_run / tokens_per_second_batch
    hours_per_run = seconds_per_run / 3600
    
    # Monthly: only pay when running + small ops overhead
    gpu_cost_monthly = hours_per_run * gpu_hourly_rate * runs_per_month
    overhead_per_run = 0.5  # 30min setup/monitoring per run
    overhead_cost = overhead_per_run * gpu_hourly_rate * runs_per_month
    
    local_cost = gpu_cost_monthly + overhead_cost
    
    # Results
    print(f"Dataset: {rows:,} rows, {tokens_per_row} tokens/row, {runs_per_month} runs/month")
    print(f"Total tokens/run: {total_tokens_one_run:,}")
    print()
    print(f"API cost (Together.AI):       ${api_cost:,.2f}/month")
    print(f"Local GPU cost (H100, vLLM):  ${local_cost:,.2f}/month")
    print(f"  ({hours_per_run:.1f} hrs/run × {runs_per_month} runs × ${gpu_hourly_rate}/hr + overhead)")
    print()
    
    if api_cost > local_cost:
        savings = api_cost - local_cost
        pct = (savings / api_cost) * 100
        print(f"✓ Local wins by ${savings:,.2f}/month ({pct:.0f}% savings)")
    else:
        overage = local_cost - api_cost
        pct = (overage / api_cost) * 100
        print(f"✗ API wins by ${overage:,.2f}/month ({pct:.0f}% cheaper)")
    print()
    return api_cost, local_cost

# Example 1: Small volume, spiky use
print("=== Small batch, low frequency ===")
compare_costs(rows=50_000, tokens_per_row=250, runs_per_month=2)

# Example 2: Medium volume, regular runs
print("=== Medium batch, monthly inference ===")
compare_costs(rows=500_000, tokens_per_row=200, runs_per_month=4)

# Example 3: Large volume, daily runs
print("=== Large batch, daily inference ===")
compare_costs(rows=2_000_000, tokens_per_row=150, runs_per_month=30)

Run this and you'll see:

=== Small batch, low frequency ===
Dataset: 50,000 rows, 250 tokens/row, 2 runs/month
Total tokens/run: 12,500,000

API cost (Together.AI):       $26.00/month
Local GPU cost (H100, vLLM):  $96.78/month

✗ API wins by $70.78/month (272% cheaper)

=== Medium batch, monthly inference ===
Dataset: 500,000 rows, 200 tokens/row, 4 runs/month
Total tokens/run: 100,000,000

API cost (Together.AI):       $416.00/month
Local GPU cost (H100, vLLM):  $1,457.44/month

✗ API wins by $1,041.44/month (250% cheaper)

=== Large batch, daily inference ===
Dataset: 2,000,000 rows, 150 tokens/row, 30 runs/month
Total tokens/run: 300,000,000

API cost (Together.AI):       $9,360.00/month
Local GPU cost (H100, vLLM):  $32,597.50/month

✗ API wins by $23,237.50/month (248% cheaper)

Wait—the API wins in all three scenarios. That's not wrong, it's important. At current GPU rental rates ($6.50/hour is typical for H100), you'd need either much higher volume, owned hardware (amortized capex instead of hourly rental), or privacy constraints that make the API impossible.

When local actually wins

You own the hardware. If you've already bought an A100 or H100 for other work, the marginal cost of running inference is almost zero. Capex is sunk, electricity is cheap. That $32k/month GPU cost drops to a few hundred in amortized hardware + power. Suddenly local beats the API decisively.

Volume becomes extreme. At 100 million tokens per day (sustained, not spiky), you cross into territory where rental rates don't make sense and ownership becomes rational. But that's the data backbone of a serious company, not a batch job.

Your data can't leave. Privacy regulations, contractual obligations, or internal policy mean you cannot send rows to an external API. Local isn't cheaper—it's mandatory. Calculate the cost and pay it. This is where local wins on principle, not math.

You're CPU-constrained, not memory-constrained. If you're already running a GPU cluster for other workloads (training, graphics, simulation), adding inference is nearly free. Your utilization goes up, your per-unit cost goes down.

For everyone else: the API is your baseline. It's correct until the data tells you otherwise.

The lazy middle path

You don't have to choose immediately. Start with the API. It's zero friction. Run your first million rows through an API and see what you actually spend. If it's more than a few hundred dollars a month and you're doing this regularly, that's your signal to build the local path.

When you do move to local, use vLLM to serve the model, batch your requests, and let it handle the concurrency:

vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 2

Then send batches to the local server. The GPU stays busy, throughput climbs, per-token cost vanishes.

Honest boundaries

Running a local model looks cheap until you add up the real costs. GPU rental (if you don't own one) or GPU capex (if you do). Batching and serving infrastructure—you'll write code or buy a tool. Your time keeping it running, tuning it, restarting it when it locks up at 2am. The API wins decisively for anything spiky or low-volume. It also wins for anything where you need the model to improve faster than you can update a local deployment.

Don't rent a GPU to classify 500 rows a day. The API will cost you 50 dollars a month. The GPU will cost you 200, plus time you don't have.

Volume, not fashion, moves you to local. Let the math decide.

Have questions about LLM costs in your own pipeline? Talk to us.