Turning messy PDFs into clean tables, on your own machine

The hard part of data engineering isn't moving structured data around. It's getting structure out of the mess in the first place — invoices with floating headers, statements with inconsistent layouts, reports where "total" means three different things on three different pages.

You don't need an expensive document-AI API or a fine-tuned vision model to handle this. In 2026 you can do it locally: parse the document to clean text, feed it to a small local model, and use a schema to force the output into validated records. The real value is the validation, not the model.

Stage 1: parse the document

The first step is turning the PDF into clean, readable text. Open-source tools have gotten good at this. The main contenders are:

Docling (IBM) is structurally rich — it understands tables, headers, sections, and builds a semantic tree before rendering to markdown. Slower but accurate. Use when the document has complex layout or you need to preserve structure.

Marker (Datalab) prioritizes speed and accuracy. Handles table detection well, but is GPU-hungry. Optional --use_llm flag if you want vision enhancement. Use when you have GPU and care about OCR quality.

MarkItDown (Microsoft) is the fastest and simplest. Shallow extraction, minimal structure, good for straightforward documents. Use when you need a quick pass and don't need perfect layout preservation.

For most work, Docling is a safe choice. Here's a minimal example:

from docling.document_converter import DocumentConverter

# Initialize once and reuse
converter = DocumentConverter()

# Convert a local PDF to markdown
result = converter.convert("invoice.pdf")
markdown_text = result.document.export_to_markdown()

print(markdown_text)

That's it. You get clean markdown with tables, headers, and readable text. Run this locally, no API calls.

Stage 2: extract and validate

Now you have readable text, but you need structured data — an Invoice object with vendor, date, line items, total. This is where a local model and a schema come in.

Use the instructor library with Pydantic. Instructor wraps your local model (via Ollama or any OpenAI-compatible endpoint) and enforces the schema you define. If the model returns invalid data, instructor retries automatically.

First, define your target schema:

from pydantic import BaseModel, Field
from typing import List

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float

class Invoice(BaseModel):
    vendor_name: str = Field(..., description="Name of the vendor/supplier")
    invoice_date: str = Field(..., description="Invoice date in YYYY-MM-DD format")
    invoice_number: str
    line_items: List[LineItem]
    total_amount: float = Field(..., description="Total invoice amount")

Now, initialize instructor and point it at your local model:

import instructor

# Use instructor with Ollama (or any OpenAI-compatible local endpoint)
client = instructor.from_provider("ollama/llama2")

# Extract structured data from the markdown text
invoice_data = client.create(
    response_model=Invoice,
    messages=[
        {
            "role": "user",
            "content": f"Extract invoice data from this text:\n\n{markdown_text}"
        }
    ],
)

print(invoice_data)
# Output: Invoice(vendor_name='ABC Corp', invoice_date='2026-06-15', ...)

Instructor automatically validates the response against your Pydantic model. If the model misses a required field or returns the wrong type, instructor retries with a corrected prompt. Most small local models (llama2, mistral) succeed on the first try for clean documents.

The schema is your real reliability lever. A strict Pydantic model with type hints and field descriptions catches bad output before it gets to your database.

Why this works

You're not relying on the model's knowledge or accuracy for the hard stuff. You're relying on the schema. The model reads the markdown text, spots the relevant fields, and returns data that fits your shape. Validation happens automatically.

This is different from prompt engineering or fine-tuning. You're just being explicit about what you want.

Honest boundaries

This approach has real limits. Know them before you start:

Parsing is the bottleneck. If the PDF is scanned or heavily corrupted, markdown comes out garbled. No amount of schema validation fixes that. Test your parse output before tuning extraction.

Tables and multi-column layouts are still hard. A neatly formatted invoice works. A chaotic three-column statement with merged cells or irregular spacing is a gamble. Marker with --use_llm does better, but costs compute.

Numbers need reconciliation. Never trust extracted totals without spot-checking. If an invoice says total=$1000 but sum(line_items)=$999.50, something went wrong in the parse or extract. Always validate before loading into your system.

Small models miss fields. A frontier LLM like Claude or GPT-4 catches nuances a local llama2 might skip. If extraction is failing for domain-specific fields, try a bigger model or fallback to a paid API for that subset.

Scanned PDFs with OCR errors compound every problem. Parsing produces gibberish, the model can't read it, validation fails. If you're dealing with ancient paper, consider a specialized OCR tool (Tesseract, Surya) before Docling.

Start with clean PDFs or documents with native text. Test on a handful before scaling to thousands.

When to stop here

This stack — Docling + instructor + Pydantic — handles invoices, statements, expense reports, and most tabular documents where layout is consistent. If it's not working, the issue is usually in parsing (bad PDF quality) not extraction (bad model).

Don't add a fine-tuned model, a retrieval layer, or a complex retry pipeline unless you can measure that the schema-based approach is actually failing. Most of the time it won't be. Add weight only when you hit a wall.

If you're stuck on a specific document type, consider whether you actually need this at all. Sometimes the right answer is a human review step, not a better model.

For a working walkthrough, see Generating test data with a local model — it covers similar ground with synthetic data.

For more on building reliable pipelines without over-engineering, grab the free book or reach out.

Stage 1: parse the document

Stage 2: extract and validate

Why this works

Honest boundaries

When to stop here

Building something data-heavy?