Calling an LLM from inside a SQL query

By Arshad Ansari

There's a moment in data work when the cleanest place to run an LLM over your data is inside the SQL query itself. One model call per row, to classify a support ticket, summarize a review, or pull a field out of free text.

This is now practical. MotherDuck and the DuckDB community have built SQL functions that send rows to LLMs and get results back. The catch is cost and non-determinism. Here's how.

The managed path: MotherDuck's prompt()

MotherDuck exposes an LLM call directly in SQL:

SELECT 
  customer_id,
  feedback,
  prompt(
    'Classify this customer feedback as: positive, negative, or neutral. Feedback: ' || feedback,
    model := 'gpt-4o-mini',
    temperature := 0.1,
    return_type := 'VARCHAR'
  ) AS sentiment
FROM customer_feedback
WHERE created_at > current_date - interval '7 days'
LIMIT 100;

The function takes a prompt, optional LLM config (model, temperature), and a return type. For structured output — category and confidence score together — use struct:

prompt(
  'Extract category and confidence from: ' || text,
  struct := '{category: VARCHAR, confidence: FLOAT}',
  struct_descr := '{category: "positive/negative/neutral", confidence: "0–1 score"}'
)

Models include GPT-4o, GPT-4o mini, GPT-5, and smaller variants. Use this if you're in MotherDuck and want metered billing and rate-limiting handled for you. Results go into a DuckDB column, so you stay in SQL.

The local-first path: openprompt

If you want your LLM calls to stay local, the DuckDB community extension openprompt talks to any OpenAI-compatible endpoint, including a local Ollama instance.

Install it once:

INSTALL open_prompt FROM community;
LOAD open_prompt;

Point it at your local Ollama server:

SET VARIABLE openprompt_api_url = 'http://localhost:11434/v1/chat/completions';
SET VARIABLE openprompt_model_name = 'qwen2.5:0.5b';

Then query:

SELECT 
  order_id,
  shipping_address,
  open_prompt(
    'Extract the country from this address: ' || shipping_address
  ) AS country
FROM orders
WHERE country_extracted IS NULL;

The function is simpler than prompt() — fewer options, but it gets the job done. For structured output, pass a json_schema parameter. For a system prompt, use system_prompt.

This is the local-first angle: the model runs on your hardware, no API costs, no external calls. If you have a homelab or a spare GPU, this is zero-cost at scale.

The cost reality

Here's the trap. prompt() and open_prompt() run once per row in the SELECT. If you query a million rows, you make a million LLM calls.

-- This is expensive (slow and/or costly)
SELECT 
  id,
  text,
  prompt('Classify: ' || text) AS classification
FROM articles;  -- 1M rows = 1M API calls

Never do this. Instead, filter first:

-- Better: only classify where needed
SELECT 
  id,
  text,
  prompt('Classify: ' || text) AS classification
FROM articles
WHERE classification IS NULL  -- Only unclassified rows
LIMIT 1000;  -- Test on a sample first

Then materialize the result so you don't recompute it on every query:

CREATE TABLE article_classifications AS
SELECT 
  id,
  text,
  prompt('Classify: ' || text) AS classification
FROM articles
WHERE classification IS NULL
ORDER BY id;

-- Later queries just read the table
SELECT * FROM article_classifications WHERE classification = 'urgent';

This is the discipline: filter aggressively, materialize the results, and never let a model call run inside a joined loop. If you find yourself calling prompt() in a subquery used by every dashboard query, you've over-engineered.

When not to use this

Don't reach for a model when SQL already solves it:

  • LIKE for detecting "refund"
  • JSON functions for structured fields
  • CASE with BETWEEN for spend categories
  • Full-text search for deduplication

Models are for genuine language understanding: parsing messy text, extracting meaning from prose, classifying subjective content. Everything else is over-engineering.

Honest boundaries

This approach has real limits.

Cost compounds fast. Every row costs money (if using a managed service) or wall-clock time (if local). A million rows is a million calls. If your table grows, costs or latency grow with it. This only works on finite, filtered datasets or on data that's classified once and cached.

Results are non-deterministic. The model may classify the same text differently on different days, especially with temperature > 0. If you recompute the query, you get different results. This breaks reproducibility unless you materialize and never re-query the model.

Not for interactive dashboards. If you're querying a million rows on-demand, you'll wait minutes for the LLM calls to finish. This is a batch operation, not a BI tool.

Non-determinism breaks dbt. If you use prompt() in a dbt model, running the same model twice produces different data. dbt assumes idempotency. Materialize the results outside dbt, or treat the dbt model as a write-once operation.

Use this when you're classifying a finite batch (1000–100k rows), caching the result in a table, and querying from that table later. Not for real-time, not for large loops, not when SQL already solves it.

In practice

The cleanest use case: you have a backlog of customer feedback (10k rows), classification rules are fuzzy, and you want to tag them once, store the tags, and query from there. SQL call, materialize, done.

The worst use case: you're building a dashboard that queries a live table, and every load kicks off a million LLM calls. That's not a feature, that's a spike in your bill.

For the local-first path with Ollama, this is lightweight — you're not paying per call, just CPU. Use it for batch work on data you own, where latency is okay and privacy matters.

If you're reaching for this, start small. Filter to 100 rows, test the prompt, measure the cost or time, then scale up. The discipline of filtering and materializing is what keeps this from becoming a money sink.

Learn more about local-first data workflows in our free Local-First Analytics book, or contact us if you're building a data platform and want to talk through where models fit.

Building something data-heavy? Let's talk.