Building AEGIS: How Prolog + a Knowledge Graph Make Local LLMs Actually Useful

I’ve been building a personal AI orchestration system. The first version — Jarvis — worked, but became a 20,000-line monolith I dreaded touching. So I rebuilt it from scratch as AEGIS (Autonomous Executive Guild Intelligence System).

Before I go further — AEGIS is a custom-built personal tool. It’s not a SaaS product, not multi-tenant, not an open-source chat interface like Open WebUI, not a general-purpose assistant framework. It’s purpose-built for one user (me), with hardcoded personalities, hardcoded integrations, and workflows tuned to my specific life. There’s no user management, no onboarding flow, no plugin marketplace. If you’re looking for something you can deploy and use yourself out of the box, this isn’t it. The value is in the ideas, not the artifact.

One thing I should be upfront about: almost all of the code was written by Claude Code , Anthropic’s agentic coding tool. I did the architecture, design documents, RFC specs, and decisions about what to build — Claude Code wrote the Python, TypeScript, SQL migrations, and tests. Sometimes I ran OpenAI Codex as a second pass to cross-check specific pieces. This post is as much about the design model as the development model.

But the thing I actually want to talk about is a specific architectural bet that’s been paying off: using Prolog and a knowledge graph to dramatically shrink what you need the LLM to do , which in turn makes it feasible to run most of the system on a cheap local model.

The Core Problem With LLM-Driven Systems

When you build an AI system that runs autonomously — processing emails, managing tasks, investigating alerts, running briefings — you end up hitting the LLM constantly. Every classification, every routing decision, every piece of context retrieval, every “should I do X?” question gets sent to the model.

That’s expensive. My original Jarvis had heartbeats firing every 1–2 hours per personality, email triage every 6 hours, intelligence scans, memory consolidation, council deliberations. With 15 personalities and a dozen scheduled workflows, the API cost compounds fast.

The naive solution is “just use a cheaper model”. But cheaper models make more mistakes on structured reasoning. You get wrong classifications, bad routing decisions, misremembered facts. The system becomes unreliable in exactly the places where reliability matters most.

The real solution is to not ask the LLM to do things it’s bad at. Structured reasoning, routing decisions, and fact lookups aren’t what LLMs are for. Prolog is.

The Hybrid Architecture: Prolog Handles Structure, LLM Handles Language

┌─────────────────────────────────────────────────┐  
│  Layer 1: Prolog Knowledge Graph                │  
│  Deterministic · Fast · Free · Auditable        │  
│  "What do we already know about this sender?"   │  
│  "What routing rule applies to this alert?"     │  
└────────────────────┬────────────────────────────┘  
                     │ no result  
┌────────────────────▼───────────────────────────┐  
│  Layer 2: SPARQL (DBpedia / Wikidata)          │  
│  External facts · Claim verification           │  
│  "Is this entity a known organization?"        │  
└────────────────────┬───────────────────────────┘  
                     │ no result  
┌────────────────────▼────────────────────────────┐  
│  Layer 3: LLM (local or cloud)                  │  
│  Language understanding · Fuzzy reasoning       │  
│  "Given everything above, what should we do?"   │  
└─────────────────────────────────────────────────┘

The insight is that most decisions in a personal assistant system are actually deterministic once you have the right facts asserted. You don’t need a language model to decide that emails from your bank should be archived — you need that rule stated once and executed reliably every time. You don’t need a model to know that a Docker container restart alert should trigger a remote script — you need a Prolog predicate that says so.

The LLM gets invoked only when the structured layers have nothing to say.

What Lives in the Knowledge Graph

AEGIS uses PostgreSQL + pgvector as the backing store, but the reasoning layer talks to it via SWI-Prolog (through janus-swi). Facts are asserted into the KG through daily sync workflows and explicit user assertions via Telegram:

% Email routing preferences (asserted from user feedback)  
:- module(kg_routing, [  
    email_sender_priority/2,  
    email_domain_priority/2  
]).  
  
% Dynamically asserted facts:  
% fact(kg_routing, email_sender_priority, ["[email protected]", "archive"])  
% fact(kg_routing, email_domain_priority, ["substack.com", "newsletter"])  
  
email_sender_priority(Sender, Category) :-  
    kg_core:fact(kg_routing, email_sender_priority, [Sender, Category]).  
  
email_domain_priority(Domain, Category) :-  
    kg_core:fact(kg_routing, email_domain_priority, [Domain, Category]).p


% Alert execution routing (static rules, no LLM needed)  
alert_execution_mode(Service, Severity, Host, remote_script) :-  
    known_infra_service(Service),  
    member(Severity, [warning, error]).  
  
alert_execution_mode(_, critical, _, claude_code) :-  
    !.  
  
alert_execution_mode(_, _, _, telegram_only).  
  
known_infra_service(worker).  
known_infra_service(docker).  
known_infra_service(redis).  
known_infra_service(nginx).

These rules are fast (microseconds), deterministic, and completely free. No API call, no latency, no cost.

Email Triage: A Concrete Example

Email triage is the workflow I most wanted to be reliable. It runs every 6 hours per personality. Here’s the decision chain:

sync def _classify_message(self, message: dict) -> tuple[str, float]:  
    sender = extract_email(message["from"])  
    domain = extract_domain(sender)  
  
    # 1. Check KG for known sender preference (Prolog)  
    result = await engine.query(  
        "kg_routing:email_sender_priority(Sender, Category)",  
        {"Sender": sender}  
    )  
    if result:  
        return result[0]["Category"], 1.0  # certainty = 1.0, no LLM needed  
  
    # 2. Check KG for domain-level preference (Prolog)  
    result = await engine.query(  
        "kg_routing:email_domain_priority(Domain, Category)",  
        {"Domain": domain}  
    )  
    if result:  
        return result[0]["Category"], 0.95  
  
    # 3. Check static routing rules (Prolog)  
    result = await engine.query(  
        "email_routing:should_archive(Sender, Reason)",  
        {"Sender": sender}  
    )  
    if result:  
        return "archive", 0.9  
  
    # 4. LLM fallback — only if all else fails  
    response = await personality.think(classify_prompt(message))  
    return parse_category(response), response.confidence

In practice, once the system has been running for a few weeks, the vast majority of emails hit KG rules at step 1 or 2. The LLM fallback fires mainly for genuinely novel senders. The system learns over time — when the LLM classifies something, I can confirm or correct via Telegram, and that fact gets asserted into the KG for next time.

The same pattern repeats across the system:

Workflow split (Prolog vs LLM):

Email triage: Prolog handles known senders and domain rules; LLM handles novel senders and ambiguous content
Alert investigation: Prolog handles service routing and execution mode; LLM handles unknown service classification
Task triage: Prolog handles assignee and priority rules; LLM handles open-ended task descriptions
Memory consolidation: Prolog handles dedup and known entity linking; LLM handles new entity extraction and summary generation.

Why This Makes Local LLMs Viable

The LLM in AEGIS only needs to handle the residual — the cases where structured rules genuinely don’t apply. That residual is:

Smaller in volume (most requests are handled by Prolog)
Qualitatively different (requires language understanding, not rule application)
More tolerant of occasional errors (Prolog handles the mission-critical routing)

This means the bar for “good enough local model” drops significantly. I’m currently routing the LLM residual to Kimi K2 (Moonshot’s large MoE model) via LiteLLM. It handles task classification, memory consolidation, knowledge graph extraction, and briefing generation without issue.

The expensive cloud model gets reserved for:

Council workflows — multiple personalities debating a strategic question
** Code execution tasks** — where the output directly drives a Claude Code run
** Low-confidence escalations** — when both Prolog and LLM disagree or abstain

The LiteLLM gateway makes this routing transparent to the application. AEGIS Core calls _litellm.acompletion(model=”aegis-fast”)_ vs _”aegis-smart”_ and the gateway handles the rest. Swapping Kimi K2 for a different local model requires no code changes.

The Rule Learning Loop

he other half of this is keeping the knowledge graph fresh. AEGIS has a _RuleLearning_ workflow that runs monthly: it reviews recent LLM classifications, identifies patterns, and generates new Prolog rules as candidates.

The generated rules land in a _generated_modules_ table with status _draft_ . I review them in the admin panel, promote to _active_ , and they’re compiled into SWI-Prolog at runtime via _assertz_ . There’s a feedback loop:

LLM classifies something → logged to trigger_history
Outcome confirmed or corrected → logged to rule_feedback
Monthly: patterns in feedback → candidate rules generated
Reviewed and promoted → asserted into Prolog → fires on future invocations
LLM fallback fires less often over time

The system gets cheaper to run as it learns. Every rule that gets asserted is one fewer LLM call per email / alert / task for the rest of time.

The Broader Architecture

Telegram Bot  ←→  AEGIS Core (FastAPI)  ←→  Workflow Engine (Temporal)  
Admin Panel   ←→       ↕                          ↕  
                  PostgreSQL + Redis          Notion (task state)  
                  (memory, KG, RAG)

AEGIS Core owns the knowledge: personalities, memory, reasoning, connectors (Gmail, Calendar, Notion, remote scripts). 30+ REST endpoints.

Temporal workflows own the orchestration: when things run, retry logic, approval gates, durable state. 16 workflow types across heartbeats, email triage, alert investigation, task execution, council, intelligence scans, area planning.

The Telegram bot is the primary UI: conversations, inline keyboards for approvals, proactive briefings.

Workflows never import Core code — they call Core over HTTP. This boundary keeps the systems independently deployable and testable.

Sprint Delivery Model

The actual development workflow: I write an RFC document defining the problem, data model, API contract, workflow changes, and acceptance criteria. Claude Code implements the sprint — reads the codebase, writes migration, route handlers, workflow code, tests, frontend pages. I review, catch issues, ask for revisions, run the suite. When everything passes, we ship. Codex occasionally cross-checks specific pieces, especially concurrency logic.

Sprint 0 — Running skeleton (config, models, DB, health): 69 tests
Sprint 1 — Conversation API, Telegram bot: 130 tests
Sprint 2 — Memory + reasoning (RAG, Prolog, scheduler): 234 tests
Sprint 3 — Observability + events: 318 tests
Sprint 4 — Temporal workflows + connectors: 401 tests
Sprint 5 — Council + advanced workflows: 444 tests
Sprint 6 — Maintenance workflows (schedules): 512 tests
Sprint 7 — Admin panel dashboard: 539 tests
Sprint 8 — Why-How Narrative Model: 568 tests
Sprint 9 — Alert execution modes + remote remediation: 605 tests
Sprint 10 — Code execution connector + task execution: 654 test
Sprint 11 — Email triage full rewrite: 685 tests
Sprint 12 — Advanced reasoning modularization: 768 tests
Sprint 13 — Notion work orchestration: 866 tests

866 tests. All passing.

What’s Next

The thesis has held up. The system runs reliably, the Prolog layer handles the bulk of routing decisions, and the local model handles the rest without degrading quality noticeably.

Next planned work:

RFC-0009 :Personality Domain Workflows — persistent long-running per-personality workflows that own their domain proactively rather than reacting to schedules
** Connector expansion** — search integration, so the reasoning layer can pull in live context before falling back to LLM
** More KG coverage** — the more facts in the graph, the less the LLM has to infer

The system is live at <https://aegis.hikmahtech.in>, though the admin panel sits behind a VPN — I haven’t done any serious security hardening yet and the system has direct access to email, calendar, and a remote script executor, so I’m not comfortable leaving it open to the internet. The code is in a private repo for the same reason. At some point I’ll do a proper security pass, but right now it’s a personal tool and “locked behind VPN” is good enough.

The design documents and RFC process are what I’d share if someone wanted to build something similar — the architecture is more transferable than the code.

Admin panel : https://aegis.hikmahtech.in (VPN locked for now)

Building AEGIS: How Prolog + a Knowledge Graph Make Local LLMs Actually Useful was originally published in Hikmah Techstack on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building something data-heavy?