# KB: LLM Knowledge Bases (Karpathy Pattern)

**Source:** @karpathy, April 2–4 2026
**Original tweet:** https://x.com/karpathy/status/2039805659525644595 (40k likes, 73k bookmarks, 10.7M views)
**Follow-up + idea gist:** https://x.com/karpathy/status/2040470801506541998
**Gist:** https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

---

## The Core Idea

Stop using LLMs as retrieval engines (RAG). Use them as **wiki maintainers**.

RAG: LLM rediscovers knowledge from scratch on every query. No accumulation.
This pattern: LLM **compiles** raw sources into a persistent, interlinked wiki of .md files — once — and keeps it current as new sources arrive. Every query, every exploration **compounds** into the knowledge base.

Karpathy's analogy: "The LLM is the programmer; the wiki is the codebase; Obsidian is the IDE."

---

## Architecture (3 Layers)

### 1. Raw sources (`raw/`)
- Articles, papers, images, data files
- **Immutable** — LLM reads, never modifies
- Source of truth

### 2. Wiki (`wiki/`)
- LLM-generated .md files: summaries, entity pages, concept pages, comparisons, synthesis
- **LLM owns this entirely** — you just read it
- Two special files:
  - `index.md` — content catalog, one-line summary per page, organized by category. LLM reads this first on every query. Works fine at ~100 sources / ~hundreds of pages without needing RAG embeddings.
  - `log.md` — append-only chronological record of ingests/queries/lints. Parseable: `grep "^## \[" log.md | tail -5`

### 3. Schema (`AGENTS.md` / `CLAUDE.md`)
- Config doc telling the LLM wiki conventions, workflows, page formats
- You and the LLM co-evolve this over time
- Converts the LLM from "generic chatbot" → "disciplined wiki maintainer"

---

## Three Operations

### Ingest
Drop source → LLM reads it → discusses with you → writes summary page → updates index → updates 10–15 entity/concept pages → appends to log.
- Can do one at a time (stay involved) or batch (less supervision)
- Karpathy prefers one at a time

### Query
Ask question → LLM reads index → drills into relevant pages → synthesizes answer.
Output formats: markdown page, comparison table, Marp slides, matplotlib chart.
**Key insight:** file good answers back into the wiki. Explorations compound.

### Lint (periodic health checks)
Ask LLM to find:
- Contradictions between pages
- Stale claims superseded by newer sources
- Orphan pages (no inbound links)
- Concepts mentioned but no dedicated page
- Missing cross-references
- Data gaps fillable by web search

---

## Tooling Stack (Karpathy's)
- **Obsidian** — IDE frontend; graph view shows shape of wiki (hubs, orphans)
- **Obsidian Web Clipper** — converts web articles → .md for `raw/`
- **Local image download hotkey** — Ctrl+Shift+D after clipping
- **Marp** — markdown → slide decks, viewable in Obsidian
- **Dataview plugin** — dynamic tables/lists over YAML frontmatter
- **qmd** (https://github.com/tobi/qmd) — local search over .md files, hybrid BM25/vector + LLM re-ranking, has CLI + MCP server. Use when wiki outgrows index.md
- **Git** — wiki is just a git repo; version history + collaboration for free

---

## Use Cases
- **Research** — go deep over weeks/months; evolving thesis across papers + articles
- **Personal** — journal entries, health, goals → structured self-picture over time
- **Book reading** — per-chapter filing → character/theme wiki (like Tolkien Gateway, built personally)
- **Business/team** — internal wiki fed by Slack threads, meeting transcripts, customer calls; LLM does maintenance no one wants to do
- **Competitive analysis, due diligence, trip planning, course notes**

---

## Why It Works (vs. RAG)

| RAG | LLM Wiki |
|-----|----------|
| Rediscovers knowledge every query | Knowledge compiled once, kept current |
| Semantic similarity retrieval | Structured index + entity graph |
| No accumulation | Every ingest/query compounds |
| Needs embedding infrastructure | Just markdown files + index.md |
| Scales poorly for multi-doc synthesis | Cross-references pre-computed |

Humans abandon wikis because maintenance burden grows faster than value. **LLMs don't get bored.** Cost of maintenance ≈ zero.

---

## The "Idea File" Concept (meta-insight)

Karpathy introduced a new sharing format: instead of sharing code/apps, share an **idea file** — an abstract description of the pattern designed to be copy-pasted to your own LLM agent, which then builds the specific implementation for your needs.

This reflects a broader shift: in the LLM agent era, the valuable artifact is the **idea + schema**, not the code.

---

## Relevance to Moon / Bo

This is essentially what Moon already does at a small scale with `MEMORY.md`, `memory/YYYY-MM-DD.md`, and `AGENTS.md`. The Karpathy pattern is a more principled, domain-specific extension of the same concept:

- `MEMORY.md` ≈ wiki synthesis layer
- `memory/YYYY-MM-DD.md` ≈ log.md (append-only chronological)
- `AGENTS.md` ≈ schema file

**Gap:** No raw source layer, no entity/concept pages, no structured ingest workflow, no lint pass.

**Potential application for Bo:**
- Bo works at the AI × bio intersection; a LLM wiki over papers/preprints in Virtual Cell, scRNA-seq, perturbation biology would be high-value
- Could replace ad-hoc `arxiv-scout` runs with a compounding knowledge base
- Obsidian already popular in research; low friction to adopt

**Open question for Bo:** Would you want me to set up a Karpathy-style wiki scaffold for any of your research areas? Would be straightforward — `raw/`, `wiki/`, `AGENTS.md` schema, index + log. Could seed it with recent arXiv papers you care about.

---

## Further Directions (from gist)
- Synthetic data generation + fine-tuning: have LLM "know" the wiki in weights, not just context
- Karpathy thinks there's room for "an incredible new product instead of a hacky collection of scripts"
- qmd for search as wiki grows past index.md scale
