basemind

Full AI context layer for coding agents — code-map, document RAG, shared memory, web crawl,
git history. 300+ languages, one MCP server.

statusline

The four pillars

Code — Tree-sitter outlines, symbol search, reference + caller + implementation graphs,
call chains, git history per symbol, blame at symbol-level resolution.

Documents — Ingest + semantic search over PDFs, Office (Word/Excel/iWork), HTML, email,
archives. Built-in OCR, layout detection, keyword + NER extraction, cross-encoder reranking.
All ONNX bundled — no system install needed.

Memory — Per-repo scoped key-value + semantic vector storage. Clones of the same git
origin automatically share memory; unrelated repos isolated.

Web — On-demand HTTP scrape + follow-link crawl. Pages chunk, embed, and land in the
documents store under scope web:<host> for unified search.

Feature table

Pillar	What it does	MCP tools	Backend
Code intelligence	Outlines, symbol search, refs/callers/callees, call graphs, impl lookup, dependents, in-tree regex	`outline`, `search_symbols`, `workspace_grep`, `find_references`, `find_callers`, `call_graph`, `find_implementations`, `dependents`, `list_files`, `status`, `repo_info`	tree-sitter × 300+ langs · Fjall LSM index · content-addressed blob store
Git intelligence	Symbol-level history, blame, churn, recent changes, structural diffs across revs	`symbol_history`, `blame_file`, `blame_symbol`, `hot_files`, `recent_changes`, `commits_touching`, `find_commits_by_path`, `diff_outline`, `diff_file`, `working_tree_status`	gix + sha-keyed disk cache
Document RAG	Ingest + semantic search over PDFs, Office (Excel/Word/HWP/iWork), HTML, XML, email, archives. Adds OCR (Tesseract + PaddleOCR), cross-encoder reranker, keyword extraction (YAKE/RAKE), NER (gline-rs ONNX + LLM), extractive + abstractive summarization, layout detection, page auto-rotate, redaction, language detection. All ONNX models bundled — no system install needed.	`search_documents`	kreuzberg + LanceDB
Shared memory	Per-repo scoped key-value + semantic memory. Clones of the same git origin URL automatically share memory; unrelated repos isolated.	`memory_put`, `memory_get`, `memory_list`, `memory_search`, `memory_delete`	LanceDB + Fjall, scope-keyed
Web crawl	On-demand HTTP scrape + link-following crawl. Crawled pages route through the documents pipeline (chunk → embed → LanceDB) under scope `web:<host>`.	`web_scrape`, `web_crawl`, `web_map`	kreuzcrawl (native HTTP, no chromium)
Admin	Live rescan + telemetry dashboard	`rescan`, `telemetry_summary`	—

Quickstart

Claude Code

/plugin marketplace add Goldziher/basemind
/plugin install basemind@basemind

Restart the session. Optional: add a live statusline to ~/.claude/settings.json:

{
  "statusLine": {
    "type": "command",
    "command": "$HOME/.claude/plugins/basemind/.claude-plugin/statusline.sh",
    "refreshInterval": 5
  }
}

Output: ▲ basemind 144 files · scanned 2d ago ● 0 calls · 0 tok saved. The freshness dot is
green (< 1 h), yellow (1–24 h), or red (> 1 day).

Any MCP client

cargo install basemind --features full --locked

Then add to your MCP config:

{
  "mcpServers": {
    "basemind": {
      "command": "basemind",
      "args": ["serve"]
    }
  }
}

Supported harnesses: Claude Code · Cursor · Codex (CLI + App) · Gemini · OpenCode · Factory Droid ·
GitHub Copilot CLI · Continue · Cline. Each harness has install instructions in the
Harness-specific setup section below.

CLI only

basemind scan                     # index the working tree
basemind query outline path/file.rs  # inspect structure
basemind query symbol "parseQuery"   # find by name
basemind watch                    # live re-index on file change

Why basemind, specifically

vs grep / ripgrep

What ripgrep does well: blazing-fast line matching. What it misses:

Grep returns 50+ hits in docs, tests, comments, variable names — agent wastes context filtering noise.
No scope awareness: parseQuery() and parseQuery string both match; semantic signals lost.
Every query re-scans the disk; no pre-computed structures to leverage.

basemind: semantic-quality answers at grep speed via tree-sitter + indexed call sites.

vs vector-only RAG (LangChain / LlamaIndex DIY stacks)

What vector RAG does well: fuzzy document semantic search. What it misses:

Pure embeddings lose exact structure — which function calls which, which class implements which interface.
No line/column resolution — agent can't map vector hits back to code symbols.
No git history integration — "what changed recently?" and "who wrote this?" require separate systems.

basemind: code structure + git history + vector memory + document search all in one, unified scope.

vs context7 / openai-codex / Aider's repo-map

What these do well: generate code-map summaries. What they miss:

Static snapshots — stale after the first edit.
No semantic indexing — every lookup re-parses or re-scans.
Human-focused output (markdown) instead of agent-facing structure (JSON tools).

basemind: live-updated index with sub-millisecond MCP tools, built for agents not humans.

vs GitHub native search

What GitHub does well: repository-wide fuzzy text search. What it misses:

Cloud-only — your code leaves the machine, latency is network-bound.
No local-editor integration — agent can't query in-progress edits before commit.
No cross-language polyglot support — each language's search tuned separately.

basemind: local-only, always-fresh index of your working tree, 300+ languages in one sweep.

Performance

Measured on Apple Silicon, release build, --features full, default eager_l2 = true. Cold
filesystem cache adds ~50% to first scan; numbers below are warm steady-state.

Scan throughput

Repo	Files	Language mix	Time
tokio	859	Rust	0.2 s
react	7 061	TS / JSX	2.2 s
django	7 061	Python	2.5 s
requests	2 195	Python	0.7 s
gin	1 217	Go	1.0 s
ripgrep	12 851	Rust	4.0 s
ripgrep-shallow	12 851	Rust	0.16 s
TypeScript compiler	81 324	TS / JS / JSON	~22 s

The TypeScript compiler is the worst case — 81k files scanned in 22 seconds. Most real repos sit
between tokio and ripgrep. Re-scans skip unchanged content hashes, so warm rescans on edited
working trees are typically dominated by the changed-set size, not repo size.

Per-tool MCP latency

Against the 81k-file TypeScript index:

Latency	Tools
< 1 ms	`outline`, `list_files`, `find_references`, `find_callers`, `find_implementations`, `hot_files`, `repo_info`
3–6 ms	`search_symbols`, `call_graph`
4–10 ms	`recent_changes`, `commits_touching`, `find_commits_by_path`, `symbol_history`, `diff_outline`, `diff_file`
20–25 ms	`status`
30–40 ms	`blame_file`, `blame_symbol`
40–200 ms	`workspace_grep`
~200 ms	`search_documents`
350–600 ms	`working_tree_status`

basemind preloads L1 outlines into RAM on serve start, so code-map queries hit no disk. The Fjall
LSM inverted index handles ref/caller/impl lookups without scanning blobs. Git tools track gix
walk cost; Fjall-backed tools dominate only on enormous histories.

Configuration

Full config lives at schema/basemind-config-v1.schema.json. Minimal example:

# .basemind/basemind.toml
file_watch_glob = "**/*.{rs,ts,tsx,py,go}"
eager_l2 = true

[documents]
enabled = true

Per-query MCP overrides:

{
  "query": "what does kreuzberg do?",
  "reranker_enabled": true,
  "reranker_preset": "bge-reranker-base"
}

Environment variables map mechanically: --llm-api-key ↔ BASEMIND_LLM_API_KEY. Every MCP tool
accepts per-query overrides that win over file/env/CLI layers.

Architecture

source files
  → tree-sitter parsers (300+ langs, pack name dispatch)
  → L1 outlines + L2 calls + L3 structural hash blobs (content-addressed)
  → Fjall LSM inverted index (symbols / calls / imports / impls)
  → MCP server (rmcp) + documents pipeline (kreuzberg) → LanceDB
  → 32 MCP tools across 8 coding-agent harnesses

Scanner (src/scanner.rs) — rayon-parallel walker over the gitignore-aware file set.
Extracts L1 (symbols + imports + implementations), L2 (calls + docs), L3 (structural hashes)
per file.
Content-addressed blobs (src/store.rs) — msgpack at
.basemind/blobs/<blake3>.{l1,l2,l3}.msgpack. Two files with identical content share the
same blob.
Inverted index (src/index/) — Fjall LSM keyspace at
.basemind/views/<view>/index.fjall/. Nine partitions drive symbol search, references,
implementations, and dependents.
MCP surface (src/mcp/) — stdio JSON-RPC via rmcp. Tool descriptions are routing surface
for agents; semantics stated honestly (substring vs prefix, scope-aware vs name-only, capped).
Git layer (src/git.rs, src/git_cache.rs) — gix-backed blame, log, diff, status.
Sha-keyed disk cache makes warm queries free.

Installation

Channel	Command	Platforms	Features
Homebrew	`brew install Goldziher/tap/basemind`	macOS, Linux	base
npm	`npm install -g basemind`	any Node 14+ platform	base
pip	`pip install basemind`	any Python 3.8+ platform	base
cargo	`cargo install basemind --locked`	any Rust platform	base
cargo (full)	`cargo install basemind --features full --locked`	any Rust platform	documents + memory + crawl
GH releases	Download binary from releases	macOS · Linux · Windows	base

Harness-specific setup

Harness	Install command
Claude Code	`/plugin marketplace add Goldziher/basemind` then `/plugin install basemind@basemind`
Cursor	See Cursor docs for plugin install flow; `basemind` manifest at `.cursor-plugin/plugin.json`
Codex CLI	`/plugins` then search for `basemind`
Codex App	Plugins panel → Coding category → basemind → `+`
Gemini CLI	`gemini extensions install https://github.com/Goldziher/basemind`
OpenCode	Add `{ "plugin": ["basemind-opencode@latest"] }` to `opencode.json`
Factory Droid	`droid plugin --help` (manifest at `.claude-plugin/marketplace.json`)
GitHub Copilot CLI	`copilot plugin --help` (same manifest)
Generic MCP	See "Any MCP client" section above

Differentiators

Content-addressed dedup — Blake3-hashed L1/L2/L3 blobs deduplicated across files and
views. Edit a file, rescan, skip unchanged hashes.
Secret-masking SecretString — api_key fields redacted in Debug/Display/Serialize.
Tracing spans and panic messages never leak the value.
Provenance ledger — every config value's origin tracked via ConfigSource (MCP > CLI >
env > TOML > defaults). Audit trail for debugging.
Schema-driven config — Rust types in src/config/ drive
schema/basemind-config-v1.schema.json via schemars; snapshot is asserted byte-equal.
Config is code.
Zero-system-dep ONNX — ort-bundled ships the runtime in the binary. No
apt install onnxruntime, no system complexity.

Project state

Real-OSS hardening: tests/harden.rs runs the full tool sweep against 8 upstream repos
(ripgrep, tokio, TypeScript, React, Django, requests, gin, ripgrep-shallow) on every release.
Canary assertions catch regressions.
CHANGELOG.md — release history and migration notes.
Contributing guide — development workflow: task setup, task check,
task build. Pre-commit hooks via prek.
License: MIT