CodeSage

CodeSage: structural and semantic code intelligence for AI agents

CodeSage is a code intelligence engine for AI coding agents. It combines structural graph queries (symbols, references, dependencies) and semantic search (embedding retrieval with cross-encoder reranking) in a single Rust binary, usable as a CLI or over MCP. Eight languages today (PHP, Python, C, C++, Rust, JavaScript, TypeScript, Go), ~250ms median query latency, ~50K-LoC PHP repos indexed in seconds.

🔍 What you can do with it

Find code by natural-language query: "where does auth happen?", "error handling in the GC".
Look up symbol definitions by name across a codebase.
Trace imports, calls, and inheritance for any symbol.
Map import and include relationships between files.
Estimate which files a change breaks (change impact analysis).
Build curated code bundles for LLM consumption in JSON, markdown, or flat-text (gitingest-style) form.
Read per-file git history: churn, fix ratio, historical co-change, risk score.
Browse the project as behavior-keyed feature slices: each slice bundles an entrypoint + owned files + context files + tests + crossed trust boundaries, mapped deterministically from build manifests and framework routing (Cargo bins, Laravel routes, php-src ext/*, Next.js app/**, Python __main__, Go cmd/*, etc.).
Inspect trust boundaries per file (network, filesystem, process-exec, secrets, database, user-input, external-api, serialization, auth, concurrency) derived from imports/includes/calls; same signal folds into assess_risk and surfaces as security-review notes when ≥3 boundaries are crossed.
Expose all of the above over MCP so Claude Code, Codex, or Cursor can call them.

Capability summary

Concrete answers to the questions a code-intelligence tool earns its keep on. The axes are the ones the broader ecosystem (GitNexus, SocratiCode, code-review-graph, claude-context, repowise) converges on; the right-hand column is what CodeSage actually ships.

Capability	CodeSage
Natural-language semantic search	✓ MiniLM embeddings + cross-encoder reranker, sub-100 ms warm
Symbol-level lookup (definitions, references, callers/callees, inheritance)	✓ tree-sitter, 8 languages, exact line/column ranges
File-level dependency mapping (imports / imported-by)	✓ via `list_dependencies`
Change impact / blast-radius analysis	✓ via `impact_analysis`, configurable depth, symbol or file target
Call-flow / "who-touches-X" tracing	✓ via `find_references` + `impact_analysis` composition
Per-file risk score (churn, fix ratio, blast radius, coupling, test gap, cycles)	✓ via `assess_risk`, six-signal blend
Patch-level risk aggregation (max/mean, hotspots, test-gap files)	✓ via `assess_risk_diff`; per-file batch via `assess_risk_batch`
Historical co-change / coupling	✓ via `find_coupling`, decay-weighted with τ=180d
Test-recommendation for a changed file set	✓ via `recommend_tests`, sibling conventions for 7 frameworks + co-change
Curated context bundle for downstream LLM	✓ via `export_context`, callers + callees optional
Session-baseline diff (did this session decay the index?)	✓ via `session_start` / `session_end`, cycle + risk regressions
Cycle / SCC detection in the import graph	✓ folded into `assess_risk` and `assess_risk_diff.cycles_touching_patch`
Feature-slice mapping (behavior-keyed bundles)	✓ via `codesage map` / `features-list` / `feature-show` / `feature-for`, MCP `list_features` / `find_feature`
Curated feature bundle (entry + owned + tests + context for one slice)	✓ via `codesage feature-bundle <id>` and MCP `feature_bundle`
Trust-boundary derivation (network / fs / secrets / process-exec / db / etc.)	✓ per-file table from imports/includes/calls, aggregated per feature, feeds `assess_risk`
Host-agnostic deployment (no Docker, no managed services)	✓ single static Rust binary + one SQLite file per project
Auto-refresh on commit/merge/checkout/rebase	✓ git hooks installed by `codesage install-hooks`
Symbol-level edits (rename, move, replace_symbol_body)	— read-only by design; pair with Serena or your editor
Multimodal ingest (images / audio / video / PDFs)	— out of scope, code-intel only
Cross-repo queries	— single-project routing today; on the roadmap, not shipped

Supported languages

PHP, Python, C, C++, Rust, JavaScript, TypeScript, Go.

Why a single Rust binary

CodeSage ships as one static Rust binary plus a local SQLite database under .codesage/ per project. No Docker container, no external vector DB server, no embedding service, and no service manager. CLI commands run directly. MCP clients use codesage mcp, a stdio shim that starts or reuses a user-local Unix-socket daemon so concurrent agent sessions share one project cache, embedding model pool, reranker pool, and CUDA context.

The trade-off: CUDA-accelerated embeddings need the nvidia-*-cu12 pip packages on the host (see CUDA setup below). In exchange, install once, run everywhere, no orchestration layer, no systemd unit to manage. Tools in the same category that take the other side of this trade (SocratiCode with managed Qdrant + Ollama, GitNexus with external Qdrant) are valid for different user profiles. If your team already runs Docker Compose for everything, use those. If you want cargo install, codesage init, and an on-demand local daemon hidden behind stdio MCP, use CodeSage.

📊 Benchmarks

Ground-truth retrieval on git-mined corpora, 30 cases per repo, search top-10:

repo	miss rate	mean recall@10
BurntSushi/ripgrep @ `4519153e5e46` (101 files, 52K LoC)	13%	0.79
nestjs/nest @ `8eec029772fa` (1,672 files, 110K LoC)	3%	0.94

Head-to-head against code-review-graph 2.3.2 (same corpora, same queries, code-review-graph configured with matching test-directory exclusions for fairness):

repo	CodeSage miss	code-review-graph miss	CodeSage per-query wall-clock	code-review-graph per-query wall-clock
ripgrep	13%	17%	~0.25 s	0.80 s
nest	3%	40%	~0.25 s	1.10 s

The nest gap is architectural: CodeSage embeds chunks (~50-line regions), code-review-graph embeds nodes (functions). Commit-style queries that describe behavior spanning multiple functions match chunks more reliably than individual function bodies.

External-corpus benchmark (semble)

semble ships a published retrieval-evaluation corpus — 1,251 queries × 63 repos × 19 languages with file-level ground truth in benchmarks/annotations/. Cleaner than the git-mined "files-changed-in-same-commit" proxy, and an externally-defined target codesage's authors did not write.

Running codesage search (jina-embeddings-v2-base-code + ms-marco-MiniLM-L6-v2 reranker, GPU) on the corpus at its pinned SHAs:

Sample	n queries	recall@10 (primary)	NDCG@10	mean first-hit rank
Supported-language repos (30 of 63)	602	0.932	0.788	1.79
Full corpus (63 repos, missing parsers = miss)	1,251	0.448	0.379	—

The headline number is the 602-query / 8-language slice — that's what compares apples-to-apples against the languages codesage actually parses. The full-corpus number reflects the parser-coverage gap (36% of corpus targets Java, Ruby, Kotlin, Scala, C#, Swift, Elixir, Haskell, Lua, Zig, or Bash — none currently supported); it is a language-coverage number, not a retrieval-quality number.

By-language headline (8 supported): JavaScript 0.892, Go 0.887, PHP 0.885 lead; TypeScript 0.595 trails (zod + vitest specifically — test-file flood dominates top-10 on phrase-matched queries).

This is not a "codesage > semble" claim. A head-to-head would require running semble end-to-end on the same 63 repos under matched conditions, which is out of scope here. The number is codesage measured against semble's published ground truth.

Run yourself with bench/codesage-bench-runner <corpus.yaml> (corpus format: project_root + cases list of {id, query, expected_files}). Scorecards from these runs live under bench/history/; corpora are not bundled so private-repo names don't leak by accident. Not a statement about every workload; bring your own corpus for your codebase.

🚀 Getting started

# Build with GPU support
cargo build --release -p codesage --features cuda

# Initialize and index a project
cd /path/to/your/project
codesage init
codesage index

# Search
codesage search "authentication handler"
codesage search --json --limit 20 "database connection pooling"

# Structural queries
codesage find-symbol MyClass
codesage find-references some_function --kind call
codesage dependencies src/main.py

# Change impact analysis (who breaks if you touch this?)
codesage impact DocumentRepository --depth 2 --source-only
codesage impact src/auth/session.ts --json

# Context bundle for LLM consumption
codesage export "authentication flow" --limit 5 --callers
codesage export MyClass --symbol --format md
codesage export "auth flow" --format ingest    # gitingest-style flat-text bundle

# Git history: churn, fix ratio, co-change, risk score
codesage git-index                                          # initial populate; hooks keep it fresh
codesage git-index --full                                   # force full rescan (weekly hygiene)
codesage coupling src/auth/session.ts --limit 5             # files that historically change with this
codesage risk src/auth/session.ts                           # score with decomposition

# MCP for Claude Code / Codex / Cursor (stdio shim starts/reuses one local daemon)
claude mcp add --scope user codesage -- codesage mcp

# Auto-reindex on git operations
codesage install-hooks

# Diagnose installation
codesage doctor

⚙️ Recipes

Common pipelines using codesage with git. Each is one shell line and how to read the output.

Risk check before committing

git diff --cached --name-only | codesage risk-diff

Pipes the staged file list through assess_risk_diff. Output shows the max risk score, files in each risk bucket (hotspot, fix-heavy, test-gap, wide blast radius), and paste-ready summary notes for the commit message or PR description. If max_score >= 0.6 or test_gap_files is non-empty, add tests, split the patch, or call it out in the PR description.

Tests to run after editing

git diff --cached --name-only | codesage tests-for

Returns sibling tests (resolved by language convention) plus tests that historically change with the edited files (from co-change history). Replaces "I'll run all tests" with a focused list.

Audit a feature branch before opening a PR

git diff origin/main...HEAD --name-only | codesage risk-diff

Same as the pre-commit check, but scoped to everything on the branch instead of just the staged diff. Useful as the last step before gh pr create.

What changed in the last week, ranked by risk

git log --since='1 week ago' --name-only --pretty='' | sort -u | codesage risk-diff --json | jq '.files[] | select(.score >= 0.5) | .file'

Lists high-risk files touched in recent history. Good signal during a retrospective or a "where should we focus refactoring?" discussion.

Trifecta for one file

codesage risk path/to/file.rs
codesage tests-for path/to/file.rs
codesage coupling path/to/file.rs --limit 5

When you're about to dive into one specific file. Risk score, suggested tests, and what historically co-changes calibrate caution before you start editing.

Browse the project as feature slices

codesage map                                 # populate feature tables
codesage features-list --kind route --json   # all HTTP/router routes
codesage feature-for app/Http/Controllers/UserController.php
codesage feature-show feat_<id> --json       # one slice + its file refs + trust boundaries
codesage feature-bundle feat_<id> --json     # bundle the slice's code for an LLM

Use when answering "what slice owns this file?" or "give me the whole flow behind /users". The bundle is the same shape as export_context but anchored on the feature's curated file list instead of semantic search results.

Trust-boundary inspection

codesage trust-boundaries crates/cli/src/main.rs --json

Per-file capability tags (network, filesystem, process-exec, secrets, database, user-input, external-api, serialization, auth, concurrency) derived from imports / includes / calls. The same signal contributes to assess_risk and surfaces a "crosses N trust boundaries — security review recommended" note when a file touches three or more.

🔌 Claude Code plugin

plugins/codesage-tools/ wraps everything above into one command per task. The marketplace manifest lives at the repo root.

claude plugin marketplace add /path/to/codesage
claude plugin install codesage-tools@codesage
/codesage-onboard /path/to/project

Slash commands: /codesage-onboard, /codesage-reset, /codesage-reindex, /codesage-bench, /codesage-eval. The plugin handles global MCP registration, per-project init, indexing, git hook install (Husky-aware), and writes a .claude/CLAUDE.md hint teaching the agent how to route MCP calls.

🔍 Feature-slice review

Codesage maps a project into behavior-keyed feature slices (routes, CLIs, libraries, test suites, jobs). The codesage-tools plugin ships a four-command workflow that dispatches read-only subagent reviews — one per slice, in parallel batches — and persists findings to gitignored JSON under .codesage/findings/. Each finding gets a stable fnd_<hex> ID so it can be referenced in commit messages and PR comments. Re-running keeps prior triage (status + audit-trail history) intact and merges new defects into the same per-feature file.

The subagent is read-only (autoApprove: read); it consumes the existing MCP surface (feature_bundle, assess_risk, find_references, find_coupling) plus Read. Codesage's core stays read-only; findings are output that other tooling can consume.

`/codesage-review`

Dispatches subagents in parallel batches over the project's mapped feature slices.

/codesage-review <project> [--limit N] [--jobs N] [--feature <id>]
                           [--kind <k>] [--severity <s>] [--categories <c,c,...>]

<project> — absolute path to an onboarded codesage project (must contain .codesage/index.db)
--limit N — cap the number of features reviewed in one run (default 50)
--jobs N — parallel subagents per batch (default 4, hard ceiling 8)
--feature <id> — review one specific feat_<hex>, skipping discovery
--kind <k> — filter features by kind: route, cli-command, service, library, test-suite, config, job
--severity <s> — minimum severity to report: low / medium / high (default medium)
--categories <c,c,...> — comma-separated list (default bug,security); other values include perf, maintainability

Features whose .codesage/findings/<feature_id>.json is newer than the feature's updated_at AND whose last run was complete are skipped (already up-to-date). Sort order: route > cli-command > service > library > rest, then high confidence first.

`/codesage-triage`

Pure local state edit — appends a history entry on the named finding and updates its status. No LLM call, no re-review.

/codesage-triage <project> --finding <fnd_id> --status <open|false-positive|wont-fix|fixed> [--note <text>]

--finding <fnd_id> — the fnd_<hex> ID from .codesage/findings/<feature_id>.json
--status <s> — new status: open, false-positive, wont-fix, or fixed
--note <text> — optional free-form note stored alongside the history entry

`/codesage-revalidate`

Re-runs the subagent against a specific feature slice (or a single finding's owning slice) and reconciles. Auto-flips open → fixed when the defect no longer surfaces. Never mass-reopens false-positive or wont-fix.

/codesage-revalidate <project> [--feature <id>] [--finding <fnd_id>]

--feature <id> — re-review one feature slice
--finding <fnd_id> — re-review the slice that owns this finding (and check whether it's still present)

`/codesage-report`

Deterministic Markdown render of the findings JSON. No LLM call.

/codesage-report <project> [--status <s>] [--severity <s>] [--category <c>] [--feature <id>]

--status <s> — filter to one status (default: all except false-positive and wont-fix)
--severity <s> — minimum severity to render
--category <c> — filter to one category
--feature <id> — render findings for a single feature

State paths

Path	Content
`.codesage/findings/<feature_id>.json`	Per-feature findings + audit-trail `history[]` per finding (status, action, run_id, timestamp)
`.codesage/findings/history/<feature_id>-<run_id>.json`	Per-run snapshot of the feature's findings — never modified after write
`.codesage/reviews/<run_id>.json`	Run record: filters used, features planned, completion stats by severity/category, top features by finding count, severity-high list

Both directories are added to .gitignore by /codesage-onboard (or its hint).

Example workflow

# Initial sweep over every mapped feature
/codesage-review /path/to/project

# Look at the result
/codesage-report /path/to/project

# Triage a false positive
/codesage-triage /path/to/project --finding fnd_b3a1c4e7 --status false-positive --note "regex is anchored, not exploitable"

# Fix a real bug, then re-check
$EDITOR src/server.ts
/codesage-revalidate /path/to/project --finding fnd_9c80fa62

Indexing pipeline

codesage index walks the project, parses every supported file, extracts structural data and embeddings, and writes both into the same SQLite database.

flowchart LR
    A[Project files] --> B[Discover<br/>walk + excludes]
    B --> C[Tree-sitter parse]
    C --> D[Extract symbols<br/>and references]
    C --> E[Chunk text<br/>recursive splitter]
    D --> F[(SQLite<br/>files, symbols, refs)]
    E --> G[Embed via ONNX<br/>MiniLM-L6-v2]
    G --> H[(sqlite-vec<br/>chunks_minilm_384)]

Parsing happens in parallel via Rayon; SQLite writes are batched. Re-running codesage index is incremental: only files whose content hash changed are re-parsed and re-embedded.

Search pipeline

A query flows through five stages:

flowchart LR
    Q[Query string] --> E[Embed<br/>MiniLM-L6-v2]
    E --> K[KNN retrieval<br/>sqlite-vec<br/>overfetch 5x]
    K --> B[Symbol boost<br/>+0.1 per token match]
    B --> R[Cross-encoder rerank<br/>ms-marco<br/>blend 50/50]
    R --> A[Symbol annotation]
    A --> T[Top-N results]

Embed the query with MiniLM-L6-v2 (22M params, 384d) via ONNX Runtime.
Prepend file path and symbol context to chunks before embedding.
Boost chunks whose content matches known symbol names.
Re-score the top candidates with ms-marco-MiniLM-L6-v2 and blend 50/50 with the semantic score.
Annotate each result with overlapping function and class names.

The reranker is optional. Set or remove it in config.toml; stages 1-3 and the annotation still run without it.

Configuration

codesage init generates .codesage/config.toml:

[project]
name = "my-project"

[embedding]
model = "sentence-transformers/all-MiniLM-L6-v2"
device = "gpu"                                        # "gpu" or "cpu"
reranker = "cross-encoder/ms-marco-MiniLM-L6-v2"     # optional, remove to disable

[index]
exclude_patterns = [
  "**/tests/**", "**/vendor/**", "**/node_modules/**",
  "**/*.test.ts", "**/*Test.php", "**/*.phpt",
]

Models download from HuggingFace the first time you use them.

🏗️ Architecture

A Rust workspace with six crates:

flowchart TD
    cli[cli<br/>binary + CLI + MCP shim]
    daemon[MCP daemon<br/>shared project/model pools]
    gr[graph<br/>indexing + query pipeline]
    parser[parser<br/>tree-sitter + discovery]
    storage[storage<br/>SQLite + sqlite-vec + FTS5]
    embed[embed<br/>ONNX + reranker + chunking]
    protocol[protocol<br/>shared types]

    cli --> daemon
    cli --> gr
    daemon --> gr
    gr --> parser
    gr --> storage
    gr --> embed
    parser --> protocol
    storage --> protocol
    embed --> protocol
    gr --> protocol

Crate	Role
`protocol`	Shared types (Symbol, Reference, SearchResult)
`parser`	File discovery, tree-sitter parsing, symbol and reference extraction
`storage`	SQLite with sqlite-vec KNN and FTS5
`embed`	ONNX embedding inference, cross-encoder reranking, chunking
`graph`	Indexing orchestration and search pipeline
`cli`	Binary with CLI subcommands, stdio MCP shim, and Unix-socket MCP daemon

Storage is a single SQLite database per project at .codesage/index.db: structural tables (symbols, refs, files) plus model-specific vector tables for embeddings.

Retrieval benchmarks

bench/ holds the harness:

codesage-bench-runner runs a YAML corpus of ground-truth cases through codesage search and reports miss rate, median first-hit, recall@5, and recall@10.
extract-eval-cases.py mines eval cases from Claude Code session transcripts and git commit history.

Corpora aren't bundled. Bring your own, or point the plugin at $CODESAGE_BENCH_CORPUS_DIR.

⚠️ Known limitations

Honest inventory of what CodeSage does not do well, measured on our canary corpora and from 30 days of real Claude Code session logs (the harness in bench/analyze-codesage-quality.py produces the same numbers locally).

Language surface is narrower than competitors'. Eight languages today (added C++ in 0.4.5). Graphify ships 25, code-review-graph 23, SocratiCode 18+. The gap matters most if your stack is Ruby, Java, Kotlin, Swift, or Scala. Measured cost: on the semble retrieval corpus (1,251 queries × 63 repos × 19 languages), 36% of queries target a language codesage does not parse — zero recall on those. The tree-sitter query files live under crates/parser/src/queries/ and contributions there are the cleanest way to extend coverage.

Retrieval misses on cross-file refactor queries. On the ripgrep corpus, 13% of cases miss top-10; four of those six misses are commit subjects like printer: drop dependency on serde_derive that describe a rename spanning multiple files without a distinctive literal signal. Single-identifier lookups (find_symbol, find_references) are reliable. Pure semantic searches (search) are reliable. Diffuse multi-file refactor descriptions expressed in prose are the failure mode.

impact_analysis biases toward over-prediction. The tool walks reference edges up to a configurable depth and reports every reachable file. Agents get false positives but almost never false negatives (short of a stale index). We picked that side of the precision/recall trade because an agent can filter a list of 20 candidates faster than it can recover from a missed dependency that bites in review. If you want high precision at the cost of recall, drop --depth to 1 and --source-only.

MCP tool-selection rate is low today. When CodeSage MCP tools are available in a Claude Code session alongside Grep, the agent picks Grep on code-identifier queries: 1.1% CodeSage-pick rate over 30 days of sessions, 0/10 on a controlled active harness. We sharpened tool descriptions and per-project CLAUDE.md guidance to call this out; the next measurement cycle will show whether the intervention landed. For a hook-level workaround today, see the LSP enforcement kit in the Complementary tools section.

find_coupling returns empty on young files. Measured 59% empty-response rate in real usage. Each empty result now carries a note field ("no commits tracked", "below min-count=3 threshold", "path shape mismatch") so the agent can tell the cause. The underlying data just doesn't exist for recently-added files; the tool reports that honestly instead of inventing signal.

🔗 Pairs with

whetstone: agents, commands, and skills that tell coding agents how to work. CodeSage is the intelligence layer (what the code is); whetstone is the discipline layer (how to investigate, review, and ship). Install both for the full stack.

Complementary tools

These address different layers than CodeSage and work well alongside it:

rtk: static compression proxy for noisy CLI output (git diff, pytest, cargo build). Different layer than CodeSage: CodeSage narrows what the agent reads for code questions, rtk compresses how much it reads for command output. Token-reduction claims from the two tools are additive, not overlapping; measure them separately when quoting.
claude-code-lsp-enforcement-kit: hook pack that blocks Grep on code-symbol patterns and steers agents toward LSP / MCP tool calls. Provider-agnostic; auto-detects CodeSage's MCP alongside cclsp and Serena. Worth pairing if your tool-selection-rate numbers (see bench/analyze-codesage-quality.py) stay low after description-level interventions.

Contributing

See CONTRIBUTING.md. In short: file an issue first, add a test, update CHANGELOG.md under [Unreleased] for user-visible changes.

License

MIT

Follow @iliaa on X • Blog • If this gave your AI agent a real model of your code, ⭐ star it!