vex

mcp
Security Audit
Fail
Health Warn
  • No license — Repository has no license file
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Fail
  • rm -rf — Recursive force deletion command in benches/bench.sh
Permissions Pass
  • Permissions — No dangerous permissions requested

No AI report is available for this listing yet.

SUMMARY

Hybrid structural + semantic code search for LLMs — compact output, MCP server, 19 languages. Tree-sitter + FST + HNSW in a zero-copy mmap'd index.

README.md

Vex

License: MIT
CI
Rust
Commands
Languages
Tests

Fast hybrid structural + semantic code search. Vector + index.

Why Vex? · How It Compares · Installation · Quick Start · Commands · Configuration · How Search Works · Benchmarks · Supported Languages · Integration · Testing · Architecture

$ vex search "TelemetryProcessor"          # 4ms — find symbol definitions
$ vex search "timeout retry"               # NEW: BM25 finds rare body terms
$ vex show "TelemetryProcessor"            # extract just the class body (not the whole file)
$ vex search "handle alert" --semantic     # find by meaning, not just name
$ vex pattern 'fn $NAME($$$) -> Result'    # AST pattern matching (like ast-grep)
$ vex usages "Config"                      # who references this symbol?
$ vex implementations "BaseService"        # who extends/implements this?
$ vex callers "process_event"              # who calls this function? (~4ms — FST lookup)
$ vex similar "PaymentService"             # NEW: semantically close symbols
$ vex duplicates --threshold 0.95          # NEW: near-duplicate pairs
$ vex check "Foo" "Bar" "Baz"              # fast existence check

Why Vex?

  • ~4ms search after indexing — FST-based O(query_len) lookup, not O(symbols). Requires a pre-built index (indexing takes 20ms-600ms+ depending on project size)
  • 3-channel hybrid search — structural FST (names) + BM25 (rare body terms) + semantic HNSW (meaning), fused via Reciprocal Rank Fusion. Find symbols when you don't know the exact name AND when generic semantic-only search would be too noisy
  • Persistent call graphvex callers/vex callees reads from an FST built at index time (~4ms), not a live tree-sitter scan (seconds)
  • Pluggable embedderEmbedder trait + registry; swap MiniLM-L6-v2 for future code-specific models (BGE, CodeBERT) without touching call sites
  • Token-efficient — compact output uses 6-88x fewer tokens than grep, vex show extracts just the symbol body instead of the whole file
  • 19 languages out of the box — Rust, Python, Go, Java, C/C++, C#, Ruby, Swift, Kotlin, TypeScript, SQL, Markdown, PHP, Bash, Lua, CSS, HTML, YAML, TOML
  • Single binary, zero config — no LSP servers, no databases, no Docker. Just vex index && vex search

How It Compares

vex ripgrep ast-index ast-grep Serena
What it searches Symbol definitions All text Symbol definitions AST patterns Symbols (via LSP)
Requires indexing? Yes (20ms-600ms+) No Yes No No
Search speed ~4ms (pre-built FST) 75-120ms (disk scan) 22-60ms (SQLite) ~30ms (scan) LSP-dependent
Semantic search HNSW + embeddings -- -- -- --
Pattern matching fn $NAME($$$) regex only -- fn $NAME($$$) regex only
Index size 5 MB / 20K syms no index 190 MB / 20K syms no index no index
Token efficiency 6-88x fewer than rg baseline ~3x fewer than rg N/A N/A
Symbol body extraction vex show -- -- -- --
Languages 19 any 10+ 10+ 40+ (LSP)
Refactoring -- -- -- -- rename, move, inline
Runtime deps none none none none Python + LSP

Note: vex search speed assumes a pre-built index. Ripgrep and ast-grep require no upfront indexing and work immediately on any directory. The tradeoff is amortized: if you search the same codebase many times (typical in agent workflows), the one-time indexing cost pays for itself.

Best for: fast symbol search in AI agent workflows where token efficiency matters. Not a replacement for LSP-based tools (no refactoring, no go-to-definition in dependencies).

Installation

# Homebrew (macOS/Linux)
brew tap tenatarika/tap
brew install vex

# From source (any platform with a Rust toolchain)
git clone https://github.com/tenatarika/vex.git
cd vex
cargo build --release
cp target/release/vex ~/.local/bin/

Windows

Pre-built vex.exe ships in every GitHub Release.

  1. Download vex-x86_64-pc-windows-msvc.zip from the latest release
  2. Extract vex.exe somewhere stable (e.g. C:\Users\<you>\bin\)
  3. Add that folder to PATH (System Properties → Environment Variables → edit Path → add the folder)
  4. Open a fresh terminal and run vex --version

To update, run vex self-update — it fetches the latest release, picks the right archive for your platform, and replaces the binary in-place. Same command works on macOS and Linux too.

Quick Start

# Index a project (structural only — fast)
vex index --path /path/to/project

# Index with semantic embeddings (slower first time, downloads 86 MB model)
vex index --path /path/to/project --semantic

# Search by symbol name
vex search "PaymentService"

# Search by meaning (requires --semantic index)
vex search "payment processing" --semantic

# Find all usages of a symbol
vex usages "IndexReader"

# File structure outline
vex outline src/main.rs

# Find implementations of a trait/interface
vex implementations "Iterator"

# Callgraph: who calls / is called by a function (fast path via persistent index)
vex callers "process_event"
vex callees "process_event"

# Multi-hop call graph (v1.7)
vex paths "main" "process_event"          # all caller chains from main → process_event
vex reachable "process_event"             # everything that transitively reaches it

# Symbol-level diff against a branch (v1.7)
vex diff --base main                      # what symbols did this branch change?

# Semantic similarity by existing symbol — explain what's actually similar (v1.7)
vex similar "PaymentService" --limit 5 --min-score 0.7 --explain

# Near-duplicate pairs with reasoning (v1.7)
vex duplicates --min-score 0.95 --min-body-lines 5 --explain

# Search with per-call scope + metadata filters (v1.7)
vex search "Repository" --include 'src/**' --exclude '**/*.gen.*' --visibility public --async-only

# Why did the search return these results? (v1.7)
vex search "Foo" --why 2>trace.json

# Fast existence check
vex check "Foo" "Bar" "Baz"

# Incremental update (re-parses only changed files, reuses unchanged from index)
vex update

# Watch mode (re-indexes on file changes)
vex watch

# Show index stats
vex status

# Shell completions
vex completions zsh > ~/.zfunc/_vex

Commands

Command Description
vex index [--path .] [--semantic] [--embedder ID] Build full index. --semantic generates embeddings + HNSW + BM25. --embedder selects embedding model (default minilm-l6-v2).
vex search <query> [--semantic] [--no-bm25] [--limit N] [--kind def,fn,…] [--visibility V] [--async-only] [--why] Hybrid search: structural + BM25 + semantic (when --semantic). 3-way RRF fusion. Multi-value --kind (canonical names + meta-selectors def/comment/test/ref). Metadata post-filters narrow by signature keywords. --why appends a JSON trace to stderr.
vex show <symbol> [--limit N] [--context N] [--kind fn] [--visibility V] [--async-only] Extract symbol body from source (saves tokens vs full file read). Same metadata + kind filters as search.
vex similar <name> [--limit N] [--min-score T] [--explain] Find symbols semantically close to an existing one (HNSW nearest neighbors). --explain adds identifier-Jaccard + truncated unified diff per match. --min-score is an alias for --threshold.
vex duplicates [--min-score T] [--min-body-lines N] [--explain] List near-duplicate symbol pairs by embedding similarity. --explain shows what's actually different between the bodies.
vex usages <name> [--limit N] Find all references/usages of a symbol (FST lookup).
vex pattern '<pat>' --lang <lang> [--why] AST pattern matching with metavariables ($NAME, $_, $$$, plus the v6 named multi-line forms $$$BODY / $$ARGS). Repeated metavars enforce back-references. Space-flanked && / `
vex outline <file> [--kind fn] Show file structure, optionally filter by symbol kind.
vex implementations <name> Find types that extend/implement a base class, trait, or interface (incl. generic-parameterised: class Foo : Repository<T>).
vex callers <name> Direct callers of a function (fast path via persistent call graph; falls back to live tree-sitter scan when the index is missing).
vex callees <name> Direct callees of a function (same fast path).
vex paths <from> <to> [--max-hops N] NEW. Enumerate all caller chains from from to to over the persistent call graph. Bounded DFS with cycle prevention; default --max-hops 6.
vex reachable <target> [--max-hops N] [--limit N] NEW. Transitive set of symbols whose callees reach target, with the BFS depth labelled per row. Blast-radius analysis.
vex diff --base <rev> [--limit N] NEW. Symbol-level diff between an arbitrary git revision and the working tree: added / removed / moved-within-file / body-changed entries. git diff --no-renames semantics so a git mv surfaces both halves.
vex bundle --mode <symbol|pr-impact|project> [...] NEW (v1.9, Phase 13.2). Unified multi-source bundle — replaces 4 round-trips (show → callers → callees → similar) with one. --mode symbol --symbol Foo returns body + callers + callees + semantic similar. --mode pr-impact --base origin/main returns changed symbols + transitive callers (depth=2 default) + tests. --mode project [--top-n 30] returns top-N by reverse call-graph indegree (experimental — see docs/MCP-SCHEMA.md#bundle-modes-v19 for the response shape and mode_hints per-mode keys). Always emits the v1 envelope { protocol_version, capabilities, _meta, results }.
vex check <name> [name...] Fast existence check — which symbols exist in the index?
vex grep <pattern> [--filter path/] Regex content search (no index needed).
vex update [--path .] [--semantic] [--embedder ID] Incremental update — re-parse only changed files, reuse unchanged symbols from existing index.
vex watch [--path .] [--semantic] [--embedder ID] Watch filesystem, auto re-index on changes.
vex status [--path .] Show index stats: symbol count, size, embeddings, call graph, BM25.
vex completions <shell> Generate shell completions (bash, zsh, fish).
vex init Create a default .vex.toml config file in the project root.

Per-query filters (every search-shaped command)

All search-shaped commands (search, usages, pattern, show, grep, implementations, callers, callees, paths, reachable, similar, duplicates, diff, bundle) accept:

  • --include <glob> / --exclude <glob> (repeatable, gitignore syntax) — per-call path scoping that doesn't require re-indexing. --exclude wins over --include. Example: vex search Foo --include 'src/**' --exclude '**/*.gen.*'.
  • --filter <substring> — older path-substring filter, still supported. Composes AND with the globs.

vex search / vex show additionally accept:

  • --visibility <public|private|protected|internal> — keep only symbols whose signature carries the explicit keyword. Defaults aren't inferred (bare Rust fn foo() does NOT match --visibility private).
  • --async-only / --no-async — keep or exclude async / Kotlin-suspend symbols.
  • --static-only, --sealed-only — restrict to static class members or sealed (or Java-final) types.

Reasoning flags

  • vex search --why prints a JSON trace to stderr (the result list stays on stdout): normalized_query, per-channel hit counts (FST / BM25 / semantic), fallbacks engaged (fuzzy), and the active filter snapshot.
  • vex pattern --why prints a JSON ScanTrace to stderr after the result list: mode (indexed / live_scan), root_kind_inferred, candidate_files / total_files, and fallback_reason when the indexed prefilter was skipped (no-index, no-skeleton-section, empty-section, grammar-drift, partial-section, index-open-error). MCP callers see the same JSON under _meta.why.
  • vex similar --explain / vex duplicates --explain add a jaccard overlap score plus a truncated unified diff between the two bodies, so you can decide whether two semantically-clustered symbols are actually duplicates before acting.

Configuration

Create a .vex.toml in your project root to customize vex behavior:

vex init  # generates .vex.toml with commented defaults
# .vex.toml

# Glob patterns to exclude from indexing (gitignore syntax, on top of .gitignore)
exclude = [
    "vendor/**",
    "node_modules/**",
    "*.generated.go",
]

# Default output format: "text", "json", or "compact"
format = "compact"

# Enable semantic embeddings by default
semantic = true

# Automatically update index before search if stale
# auto_update = false

CLI flags always override config values. Use --no-semantic to explicitly disable semantic mode when the config enables it.

Staleness Detection

Vex detects when the index is stale and warns before search:

$ vex search "Config"
Warning: index may be stale (HEAD changed). Run `vex update`.

How it works: on every search, vex compares the git HEAD stored at index time with the current HEAD (~0.1ms, single git rev-parse). If HEAD changed → stale. For non-git repos, falls back to mtime comparison.

Auto-update: skip the warning and update inline:

# Per-command
vex search "Config" --auto-update

# Always (in .vex.toml)
auto_update = true

# Disable staleness check entirely
vex search "Config" --no-stale-check

Output Formats

# Human-readable (default)
vex search "Foo"

# JSON (for MCP/tool integration)
vex search "Foo" --format json

# Compact (token-efficient, optimized for LLM context)
vex search "Foo" --format compact

How Search Works

Structural Search (default)

Searches by symbol name using an inverted index with CamelCase splitting:

  • "PaymentService" — exact match
  • "Payment" — prefix match, finds PaymentService, PaymentGateway
  • "payment" — case-insensitive, also finds via CamelCase tokens

Semantic Search (--semantic)

Embeds your query with MiniLM-L6-v2 (384-dim vectors) and finds symbols with similar meaning:

  • "parse source code files" finds parse_file, extract_refs, parse_file_symbols
  • "database storage" finds populate_db, create_10k_db, add_root_persists_to_db
  • "find implementations of an interface" finds find_implementations, test_interface_extends

BM25 Channel (auto-on when index has BM25 data)

A classic Okapi BM25 (K1=1.2, B=0.75) over symbol body tokens — identifiers, signatures, docstrings. Closes the gap between "exact name" (structural) and "general meaning" (semantic): finds rare body terms like timeout, retry, singlestore, idempotency_key that aren't part of any symbol name. Pass --no-bm25 to disable per-call.

Hybrid Search (3-way RRF)

When the index has all three channels (built with --semantic), vex search fuses structural + BM25 + semantic using Reciprocal Rank Fusion. Symbols hit by ≥2 channels rank as Hybrid; symbols unique to one keep their original match type. Cuts both structural-noise and semantic-blur in the same query.

Usages (FST)

References stored in an FST (Finite State Transducer) — zero-copy lookup from mmap with prefix search support.

Type-aware refs (--strict)

vex usages --strict <name> reads the v5 reference_edges section
written by an LSP-style scope binder. For the languages with a
binder (Rust, TypeScript, Python, C#, C++) every ref is resolved at
index time against an in-file scope chain plus an import/use graph,
then serialised against the global symbol the user actually meant —
not just any line that mentions the spelling.

What this changes for the user:

  • Identifiers inside comments, doc-strings, string literals, and
    regex bodies are dropped (this filter is on for everyone, not just
    --strict).
  • A name shadowed by a let / const / fn param resolves to the
    inner scope, not the outer.
  • A use ext::Foo; / import { Foo } from './ext' / from ext import Foo makes a ref to Foo resolve cross-file to whatever defines it
    in the index.
  • A name imported but never defined in the index stays Unresolved
    and produces no edge — better than a coincidental match.

Without --strict vex usages still works for every supported
language via the legacy refs FST; --strict simply trades recall
breadth for precision on the five binder languages. v3 / v4 indexes
predating the binder bail with a "re-run vex index" message.

Structural Patterns (vex pattern)

Match code by shape rather than text. Live-scan today for every
language vex parses; indexed prefilter (via the v6 pattern_skeletons
section) for Rust, TypeScript, and Python.

Syntax:

  • $NAME — capture a single identifier or balanced expression. Same
    name appearing twice enforces a back-reference: record($X, $X)
    matches record(state, state) and rejects record(state, other).
  • $_ — wildcard (matches without capturing).
  • $$$ — anonymous ellipsis (matches anything up to the next literal;
    spans newlines).
  • $$$BODY / $$ARGSnamed multi-line ellipsis. Functionally
    identical to $$$ but captures the consumed text under the given
    name; $$$BODY reads naturally for block bodies, $$ARGS for
    parameter lists. Back-reference equality also applies.
  • && (space-flanked) — AND composition. Both sub-patterns must
    match in the same file, and shared metavar names must capture the
    same text in both: struct $S && impl $S matches files that have
    both shapes for the same $S.
  • || (space-flanked) — OR composition (union, deduped by
    (path, line)). && binds tighter than ||.
  • Composition operators only fire at bracket / quote depth 0, so
    record($X, $X) and f($X && $Y) stay single patterns.

Indexed prefilter: when a v6 index is present, the leading literal
keyword of the pattern (fn, struct, class, def, impl, …) is
mapped to a tree-sitter node kind, and vex pattern walks only the
files whose persisted skeletons contain that kind. Visibility / async
/ export modifiers in front of the keyword are stripped before the
match (pub async fn $F infers function_item correctly). Falls
back to live-scan on grammar drift, missing section, or a partial
section after vex update--why reports the exact reason.

Examples:

# Multi-line function body with named captures
vex pattern 'fn $NAME($$ARGS) -> Result<$T, $E> { $$$BODY }' --lang rust

# Both struct and impl for the same type in one file
vex pattern 'struct $S && impl $S' --lang rust

# Interface OR class with the same name
vex pattern 'interface $N || class $N' --lang typescript

# See which mode and what narrowing happened
vex pattern 'fn $N($$$)' --lang rust --why 2>trace.json

Benchmarks

Compared against ast-index v3.31.0 (SQLite + FTS5) and ripgrep 14.x.

Indexing

Project vex ast-index Speedup vex size ast-index size
Small (2K lines Rust) 16 ms 48 ms 3.0x 43 KB 490 KB
Medium (31K lines Rust) 37 ms 112 ms 3.0x 314 KB 3.4 MB
Large (1247 Python files) 183 ms 633 ms 3.5x 1.8 MB 15.9 MB

Index size: 10-11x smaller than ast-index (mmap binary + FST vs SQLite + FTS5).

Note: projects with --semantic indexing are slower due to ONNX embedding generation.

Search: vex vs ast-index vs ripgrep

Medium project (31K lines Rust, avg 10 runs)

Query vex ast-index rg -w vex vs rg
Query A 4.9 ms 9.5 ms 54.2 ms 11x
Query B 4.6 ms 9.5 ms 8.9 ms 1.9x
Query C 4.5 ms 9.2 ms 8.6 ms 1.9x
Query D 5.0 ms 12.1 ms 9.3 ms 1.9x

Large project (20K symbols, Python/JS/SQL, avg 10 runs)

Query vex ast-index rg -w vex vs rg Results (def/text)
Symbol 1 6.0 ms 59.7 ms 84.6 ms 14x 1 / 4
Symbol 2 3.7 ms 44.5 ms 78.5 ms 21x 2 / 5
Symbol 3 3.9 ms 22.7 ms 76.7 ms 20x 1 / 20
Symbol 4 3.8 ms 43.1 ms 77.5 ms 21x 1 / 2
Symbol 5 3.6 ms 33.7 ms 77.3 ms 21x 1 / 22
Symbol 6 3.8 ms 43.3 ms 76.9 ms 20x 1 / 8
Symbol 7 4.0 ms 42.5 ms 74.9 ms 19x 1 / 6
Symbol 8 3.7 ms 42.8 ms 78.4 ms 21x 1 / 2

Key takeaway: vex search is constant ~4 ms (FST O(query_len)), regardless of project size — but this assumes a pre-built index. The comparison with ripgrep is not apples-to-apples: rg scans raw text with no indexing, while vex looks up a pre-built index. The real advantage is amortized: vex returns only symbol definitions (precise, token-efficient), while rg returns all text occurrences (noisy, expensive in LLM contexts).

Pattern Matching (vex only)

Pattern Time Matches
fn $NAME($$$) -> Result 31 ms 50
pub struct $NAME 32 ms 45
fn $NAME($$$) 31 ms 50

ast-index and ripgrep do not support AST pattern matching.

Semantic Search

Queries where structural search returns 0 results but semantic finds relevant symbols:

Query Structural Semantic
"parse source code files" 0 19
"database storage" 0 20
"find implementations of an interface" 0 20
"file system directory walker" 0 20
"handle errors and exceptions" 0 20

HNSW vs Brute-Force (semantic vector search)

Semantic search embeds the query via ONNX (~55ms) then searches stored vectors. HNSW (usearch) replaces brute-force O(N) scan with O(log N) approximate nearest neighbor search:

Symbols Brute-force HNSW Speedup
333 ~3 ms ~3 ms 1x
11K ~8 ms ~3 ms 2.3x
20K ~11 ms ~3 ms 4x
100K (projected) ~55 ms ~3 ms ~18x

HNSW stays constant ~3ms regardless of index size. Brute-force grows linearly. Total semantic search latency is dominated by ONNX embedding (~55ms), so end-to-end speedup is modest for small codebases but critical at scale.

Mode Latency
Structural only ~4 ms
Hybrid (structural + semantic) ~58 ms (HNSW) / ~66 ms (brute-force)

LLM Token Efficiency

When an AI agent searches code, the output goes directly into the context window. Grep-based tools return every text occurrence — including comments, strings, variable usage, and matches in minified files — consuming tokens without adding signal.

vex returns only symbol definitions in a compact one-line format, drastically reducing token consumption:

vex compact rg (grep) Reduction
7 symbol lookups (typical) ~220 tokens ~1,300 tokens 6x
Queries hitting minified JS/CSS ~270 tokens ~58,700 tokens 217x

Example — searching for a class name on a large project:

# rg: 20 matches across imports, usage sites, comments, tests (2,045 chars)
$ rg -w "PreAggregatedConfig" .
./models.py:3602:class PreAggregatedConfig(models.Model):
./models.py:3610:    pre_aggregated_config = PreAggregatedConfig.objects.get(...)
./serializers.py:48:from .models import PreAggregatedConfig
./tests.py:12:    config = PreAggregatedConfig(...)
... (16 more lines)

# vex: 1 definition (93 chars)
$ vex search "PreAggregatedConfig" --format compact
C PreAggregatedConfig models.py:3602 class PreAggregatedConfig(models.Model):

For an agent making 10-20 code lookups per task, vex saves 5,000-20,000 tokens per session compared to grep — reducing cost and leaving more context window for reasoning.

Supported Languages

19 languages indexed via tree-sitter. The capability columns:

  • Binder — does vex usages --strict resolve refs through an
    LSP-style scope chain (Phase 11.1)? cross-file includes
    use / import resolution; in-file resolves within a file but
    treats imports as unresolved. The remaining languages fall back to
    the line-based scanner used by plain vex usages.
  • Patterns — does vex pattern get the v6 indexed prefilter
    (Phase 11.4)? indexed means a persisted skeleton section narrows
    candidate files at query time; live-scan means tree-sitter walks
    every lang-matching file on each query. All 19 languages work with
    vex pattern syntax ($NAME, $$$BODY, && / ||); the
    prefilter just speeds up discovery for the three T1 languages.
Language Extensions Symbols Imports Binder Patterns
Rust .rs functions, structs, enums, traits, impls, types, constants use declarations cross-file indexed
TypeScript/JS .ts, .tsx, .js, .jsx classes, interfaces, enums, functions, arrows, type aliases import cross-file indexed
Python .py classes, functions (incl. async, decorated) import, from..import cross-file indexed
C# .cs classes, interfaces, structs, enums, methods, properties in-file live-scan
C/C++ .cpp, .cc, .cxx, .hpp, .hxx, .h classes, structs, functions, methods, templates, enums #include in-file live-scan
Go .go functions, methods, structs, interfaces import live-scan
Java .java classes, interfaces, enums, methods, constructors import live-scan
Kotlin .kt, .kts classes, interfaces, objects, functions, properties import live-scan
Ruby .rb classes, modules, methods live-scan
Swift .swift classes, structs, enums, actors, protocols, functions import live-scan
PHP .php, .phtml classes, interfaces, traits, methods, functions use, require live-scan
SQL .sql tables, views, functions, triggers, indexes, schemas, types, sequences ALTER TABLE refs live-scan
Markdown .md, .markdown headings (section structure) live-scan
Bash .sh, .bash functions live-scan
Lua .lua functions, local functions, tables require live-scan
CSS .css rules, selectors, @keyframes live-scan
HTML .html, .htm custom elements (hyphenated tag names) live-scan
YAML .yaml, .yml top-level keys live-scan
TOML .toml bare keys, dotted keys, tables live-scan

See docs/SUPPORTED_LANGUAGES.md for grammar
versions, ABI level, and the runbook for adding a language or upgrading a
grammar. Adding a language to the indexed-Patterns tier is one
allowlist edit in src/pattern/skeleton.rs — see the Phase 11.4
follow-up notes for the planned Go → Java → Kotlin → C# → C++ → Swift
→ PHP → Ruby promotion order.

Index Location

macOS:   ~/Library/Caches/vex/<hash>/index.vex
Linux:   $XDG_CACHE_HOME/vex/<hash>/index.vex

Each project gets its own index based on a hash of the project root path.

Known limitations

vex is a static-analysis tool — some real call sites and references are
invisible by construction. The headline gaps:

  • vex callers is function-scoped. Module-level expressions
    (app = create_app() at the top of a file) and decorator-based
    dispatch (@router.get("/foo")) do NOT register as callers.
  • vex usages quality depends on language. Rust / TypeScript /
    Python / C# / C++ get --strict (binder-resolved refs from the
    v5 reference_edges section, Phase 11.1). Other languages use a
    line-based identifier scan with a higher false-positive rate.
  • Dynamic dispatch is invisible. String-resolved factories
    (uvicorn.run("main:app")), task queues (celery_task.delay()),
    reflection (getattr(obj, name)()) — none of these produce edges.
  • Workaround: vex grep '\bname\b' is the exhaustive textual
    fallback. Slower (~50 ms) but never misses a hit.

See docs/LIMITATIONS.md for the full coverage
matrix, repros, and recommendations per query type.

Troubleshooting

Surfacing internal warnings

Vex emits structured logs via the tracing crate at parse/store
boundaries — failed grammar loads, mmap reopens, manifest mismatches,
and so on. By default RUST_LOG is unset, so only the most critical
diagnostics make it to stderr.

When a search returns surprising results or an index command behaves
oddly, raise the log level:

RUST_LOG=vex=warn vex search Foo
RUST_LOG=vex=info vex index   # noisier — file-level progress

For what the search engine actually did (per-channel hit counts,
fuzzy fallback engagement, applied filters), use the structured trace
instead:

vex search Foo --why 2>trace.json   # trace lands on stderr as JSON

See docs/MCP-SCHEMA.md for the --why /
why: true JSON shape.

Integration

Claude Code (CLI Integration)

The recommended way to integrate vex with Claude Code is via CLAUDE.md rules (see below). Vex runs as a CLI tool — Claude Code calls it directly via Bash, no MCP server needed.

Setup:

# Install vex
brew tap tenatarika/tap && brew install vex

# In your project
cd /path/to/project
vex init              # create .vex.toml
vex index             # build index (add --semantic for meaning-based search)

Then add .vex.toml config for auto-update so Claude always searches a fresh index:

# .vex.toml
auto_update = true
format = "compact"

Claude Code (MCP Server)

Alternatively, vex includes an MCP server (vex-mcp) that exposes all commands as MCP tools:

# Build MCP server
cargo build --release -p vex-mcp

# Add to Claude Code MCP config (~/.claude/claude_desktop_config.json)
{
  "mcpServers": {
    "vex": {
      "command": "/path/to/vex-mcp",
      "env": {
        "VEX_ROOT": "/path/to/your/project"
      }
    }
  }
}

MCP Tools (20):

  • search — 3-way hybrid (structural + BM25 + semantic); accepts --why trace, metadata filters
  • find_symbol — exact name lookup
  • find_similar — semantic search by free-form description
  • similar — nearest neighbors of an existing symbol (explain adds Jaccard + diff)
  • duplicates — near-duplicate symbol pairs (explain shows what differs)
  • show — extract symbol body from source
  • outline — file structure
  • usages — find all references to a symbol
  • grep — regex content search
  • pattern — AST pattern matching with metavar back-references
  • implementations — find types extending a base class/trait/interface (incl. generics)
  • callers / callees — direct callgraph navigation (fast path via persistent index)
  • paths — enumerate caller chains between two functions
  • reachable — transitive callers of a target
  • diff — symbol-level diff between a git revision and the working tree
  • check — fast symbol existence check
  • index / update — build/rebuild index
  • status — index statistics

The schemas follow a canonical vocabulary (query / symbol / symbols / path / pattern / filter / include / exclude); pre-v1.7 aliases (name, file, names, etc.) still work and emit _meta.deprecated_args: [...] in the JSON-RPC response. See docs/MCP-SCHEMA.md.

Shell Integration

# Shell completions (tab-completion for commands and flags)
vex completions bash > ~/.bash_completion.d/vex   # Bash
vex completions zsh > ~/.zfunc/_vex               # Zsh (add ~/.zfunc to fpath)
vex completions fish > ~/.config/fish/completions/vex.fish  # Fish

# Aliases — add to .zshrc / .bashrc
alias vx="vex search"
alias vxu="vex usages"
alias vxi="vex index --path ."
alias vxs="vex index --path . --semantic"
alias vxw="vex watch"

CLAUDE.md Integration

Add this to your project's CLAUDE.md to make Claude Code use vex instead of grep:

## Code Search

Before first use in a project, run `vex init` to generate `.vex.toml`, then `vex index` to build the index.
Set `auto_update = true` in `.vex.toml` so the index stays fresh automatically.

Use vex for code search instead of grep or manual file reading:

- `vex search "SymbolName"` — find symbol definitions (~4ms)
- `vex show "SymbolName"` — extract symbol body (use INSTEAD of Read for specific symbols)
- `vex show "A" "B" "C"` — extract multiple symbols at once
- `vex usages "SymbolName"` — find all references
- `vex grep "pattern"` — regex content search (when you need text, not symbols)
- `vex search "description" --semantic` — search by meaning
- `vex search "rare_term"` — BM25 channel finds rare terms in symbol bodies (auto-on when index has BM25 data)
- `vex pattern 'class $NAME(BaseModel):' --lang python` — AST pattern matching with metavariables
- `vex pattern 'fn $N($$ARGS) -> Result<$T, $E> { $$$BODY }' --lang rust` — multi-line `$$$BODY` / `$$ARGS` capture
- `vex pattern 'struct $S && impl $S' --lang rust` — AND composition (back-ref `$S` must agree across both shapes)
- `vex pattern 'interface $N || class $N' --lang typescript` — OR composition (union, deduped by `(path, line)`)
- `vex pattern '<pat>' --lang <lang> --why` — emit ScanTrace on stderr (mode / candidate vs total / fallback reason)
- `vex outline path/to/file.py` — file structure overview
- `vex implementations "BaseService"` — find types extending a class/interface
- `vex callers "function_name"` — find all callers (~4ms via persistent call graph)
- `vex callees "function_name"` — find all callees (~4ms via persistent call graph)
- `vex similar "SymbolName"` — semantically close symbols (requires --semantic index)
- `vex duplicates --threshold 0.95` — near-duplicate symbol pairs
- `vex check "A" "B" "C"` — fast symbol existence check

All commands support `--filter "path/"` to narrow results to a directory.

### Rules
- **Always prefer `vex show` over `Read`** when you need a specific function or class
- **Always prefer `vex search` over `Grep`** when looking for symbol definitions
- **Use `vex grep` instead of `Grep`** for searching inside string literals, comments, or config values
- **Use `--format compact`** for token-efficient output in automated workflows
- **Use `--kind fn`** to boost results matching a specific symbol kind (fn, struct, trait, class, etc.)
- **Use `--context-path`** with the path of the file you are currently editing to boost nearby results
- **Run `vex update` after modifying source files** if `auto_update` is not enabled in `.vex.toml`
- **Use `vex pattern ... --why`** to debug match counts — the trace tells you whether the indexed prefilter ran or fell back to live-scan, and why
- **Indexed pattern prefilter requires a full `vex index`** — after `vex update` the section is partial and `vex pattern` automatically degrades to live-scan (reason `partial-section` in `--why`)

### Indexing
- `vex index` — full structural index + pattern skeleton section (v6)
- `vex index --semantic` — with embeddings (slower, enables semantic search)
- `vex update` — incremental update (only changed files)
- `vex index --no-pattern-index` — skip the v6 pattern skeleton section if you don't use `vex pattern` (sticky across `vex update`)

Testing

Unit & Integration Tests

cargo test                    # 1172 tests — unit, integration, property-based, adversarial
cargo clippy -- -D warnings   # zero warnings policy

Test coverage includes:

  • Per-language grammar regression (NEW): tests/<lang>_query_test.rs for all 19 supported languages — catches ABI mismatches and AST node renames when a tree-sitter grammar crate is upgraded
  • Binary format: roundtrip, corrupted/truncated/wrong-version rejection, out-of-bounds access, string pool dedup, empty index
  • Adversarial format: 20 crafted index tests — overflow offsets, bad magic/version, alignment attacks, truncated records
  • Vectors: write/read roundtrip for 384-dim f32 embeddings
  • FST: refs FST roundtrip, prefix search, symbol FST exact/prefix/fuzzy search
  • Search: structural, fuzzy (Levenshtein), RRF fusion, reranking with kind/path/proximity boosts
  • Reranking stress: NaN/Infinity/zero scores, 10K results, edge context paths
  • Property-based (proptest): rerank preserves length, sorted output, no NaN/negative scores, fusion commutativity
  • Incremental update: unchanged reuse, deleted removal, file rename, symbol move between files, empty file
  • Concurrency: parallel index/update (lock serialization), concurrent readers, read during reindex
  • Multi-language: Rust, Python, Go, Kotlin, TypeScript, C++, cross-language same-name, wrong extension, 1K-symbol file, deep nesting, error recovery
  • Unicode: BOM, mixed CRLF, unicode identifiers, null bytes, empty/whitespace files
  • Path edges: spaces in paths, deep nesting (20 levels), symlinks, absolute vs relative, Windows backslashes
  • Callgraph: callers/callees for Rust, Python, Go, TypeScript, Java
  • Persistent call graph (v1.5): format v4 roundtrip, callers/callees FST lookup, dedup, same-name-across-files isolation, same-name-within-file disambiguation, incremental update preserves edges, fallback to live scan for v3
  • Similar/duplicates (v1.5): self-exclusion, threshold filtering, canonical pair dedup, body-length filter, empty-index handling
  • Pluggable embedder (v1.5): registry lookup, mismatch detection (incl. back-compat for pre-9.1 manifests), config + CLI priority, writer variable vector_dim
  • BM25 channel (v1.5): writer/reader roundtrip, pipeline emission, IDF discrimination, short-doc preference, 3-way RRF with Hybrid labeling, MatchType tagging, unicode tokens
  • Staleness: git HEAD comparison, dirty file detection, mtime fallback

Fuzz Testing

Fuzz tests exercise the binary format reader with arbitrary/corrupted data using cargo-fuzz (libFuzzer + AddressSanitizer):

# Install (once)
cargo install cargo-fuzz

# Generate seed corpus from local vex cache
bash fuzz/generate_seeds.sh

# Run (requires nightly)
RUSTUP_TOOLCHAIN=nightly cargo fuzz run fuzz_index_reader -- -max_total_time=120
RUSTUP_TOOLCHAIN=nightly cargo fuzz run fuzz_refs_fst -- -max_total_time=60
RUSTUP_TOOLCHAIN=nightly cargo fuzz run fuzz_symbol_fst -- -max_total_time=60

Three fuzz targets cover all unsafe code paths in the reader:

Target What it fuzzes Unsafe paths exercised
fuzz_index_reader Arbitrary bytes as .vex file header(), symbol(), vector(), read_string(), file_paths()
fuzz_refs_fst Arbitrary FST + posting bytes RefReader::find(), find_by_prefix()
fuzz_symbol_fst Arbitrary FST + posting bytes SymbolFstReader::find(), find_fuzzy(), search_with_fallback()

Fuzzing found and fixed 3 bugs: out-of-bounds read on crafted symbol_count, misaligned pointer dereference on odd symbols_offset, and unchecked section offsets exceeding file size.

Architecture

CLI (clap) → Pipeline (rayon) → Tree-sitter → Binary Format v4 (mmap)
                                      ↓
                               Embedder trait (fastembed/MiniLM)
                                      ↓
                               HNSW Index (usearch)
                                      ↓
Search:    Symbol FST (structural) + BM25 (body) + HNSW (semantic) → 3-way RRF
Callers/Callees: Callers FST + Callees FST (persistent edges) → ~4ms
Usages:    Refs FST + Posting Lists → zero-copy refs lookup
Show:      Tree-sitter node boundaries → symbol body extraction
Similar:   HNSW nearest neighbors over stored embeddings
  • No SQLite — custom binary format v4 with zero-copy mmap reads (v3 still readable)
  • Symbol FST — persistent inverted index, O(query_len) lookup
  • Refs FST — symbol references in Finite State Transducer, prefix search
  • Persistent call graphCallEdge records + callers/callees FSTs built at index time, ~4ms lookup vs seconds of live tree-sitter scan
  • BM25 channel — Okapi BM25 over body identifiers, auto-on when section present
  • HNSW — approximate nearest neighbor via usearch, O(log N) semantic search
  • Pluggable embedderEmbedder trait + registry, identity recorded in manifest with mismatch detection at search
  • Parallel parsing — rayon with 500-file chunks
  • Incremental updates — content hashing via xxh3, only re-parse changed files (unchanged symbols + call edges reconstructed from existing index)
  • Watch mode — notify crate with 500ms debouncing
  • 3-way RRF fusion — merges structural + BM25 + semantic ranked lists, marks cross-channel hits as Hybrid

License

MIT

Reviews (0)

No results found