🦖 VelociRAG

Lightning-fast RAG for AI agents.

Four-layer retrieval fusion powered by ONNX Runtime. No PyTorch. Sub-200ms warm search. Incremental graph updates. MCP-ready.

Most RAG solutions either drag in 2GB+ of PyTorch or limit you to single-layer vector search. VelociRAG gives you four retrieval methods — vector similarity, BM25 keyword matching, knowledge graph traversal, and metadata filtering — fused through reciprocal rank fusion with cross-encoder reranking. All running on ONNX Runtime, no GPU, no API keys. Comes with an MCP server for agent integration, a Unix socket daemon for warm queries, and a CLI that just works.

🚀 Quick Start

MCP Server (Claude, Cursor, Windsurf)

pip install "velocirag[mcp]"
velocirag index ./my-docs
velocirag mcp

Claude Code — add to .mcp.json in your project root:

{
  "mcpServers": {
    "velocirag": {
      "command": "velocirag",
      "args": ["mcp"],
      "env": { "VELOCIRAG_DB": "/path/to/data" }
    }
  }
}

Then open /mcp in Claude Code and enable the velocirag server. If using a virtualenv, use the full path to the binary (e.g. .venv/bin/velocirag).

Claude Desktop — add to claude_desktop_config.json:

{
  "mcpServers": {
    "velocirag": {
      "command": "velocirag",
      "args": ["mcp", "--db", "/path/to/data"]
    }
  }
}

Cursor — add to .cursor/mcp.json:

{
  "mcpServers": {
    "velocirag": {
      "command": "velocirag",
      "args": ["mcp", "--db", "/path/to/data"]
    }
  }
}

Python API

from velocirag import Embedder, VectorStore, Searcher

embedder = Embedder()
store = VectorStore('./my-db', embedder)
store.add_directory('./my-docs')
searcher = Searcher(store, embedder)
results = searcher.search('query', limit=5)

CLI

pip install velocirag
velocirag index ./my-docs
velocirag search "your query here"

Search Daemon (warm engine for CLI users)

velocirag serve --db ./my-data        # start daemon (background)
velocirag search "query"              # auto-routes through daemon
velocirag status                      # check daemon health
velocirag stop                        # stop daemon

The daemon keeps the ONNX model + FAISS index warm over a Unix socket. First query loads the engine (~1s), subsequent queries return in ~180ms with full 4-layer fusion.

🎯 Why VelociRAG?

4-layer search — vector + BM25 keyword + knowledge graph + metadata, fused with RRF
No LLM needed — search runs entirely on local models (MiniLM + TinyBERT, ~80MB total)
No GPU needed — pure ONNX inference, runs on any machine
~3ms warm search — daemon keeps models + indices warm over Unix socket
Incremental indexing — add files without rebuilding the whole index
MCP server — plug into Claude, Cursor, Windsurf, any MCP client

Related Projects

Memkoshi — Agent memory system. Uses VelociRAG as its search engine.
Stelline — Session intelligence. Crafts memories from conversation logs.
Glyph — MCP security scanner and runtime protection.

🏗️ How It Works

The 4-layer pipeline:

Query → expand (acronyms, variants)
      → [Vector]   FAISS cosine similarity (384d, MiniLM-L6-v2 via ONNX)
      → [Keyword]  BM25 via SQLite FTS5
      → [Graph]    Knowledge graph traversal
      → [Metadata] Structured SQL filters (tags, status, project)
      → RRF Fusion → Cross-encoder rerank → Results

What each layer catches:

Query type	Vector	Keyword	Graph	Metadata
Conceptual ("improve error handling")	✅	—	—	—
Exact match ("ERR_CONNECTION_REFUSED")	—	✅	—	—
Connected concepts	—	—	✅	—
Filtered ("#python status:active")	—	—	—	✅
Combined ("React state management")	✅	✅	✅	✅

✨ Features

ONNX Runtime — 184ms cold start, 3ms cached. No PyTorch, no GPU
Four-layer fusion — FAISS vector similarity + SQLite FTS5 (BM25) + knowledge graph + metadata filtering, merged via reciprocal rank fusion
Cross-encoder reranking — TinyBERT reranker via ONNX Runtime — included in base install, no PyTorch needed. Downloads ~17MB model on first use
Incremental graph updates — file-centric provenance tracking detects what changed and only rebuilds affected nodes/edges. Cascading deletes maintain consistency across all stores (vector, graph, metadata). Multi-source support with isolated provenance per source
MCP server — Five tools (search, index, add_document, health, list_sources) for Claude, Cursor, Windsurf
Search daemon — Unix socket server keeps ONNX model + FAISS index warm between queries
Knowledge graph — Analyzers build entity, temporal, topic, and explicit-link edges from markdown. Optional GLiNER NER. 418 files in 2.1s
Smart chunking — Header-aware splitting preserves document structure and parent context
Query expansion — Acronym registry, casing/spacing variants, underscore-aware tokenization
Runs anywhere — CPU-only, 8GB RAM, no API keys, no external services

🤖 MCP Server

VelociRAG exposes a Model Context Protocol server for seamless agent integration:

Available tools:

search — 4-layer fusion search with reranking
index — Add documents to the knowledge base
add_document — Insert single document
health — System diagnostics
list_sources — Show indexed document sources

The MCP server process stays alive between queries, so models load once and every subsequent search is warm. Works with any MCP-compatible client.

🐍 Python API

Full 4-layer unified search:

from velocirag import (
    Embedder, VectorStore, Searcher,
    GraphStore, MetadataStore, UnifiedSearch,
    GraphPipeline
)

# Build the full stack
embedder = Embedder()
store = VectorStore('./search-db', embedder)
graph_store = GraphStore('./search-db/graph.db')
metadata_store = MetadataStore('./search-db/metadata.db')

# Index with graph + metadata
store.add_directory('./docs')
pipeline = GraphPipeline(graph_store, embedder, metadata_store)
pipeline.build('./docs', source_name='my-docs')

# Unified search across all layers
searcher = Searcher(store, embedder)
unified = UnifiedSearch(searcher, graph_store, metadata_store)
results = unified.search(
    'machine learning algorithms',
    limit=5,
    enrich_graph=True,
    filters={'tags': ['python'], 'status': 'active'}
)

Quick semantic search:

from velocirag import Embedder, VectorStore, Searcher

embedder = Embedder()
store = VectorStore('./db', embedder)
store.add_directory('./docs')
searcher = Searcher(store, embedder)
results = searcher.search('neural networks', limit=10)

Incremental graph updates:

from velocirag import Embedder, GraphStore, GraphPipeline

# First run — full build, populates provenance
gs = GraphStore('./db/graph.db')
pipeline = GraphPipeline(gs, embedder=Embedder())
pipeline.build('./docs', source_name='my-docs')  # full build

# Subsequent runs — only changed files get reprocessed
pipeline.build('./docs', source_name='my-docs')  # incremental (automatic)

# Force full rebuild
pipeline.build('./docs', source_name='my-docs', force_rebuild=True)

# Multi-source graphs
pipeline.build('./project-a', source_name='project-a')
pipeline.build('./project-b', source_name='project-b')  # isolated provenance

# Deleted files automatically cascade across all stores
# (vector, FTS5, graph, metadata) on next build

💻 CLI Reference

# Index documents (graph + metadata built by default)
velocirag index <path> [--no-graph] [--no-metadata] [--gliner] [--full-graph] [--force]
                       [--source NAME] [--db PATH]

# Search across all layers (auto-routes through daemon if running)
velocirag search <query> [--limit N] [--threshold F] [--format text|json]

# Search daemon
velocirag serve [--db PATH] [-f]         # start daemon (-f for foreground)
velocirag stop                            # stop daemon
velocirag status                          # check daemon health

# Metadata queries
velocirag query [--tags TAG] [--status S] [--project P] [--recent N]

# System health and status
velocirag health [--format text|json]

# Start MCP server
velocirag mcp [--db PATH] [--transport stdio|sse]

Options:

--no-graph — Skip knowledge graph build
--no-metadata — Skip metadata extraction
--full-graph — Build graph WITH semantic similarity edges (~2GB extra RAM)
--source NAME — Label for multi-source provenance isolation
--force — Clear and rebuild from scratch
--gliner — Use GLiNER for entity extraction (requires pip install "velocirag[ner]")

📊 Performance

Real benchmarks on ByteByteGo/system-design-101 (418 files, 1,001 chunks):

Metric	Value
Index (418 files)	13.6s
Search (warm, 5 results)	35–90ms
Graph build (light)	2.1s → 2,397 nodes, 8,717 edges
Incremental update (1 file)	1.3s
Reranker	Cross-encoder TinyBERT via ONNX
Install size	~80MB (no PyTorch)
RAM usage	<1GB with all models loaded

Production deployment (6,300+ chunks, 3 sources, 950 files):

Metric	Value
Full search (warm)	16ms avg, 2ms min
Full search (first run)	22ms avg, 4ms min
Search P50 / P95	17ms / 55ms
Hit rate (100-query benchmark)	99/100
Graph	3,125 nodes, 132,320 edges
Reranker	Cross-encoder TinyBERT via ONNX
RAM	<1GB with all models loaded

⚙️ Configuration

Environment Variable	Default	Description
`VELOCIRAG_DB`	`./.velocirag`	Database directory
`VELOCIRAG_SOCKET`	`/tmp/velocirag-daemon.sock`	Daemon socket path
`NO_COLOR`	—	Disable colored output

Dependencies (all included in base install):

onnxruntime — ONNX inference (embedder + reranker)
tokenizers + huggingface-hub — model loading
faiss-cpu — vector similarity search
networkx + scikit-learn — knowledge graph + topic clustering
numpy, click, pyyaml, python-frontmatter

Optional extras:

pip install "velocirag[mcp]" — MCP server (adds fastmcp)
pip install "velocirag[ner]" — GLiNER entity extraction (adds gliner, requires PyTorch)

📚 References

VelociRAG builds on these foundational works:

Core Fusion & Retrieval

Reciprocal Rank Fusion — Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." SIGIR '09.
Core fusion algorithm for merging results across retrieval layers.

BM25 — Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). "Okapi at TREC-3." TREC-3.
Keyword search foundation via SQLite FTS5.

Embeddings & Neural IR

Sentence-BERT — Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. paper
Dense embedding architecture using all-MiniLM-L6-v2.

MiniLM — Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers." NeurIPS 2020. paper
Efficient transformer distillation for production embedding models.

Reranking & Neural Models

Cross-Encoder Reranking — Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085. paper
Cross-attention reranking with TinyBERT on MS MARCO.

TinyBERT — Jiao, X., et al. (2020). "TinyBERT: Distilling BERT for Natural Language Understanding." Findings of EMNLP 2020. paper
Compressed BERT for fast reranking inference.

Vector Search & Systems

FAISS — Johnson, J., Douze, M., & Jégou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data. paper
High-performance vector similarity search engine.

GLiNER — Zaratiana, U., Nzeyimana, A., & Holat, P. (2023). "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer." arXiv:2311.08526. paper
Generalist NER for knowledge graph entity extraction (optional dependency).

📄 License

MIT — Use it anywhere, build anything.

Need agent integration help? Check AGENTS.md for machine-readable project context.

Built for agents who think fast and remember faster.