helix-agent

Cut Claude Code's token usage by 82–97% — automatically. One MCP server that detects retry loops, compresses screenshots & DOM via local LLM, and auto-selects the optimal model for your GPU.

日本語README: README.ja.md

Token Savings — Real Numbers

"My Max plan 5-hour quota vanished in 19 minutes." — Claude Code users (666+ 👍)

helix-agent tackles the #1 pain point of Claude Code: token waste.

What helix-agent does	Without	With	Reduction
Screenshot analysis (vision_compress)	~15,000 tokens	~400 tokens	97%
DOM/HTML processing (dom_compress)	~114,000 tokens	~500 tokens	99%
Browser automation (agent-browser)	~15,000 tokens/action	~1,000–2,700	82–93%
Retry loop prevention (retry_guard)	∞ (until quota dies)	Stopped at 3rd repeat	100%
Routine tasks (think/agent_task)	Opus tokens ($$$)	Local LLM ($0)	100%

All compression runs on your local GPU via Ollama — zero cloud API cost.

The problem in numbers

A typical Claude Code session burns tokens in ways you don't see (source: 926-session audit):

Where tokens go	Tokens per turn	% of total
System prompt + MCP tool schemas	45,000	~60%
Screenshot / DOM from Playwright MCP	15,000–114,000	variable
Conversation history rebuild	10,000+	grows each turn
Your actual prompt	~500	<1%

After 22 turns (average session), that's ~1M+ tokens — most of it overhead.

helix-agent attacks each layer:

Tool schemas → use defer_loading: true (we document how)
Screenshots/DOM → vision_compress / dom_compress (97-99% cut)
Browser actions → agent-browser backend (82-93% cut)
Retry loops → retry_guard (infinite → 0)
Routine delegation → local LLM via think ($0 vs ~$0.04/call on Opus)

Who is this for?

If you...	helix-agent helps by...
Hit Max plan rate limits within 1–2 hours	Compressing screenshots/DOM 97–99% before Claude sees them
Watch Claude repeat the same failing command 10+ times	`retry_guard` stops loops at the 3rd repeat — automatically
Pay for Opus tokens on tasks a local model could handle	Delegating reads, summaries, reviews to Ollama ($0)
Only have an 8GB GPU and think local LLMs won't help	Auto-selecting gemma4:e2b — proven to work at 2.7× speed
Want your agent to remember patterns across sessions	Self-evolving memory saves skills & preferences locally

What helix-agent does that nothing else does

Capability	helix-agent	Alternatives
Screenshot → text (97% token cut)	✅ `vision_compress` via local LLM	❌ No MCP server does this
DOM → text (99% token cut)	✅ `dom_compress` via local LLM	❌ Playwright MCP sends raw DOM
Retry loop detection	✅ `retry_guard` (sub-ms, no LLM)	❌ Claude Code has no built-in detection
GPU auto-detect → model selection	✅ 8GB to 96GB+ tiers	❌ Other tools require manual config
Self-evolving memory	✅ hermes-style SKILL.md + Qdrant	❌ Unique to helix-agent
Browser 82–93% token reduction	✅ agent-browser + fallback chain	△ agent-browser alone (no fallback)
All 3 MCP primitives	✅ 27 Tools + 3 Resources + 3 Prompts	△ Most MCPs only implement Tools

Why retry_guard?

Claude Code's Opus sometimes gets stuck calling the same tool with identical args when it misreads an error (anthropics/claude-code#41659). A Max plan 5-hour quota can vanish in 19 minutes.

There is no built-in loop detection. The community best practice is "write your own hook". retry_guard packages that hook as a reusable MCP tool.

retry_guard_check(tool_name="navigate", args={"url": "..."})
# → {"loop_detected": true, "repeat_count": 3,
#    "recommendation": "Tool 'navigate' called 3 times with identical args.
#                       Likely stuck in retry loop. Vary args or escalate."}

Three tools, one purpose:

Tool	Purpose
`retry_guard_check`	Called before a risky tool — warns if this exact call is looping
`retry_guard_status`	Session stats: total_calls / unique_calls / max_repeats
`retry_guard_reset`	Clear history after resolving a loop

Per-session histories, SHA1-hashed call fingerprints, sliding time window. No LLM required for the guard itself — pure logic, sub-millisecond.

Bundled extras

GPU Auto-Detection & Model Tiers

helix-agent detects your GPU at startup and selects the best model for each task. Works on any NVIDIA GPU from 8GB to 96GB+.

Your GPU	VRAM	Model Selected	DOM Compress	Memory Review
RTX 4060	8GB	gemma4:e2b	10.2s	9.4s
RTX 4070 Ti / 5070 Ti	16GB	gemma4:e4b	11.8s	12.3s
RTX 4090 / 3090	24GB	gemma4:26b (MoE)	14.7s	14.4s
RTX PRO 6000 / A6000	48GB+	gemma4:31b	27.5s	18.7s

Key finding: gemma4:e2b on 8GB VRAM runs 2.7× faster than 31b with comparable output quality for compression tasks. You don't need a $2,000 GPU to save tokens.

# No configuration needed — just install a model that fits your GPU:
ollama pull gemma4:e2b   # 8GB GPU
ollama pull gemma4:e4b   # 16GB GPU
ollama pull gemma4:26b   # 24GB GPU
ollama pull gemma4:31b   # 48GB+ GPU
# helix-agent picks the right one automatically.

Token savers — screenshot-to-text pipeline

The core idea: never send raw images or HTML to Claude. Compress them locally first.

┌──────────────┐     ┌─────────────────┐     ┌──────────────┐
│ Screenshot   │────→│ vision_compress  │────→│ ~400 tokens  │
│ (15K tokens) │     │ (local gemma4)   │     │ (text only)  │
└──────────────┘     └─────────────────┘     └──────────────┘

┌──────────────┐     ┌─────────────────┐     ┌──────────────┐
│ DOM/HTML     │────→│ dom_compress     │────→│ ~500 tokens  │
│ (114K tokens)│     │ (local gemma4)   │     │ (text only)  │
└──────────────┘     └─────────────────┘     └──────────────┘

When computer_use(action="screenshot", analyze=True) is called, the raw image is automatically deleted from the response — Claude only receives the text summary. This happens transparently, no extra configuration needed.

vision_compress — screenshot → local vision LLM → JSON (page_type, interactive_elements, state_flags). 97% reduction.
dom_compress — HTML → local LLM → JSON (forms, links, buttons, next_action_candidates). 99% reduction.

Real example (tested on RTX PRO 6000):

Input:  1920×1048 screenshot of X.com (would cost ~15,000 tokens)
Output: "X home feed, Japanese UI, 'For You' tab, post by @Suryansh777
         about Claude Code Resource Bible visible" (~400 tokens)
Saved:  7,362 tokens in one call

Browser automation (v0.12.0)

computer_use routes browser actions through Vercel's agent-browser (Rust/CDP) by default, falling back to helix-pilot → Playwright.

Measured on 50 identical automation flows:

Backend	Tokens per action	React controlled components
Playwright (screenshot+DOM)	~15,000	⚠️ setValue silently reverts
agent-browser (accessibility tree)	~1,000–2,700	✅ native keyboard events work

Autonomous screen verification (v0.14.0, NEW)

Claude Code's computer_use normally sends raw screenshots (~15,000 tokens each) back to the model. helix-agent intercepts this:

Action: computer_use(action="click", target="#submit")
  ↓
Verify: computer_use(action="screenshot", analyze=True)
  ↓ (raw image auto-deleted, local gemma4 analyzes)
Result: "Form submitted, success toast visible" (~400 tokens)

The instructions field in the MCP server tells Claude Code to:

Always use vision_compress instead of sending raw screenshots
Always verify actions with analyze=True screenshots
Always run retry_guard_check before repeating any tool call
Delegate routine tasks to local LLM via think at $0 cost

This means Claude Code autonomously saves tokens without any user intervention — just connect the MCP server and it works.

Self-evolving memory (v0.14.0, NEW)

Inspired by NousResearch/hermes-agent: helix-agent reviews conversations every N turns using a local LLM and automatically saves reusable skills and insights — at $0 cost.

Memory nudge: Every 5 turns, gemma4 reviews for saveable preferences/corrections
Skill auto-generation: Successful task patterns → SKILL.md files (hermes-compatible)
The agent gets smarter the more you use it — all running locally

4-Layer Code Review Pipeline (v0.15.0, NEW)

Automated multi-LLM code review that catches 100% of issues at ~¥30 ($0.20) total:

Layer 2: gemma4 ReAct review ($0, with web_search + RAG)
  ↓ findings + context
Layer 3: Sonnet 4.6 verification + cross-file analysis (~¥10)
  ↓ merged findings
Layer 4: Opus 4.6 meta-review (~¥5, reads summary only — no source code)
  ↓ final verdict
Codex:   Consultant (P1 issues only, on-demand)

Empirical results (5-model comparison on real codebase):

Reviewer	Findings	Unique	Cost
gemma4+RAG (local)	7	1	$0
Codex GPT-5.3	5	0	~¥50
Sonnet 4.6	14	1	~¥20
Opus 4.6	16	4	~¥100
4-Layer Combined	16+	all	~¥30

Key finding: gemma4 + RAG ($0) outperforms Codex GPT-5.3 (~¥50) in code review.

# Daily review (gemma4 only, $0)
code_review(target="src/", skip_sonnet=True)

# Pre-release (gemma4 + Sonnet, ~¥10)
code_review(target="src/", context="payment module")

# P1 emergency (+ Codex consultant)
code_review(target="src/", codex_consult=True)

# Control Codex reasoning depth explicitly
code_review(target="src/", codex_consult=True, codex_effort="xhigh")

Codex reasoning effort control (v0.15.0):

codex_effort="none|minimal|low|medium|high|xhigh" — overrides Codex reasoning depth
Default (empty) → high
Auto-escalation: when the pipeline detects ≥3 P1 issues, Codex is invoked with xhigh automatically (no manual tuning needed)

gemma4 Context Expansion (v0.15.0, NEW)

gemma4 now operates as a 12-tool ReAct agent with external knowledge access:

web_search — Qdrant RAG search + SearXNG web search
search_memory — enhanced with source/category filters
add_memory — auto-categorizes into 9 categories (vtuber/coding/mcp/genai/llm/security/infra/x_ops/job)
Security: 5 injection defense rules prevent execution of instructions found in search results

Qwen3-VL 32B Vision/OCR (v0.15.0, NEW)

Dedicated vision model for 95%+ OCR accuracy on Japanese text:

Model	Phone number	Postal code	Cost
gemma4:31b	❌ 0565-2016	❌ 446-8700	$0
Qwen3-VL 32B	✅ 0566-76-2316	✅ 446-8799	$0

Auto-selected for 48GB+ GPUs. Role separation: gemma4 = code/reasoning/RAG, Qwen3-VL = vision/OCR.

Parallel Task Execution (v0.15.1, NEW)

Run multiple tasks simultaneously with automatic model routing:

parallel_tasks(tasks='[
    {"task": "Summarize this code", "type": "summarize", "context": "..."},
    {"task": "Translate to English: ...", "type": "translate"},
    {"task": "Classify these items", "type": "classify"},
    {"task": "Search for best practices", "type": "search"},
    {"task": "Security review", "type": "review", "context": "...code..."}
]')

2-axis automatic model selection — task type × input complexity:

Input size	summarize/translate/classify	search/code_gen	review
Short (<3K chars)	gemma4:e2b (3-6s)	gemma4:e4b (26s)	gemma4:31b
Medium (3-8K)	gemma4:e4b (12s)	gemma4:31b	gemma4:31b
Long (>8K)	gemma4:31b (21s)	gemma4:31b	gemma4:31b

Benchmark (5 tasks simultaneous, clip-bridge 501 lines):

Config	Time	VRAM	Quality
e2b+e4b mixed parallel	51s	10GB	All 5 tasks OK
e4b×3 specialist parallel	85s	6GB	P1=2 detected
31b single	130s	20GB	P1=2, P2=1, P3=2

Light tasks (e2b/e4b) run in parallel via asyncio.gather. Heavy tasks (31b+) run sequentially to avoid GPU contention.

Autonomous operations & growth loop (v0.15.0, NEW)

helix-agent ships a scripts-layer automation harness that keeps the agent itself healthy. The audit → dispatch → heal chain runs under Windows Task Scheduler so Claude Code stays on a self-healing substrate.

Script	Purpose
`scripts/system_auditor.py`	Periodic integrity & drift audit across memory, hooks, services
`scripts/anomaly_dispatcher.py`	Routes detected anomalies to the right department / agent
`scripts/env_self_heal.py`	Auto-repairs common environment regressions (services, paths, deps)
`scripts/critical_files_guard.py`	Protects `CLAUDE.md`, `settings.json`, core configs from accidental loss (SHA-256 snapshots, 30 generations)
`scripts/helix_overview.py`	Single-command 9-domain overview (corp / memory / RAG growth / anomalies / config / startup / projects / security / maintenance)
`scripts/dept_feed_bridge.py`	Feeds per-department Qdrant RAGs (dept_hr/research/design/build/qa) from live signals
`scripts/dept_dataset_builder.py`	Builds instruction-tuning datasets from department RAG growth
`scripts/dept_ft_advisor.py`	Advises when a department is ready for LoRA fine-tuning
`scripts/supervisor.py`	Watches 9 resident daemons and restarts as needed

Delegation & agents

ReAct loop with tool access, context-inheriting sub-agents, background workers, Qdrant shared memory, JSONL tracing, PathGuard safety, OOM auto-fallback.

think / agent_task / parallel_tasks / fork_task — local LLM delegation
see / browse / computer_use / vision_compress / dom_compress — vision + browser
spawn_agent / send_agent_input / wait_agent / list_agents / close_agent — background workers
dept_search / dept_store — per-department Qdrant (dept_hr/research/design/build/qa, mem0_shared)
evolving_memory_review / list_learned_skills / get_skill — self-evolving memory
retry_guard_check / retry_guard_status / retry_guard_reset — loop detection
code_review — 4-layer review pipeline
providers / models / config / agent_types — meta

Quick Start

git clone https://github.com/tsunamayo7/helix-agent.git
cd helix-agent
uv sync
ollama pull gemma4:e2b   # 8GB GPU (or e4b/26b/31b for larger GPUs)
uv run python server.py

Add to Claude Code (~/.claude/settings.json):

{
  "mcpServers": {
    "helix-agent": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/helix-agent", "python", "server.py"]
    }
  }
}

Restart Claude Code. retry_guard_check, vision_compress, and friends are now available.

Japanese users — 日本語ユーザー向け

helix-agent ships opt-in Japanese helpers for Claude Code:

helix-agent-ja-input — floating input window for Windows that sidesteps the React Ink + IME incompatibility (known issue)
ja_screen_read (coming in v1.2) — Japanese UI screenshot parsing via PaddleOCR + gemma4

See README.ja.md for details.

Security

Claude Code has documented prompt-injection vulnerabilities
(CVE-2025-59536)
where malicious content in project files can exfiltrate API tokens. helix-agent
ships PathGuard — path allowlists and sanitization — so delegated tools
cannot access sensitive locations outside the workspace. See SECURITY.md.

Not a Claude Code wrapper

helix-agent is an MCP server that Claude Code connects to — it does not
wrap, proxy, or re-host Claude Code or the Anthropic API. Fully compliant with
Anthropic's Terms of Service.

Requirements

Python 3.12+
uv
Ollama + any Gemma 4 model (auto-selected by GPU):
- 8GB VRAM: ollama pull gemma4:e2b (2.3B effective, 4GB)
- 16GB VRAM: ollama pull gemma4:e4b (4.5B effective, 6GB)
- 24GB VRAM: ollama pull gemma4:26b (MoE 3.8B active, 12GB)
- 48GB+ VRAM: ollama pull gemma4:31b (30.7B dense, 20GB)

Optional:

Qdrant (shared memory)
Playwright (browser automation fallback)
agent-browser (recommended for 82-93% browser token savings)

MCP 3-Primitive Architecture

helix-agent implements all three MCP primitives as defined by Anthropic Academy:

Primitive	Control	Count	Examples
Tools	Model-controlled (Claude decides)	27	`retry_guard_check`, `think`, `computer_use`, `vision_compress`, `code_review`, `parallel_tasks`, `dept_search`
Resources	App-controlled (read-only data)	3	`helix://status`, `helix://models`, `helix://config`
Prompts	User-controlled (workflows)	3	`retry_report`, `optimize_tokens`, `setup_guide`

Claude Code (Opus 4.6 — decides what to do)
  │
  ├─ Resources (read-only)
  │   ├─ helix://status       → runtime state, backend, retry-guard stats
  │   ├─ helix://models       → available Ollama/provider models
  │   └─ helix://config       → current configuration
  │
  ├─ Prompts (user-triggered workflows)
  │   ├─ retry_report         → loop detection analysis (Japanese)
  │   ├─ optimize_tokens      → token saving recommendations
  │   └─ setup_guide          → first-run setup walkthrough (Japanese)
  │
  ├─ Tools (27 total)
  │   ├─ retry_guard_check    → is this tool call looping? (pure logic, no LLM)
  │   ├─ vision_compress      → gemma4 vision → ~400-token summary
  │   ├─ dom_compress         → gemma4 text → ~500-token structured extract
  │   ├─ think / agent_task   → ReAct loop with local model
  │   ├─ fork_task            → parent-context inheriting sub-agent
  │   ├─ computer_use / browse → agent-browser → helix-pilot → Playwright
  │   └─ spawn/send/wait/list/close → background agent workers
  │
  └─ Infrastructure
      ├─ Qdrant shared memory
      ├─ JSONL tracing
      ├─ PathGuard path safety
      └─ OOM auto-fallback chain

Contributing

See CONTRIBUTING.md.

Related Projects

helix-ai-studio — All-in-one AI chat studio with 7 providers, RAG, MCP tools, and pipeline
helix-pilot — GUI automation MCP server — AI controls Windows desktop via local Vision LLM
claude-code-codex-agents — MCP bridge to Codex CLI with structured JSONL traces
helix-sandbox — Secure sandbox MCP server — Docker + Windows Sandbox

License

MIT