hermes-claude-code-local
Health Uyari
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Run Hermes Agent + Claude Code locally on llama.cpp — zero API costs. A 4h / 7M-token session that would have cost $94 on Claude Opus 4.7
Hermes Local Stack — Local Agentic Coding with Claude Code, llama.cpp, Zero API Costs
A 4-hour autonomous coding session — 7,256,671 tokens — would have cost $94.34 on Claude Opus 4.7. With this stack: $0.
This repo wires Hermes Agent directly to llama.cpp and optionally bridges Claude Code through a local LiteLLM proxy — so your agent calls tools, edits files, and runs task loops 100% on your own hardware.
Real session · 4 hours · 7,256,671 tokens
Same session with Claude Opus 4.7 would have cost ~$94.34 · Cost with this stack: less than a coffee
Session title: "Claude Code Migration im Hintergrund läuft #7" — Hermes autonomously ran a multi-step code migration in the background, 7th iteration, while the phone stayed in my pocket.
Quick Start
Windows (PowerShell)
./setup_hermes_local.ps1
Linux / macOS
curl -fsSL https://raw.githubusercontent.com/KaiFelixBennett/hermes-claude-code-local/main/setup.sh | bash
That's it. The setup script verifies your environment, configures Hermes with llama.cpp, and launches everything in one command.
What You Get
- Agentic coding, fully local — Hermes plans multi-step tasks, calls tools (web search, file edits, browser), and executes autonomously without cloud round-trips
- Hermes Agent running locally with llama.cpp as the inference engine
- Claude Code bridge via LiteLLM (optional) — Claude Code talks to your local model instead of Anthropic's API
- Telegram integration — control Hermes from any device via Telegram, including voice messages
- Zero API costs — everything runs on your hardware
- MTP / speculative decoding — faster inference with Qwen3.6-27B-MTP GGUF out of the box
- Self-healing scripts — LiteLLM restarts automatically if it crashes
- Model tuning notes per GGUF under
docs/models/
Architecture
The core path looks like this:
Hermes -> llama.cpp
and for local Claude Code tasks:
Hermes -> claude-code skill -> Claude Code -> LiteLLM -> llama.cpp
Why This Hybrid Setup?
At first glance, the architecture looks more complicated than just pointing everything straight at one local model. In practice, the split is what makes the system stable:
- Hermes talks directly to
llama.cpp— fewer moving parts, lower overhead, simpler debugging - Claude Code goes through LiteLLM — it expects Anthropic-style API behavior and is pickier about compatibility
- The wrapper scripts auto-heal the bridge when LiteLLM is not already running
The honest trade-off on Claude Code: Claude Code is a powerful tool but has a real runtime overhead — system prompt, tool schemas, and scaffolding can consume 60k+ tokens before your first message on a 64k context window. With Hermes talking directly to llama.cpp, that overhead disappears. Claude Code becomes worthwhile again once you run 128k+ context reliably; until then, Hermes direct is the better default for agent loops.
Why llama.cpp and not Ollama?
Ollama has a friendlier setup experience, but llama.cpp was chosen here for three reasons:
- MTP / speculative decoding support for the Qwen3.6-27B-MTP GGUF (measurable speed boost)
- Fine-grained control over GPU backends (HIP, Vulkan, CUDA) and context windows
- Direct API compatibility — llama.cpp speaks the OpenAI
/v1format natively without an extra translation layer
If you prefer Ollama, the LiteLLM bridge in this stack can route to an Ollama endpoint with a one-line config change.
Why Dense over MoE for Agentic Tasks?
MoE models (e.g. Qwen 35B-A3B) activate only a fraction of parameters per token — great for throughput, but they can drift on long agent loops where consistent multi-step reasoning matters. A dense 27B model activates all parameters on every step:
- More stable over long task chains — fewer reasoning errors when Hermes is 5–10 tool calls deep
- Less hallucination on sequential code edits — the model holds intent across the full loop
- Verified in practice — the 4-hour session was iteration #7 of an autonomous code migration, 9+ user turns, running unattended via Telegram
The trade-off: MoE models can handle larger context windows faster on the same hardware. Once you can run 128k+ context reliably on your setup, re-evaluating a larger MoE makes sense.
Who This Is For
Use this repo if you want one of these outcomes:
- run autonomous agentic coding tasks locally with zero API costs
- control Hermes from Telegram while it works in the background
- test Claude Code against a local model instead of the Anthropic API
- keep working launch scripts and config in one place
- reuse the setup later on another machine with minimal changes
Prerequisites
| Requirement | Minimum | Recommended |
|---|---|---|
| OS | Windows 10/11 (WSL2) or Linux/macOS | Windows 11 + WSL2 Ubuntu |
| RAM | 16 GB | 32 GB |
| Disk | ~10 GB for model + tools | SSD |
| GPU | Strongly recommended | AMD Radeon (ROCm/HIP) or NVIDIA (CUDA) |
On GPU: For the Qwen3.6-27B Q4_K_M model used here, plan for 24+ GB VRAM — the model weights alone are ~14.5 GB, and the KV cache at 64k context adds several GB on top. 16 GB VRAM will be tight or require CPU offload (noticeably slower). CPU-only works but is very slow for interactive agent loops; consider a 7B Q4 model if GPU is not available.
You should already have:
- Hermes installed in WSL (Windows) or natively (Linux/macOS)
claudeCLI installed (for the Claude Code bridge path)- a local GGUF model available on disk
- a working
llama.cppbinary (or letsetup.shinstall it)
This repo gives you the wiring, launch scripts, and tested configuration. It does not ship model weights or llama.cpp binaries.
Setup
Option A: One-Command Setup (Recommended)
Windows:
./setup_hermes_local.ps1
Linux / macOS:
bash setup.sh
What these scripts do:
- Verify required local files exist
- Check your configured GGUF path in
hermes_config.yaml - Prompt for a GGUF path if the configured one is missing
- Write the new path back to
model.path - Start the normal Hermes launcher
With Claude Code bridge:
# Windows
./setup_hermes_local.ps1 -WithClaudeBridge
# Linux / macOS
bash setup.sh --with-claude-bridge
Config-only (no launch):
# Windows
./setup_hermes_local.ps1 -SkipLaunch
# Linux / macOS
bash setup.sh --skip-launch
Option B: Manual Start
Hermes only:
# Windows
./start_hermes.bat
# Linux / macOS
bash start_hermes.sh
Hermes + Claude Code bridge:
# Windows
./start_hermes_claude_local.bat
This path checks or starts llama.cpp, starts LiteLLM on 127.0.0.1:4000, launches Hermes, and ensures Claude Code subprocesses use the local gateway.
Claude Code check without Hermes:
./claude_local.sh -p 'Reply with exactly OK.' --output-format json
The wrapper auto-starts LiteLLM on demand. Verify success when JSON output contains modelUsage.qwen-local-anthropic.
What This Looks Like In Practice
Hermes inspects the current state, switches into the claude-code skill when deeper coding work is needed, and delegates through the local Claude Code bridge:
Telegram Integration
Hermes is not tied to one terminal window — you can control it from Telegram:
- Run Hermes on your local machine or in WSL
- Keep the model and tooling running in the background
- Message Hermes from Telegram instead of sitting in front of the terminal
- Hermes uses the same local stack while you interact from chat
Connect Your Own Telegram Bot
The shortest stable setup flow:
- Create a bot with @BotFather
- Get your numeric Telegram user ID
- Run
hermes gateway setupin WSL and choose Telegram - Start the gateway with
hermes gateway - Send your bot a message and verify Hermes replies
Official references:
Note: Telegram group privacy mode is the most common reason bots appear "silent" in groups. If you change privacy mode in BotFather, remove and re-add the bot to the group afterwards.
Configuration
The main config file is hermes_config.yaml. The only values you typically need to change:
| Field | What It Does | Example |
|---|---|---|
model.path |
Path to your GGUF model file | E:\models\qwen3.6.gguf |
model.backend |
GPU backend for llama.cpp | hip, vulkan, cuda, cpu |
model.binary_dir |
(optional) Pin a specific llama.cpp build | — |
Everything else — Hermes, LiteLLM, Claude Code — is pre-configured and works out of the box.
Current Verified Defaults
ANTHROPIC_BASE_URL=http://127.0.0.1:4000
ANTHROPIC_AUTH_TOKEN=***
ANTHROPIC_MODEL=qwen-local-anthropic
ANTHROPIC_CUSTOM_MODEL_OPTION=qwen-local-anthropic
Model tuning details are documented in docs/models/qwen3.6-27b-mtp-gguf-llamacpp.md.
Files That Matter
| File | Purpose |
|---|---|
hermes_config.yaml |
Main Hermes provider and model config |
setup_hermes_local.ps1 |
First-run setup wizard (recommended entry point) |
start_hermes.bat |
Quick start Hermes + llama.cpp |
start_hermes_claude_local.bat |
Start with Claude Code bridge |
start_llamacpp.ps1 |
Starts llama.cpp from repo config |
start_litellm.ps1 |
Starts LiteLLM for the Claude bridge |
claude_local.sh |
Local Claude Code entry point (standalone) |
ensure_claude_local_bridge.sh |
On-demand LiteLLM self-healing wrapper |
litellm.proxy.yaml |
LiteLLM bridge config |
HERMES_README.md |
Deeper operator notes for this setup |
Reusing on Another Machine
This repo does not ship the GGUF model or llama.cpp binaries. Usually only these parts need to change:
model.pathinhermes_config.yaml- The installed llama.cpp backend and binary folder
- Machine-specific paths to GGUFs or tools
- Optional GPU backend tuning
The launch scripts resolve paths dynamically — you typically don't need to edit them manually.
Troubleshooting
llama.cpp won't start on Windows
Check that your GPU backend matches your hardware in hermes_config.yaml:
- AMD Radeon → set
model.backend: "hip" - NVIDIA → set
model.backend: "cuda"or"vulkan" - No GPU → set
model.backend: "cpu"(slower but works)
LiteLLM bridge is down
The wrapper scripts auto-heal. If it's still not working:
# Windows
./start_litellm.ps1
# Linux / macOS
make litellm
Hermes can't reach llama.cpp
On Windows with WSL2, make sure mirrored networking is enabled (default in WSL 2.0+). Test from inside WSL:
curl http://127.0.0.1:8080/v1/models
Telegram bot appears silent in groups
- Check that privacy mode is disabled in BotFather
- Remove and re-add the bot to the group after changing settings
- Hermes can also stay DM-only for the simplest setup
Why This Setup?
Local LLMs are great for privacy and cost control, but Claude Code expects an Anthropic-compatible API. This repo solves that with a hybrid architecture:
- Hermes talks directly to llama.cpp (minimal overhead)
- Claude Code goes through LiteLLM (Anthropic API compatibility layer)
- Both share the same local model — no duplicate inference servers needed
A real session on an AMD Radeon AI PRO R9700 (32 GB VRAM) ran 7 million tokens for €0 across an evening of agent-driven coding. The architecture described here is exactly what was running.
Contributing
Issues, PRs, and config contributions for different hardware setups (NVIDIA CUDA, Apple Silicon, CPU-only builds) are very welcome.
The most useful contributions right now:
- Hardware configs — got it running on NVIDIA / Apple Silicon / CPU-only? Open a Hardware Config issue or add a file under
docs/models/ - Bug reports — especially for WSL2 edge cases and GPU backend quirks
- Script improvements — portability, error messages, edge cases
If this stack saves you money or solves a problem, a ⭐ on the repo helps others find it.
Community
This project exists because local LLMs are finally good enough to replace cloud APIs for real agentic work. If you've used it, broken it, or improved it — share it with others who might benefit:
- r/LocalLLaMA — the best community for local model setups
- r/ClaudeAI — for Claude Code users
- Hermes Discord — for Hermes-specific questions
License
MIT
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi