GeneralStaff

Verification-gate discipline for autonomous coding agents.
Your code. Your keys. Your audit log.

GeneralStaff treats agentic AI as an adversarial input to your codebase. Every cycle runs through a Boolean verification gate before producing a commit: tests must pass, the diff must be non-empty, a separate reviewer must confirm scope match. Hands-off file lists are enforced by the dispatcher. Every prompt, response, tool call, and diff lands in PROGRESS.jsonl. Open source, BYOK, no SaaS layer.

Status: v0.5.0, 2,052 passing tests, 30+ managed projects. Cross-platform (Windows, macOS, Linux). v0.5.0 fixes two long-standing bugs (a pre-cycle advisor inert since v0.4.0, a heartbeat watchdog that could spin in a kill-loop) and adds four opt-in features. Release notes: CHANGELOG.md.

The problem

Autonomous coding agents fail in one predictable way: industrious without judgment. They mark tasks done when tests fail. They produce empty diffs and call them complete. They edit files you told them not to touch. They write confident summaries of work they didn't do.

These aren't edge cases. They're the equilibrium when agent loops rely on instructions the model can drift from instead of locks the model can't bypass. Closed SaaS platforms charge per credit whether the project ships or not. Polsia's top Trustpilot complaint is false task completions. Nobody checks the bot's work against reality, so the damage compounds where you won't see it until next week.

Better prompts won't fix this. Structure will.

What GeneralStaff does instead

Five mechanisms enforced by the dispatcher:

Verification gate. After every cycle: tests pass, diff non-empty, separate reviewer confirms scope match. A cycle is not done until all three hold. Failure rolls the cycle back. The gate is code, not a prompt, and it fires on every cycle.
Hands-off lists. Per-project glob patterns the bot cannot touch. Reviewer checks every diff against the list. Violation → rollback. Empty list = no registration.
Worktree isolation. The bot works in .bot-worktree on a bot/work branch. Your master is untouched until you merge. Bot pushes to bot/work on your remote, nowhere else.
BYOK billing. You pay Anthropic, OpenRouter, or whoever directly. No platform credits, no SaaS middleman, no revenue share.
Open audit log. Full prompts, responses, tool calls, and diffs in state/<project>/PROGRESS.jsonl. Grep-able, reviewable. Closed SaaS tools can't show you theirs.

What it catches

This is a real rejection from this repo's audit log:

{
  "event": "reviewer_verdict",
  "cycle_id": "20260417161301_juzs",
  "data": {
    "verdict": "verification_failed",
    "reason": "The diff contains hands-off violations by modifying src/safety.ts and src/reviewer.ts which are explicitly restricted.",
    "hands_off_violations": [
      "src/safety.ts",
      "src/reviewer.ts",
      "src/prompts/"
    ]
  }
}

The bot tried to edit three safety-critical files. The reviewer caught all three. Cycle rolled back. The entry above is a line from state/generalstaff/PROGRESS.jsonl. Grep for "verdict":"verification_failed" and count the rest.

Dogfooding numbers since 2026-04-15:

223 verified + 27 rejected reviewer verdicts — the gate caught ~10.8% of what the engineer proposed.
2,030 passing tests across 69 test files.
Two pre-launch security audits. First fixed five HIGH/MEDIUM findings. Second caught a symlink bypass on the hands-off check.
Every verified commit in this repo passed the same gate the tool ships with.

grep '"verdict":"verification_failed"' state/generalstaff/PROGRESS.jsonl and verify the count. The gate makes the velocity trustworthy.

What the gate doesn't catch

Real failure modes from the audit log:

Engineer crashes before producing a diff. Mode-B projects with stub engineer_command, Windows worktree-junction races, missing toolchains — the gate has nothing to verify. Set interactive_only: true on Mode-B projects and declare expected_touches.
Empty-diff cycles. When a project's bot-pickable inventory is thin, the engineer runs cleanly and reports nothing-to-do. The cycle returns verified_weak. Watch substantive landings vs. verified_weak, not raw cycle count. The gs inventory-audit command surfaces this at fleet level.
Scope-match is not correctness. The reviewer confirms the diff matches declared expected_touches and respects hands_off. It does not check correctness. The engineer's tests are the correctness signal — if they pass for the wrong reason, the gate ratifies the cycle.
Push is best-effort. The gate runs at commit time. Pushing to origin is opportunistic and fails silently on offline or auth-expired states. The final-sweep step is load-bearing.
Picker rotation can starve projects. Round-robin within the ready set means some projects may not get selected across a session.

File counterexamples on the issue tracker.

What GeneralStaff is not

Not a Claude wrapper. Multi-provider: claude -p, aider + OpenRouter, Ollama for unattended runs.
Not an alignment tool. It does not make the agent smarter. It catches the agent at cycle boundaries.
Not a SaaS. No hosted offering, no credits, no telemetry, no GeneralStaff server. Export = git clone.
Not a chat UI. Dispatched labor: you write work orders, the dispatcher runs cycles, you read SITREPs.

Why this over the alternatives

vs. Polsia / Devin / closed SaaS: your code lives on their infra, you pay per credit, you can't verify what the bot actually did. GeneralStaff is local-first, BYOK, audit-log-first.
vs. Naive claude -p loops: prompts can be ignored; Boolean gates cannot. The verification gate catches the ~2% tail where the engineer goes stupid+industrious.
vs. Hand-rolled nightly scripts: what GeneralStaff started as. This is that script, hardened and made inspectable.

Origin

Named for Kurt von Hammerstein-Equord's officer typology: clever/stupid × industrious/lazy. The "general staff" quadrant handles execution on behalf of command. The stupid-industrious quadrant — confident officers without judgment — causes unbounded damage. Autonomous coding agents without verification gates live there.

The architecture is the philosophy: gate, hands-off lists, default-off creative roles, open audit log. Built by a wargame designer thinking about AI failure modes the way wargames think about adversarial conditions — structurally, with explicit failure-mode enumeration, with discipline encoded as rules.

Why the gate matters

The failure mode isn't unique to AI. Both the operator and the agent are vulnerable to confident industriousness without judgment — helper syndrome cuts both ways. The verification gate exists because instructions can be ignored by either party; the bot's enthusiasm tends to amplify the operator's optimism. The gate fires regardless. Protection only fires when the operator reads the verdict and listens — necessary but not sufficient.

Hard rules

Enforced in code or by convention. Relaxing any requires a RULE-RELAXATION-<date>.md log committed alongside the change.

No creative work delegation by default. Correctness work only. Creative agents are opt-in plugins.
File-based state SSOT. No databases. Local desktop UI permitted as a viewer/controller.
Sequential cycles for MVP. Parallel worktrees opt-in.
Auto-merge off by default. Opt in per-project after 5 clean cycles.
Mandatory hands-off lists. Empty list = no registration.
Verification gate is load-bearing. Cycle not done until tests pass, diff non-empty, reviewer confirms scope.
Code ownership. Bot pushes to bot/work on your remote only.
BYOK for LLM providers. API-key default; subscription support for personal use.
Open audit log. Full prompts, responses, tool calls, diffs in PROGRESS.jsonl per cycle.
Local-first. No SaaS tier, no managed offering.

Full rationale: docs/internal/RULE-RELAXATION-2026-04-15.md.

Quickstart

One-line installer

# macOS / Linux
curl -fsSL https://raw.githubusercontent.com/lerugray/generalstaff/master/install.sh | bash

# Windows PowerShell
irm https://raw.githubusercontent.com/lerugray/generalstaff/master/install.ps1 | iex

The installer clones into ./GeneralStaff/, installs bun if missing (to $HOME/.bun, no root), runs bun install, prints next steps. Safe to re-run.

First-run wizard

gs welcome

Guided setup: provider config, register your first project, run one verified cycle so you see dispatcher → engineer → verification → reviewer end-to-end before trusting it with real work.

Manual flow

Requires git, bash (Git Bash works on Windows), bun 1.2+, claude CLI in PATH.

generalstaff bootstrap /path/to/project "what this project is" --id=myproject
# review .generalstaff-proposal/ output, move hands_off.yaml into place
generalstaff register myproject --path=/path/to/project
generalstaff cycle --project=myproject --dry-run
generalstaff session --budget=90
generalstaff history --lines=20

Bot pushes to bot/work on your remote only. Full config: projects.yaml.example.

Tested configurations

Primary dogfood trail (223 verified cycles) on Windows 11 + Claude Code. macOS bootstrap validated end-to-end 2026-05-01. Real-cycle mileage on macOS/Linux is lighter than Windows; rougher edges in less-trodden paths.

Works alongside

Runtime enforcement at cycle boundaries. Stacks with instruction-layer tools:

AGENTS.md / agents-md — drop-in rules file teaching coding agents to push back on bad requests and verify before claiming done.
lean-ctx — context runtime compressing file reads and search results into compact wire format.
aider + OpenRouter — set engineer_provider: aider to route cycles through Qwen3 Coder (~40× cheaper than Claude Sonnet). Bulk scaffolding; complex work stays on claude.

Strategic-reasoning companion

GeneralStaff gates execution. For pre-queue work — auditing a plan, picking what to ship next, getting an adversarial second opinion — Hammerstein is the companion CLI (Python, MIT). Provider fallback chain (OpenRouter → DeepSeek → Ollama), sub-cent-per-call typical, Plain English summaries.

h audit "<plan>"     # catch scope creep before queueing
h next "<options>"   # strategic ranking when queue depth alone isn't enough
h worth "<proposal>" # opportunity-cost check before committing Claude tokens

Over time, ~/.hammerstein/logs/ accumulates your strategic decisions for curation into your personal corpus.

Wire it into the dispatcher (v0.4.0+). Set advisor.enabled: true per project and GS calls h audit automatically between picker and engineer with the proposed task plan + bounded cycle history. Verdict lands in PROGRESS.jsonl as advisor_verdict. Opt-in (default off, zero overhead). With gate: true, a block verdict skips the cycle (cycle_skipped: advisor_gated). Full setup: docs/ADVISOR.md.

24/7 heartbeat dispatch (v0.4.0+)

Anthropic separates claude -p and SDK billing into a dedicated credit bucket on 2026-06-15. Scheduled-task launchers that ran on the regular subscription move to that bucket.

GS heartbeat mode sidesteps it: keep an interactive Claude Code session alive via the Stop-hook contract, watch io/inbox.jsonl for action messages, restart-per-message for fresh context (same property -p provides), bill against the Max subscription. Architecture inspired by Siigari/claude-heartbeat; GS port adds an action vocabulary (run_cycle, run_session, digest, status, manual) and structured outbox responses.

# Start the supervisor (visible cmd window on Windows; tmux/screen on Unix)
.\scripts\heartbeat-run.ps1
./scripts/heartbeat-run.sh

# Queue work from any other shell
bun scripts/heartbeat-inbox.ts run_cycle myproject
bun scripts/heartbeat-inbox.ts run_session --max-cycles=3
bun scripts/heartbeat-inbox.ts status

Additive over the existing scheduled-task path — rollback is "stop the supervisor." Full setup, action protocol, ToS framing, latency math: docs/HEARTBEAT.md.

Configuration

Defaults stay conservative. Flip per-project in projects.yaml; full schema in projects.yaml.example.

engineer_provider: aider — route to OpenRouter Qwen3 Coder (~$0.05-0.10/cycle).
creative_work_allowed: true — Hard Rule 1 carve-out for creative-draft cycles.
auto_merge: true — auto-merge bot/work after clean cycles. Opt in after 5.
dispatcher.session_budget — cap on USD, tokens, or cycles.
dispatcher.max_parallel_slots: N — N cycles per round in parallel.
advisor.enabled: true — pre-cycle Hammerstein audit (opt-in, v0.4.0+). With gate: true, a block verdict skips the cycle. See docs/ADVISOR.md.
engineer_claim_timeout_minutes: N — kill a stuck engineer early if it emits no task-claim signal within N minutes (v0.5.0+).
customer_facing_smoke — shell probe run after verification on public_facing projects; a non-zero exit fails the cycle (v0.5.0+).

Hard Rules hold regardless of knob state. Every cycle still lands in PROGRESS.jsonl.

Who this is for

Any project you point it at: a SaaS, a research tool, an art piece, a satirical anti-startup, a blog. The dispatcher doesn't care what the project is. It runs correctness work on what you tell it.

Polsia assumes you want to build a profitable SaaS. GeneralStaff doesn't. Bring your own imagination; the tool runs the execution. LLMs asked for "a startup idea" return the mode of their training distribution — generic SaaS. The tool is a GM, not a writer. GMs run the rules; players write the characters.

Hard Rule 1 still holds: the bot does correctness work (tests, infra, pipelines, bug grinding); you do the creative part.

Sister projects

Three open-source projects in the fleet, same posture (your data, your keys, no SaaS):

mission-brain — citation-grounded RAG retrieval over your own writing.
mission-bullet-oss — AI-assisted bullet journal (Ryder Carroll method).
mission-swarm — swarm-sim engine for smoke-testing launch copy.

Documentation

DESIGN.md — architecture (v1–v8, append-only)
CHANGELOG.md — release notes, phase narratives, recently shipped
projects.yaml.example — config schema reference
docs/conventions/ — usage-budget, roadmap, integrations
docs/internal/ — design decisions, phase closures, research notes
docs/HEARTBEAT.md — 24/7 inbox-driven dispatcher mode (experimental, 2026-05-14)
AGENTS.md — cross-platform agent-config (Claude Code, Cursor, Aider, Codex, Zed)
scripts/orchestration/README.md — multi-agent spawn primitives

Contributing

CONTRIBUTING.md. Correctness PRs welcome. Taste-work PRs need a conversation first (Hard Rule 1). The best bug report is a snippet of your PROGRESS.jsonl showing the failed cycle.

Support

Maintained by one person alongside a day job. No company layer. Support via GitHub Sponsors. SUPPORTERS.md.

License

AGPL-3.0-or-later. Running GeneralStaff as a hosted service requires offering source to users — to prevent the SaaS-fork attack the project positions against.