Swarm Orchestrator

Verification and governance layer for AI coding agents. Parallel execution with evidence-based quality gates, not autonomous code generation.

This is not an autonomous system builder. It orchestrates external AI agents (Copilot, Claude Code, Codex) across isolated branches, verifies every step with outcome-based checks (git diff, build, test), and only merges work that proves itself. The value is trust in the output, not speed of generation.

Quick Start · What Is This · Quality Benchmarks · Usage · GitHub Action · Recipes · Architecture · Contributing

Swarm Orchestrator TUI dashboard showing parallel agent execution across waves

Quick Start

# Install globally
npm install -g swarm-orchestrator
# Or clone and build from source
git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
cd swarm-orchestrator
npm install && npm run build && npm link

# Run against your project with any supported agent
swarm bootstrap ./your-repo "Add JWT auth and role-based access control"

# Use Claude Code instead of Copilot
swarm bootstrap ./your-repo "Add JWT auth" --tool claude-code

# Use Codex
swarm bootstrap ./your-repo "Add JWT auth" --tool codex

See it work before pointing it at your code:

swarm demo demo-fast    # two parallel agents, ~1 min

Requires Node.js 20+, Git, and at least one supported agent CLI installed.

Agent	Install	Auth
GitHub Copilot CLI	`npm install -g @github/copilot`	Launch `copilot` and run `/login` (requires Node.js 22+)
Claude Code	`npm install -g @anthropic-ai/claude-code`	`ANTHROPIC_API_KEY`
Codex	`npm install -g @openai/codex`	`OPENAI_API_KEY`

What Is This

AI coding agents generate code fast, but without verification, you're merging untested assumptions into your codebase. This orchestrator provides the evidence layer: it runs agents in parallel, checks whether the generated code actually works, and blocks anything that can't prove itself.

What it does: You define a goal. The orchestrator builds a dependency graph, launches steps as dependencies resolve, and manages the full lifecycle: branch creation, agent execution, outcome verification, failure repair, and merge. Every agent runs on its own isolated git branch. Every step is verified by what actually happened: did files change, does the build pass, do tests pass. Steps that can't prove their work don't merge.

What it does not do: This tool does not generate code. It delegates code generation to external agent CLIs (Copilot, Claude Code, Codex) and focuses entirely on orchestration, verification, and quality governance. It is not a replacement for autonomous coding tools; it is a trust layer that wraps them.

Works with Copilot CLI, Claude Code, Codex, or any CLI agent via the adapter interface. Select your tool with --tool globally or per-step in your plan. The orchestrator doesn't care which agent writes the code; it cares whether the code works.

Verification is outcome-based. The engine runs git diff against the branch baseline, executes the project's build and test commands in the worktree, and checks for expected output files. Transcript analysis (parsing what the agent claimed) runs as a supplementary signal, not the primary gate. When a step fails, the RepairAgent receives structured failure context (which checks failed and why, ordered by actionability) instead of blindly retrying the same prompt.

Also available as a GitHub Action for CI/CD integration and with built-in recipes for common tasks.

Quality Benchmarks

The orchestrator's prompt injection and quality gates front-load requirements that developers normally discover through iterative reprompting. The same goal run through the orchestrator produces output that would take 17-25 follow-up prompts to reach with a standalone agent.

The following comparison used the same goal run through Claude Code unassisted and through the orchestrator. Both outputs were evaluated by an independent reviewer against identical criteria.

Goal: "Create a simple browser-based tic-tac-toe game with HTML, CSS, and vanilla JavaScript. Include a 3x3 grid, alternating X and O turns, win detection, and a reset button."

Results: Orchestrator vs Claude Code (unassisted)

Category	Claude Code	Orchestrator
Architecture	A (factory pattern, logic/DOM separation)	A+ (pure ES module + DOM controller, new-array-per-move state)
Tests	A- (11 tests, custom harness, storage mock required)	A+ (19 tests, zero dependencies, edge case + error coverage)
Accessibility	F (no ARIA, no focus management, no keyboard support)	A+ (skip link, aria-live, positional labels, focus-visible)
Responsive design	F (fixed 100px cells, no handling)	A (clamp on all sizes, dvh, edge padding)
CSS architecture	C (hardcoded colors, no variables, no media queries)	A+ (20+ custom properties, dark mode, reduced-motion)
HTML semantics	C+ (buttons, no landmarks, no meta tags)	A+ (meta description, dual theme-color, SVG favicon, landmarks)
Project scaffolding	F (no package.json, no README)	A (zero-dep test runner, structured README)
Audio feedback	None	A (Web Audio API, lazy init, per-event frequencies)

What the orchestrator included that Claude Code did not

17 specific quality attributes were present in orchestrator output and absent from Claude Code output: skip link, aria-live region, positional aria-labels (row/column), focus-visible styles, responsive clamp sizing, CSS custom properties (50+ variable references), prefers-reduced-motion media query, prefers-color-scheme dark mode with full variable overrides, <meta name="description">, dual <meta name="theme-color"> (light and dark), inline SVG favicon, pure logic module separation, copy-on-move game state, audio feedback via Web Audio, separate DOM controller, zero-dependency Node test runner, and structured README with file table.

Each attribute requires at least one follow-up prompt to add when using a standalone agent. Several (full dark mode variable overrides, responsive clamp system, module extraction) require 2-3 rounds. Conservative total: 17-25 prompts eliminated per project.

Note: These results are from a representative run. The underlying agent is non-deterministic, so exact grades and counts may vary between runs. The quality attributes are enforced by prompt injection and gate verification, so they are reliably present, but the specific implementation details (e.g., test count, number of CSS variables) can differ.

How it works

The orchestrator injects quality requirements into every agent prompt before execution begins: accessibility standards (ARIA labels, keyboard navigation, focus-visible, skip links), CSS requirements (custom properties, reduced-motion, color-scheme), HTML metadata (description, theme-color, viewport), and code structure rules (pure logic separation, DOM controller pattern, semantic HTML). Quality gates then verify the output and reject work that doesn't meet the bar, triggering targeted repair with specific failure context.

Standalone agents optimize for "correct and working." The orchestrator adds "accessible, responsive, themed, and structured" before the agent writes a single line. The quality bar comes from the system, not from the user's prompt.

Note: This benchmark covers frontend web projects using Claude Code as the baseline. Copilot CLI and Codex comparisons are in progress and will be added here. Backend, API, and CLI project benchmarks are planned.

Usage

Commands

Command	Description
`swarm bootstrap ./repo "goal"`	Analyze repo and generate a plan
`swarm run --goal "goal"`	Generate plan and execute in one step
`swarm swarm plan.json`	Execute a plan with parallel agents
`swarm quick "task"`	Single-agent quick task
`swarm use <recipe>`	Run a built-in recipe against current project
`swarm recipes`	List available recipes
`swarm recipe-info <n>`	Show recipe details and parameters
`swarm gates [path]`	Run quality gates on a project

Key Flags

Flag	Effect
`--tool <n>`	Agent to use: `copilot` (default), `claude-code`, `codex`
`--governance`	Enable Critic review wave with scoring and auto-pause
`--lean`	Enable Delta Context Engine (KB-backed prompt references)
`--cost-estimate-only`	Print cost estimate and exit without running
`--max-premium-requests <n>`	Abort if estimated premium requests exceed budget
`--wrap-fleet`	Use Copilot CLI's native `/fleet` for parallel subagent dispatch
`--strict-isolation`	Restrict cross-step context to verified entries only
`--pm`	Enable PM Agent plan review before execution
`--param key=value`	Set recipe parameters (with `use` command)
`--pr auto\|review\|none`	PR behavior after execution

Examples

# Full-featured run with Claude Code
swarm swarm plan.json --tool claude-code --governance --lean

# Recipe: add tests with vitest targeting 90% coverage
swarm use add-tests --tool codex --param framework=vitest --param coverage-target=90

# Preview cost before committing
swarm swarm plan.json --cost-estimate-only

# Per-step agent selection in plan.json
# { "steps": [
#   { "id": 1, "task": "...", "agentName": "BackendMaster", "cliAgent": "claude-code" },
#   { "id": 2, "task": "...", "agentName": "TesterElite", "cliAgent": "codex" }
# ]}

GitHub Action

Run the orchestrator in CI without installing anything. Outcome-based verification provides the trust layer for unattended execution.

Security note: Always pass credentials via the env: block, never via with: inputs. GitHub Actions may expose input values in workflow logs. Always set minimal permissions: to limit GITHUB_TOKEN scope. See SECURITY.md for full credential handling guidance.

name: AI Swarm - Add Tests
on:
  workflow_dispatch:
    inputs:
      goal:
        description: 'What should the swarm do?'
        default: 'Add comprehensive unit tests for all untested modules'

permissions:
  contents: write
  pull-requests: write

jobs:
  swarm:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: moonrunnerkc/swarm-orchestrator@main
        id: swarm
        with:
          goal: ${{ github.event.inputs.goal }}
          tool: claude-code
          pr: review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          # Add other adapter keys as needed:
          # OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check Results
        if: always()
        run: echo "${{ steps.swarm.outputs.result }}"

Input	Default	Description
`goal`	(required)	What the swarm should accomplish
`tool`	`copilot`	Agent CLI: `copilot`, `claude-code`, `codex`
`recipe`		Run a built-in recipe instead of a goal
`plan`		Path to a pre-built plan JSON
`pr`	`review`	PR behavior: `auto`, `review` (draft), `none`
`max-retries`	`3`	Max retry attempts per step
`model`		Model to pass to the agent CLI

The Action outputs result (JSON with per-step verification status), plan-path, and pr-url. Session artifacts are automatically redacted for known secret values (API keys, tokens) at the end of every run. The agent CLI must be available in the runner; the Action does not install it. See docs/github-action.md for setup instructions.

Recipes

Reusable, parameterized plans for common tasks. Recipes modify existing projects (unlike templates, which create new ones).

swarm recipes                           # list all
swarm recipe-info add-tests             # show details
swarm use add-tests                     # run with defaults
swarm use add-auth --param strategy=session --tool claude-code

Recipe	Steps	Description	Key Parameters
`add-tests`	3	Add unit tests for untested modules	`framework` (jest/vitest/mocha), `coverage-target`
`add-auth`	4	Add authentication	`strategy` (jwt/session)
`add-ci`	3	Add GitHub Actions CI pipeline
`migrate-to-ts`	4	Migrate JavaScript to TypeScript	`strict` (true/false)
`add-api-docs`	3	Generate OpenAPI spec and docs	`format` (openapi/markdown)
`security-audit`	3	Run security audit and fix findings
`refactor-modularize`	4	Break monolithic code into modules

Create custom recipes by adding JSON files to templates/recipes/. See docs/recipes.md for the schema and examples.

Architecture

Goal ──> Plan ──> Waves ──> Branches ──> Agents ──> Verify ──> Repair? ──> Merge

Plan generation. A goal becomes numbered steps with declared dependencies, each assigned to a specialized agent. Plans can be generated from a goal, loaded from a template, run from a recipe, or bootstrapped from repo analysis.
Greedy scheduling. Steps launch the moment their dependencies are satisfied. Adaptive concurrency adjusts based on success rates.
Branch isolation. Each step runs on its own git worktree and branch. With --strict-isolation, cross-step context is restricted to verified entries only.
Agent execution. The orchestrator spawns the selected agent CLI (--tool) as a subprocess, injecting the prompt plus dependency context. Transcripts are captured for supplementary analysis.
Outcome verification. The engine checks what actually happened: git diff against the recorded base SHA, build execution, test execution, and expected file existence. Transcript parsing runs as a secondary signal. Steps must prove their work with outcomes, not claims.
Failure repair. Failed steps are classified (build failure, test failure, missing files, no changes) and retried up to three times. Each retry receives structured failure context: which checks failed, the relevant build/test output, and what to fix. The RepairAgent uses outcome-based root causes, not guesswork.
Merge. Verified branches merge to main. Quality gates check the result for scaffold leftovers, duplicate blocks, hardcoded config, README accuracy, test isolation, runtime correctness, accessibility, and test coverage.

Verification checks

Check	Type	Required	What It Verifies
Git diff	`git_diff`	Yes	Agent produced file changes vs base SHA
File existence	`file_existence`	If specified	Expected output files exist in worktree
Build execution	`build_exec`	If script exists	`npm run build` (or detected equivalent) passes
Test execution	`test_exec`	If script exists	`npm test` (or detected equivalent) passes
Transcript evidence	`transcript`	No	Agent claimed completion (supplementary)

When outcome checks are present, transcript-based checks are demoted to non-required. A step passes when all required checks pass.

Key modules

Module	Responsibility
`swarm-orchestrator.ts`	Greedy scheduler, dependency resolution, merge delegation, cost tracking
`worktree-manager.ts`	Git worktree lifecycle: creation, removal, branch operations
`branch-merger.ts`	Branch merge strategies: rebase-and-merge, conflict resolution, wave merges
`verifier-engine.ts`	Outcome-based verification (git diff, build, test, file existence) + transcript analysis
`session-executor.ts`	Agent adapter integration, AgentResult-to-SessionResult mapping
`adapters/`	Pluggable agent adapters (copilot, claude-code, codex)
`recipe-loader.ts`	Recipe loading, parameterization, listing
`repair-agent.ts`	Failure classification, targeted retry with outcome context
`plan-generator.ts`	Plan creation, dependency validation, recipe-to-plan conversion
`cost-estimator.ts`	Pre-execution cost prediction with model multipliers
`knowledge-base.ts`	Cross-run pattern storage, recipe run tracking, cost history

Output artifacts

runs/<execution-id>/
  session-state.json          # full execution state (resumable)
  metrics.json                # timing, commit count, verification stats
  cost-attribution.json       # per-step estimated vs actual premium requests
  steps/
    step-N/share.md           # raw agent transcript
  verification/
    step-N-verification.md    # outcome-based pass/fail report

Demos

Six built-in scenarios for verifying your setup or seeing the pipeline end-to-end.

Cost note: Demos run real agent sessions against real APIs. Each step consumes at least one premium request (or API call for Claude Code / Codex). Larger demos with expensive models can use significant budget. For example, saas-mvp with o3 (20x multiplier) could consume 160+ premium requests. Use --cost-estimate-only to preview costs before committing.

swarm demo list
swarm demo-fast          # ~1 min, two parallel agents
swarm demo <n>        # any scenario

# Preview cost before running
swarm demo api-server --cost-estimate-only

Scenario	Agents	What Gets Built	Time	Est. Requests (1x model)
`demo-fast`	2	Two independent utility modules	~1 min	2
`dashboard-showcase`	4	React + Chart.js dashboard, Express API	~8 min	4-5
`todo-app`	4	React todo with Express backend	~15 min	4-5
`api-server`	6	REST API with JWT, PostgreSQL, Docker	~25 min	6-8
`full-stack-app`	7	Full-stack with auth, E2E tests, CI/CD	~30 min	7-10
`saas-mvp`	8	SaaS MVP with Stripe, analytics, security	~40 min	8-12

Contributing

npm install && npm run build && npm test

Before submitting a PR: run npm test, run swarm gates ., and keep commits descriptive. TypeScript strict mode, ES2020 target.

License

ISC

Built by Bradley R. Kinnard.