ultragoal

Tell Claude what you want once. It works until the job is verifiably done — and it gets smarter every time.

You talk to it like a person — a messy, unedited voice note is fine. It asks the few questions it can't answer itself, agrees with you on what "done" means, then works on its own: turn after turn, session after session, until an independent reviewer confirms the work holds up. Then it writes down what it learned, so the next goal starts smarter.

Underneath, this is the workflow Anthropic's engineers describe using with Fable 5: don't steer the model prompt by prompt — design a loop where it self-corrects against honest feedback and manages its own memory. ultragoal packages that whole system into a plugin you install with one command.

Every mechanism in the loop is research-backed — verifier design, evidence ledgers, rubric architecture, memory provenance all trace to published results from Anthropic, DeepSeek, Alibaba, ByteDance, Tencent, and academic agent-systems work. The full mechanism→evidence map lives in docs/research-foundations.md, fed by dated research sweeps in docs/research/.

  BRIEF ──► GOAL ──► LOOP ──► VERIFY ──► DISTILL
   │          │        │         │           │
   ramble    spec    work     fresh-eyes   memory
   (voice)  +rubric  turns    subagent     grows
                        ▲                    │
                        └──── consult ◄──────┘    next session starts smarter

Four parts keep each other honest:

A real definition of done. Every goal becomes a spec whose rubric is checkable by commands — "tests pass", "p95 under 200ms" — never vibes. In the research's words: rubric design is the skill now; a well-designed rubric does more work than the model.
Fresh eyes, not self-review. A separate verifier agent — with no knowledge of how the work was done — re-runs every check and tries to prove the work wrong. Anthropic's guidance is blunt: fresh-context verifiers outperform self-critique. The gate releases only on the verifier's sign-off, and the worker is instructed never to write that verdict itself. (Like everything in Claude Code, this is a prompt-level boundary, not a sandbox — the rigor comes from the separation and the honest rubric, not from locking the worker out of a file.)
A loop that can't quit early. A gate blocks Claude from stopping while the goal is unfinished — and because the goal lives in a file, it survives /clear, restarts, and days away. Goals are per-session: run different goals in different sessions of the same repo at once, each gated independently. Same architecture as Claude Code's built-in /goal, with upgrades (see how the loop works).
Memory that compounds. Every goal ends by saving verified facts, working patterns, and dead ends into your repo. The continual-learning progression — fail → investigate → verify → distill → consult — runs on autopilot, for your whole team.

And the pitch in one line: you never have to learn prompt engineering. You bring intent; ultragoal writes the expert-grade brief for itself, straight from Anthropic's playbook.

Install

npx ultragoal

An interactive installer walks you through it: confirm where it goes — this project is the default (it lands in .claude/settings.json, so teammates get it through git; --global installs machine-wide instead) — and optionally pre-configure the repo: five working-style questions, and it scaffolds .ultragoal/ plus the CLAUDE.md block on the spot. --yes skips all prompts for CI; --setup pre-configures the repo; uninstall removes it. It wraps Claude Code's native plugin system — prefer that route directly? Inside Claude Code:

/plugin marketplace add morphaxl/ultragoal
/plugin install ultragoal@ultragoal

Want it available in every project on your machine instead of just this one?

npx ultragoal --global

Autopilot — the recommended way to run goals

npx ultragoal run "checkout is slow, get p95 under 200ms without breaking contract tests"

This is how ultragoal is meant to be used: one command from terminal to running goal loop, at full autonomy — it makes sure the plugin is installed, then launches Claude Code with your brief armed and --dangerously-skip-permissions. Zero prompts of any kind until the goal is verified done. A goal loop only earns its keep when nothing blocks the turns; permission prompts are exactly the babysitting this system exists to remove — the rubric, the verifier, the turn budget, and the fail-open gate are the guardrails. Since Claude can run any command without asking, favor repos you can reset (git is your undo) or a container, and know your three dials: --safe keeps permission guardrails on (auto mode: tools auto-approved within turns, sensitive actions still ask), --worktree runs the goal in a fresh git worktree (an isolated checkout on its own branch — the natural pairing for full autonomy, and how parallel goals on one repo keep out of each other's files), and --headless runs the whole loop non-interactively, exiting when the goal completes.

Requires Claude Code ≥ 2.1.139. The hook scripts are POSIX shell — on Windows, Claude Code runs them via Git Bash (installed with Git), or use WSL. Update with npx ultragoal update — it sweeps every install in one go: user scope plus all per-project pins (project-scoped installs never auto-update on their own, so they go stale silently); restart sessions to apply. Uninstall with npx ultragoal uninstall (add --purge to also remove a repo's .ultragoal/ data). Working in a monorepo or multi-repo workspace? Put .ultragoal/ at the workspace root — the hooks walk up to the nearest one, so all nested repos share a single brain.

Sixty seconds to your first goal

/ultragoal:goal okay so the checkout flow is slow and users are bouncing, I think it's
the inventory check, we talked about caching it last week, anyway it needs to be under
200ms and definitely don't break the contract tests, oh and there's that weird race
condition ticket too maybe related...

That's a real, unedited ramble — exactly what it's built for. Ultragoal will:

Consult project memory and scan your repo with parallel subagents before asking you anything.
Interview you on the decisions that actually steer the outcome — approach, the definition of "done", what's explicitly out of scope, which tradeoff to favor — never trivia it could look up. Each is a concrete fork with a recommended default, so you ratify fast or override deliberately.
Spec the goal: objective with the why, a rubric where every item has an exact check command, stop conditions, and constraints — then adversarially reviews its own rubric before showing you.
Recap before building — what it understood you want, which way each decision went (including calls it made for you), what it's about to do, what it will take in rough terms (turns and subagent fan-outs — never time estimates), and how it'll know it's done. Your last cheap moment to redirect, before a single line changes.
Arm the loop on your yes. From here a Stop-hook gate blocks the end of every turn and feeds the remaining rubric back, so Claude keeps working without you prompting each step.
Verify with a separate fresh-context subagent that re-runs every check itself and tries to refute the claims — because models grade their own work generously, and independent verifiers don't.
Distill before it's allowed to finish: verified lessons, working patterns, and dead ends are written to .ultragoal/memory/, so the next goal starts smarter.

Walk away mid-goal, close the laptop, /clear — the goal survives. Next session opens with a banner: "Active goal 'checkout-latency' — turn 9 of 25."

Want better goals from the first try? docs/briefing-guide.md lists the high-value signals — done-criteria, scope edges, constraints, where logs live — that turn a twenty-question interview into a two-question one.

Two kinds of goals

Task goals — "build this, fix this, migrate this." Done means the checklist holds.

Experiment goals — "make this number better." When the brief is an optimization (build time, latency, bundle size, test runtime), ultragoal compiles it into a measure-and-ratchet loop modeled on Karpathy's autoresearch: establish the baseline first, then one change per experiment — commit, measure with an immutable command, keep only if the number strictly improved, git reset if it didn't. Every attempt lands in results.tsv (keeps, discards, and crashes), and since each row carries its commit hash, any discarded idea's full diff stays recoverable. The verifier re-runs the final measurement itself and fails the goal if the measure command was ever touched — no moving goalposts. The same pattern took Shopify from "one-shot 'make it faster' prompts fail" to a 65% faster build, unattended.

Either kind starts from the rubric library when the brief matches a known domain: 16 research-backed templates (Next.js features, web performance, accessibility, API quality, security, bug fixes, refactors, test health, CI speed, dependency upgrades, CLI tools, docs, React Native, app-store readiness, realtime stability) with every threshold cited — Core Web Vitals, WCAG 2.2, OWASP 2025, Google's engineering practices — and every item carrying the command that proves it. Templates also recommend skills worth pairing, like Vercel's react best-practices skills from skills.sh.

Commands

Command	What it does
`/ultragoal:goal <brain dump>`	The front door: interview → spec → armed loop → execution
`/ultragoal:status`	Dashboard: rubric progress, turn budget, last verdict, memory health, goal history trends
`/ultragoal:verify`	Independent audit of any goal — fresh-context verifier re-runs every check
`/ultragoal:stop`	Bail out gracefully — pause or abandon, gate releases instantly
`/ultragoal:remember`	Distill lessons from the current session into memory
`/ultragoal:compact`	Memory hygiene pass — merge, generalize, drop stale (nudged every ~10 sessions)
`/ultragoal:setup`	First-run init / change preference knobs (runs automatically on first goal)

What it creates in your repo

Everything the plugin produces is plain markdown you own — editable, diffable, git-shareable. The engine ships in the plugin; the state lives with you.

.ultragoal/
├── config.md            # your knobs — hand-editable
├── stats.tsv            # one row per finished goal: turns, verifier fails, outcome —
│                        #   "rubric design is the skill"; this is its scoreboard
├── goals/
│   ├── active/
│   │   └── <slug>/      # one directory per live goal (concurrent across sessions)
│   │       ├── goal.md  #   the spec: rubric, verification log, decision journal
│   │       └── results.tsv  # experiment goals: every attempt with its commit hash
│   └── archive/         # finished and abandoned goals (their journals feed memory)
└── memory/
    ├── MEMORY.md        # index + fixed slots (commands, invariants, gotchas, hot files)
    ├── facts.md         # what's true of this repo
    ├── patterns.md      # approaches that worked, and why
    └── failures.md      # dead ends, so no future session repeats them

Memory files are two-layered, borrowing the structure of Karpathy's LLM-wiki pattern and Garry Tan's gbrain: compiled truth above the line — rewritten as understanding improves — and an append-only, dated evidence log below it that is never edited. Every claim carries its provenance — [VERIFIED · ran the command], [READ · from docs], [INFERRED], [USER-CORRECTION] — so confident prose can never quietly masquerade as checked fact, and the compaction pass cleans the synthesis without ever touching the evidence. When the repo has moved a lot since memory was last fed, the session banner says so and tells Claude to re-verify before trusting.

Plus a small fenced block in CLAUDE.md (shown to you before it's written) wiring the memory protocol and your chosen style knobs.

Memory is git-committed by default: it's your team's growing brain — every teammate's Claude consults and feeds the same one. Choose local-only at setup if you prefer.

The knobs

Five questions at first run, stored in .ultragoal/config.md, each backed verbatim by Anthropic's official prompting guidance:

Knob	Options (default first)
Action mode	proactive · conservative
Communication	lead-with-outcome · detailed
Scope discipline	polish-welcome · minimal
Memory sharing	git-committed · local-only
Verification	on · off — off lets goals finish on a fully checked rubric + saved lessons, skipping the independent verifier pass

Change them anytime with /ultragoal:setup or by editing the markdown.

How the loop actually works

/goal in Claude Code is a Stop hook under the hood: something checks a condition after every turn and blocks the stop until it holds. Ultragoal ships that same architecture with four differences:

The model can arm it. Claude can't invoke built-in /goal itself; it can write a goal file, which is all the ultragoal gate needs. One skill takes you from ramble to running loop.
It persists, and it's per-session. Native /goal dies with the session and there's one at a time. The ultragoal gate reads files keyed by session, so a goal spans sessions and days — and different sessions in the same repo can each run their own goal concurrently, with the gate enforcing only the one you armed in the session that's stopping.
The judge runs commands. Native /goal's evaluator only reads the transcript — the self-report channel. Ultragoal's gate is deterministic (free, instant), and completion requires a fresh-context verifier that re-ran the checks itself.
Finishing requires learning. The gate won't release until lessons are distilled to memory. Failed goals distill too — failures.md exists so the next attempt doesn't repeat them.

Every goal spec also includes a one-line native /goal fallback, handy for one-off headless runs: claude -p "/goal ...".

Escape hatches

Loops need brakes. Every rubric must carry stop conditions; every goal has a turn budget (default 25) — at the limit the gate demands an honest status report, and if that's ignored it pauses the goal itself; /ultragoal:stop releases it instantly; and the gate fails open on any script error. It cannot trap a session. The gate also binds to the session that armed the goal — open a second Claude session in the same repo for a quick side question and it stays free (the banner tells it how to take the goal over if you want that). Verifier verdicts are cryptographically dull but effective: each one is bound to a hash of the rubric it was issued against, so a stale PASS — or a quietly weakened rubric — never releases the gate. The engine has a regression suite (tests/gate-test.sh) run in CI on every push.

Footprint

Always-on context cost is a handful of skill descriptions — on the order of a hundred tokens. Everything else loads when invoked. When no goal is active, the gate is a single file-existence check.

Where this comes from

Lance Martin (Anthropic), Designing loops with Fable 5 — loops over prompts; rubric design as the skill; verifier subagents over self-critique; the fail → investigate → verify → distill → consult progression this plugin mechanizes.
Anthropic, Prompting Claude Fable 5 — the verbatim behavior blocks behind the knobs, the memory protocol, and the verification guidance.
Anthropic, Prompting best practices and the Claude Code docs on /goal, hooks, skills, and sub-agents.
Andrej Karpathy, autoresearch — the experiment ratchet behind experiment goals: baseline-first, strict improvement, keep/revert via git, every attempt journaled, the evaluator immutable.
Karpathy's LLM-wiki gist and Garry Tan's gbrain — the memory architecture: compiled truth over append-only evidence, per-claim provenance, lint-style maintenance.

Design rationale, trade-offs, and the competitive landscape live in DESIGN.md.

FAQ

Do I need to know how to prompt? No — that's the point. You bring what only you know (what you want, who it's for, what must not break); ultragoal writes the expert-grade brief for itself. You review a plan in plain English, never author a prompt.

Versus ralph-loop? Ralph re-feeds the same prompt until a promise appears. Ultragoal adds the parts the article argues matter: a rubric with per-item check commands, an independent verifier, persistent cross-session goals, and enforced distillation into memory.

Does the verifier have its own context, or does it grade in the same conversation? Its own. The verifier is a separate subagent with a fresh context window and no access to the worker's reasoning — it only sees the goal file and what it learns by re-running the checks itself. (It does share the Claude Code process and permissions; for absolute isolation on high-stakes work, run /ultragoal:verify from a separate headless session as documented in that skill.)

How do I change my setup answers later? They're just markdown: edit .ultragoal/config.md directly (flip verification to off, change scope, anything), or re-run /ultragoal:setup to be re-asked interactively. Changes apply to the next goal you arm.

Does it spend a lot of tokens? The gate itself is free (no model call). The loop spends what the work needs — that's the point of goal-directed runs. Budgets cap the blast radius; start small (10–15 turns) to calibrate.

Can I run it unattended? Yes — that's the recommended mode: npx ultragoal run "<brief>" launches at full autonomy, and --headless runs the loop to completion with no UI at all. The discipline lives in the rubric, the verifier, and the budget — not in you approving each tool call.

Uninstall? npx ultragoal uninstall removes the plugin and marketplace entry; your .ultragoal/ state stays — it's yours. Add --purge to delete a repo's state too (restores CLAUDE.md byte-for-byte).

License

MIT · Privacy: no data collection — everything is local markdown in your repo.