autoimprove
Autonomous codebase improvement loop for Claude Code
๐ autoimprove
Autonomous codebase improvement loop for Claude Code
Inspired by karpathy/autoresearch โ but for any codebase, not just ML training loops.
What is this?
Karpathy's autoresearch lets an AI agent run ML experiments overnight: modify train.py โ measure val_bpb โ keep if better, discard if worse โ repeat. You wake up to a log of experiments and a better model.
autoimprove does the same thing for your codebase.
Give Claude Code your project, run /autoimprove:improve, and let it iterate autonomously. It proposes a targeted change, scores your codebase before and after using your own tooling (TypeScript, cargo clippy, pytest, golangci-lint โ whatever you already have), keeps the changes that improve the score, reverts the ones that don't, and logs everything. You wake up to a readable log of what worked, what didn't, and a cleaner codebase.
propose โ measure BEFORE โ implement โ measure AFTER โ keep โ
or discard โ โ log โ repeat
A real autoimprove session: 5 iterations, 3 wins, score 50 โ 59 (+9 pts) in under 6 minutes
Quick start
# 1. Add the marketplace and install the plugin
/plugin marketplace add benmarte/autoimprove
/plugin install autoimprove@autoimprove
# 2. Auto-detect your stack and see your codebase report
/autoimprove:setup
# 3. The audit shows what's wrong and offers to start fixing
# Or run the audit anytime for a fresh check
/autoimprove:audit
# 4. For unattended runs (e.g. overnight), use improve directly
/autoimprove:improve 20
# Or focus on a specific task
/autoimprove:improve 10 "Replace all any types with proper interfaces"
# 5. Review in the morning
cat .claude/autoimprove/log.md
git log --oneline # one commit per winning experiment
git show HEAD # inspect the latest win
That's it. No config required upfront โ /autoimprove:setup fingerprints your project, writes .claude/autoimprove/config.md, and immediately runs an audit showing your codebase's deficiencies ranked by efficiency.
Upgrading
If you already have the upgrade command
/autoimprove:upgrade
If you don't have the upgrade command (older installs)
The plugin system caches marketplace clones locally. If your install predates the upgrade command, you need to update the marketplace clone first:
# 1. Update the marketplace clone
cd ~/.claude/plugins/marketplaces/autoimprove && git pull origin main
# 2. Reinstall the plugin
/plugin update autoimprove@autoimprove
If /plugin update still shows "already at the latest version", uninstall and reinstall:
/plugin uninstall autoimprove@autoimprove
/plugin install autoimprove@autoimprove
After this, /autoimprove:upgrade will be available for all future updates.
Auto-update check
autoimprove checks for new releases once per day on session start. If an update is available, you'll see:
Update available: v1.2.0 โ v1.3.0
Run /autoimprove:upgrade to update.
The check is lightweight (single GitHub API call, 3s timeout, cached for 24 hours) and never blocks startup.
How it works
1. Setup (once per project)
/autoimprove:setup scans your project root to detect:
- Language and framework
- Package manager (
npm,cargo,poetry,uv, etc.) - Test runner (
pytest,jest,go test,rspec, etc.) - Type checker (
tsc,mypy,pyright, etc.) - Linter (
eslint,ruff,golangci-lint,rubocop, etc.)
It writes an .claude/autoimprove/config.md file in your project root โ a plain Markdown config that maps your specific tools to a 0โ100 composite quality score. You can edit this file to customise the loop for your project.
2. Isolated experiments via git worktrees
Every experiment runs in a separate git worktree โ its own directory, its own branch, completely isolated from your main codebase:
your-project/ โ main branch (never touched during experiments)
.claude/autoimprove/worktrees/ โ gitignored, auto-created
experiment-001/ โ branch: autoimprove/experiment-001
experiment-002/ โ branch: autoimprove/experiment-002
experiment-003/ โ branch: autoimprove/experiment-003
- โ Winning experiments get squash-merged back to main as a clean commit
- โ Losing experiments have the worktree and branch deleted โ nothing touches main
- ๐ Your working directory is read-only for the entire session
- ๐งน All worktrees are cleaned up automatically at session end
No more git checkout -- . rollbacks. No risk of a broken experiment corrupting your codebase.
3. The audit
Before diving into fixes, /autoimprove:audit scans your codebase and shows exactly what needs work:
โโโ Codebase Audit โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Current Score: 61/100
Type safety: 24/40 โโโโโโโโโโ (16 pts to max)
Build: 20/20 โโโโโโโโโโ โ maxed
Tests: 10/30 โโโโโโโโโโ (20 pts to max)
Lint: 7/10 โโโโโโโโโโ (3 pts to max)
โโโ Fastest Path to 100% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Area Gap Issues Est. iterations Efficiency
1 Type safety 16pts 8 errors 3 iterations 5.3 pts/iter โ best
2 Lint 3pts 2 warnings 1 iteration 3.0 pts/iter
3 Tests 20pts 0/4 covered 7 iterations 2.9 pts/iter
Total: ~11 iterations to reach 100/100
โก Estimated token usage: ~250K tokens (rough estimate, actual usage varies)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The audit ranks areas by efficiency โ points gained per iteration โ so you fix the highest-impact issues first. It then offers to start fixing interactively, area by area, or you can run /autoimprove:improve directly.
Setup auto-runs the audit after generating your config, so first-time users see this report immediately.
4. The score
Every iteration, the loop measures your codebase on four axes:
| Metric | Weight | What it checks |
|---|---|---|
| Type / compile errors | 40 pts | tsc --noEmit, cargo check, go build, mypy, etc. |
| Build success | 20 pts | Does the project build without errors? |
| Test pass rate | 30 pts | (passing / total) ร 30 |
| Lint errors | 10 pts | eslint, ruff, clippy, golangci-lint, etc. |
If a metric doesn't apply (no tests yet, no linter configured), its weight is redistributed across the others.
5. The loop
Each iteration prints visible progress so you always know what's happening:
โโโ Iteration 1/5 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฌ PROPOSE: Targeting error handling in src/api/client.ts
๐ฌ SNAPSHOT: Measuring BEFORE score...
๐ฌ IMPLEMENT: Adding try/catch to unhandled async calls
๐ฌ MEASURE: Measuring AFTER score...
๐ฌ DECIDE: 85 โ 89 (+4 pts) โ KEPT โ
๐ฌ LOG: Recorded to .claude/autoimprove/log.md
Steps per iteration:
- Creates a fresh git worktree + branch (
autoimprove/experiment-NNN) - Proposes one bounded improvement with an explicit hypothesis โ "I will fix the three unhandled promise rejections in
api/invoices.tsbecause I expect it to reduce TypeScript errors and improve the type score by ~8 points" - Measures the score inside the worktree (BEFORE)
- Implements the change inside the worktree (surgical โ 1โ3 files at most)
- Measures again (AFTER)
- Keeps โ squash-merges to main and deletes the worktree โ if AFTER โฅ BEFORE
- Discards โ deletes the worktree and branch, main untouched โ if AFTER < BEFORE
- Logs the result to
.claude/autoimprove/log.md
6. The log
After each iteration, .claude/autoimprove/log.md gets an entry like:
## Iteration 4 โ 2026-03-11 02:14
**Hypothesis:** Replace 3 `any` types in convex/invoices.ts with proper TypeScript interfaces
**Branch:** autoimprove/experiment-004
**Files changed:** convex/invoices.ts
**Before:** 74/100 โ type: 28, build: 20, tests: 18, lint: 8
**After:** 82/100 โ type: 36, build: 20, tests: 18, lint: 8
**Decision:** KEPT โ
(squash-merged to main, worktree deleted)
**Reason:** Eliminated 2 TS errors by typing the invoice mutation arguments properly
Commands
| Command | Description |
|---|---|
/autoimprove:setup |
Detect stack, generate config, and run initial audit |
/autoimprove:audit |
Scan codebase for deficiencies and get a prioritized fix plan |
/autoimprove:improve [N] ["focus"] |
Run N iterations of the loop (default: 5), optionally focused on a specific task |
/autoimprove:continue [N] ["focus"] |
Resume an interrupted session โ inherits remaining iterations and focus from the log |
/autoimprove:status |
Show a summary of all runs from .claude/autoimprove/log.md |
/autoimprove:upgrade |
Check for and install the latest version |
Supported languages
| Language | Type check | Build | Tests | Lint |
|---|---|---|---|---|
| TypeScript / JavaScript | tsc --noEmit |
npm/pnpm/yarn/bun build |
jest / vitest / mocha | eslint |
| Next.js / Nuxt / Remix / Astro | tsc --noEmit |
framework build cmd | jest / vitest | eslint |
| Python | mypy / pyright | โ | pytest | ruff / flake8 / pylint |
| Go | go build ./... |
go build |
go test ./... |
golangci-lint / go vet |
| Rust | cargo check |
cargo build |
cargo test |
cargo clippy |
| Ruby | sorbet (if configured) | โ | rspec / minitest | rubocop |
| Java / Kotlin | mvn compile / ./gradlew build |
same | mvn test / ./gradlew test |
checkstyle / ktlint |
| C# / .NET | dotnet build |
dotnet build |
dotnet test |
dotnet format --verify-no-changes |
| PHP | phpstan | โ | phpunit | phpcs |
| Swift | swift build |
swift build |
swift test |
swiftlint |
| Any Makefile project | make check / make typecheck |
make build |
make test |
make lint |
Don't see your stack? Edit .claude/autoimprove/config.md after setup to add your own commands.
Customising .claude/autoimprove/config.md
After running /autoimprove:setup, edit the generated .claude/autoimprove/config.md to tailor the loop to your project:
## Improvement Areas
- Check all Convex mutations have auth guards
- Replace fetch() calls with our internal apiClient wrapper
- Ensure every page component has a loading.tsx sibling
## Files to Never Modify
- convex/schema.ts
- src/generated/
- migrations/
- .env.local
You can also override any auto-detected command, change scoring weights, or add custom shell commands as additional metrics.
Focused improvements
You can focus the loop on a specific task directly from the command โ no config editing needed. Just pass a quoted string:
# Focus on type safety
/autoimprove:improve 10 "Replace all any types with proper TypeScript interfaces"
# Focus on a specific directory
/autoimprove:improve 5 "Fix all lint warnings in src/components/dashboard/"
# Focus on tests
/autoimprove:improve 10 "Add unit tests for every exported function in lib/billing/"
# Focus on a migration
/autoimprove:improve 20 "Replace all raw fetch() calls with the apiClient wrapper from lib/api-client.ts"
When a focus string is provided, every iteration targets that task. The loop breaks it into file-by-file sub-tasks and chips away one per iteration until the focus is fully addressed or iterations run out.
Without a focus string, the loop rotates through all areas listed in your .claude/autoimprove/config.md as usual.
Alternative: edit the config
For recurring focus areas, you can also edit the Improvement Areas section in .claude/autoimprove/config.md directly:
## Improvement Areas
- Replace every `any` type with a proper TypeScript interface or type alias
This is useful when you want the focus to persist across multiple sessions without re-typing it.
Tips for focused runs
- Be specific.
"Fix type errors"is vague."Replace any with proper types in convex/ mutations"gives the loop a clear target. - One concern at a time works best. The loop makes surgical 1โ3 file changes per iteration โ a narrow focus means every iteration chips away at the same problem.
- Match iteration count to scope. If you have ~20 files to fix, run
/autoimprove:improve 20 "..."so each iteration can tackle one file. - Use "Files to Never Modify" in the config to protect areas you don't want touched during a focused run.
Resuming interrupted sessions
If your session gets interrupted (Ctrl+C, context limit, crash), you can pick up where you left off:
# Resume with remaining iterations and same focus
/autoimprove:continue
# Resume but only run 3 more iterations
/autoimprove:continue 3
# Resume with a different focus
/autoimprove:continue "New focus area"
# Override both
/autoimprove:continue 5 "Fix error handling in api/"
The continue command reads .claude/autoimprove/log.md to find the interrupted session, inherits its settings, and picks up from the next iteration. Iteration numbering continues seamlessly (e.g., if you completed 4/10, it resumes at 5/10).
If the codebase has changed since the interrupted session (you made manual commits), autoimprove will warn you and re-measure the baseline.
Check /autoimprove:status to see if you have an interrupted session to resume.
What the loop improves
The loop rotates through these universal improvement areas (and adds language-specific ones based on your stack):
- Type safety โ fix type errors, replace
any/interface{}/untyped constructs - Error handling โ unhandled promises, bare
catch {}, swallowed errors - Dead code โ unused imports, variables, unreachable branches
- Code duplication โ extract repeated logic (3+ occurrences) into shared utilities
- Naming & readability โ cryptic names, functions over ~50 lines
- Performance โ N+1 query patterns, missing memoization, unnecessary allocations
- Security โ hardcoded secrets, missing input validation, unguarded auth routes
- Tests โ add a test for the most critical untested function, fix flaky tests
Safety
The loop is designed to be safe to run unattended:
| Rule | Detail |
|---|---|
| ๐ Never touches lock files | package-lock.json, Cargo.lock, go.sum, Gemfile.lock, etc. |
| ๐ Never touches generated files | Migrations, protobuf output, OpenAPI generated code |
| ๐ Never touches secrets | .env, .env.local, any secrets file |
| ๐ Never deploys or publishes | No git push, npm publish, cargo publish, etc. |
| ๐ Requires clean git state | Won't start if git status shows uncommitted changes |
| ๐ Experiments in isolated worktrees | Each experiment is on its own branch โ main is never modified mid-session |
| ๐ Losers deleted, not rolled back | Failed experiments: worktree deleted, branch deleted, main untouched |
| ๐ Winners squash-merged | One clean commit per winning experiment โ easy to review with git log |
| ๐ Pauses every 10 iterations | Cleans up worktrees, writes summary, waits for human review |
You always review and push โ the loop never commits or pushes on your behalf.
Plugin structure
autoimprove/
โโโ .claude-plugin/
โ โโโ plugin.json # Plugin manifest
โ โโโ hooks/
โ โโโ hooks.json # SessionStart hook registration
โโโ hooks/
โ โโโ sessionstart.sh # update check on startup (once per day)
โโโ skills/
โ โโโ audit/
โ โ โโโ SKILL.md # Codebase deficiency scan, prioritized report, interactive fix loop
โ โโโ detect-stack/
โ โ โโโ SKILL.md # Fingerprints project, writes .claude/autoimprove/config.md
โ โโโ worktree/
โ โ โโโ SKILL.md # Creates/manages/cleans up git worktrees per experiment
โ โโโ improve-loop/
โ โ โโโ SKILL.md # Core loop: worktree โ propose โ implement โ measure โ merge/delete
โ โโโ measure/
โ โ โโโ SKILL.md # Internal scoring utility (used by audit and improve-loop)
โ โโโ rollback/
โ โโโ SKILL.md # Emergency cleanup of all experiment worktrees
โโโ commands/
โโโ audit.md # /autoimprove:audit
โโโ continue.md # /autoimprove:continue [N] ["focus"]
โโโ setup.md # /autoimprove:setup
โโโ improve.md # /autoimprove:improve [N] ["focus"]
โโโ status.md # /autoimprove:status
โโโ upgrade.md # /autoimprove:upgrade (check for updates)
Example run
Here's what a real overnight session looks like. This is from a Next.js + Convex project starting at a score of 61/100:
## Iteration 1 โ 23:04
**Hypothesis:** Replace 4 implicit `any` types in `convex/invoices.ts` with proper interfaces
**Files changed:** convex/invoices.ts
**Before:** 61/100 โ type: 24, build: 20, tests: 10, lint: 7
**After:** 69/100 โ type: 32, build: 20, tests: 10, lint: 7
**Decision:** KEPT โ
**Reason:** Removed 4 TS7006 implicit-any errors by typing mutation arguments
## Iteration 5 โ 23:37
**Hypothesis:** Move ExpenseList to a server component โ it only reads data, no interactivity
**Branch:** autoimprove/experiment-005
**Files changed:** components/ExpenseList.tsx
**Before:** 71/100 โ type: 32, build: 20, tests: 10, lint: 9
**After:** 68/100 โ type: 26, build: 20, tests: 10, lint: 12
**Decision:** DISCARDED โ (worktree deleted, main untouched)
**Reason:** Removing "use client" broke useQuery hook โ must stay client component.
## Iteration 8 โ 00:02
**Hypothesis:** Add unit tests for calculateTaxEstimate() โ most complex function, zero coverage
**Files changed:** lib/tax.test.ts (new)
**Before:** 78/100 โ type: 36, build: 20, tests: 10, lint: 10
**After:** 84/100 โ type: 36, build: 20, tests: 16, lint: 10
**Decision:** KEPT โ
**Reason:** 2 new tests passing, covers basic and edge-case tax bracket logic
โโโ Session Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Score: 61 โ 84 (+23 pts)
๐ Iterations: 10 total โ 9 kept โ
, 1 discarded โ
๐ Merged commits:
โข abc1234 autoimprove(001): Replace 4 implicit any types
โข def5678 autoimprove(002): Add error boundaries
โข ...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
See autoimprove-log.example.md for the full 10-iteration session with summary table.
Contributing
PRs welcome! Especially:
- New language profiles in
detect-stack/SKILL.md - Better improvement area prompts for specific frameworks
- Example
.claude/autoimprove/config.mdfiles for common stacks
License
MIT
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found