skill-optimizer

Benchmark and self-optimize SDK, CLI, and MCP guidance so every agent model can use your tool reliably.

skill-optimizer runs your SDK / CLI / MCP docs against multiple LLMs, measures whether they call the right actions with the right arguments, and iteratively rewrites your SKILL.md / docs until a floor score is met across every model.

Quickstart

git clone https://github.com/bucurdavid/skill-optimizer
cd skill-optimizer
npm install
export OPENROUTER_API_KEY=sk-or-...

# Scaffold config for your surface type (sdk | cli | mcp)
npx tsx src/cli.ts init cli

init cli creates a skill-optimizer/ directory with:

skill-optimizer.json — the main config (task generation enabled by default)
cli-commands.json — command manifest template (used as fallback if code-first discovery finds nothing)

init sdk creates skill-optimizer.json only. init mcp creates skill-optimizer.json + tools.json.

Open skill-optimizer/skill-optimizer.json and fill in these fields:

Field	What it does	Set it to
`target.repoPath`	Root of the project being benchmarked	Absolute or relative path to your repo
`target.discovery.sources`	Source files to scan for callable methods/commands/tools	e.g. `["../src/index.ts"]` or `["../src/server.ts"]`
`target.skill`	Docs file the optimizer will edit	Path to your `SKILL.md` or equivalent guidance doc
`benchmark.models`	Models to benchmark	Valid OpenRouter model IDs

For CLI and MCP surfaces: if code-first discovery yields nothing, edit the companion manifest (cli-commands.json or tools.json) with your real commands/tools — the config already points to it as a fallback.

Tasks are generated automatically from your discovered surface — you don't need to write them manually.

Then run the benchmark:

npx tsx src/cli.ts run --config ./skill-optimizer/skill-optimizer.json

How it works

Discover callable surface (SDK methods / CLI commands / MCP tools) via tree-sitter or a manifest.
Scope the surface with target.scope.include / target.scope.exclude globs.
Generate tasks — one prompt per in-scope action, coverage-guaranteed.
Benchmark — every configured model attempts every task; static evaluator checks action calls + args.
Verdict — PASS/FAIL against two gates (per-model floor, weighted average).
Optimize — mutate SKILL.md / docs inside allowedPaths, re-benchmark, accept only if both gates hold, rollback if not.
Recommendations — on FAIL, one critic call summarizes what to improve manually.

Configuration reference

All configuration lives in a single skill-optimizer.json file.

`target` fields

Field	Type	Default	Description
`surface`	`"sdk" \| "cli" \| "mcp"`	required	Type of callable surface
`repoPath`	`string`	`.`	Path to the target repo
`skill`	`string \| { source: string; cache?: boolean }`	—	Path to SKILL.md
`discovery.mode`	`"auto" \| "manifest"`	`"auto"`	How to discover actions
`discovery.sources`	`string[]`	—	Source files for tree-sitter discovery
`discovery.language`	`"typescript" \| "python" \| "rust"`	—	Language for code-first discovery
`discovery.fallbackManifest`	`string`	—	Path to manifest JSON when code-first discovery is incomplete
`sdk.language`	`"typescript" \| "python" \| "rust"`	—	SDK language
`sdk.entrypoints`	`string[]`	—	SDK entry files
`cli.commands`	`string`	—	Path to CLI commands manifest JSON
`mcp.tools`	`string`	—	Path to MCP tools manifest JSON

`benchmark` fields

Field	Type	Default	Description
`format`	`"pi"`	`"pi"`	LLM transport format
`apiKeyEnv`	`string`	`OPENROUTER_API_KEY`	Env var name for the API key
`timeout`	`number`	`240000`	Ms per model call
`models`	`Array<{ id: string; name: string; tier: "flagship"\|"mid"\|"low"; weight?: number }>`	required	Models to benchmark
`taskGeneration.enabled`	`boolean`	`false`	Whether to generate tasks automatically
`taskGeneration.maxTasks`	`number`	`10`	Max tasks to generate (must be >= scope size)
`taskGeneration.seed`	`number`	`1`	RNG seed for reproducible generation
`output.dir`	`string`	`benchmark-results/`	Where reports are saved
`verdict.perModelFloor`	`number`	`0.6`	Minimum per-model pass fraction for a PASS verdict
`verdict.targetWeightedAverage`	`number`	`0.7`	Minimum weighted average across all models for a PASS verdict

`optimize` fields (all optional)

Field	Type	Default	Description
`model`	`string`	—	Model for mutation (e.g. `openrouter/anthropic/claude-sonnet-4-6`)
`apiKeyEnv`	`string`	—	Env var for the optimizer's API key
`thinkingLevel`	`"off"\|"minimal"\|"low"\|"medium"\|"high"\|"xhigh"`	`"medium"`	Reasoning depth for mutation calls
`allowedPaths`	`string[]`	—	Paths the optimizer may edit (safety boundary)
`validation`	`string[]`	—	Shell commands to run to validate each mutation
`requireCleanGit`	`boolean`	`true`	Require clean git state before starting
`maxIterations`	`number`	`5`	Maximum optimization iterations
`minImprovement`	`number`	`0.02`	Minimum weighted-average gain per accepted iteration
`reportContextMaxBytes`	`number`	`16000`	Byte budget for mutation context

Annotated example config

{
  "name": "my-mcp-project",
  "target": {
    "surface": "mcp",
    "repoPath": ".",
    "skill": "./SKILL.md",
    "discovery": {
      "mode": "auto",
      "sources": ["./src/server.ts"]
    }
  },
  "benchmark": {
    "format": "pi",
    "apiKeyEnv": "OPENROUTER_API_KEY",
    "models": [
      { "id": "openrouter/anthropic/claude-sonnet-4-6", "name": "Claude Sonnet", "tier": "flagship", "weight": 2 },
      { "id": "openrouter/openai/gpt-4o-mini",          "name": "GPT-4o mini",   "tier": "mid",      "weight": 1 }
    ],
    "taskGeneration": {
      "enabled": true,
      "maxTasks": 20,
      "seed": 1
    },
    "output": { "dir": "./benchmark-results" },
    "verdict": {
      "perModelFloor": 0.6,
      "targetWeightedAverage": 0.7
    }
  },
  "optimize": {
    "model": "openrouter/anthropic/claude-sonnet-4-6",
    "apiKeyEnv": "OPENROUTER_API_KEY",
    "thinkingLevel": "medium",
    "allowedPaths": ["SKILL.md"],
    "requireCleanGit": true,
    "maxIterations": 5,
    "minImprovement": 0.02,
    "reportContextMaxBytes": 16000
  }
}

Interpreting the verdict

Every benchmark run produces one of two verdicts: PASS or FAIL.

Two gates must both be satisfied for a PASS:

benchmark.verdict.perModelFloor (default 0.6): every model must pass at least this fraction of tasks. A single model below the floor fails the run, regardless of the average.
benchmark.verdict.targetWeightedAverage (default 0.7): the weighted average score across all models must reach this threshold.

benchmark.models[].weight (default 1.0): heavier-weighted models count more toward the weighted average. Use higher weights for flagship models you care most about.

The optimizer only accepts a mutation when:

the weighted average improves by at least minImprovement, AND
no model that was above the floor drops below it.

Exit codes: 0 = PASS, 1 = FAIL — usable directly in CI pipelines.

Scope & coverage

Control which actions are benchmarked with target.scope:

target.scope.include (default ["*"]): glob patterns for actions to include.
target.scope.exclude (default []): glob patterns for actions to exclude.

The * wildcard matches any sequence of characters including dots and slashes — it is not limited to a single path segment.

Examples:

"Wallet.*" — includes all Wallet methods
"*.internal*" — excludes anything with "internal" anywhere in the name
"get_*" — includes only getter actions

Task generation is coverage-guaranteed: every in-scope action gets at least one task. If the first generation pass misses any, a targeted retry runs (max 2 iterations). If coverage still fails, an error names the uncovered actions and suggests either fixing SKILL.md guidance or adding them to scope.exclude.

Cost notes

Rough LLM spend per run:

Baseline benchmark: N models × M tasks LLM calls.
Optimizer iteration: 1 mutation call + N models × M tasks re-benchmark per iteration.
Recommendations: 1 critic call, only on FAIL verdict.

No per-failure LLM calls — feedback is deterministic (structured failure details + patterns + passing/failing diffs).

Dependencies

The optimizer's coding agent is powered by @mariozechner/pi-coding-agent — a small OSS wrapper around OpenRouter that handles agent sessions and tool loops. Models are accessed through OpenRouter — you need one API key for everything.

Troubleshooting

Missing OPENROUTER_API_KEY: Set it in your shell before running:

export OPENROUTER_API_KEY=sk-or-...

Dirty git: The optimizer requires a clean git state in the target repo (requireCleanGit: true by default). Commit or stash uncommitted changes before running.

maxTasks < scope_size: benchmark.taskGeneration.maxTasks must be >= the number of in-scope actions. Run npx tsx src/cli.ts --dry-run --config ./skill-optimizer.json to see the count without making LLM calls.

Empty scope: target.scope.include matched nothing. Check your glob patterns — remember * matches everything including dots.

Legacy skill-benchmark.json: Rename it to skill-optimizer.json. The loader will tell you if it finds the old name.

Contributing

See CONTRIBUTING.md.