eval-layer

Pixel-art balance scale tilting back and forth

A Claude Code skill that adds a rubric-based evaluation layer to any existing agent project. Framework-agnostic — works with PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, the raw Anthropic SDK, or anything else that exposes run(prompt) -> result.

You bring the agent. The skill gives you:

A scoring rubric (3-5 dimensions with concrete level descriptors, not "good" / "bad")
A test suite of inputs with ≥3 reference-graded cases (for calibration)
A judge prompt for LLM-as-a-judge with 2-3 calibration examples
An eval harness that runs your agent on the cases, sends output to the judge, and aggregates:
- per-dimension averages
- weighted overall score
- pass rate
- leniency vs. the human references (flags if the judge is systematically too strict or too lenient)
A per-subject markdown report and, for multi-subject benchmarks, a self-contained HTML dashboard (leaderboard + radar + per-case heatmap + failure-category breakdown).

Installation

This is a Claude Code skill. To use it as your own user-level skill, clone it into your Claude skills directory:

git clone <repo-url> ~/.claude/skills/eval-layer

Then from any Claude Code session: /eval-layer <your agent path>.

Why this exists

"Vibes-based" evals fail silently. You ship a change, the agent feels better, but you have no numbers. This skill produces numbers you can trust — including a judge-calibration signal (leniency) so you know when the judge itself is drifting, not just the agent.

Two measurements, not one:

Weighted score — how good the output is, per the rubric, on [0, 1]
Leniency — mean(judge_score − human_reference_score) on [-1, +1]
- abs < 0.10 → well-calibrated
- 0.10 – 0.25 → slight bias, monitor
- abs > 0.25 → recalibrate before trusting scores

The 7-field metadata contract

Every framework adapter returns the same shape. This is what makes cross-framework comparison honest:

{
    "recommendation": <schema instance or None>,  # agent's structured output
    "latency_ms":      int,                       # wall-clock including tool execution
    "tool_calls":      int,                       # count of tool invocations
    "input_tokens":    int | None,                # summed across turns
    "output_tokens":   int | None,
    "model_id":        str,
    "error":           str | None,
}

If a framework doesn't expose a field (Strands 0.1.x doesn't expose tokens, for example), pass None — don't fabricate. The harness handles None defensively throughout.

Usage

In Claude Code, invoke the skill against your project:

/eval-layer Add an eval layer to /path/to/your/agent

Claude reads the agent, proposes a rubric, and on your confirmation generates:

your-project/
  evals/
    eval_harness.py
    rubrics/main.yaml
    prompts/judge.md
    test_cases/seed.yaml
    reports/
      raw/<subject>.jsonl
      eval_<subject>_<ts>.md
      framework-comparison.html   # multi-subject only

Smoke-test one case:

python evals/eval_harness.py --framework <subject> --test-case easy-01 -v

Full run:

python evals/eval_harness.py --framework <subject>

Multi-subject sweep + dashboard:

python evals/eval_harness.py --framework all
python evals/make_html_report.py

Required harness flags

Every harness generated by this skill exposes these:

Flag	Purpose
`--framework NAME`	which subject to run (or `all`)
`--test-case ID`	run only one case — essential for debugging adapters
`-v` / `--verbose`	print per-case metadata as it runs
`--trials N`	run each case N times (pass@k / variance)

Repository layout

eval-layer/
├── SKILL.md                                        # entry point — the process Claude follows
├── references/
│   ├── rubric-design.md                            # dimension catalog, scales, leniency thresholds, anti-patterns
│   ├── judge-prompts.md                            # judge prompt template + calibration techniques
│   ├── judge-robustness.md                         # defensive JSON parse, retry-once, never-drop-a-result
│   ├── framework-adapters.md                       # copy-paste recipes per framework (w/ the metadata contract)
│   ├── structured-output-troubleshooting.md        # the three Bedrock Opus errors and the two-stage fix
│   ├── cross-subject-benchmarking.md               # multi-subject flow (models / frameworks / prompts)
│   └── html-report-template.html                   # self-contained Chart.js dashboard template
└── README.md

SKILL.md is the top-level recipe (loaded when the skill is invoked). The references/ files are loaded on demand for the relevant step.

Bedrock gotcha

If the agent targets Claude on Bedrock, the default path for structured output is the two-stage pattern, not response_format / output_type. Single-stage structured output fails with three distinct errors on Bedrock Opus 4.x across LangGraph, Strands, and OpenAI Agents SDK:

LangGraph — "This model does not support assistant message prefill"
Strands — "No valid tool use or tool use input was found in the Bedrock response"
OpenAI Agents + LiteLLM — "minItems values other than 0 or 1 are not supported"

The fix:

Stage 1: run the agent conversationally — tools enabled, no output_type.
Stage 2: feed the final text into a fresh structured-output call.

See references/structured-output-troubleshooting.md.

Validation checklist

Before handing off an eval produced by this skill:

3–5 dimensions, weights sum to 1.0
Concrete level descriptors (not "good"/"bad")
2–3 calibration examples in the judge prompt
≥3 test cases with reference_scores for leniency
Harness uses the defensive parse_judge_response helper
Harness outputs the 7-field metadata contract
--framework, --test-case, -v, --trials flags present
Single-case smoke test passes end-to-end
If Bedrock: two-stage structured output is the default
If multi-subject: HTML dashboard renders with all subjects