eval-layer
Health Warn
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 7 GitHub stars
Code Warn
- Code scan incomplete — No supported source files were scanned during light audit
Permissions Pass
- Permissions — No dangerous permissions requested
This tool is a Claude Code skill that adds a rubric-based evaluation framework to AI agent projects. It generates test suites, an LLM-as-a-jest prompt, and an eval harness to score agent performance across multiple frameworks.
Security Assessment
Overall risk is rated as Low. The light code scan was unable to parse the supported source files, meaning a deep automated code review could not be completed. However, the rule-based scan confirms no dangerous permissions are requested and no hardcoded secrets were detected. Based on the README, the tool operates locally to read your agent code, run eval cases, and generate an HTML dashboard. It acts as a prompt generator and evaluation wrapper rather than a background service, limiting its attack surface. No sensitive data access or unauthorized shell execution is implied by its documented behavior.
Quality Assessment
The project is under active development, with its most recent push happening today. However, there are significant trust and adoption concerns. It lacks a standard open-source license, meaning legal usage rights are undefined and it cannot safely be used in commercial environments. Additionally, it has very low community visibility with only 7 GitHub stars, indicating minimal peer review or real-world testing.
Verdict
Use with caution due to the lack of a software license and limited community testing, though the actual operational risk appears low.
A Claude Code skill that adds a rubric-based eval layer to any agent project. Framework-agnostic — generates rubric, test cases, judge prompt, and harness. Returns a weighted score plus a judge-leniency signal.
eval-layer
A Claude Code skill that adds a rubric-based evaluation layer to any existing agent project. Framework-agnostic — works with PydanticAI, LangGraph, CrewAI, Strands, OpenAI Agents SDK, the raw Anthropic SDK, or anything else that exposes run(prompt) -> result.
You bring the agent. The skill gives you:
- A scoring rubric (3-5 dimensions with concrete level descriptors, not "good" / "bad")
- A test suite of inputs with ≥3 reference-graded cases (for calibration)
- A judge prompt for LLM-as-a-judge with 2-3 calibration examples
- An eval harness that runs your agent on the cases, sends output to the judge, and aggregates:
- per-dimension averages
- weighted overall score
- pass rate
- leniency vs. the human references (flags if the judge is systematically too strict or too lenient)
- A per-subject markdown report and, for multi-subject benchmarks, a self-contained HTML dashboard (leaderboard + radar + per-case heatmap + failure-category breakdown).
Installation
This is a Claude Code skill. To use it as your own user-level skill, clone it into your Claude skills directory:
git clone <repo-url> ~/.claude/skills/eval-layer
Then from any Claude Code session: /eval-layer <your agent path>.
Why this exists
"Vibes-based" evals fail silently. You ship a change, the agent feels better, but you have no numbers. This skill produces numbers you can trust — including a judge-calibration signal (leniency) so you know when the judge itself is drifting, not just the agent.
Two measurements, not one:
- Weighted score — how good the output is, per the rubric, on
[0, 1] - Leniency —
mean(judge_score − human_reference_score)on[-1, +1]abs < 0.10→ well-calibrated0.10 – 0.25→ slight bias, monitorabs > 0.25→ recalibrate before trusting scores
The 7-field metadata contract
Every framework adapter returns the same shape. This is what makes cross-framework comparison honest:
{
"recommendation": <schema instance or None>, # agent's structured output
"latency_ms": int, # wall-clock including tool execution
"tool_calls": int, # count of tool invocations
"input_tokens": int | None, # summed across turns
"output_tokens": int | None,
"model_id": str,
"error": str | None,
}
If a framework doesn't expose a field (Strands 0.1.x doesn't expose tokens, for example), pass None — don't fabricate. The harness handles None defensively throughout.
Usage
In Claude Code, invoke the skill against your project:
/eval-layer Add an eval layer to /path/to/your/agentClaude reads the agent, proposes a rubric, and on your confirmation generates:
your-project/ evals/ eval_harness.py rubrics/main.yaml prompts/judge.md test_cases/seed.yaml reports/ raw/<subject>.jsonl eval_<subject>_<ts>.md framework-comparison.html # multi-subject onlySmoke-test one case:
python evals/eval_harness.py --framework <subject> --test-case easy-01 -vFull run:
python evals/eval_harness.py --framework <subject>Multi-subject sweep + dashboard:
python evals/eval_harness.py --framework all python evals/make_html_report.py
Required harness flags
Every harness generated by this skill exposes these:
| Flag | Purpose |
|---|---|
--framework NAME |
which subject to run (or all) |
--test-case ID |
run only one case — essential for debugging adapters |
-v / --verbose |
print per-case metadata as it runs |
--trials N |
run each case N times (pass@k / variance) |
Repository layout
eval-layer/
├── SKILL.md # entry point — the process Claude follows
├── references/
│ ├── rubric-design.md # dimension catalog, scales, leniency thresholds, anti-patterns
│ ├── judge-prompts.md # judge prompt template + calibration techniques
│ ├── judge-robustness.md # defensive JSON parse, retry-once, never-drop-a-result
│ ├── framework-adapters.md # copy-paste recipes per framework (w/ the metadata contract)
│ ├── structured-output-troubleshooting.md # the three Bedrock Opus errors and the two-stage fix
│ ├── cross-subject-benchmarking.md # multi-subject flow (models / frameworks / prompts)
│ └── html-report-template.html # self-contained Chart.js dashboard template
└── README.md
SKILL.md is the top-level recipe (loaded when the skill is invoked). The references/ files are loaded on demand for the relevant step.
Bedrock gotcha
If the agent targets Claude on Bedrock, the default path for structured output is the two-stage pattern, not response_format / output_type. Single-stage structured output fails with three distinct errors on Bedrock Opus 4.x across LangGraph, Strands, and OpenAI Agents SDK:
- LangGraph —
"This model does not support assistant message prefill" - Strands —
"No valid tool use or tool use input was found in the Bedrock response" - OpenAI Agents + LiteLLM —
"minItems values other than 0 or 1 are not supported"
The fix:
Stage 1: run the agent conversationally — tools enabled, no output_type.
Stage 2: feed the final text into a fresh structured-output call.
See references/structured-output-troubleshooting.md.
Validation checklist
Before handing off an eval produced by this skill:
- 3–5 dimensions, weights sum to 1.0
- Concrete level descriptors (not "good"/"bad")
- 2–3 calibration examples in the judge prompt
- ≥3 test cases with
reference_scoresfor leniency - Harness uses the defensive
parse_judge_responsehelper - Harness outputs the 7-field metadata contract
-
--framework,--test-case,-v,--trialsflags present - Single-case smoke test passes end-to-end
- If Bedrock: two-stage structured output is the default
- If multi-subject: HTML dashboard renders with all subjects
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found