rashomon
Health Warn
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 8 GitHub stars
Code Fail
- rm -rf — Recursive force deletion command in skills/worktree-execution/scripts/worktree-cleanup.sh
Permissions Pass
- Permissions — No dangerous permissions requested
This tool evaluates and compares the effectiveness of AI skills and prompts through blind A/B testing in isolated environments. It helps developers measure whether changes to prompts actually improve agent behavior.
Security Assessment
The tool operates by executing shell commands to set up and tear down isolated testing environments. While it does not request dangerous permissions, the static scan flagged a recursive force deletion command (`rm -rf`) inside a cleanup script. Although common in build and teardown scripts, a poorly implemented `rm -rf` can accidentally delete critical system files if a path variable is ever empty or incorrectly set. It does not appear to access highly sensitive user data, possess hardcoded secrets, or make suspicious external network requests. Overall risk is rated Medium due to the destructive potential of the cleanup script.
Quality Assessment
The project is very new and currently has low community visibility, evidenced by only 8 GitHub stars. However, it appears to be actively maintained, with the most recent code push happening today. There is a discrepancy in the repository's health: the README displays an MIT license badge, but the automated scan failed to find an actual license file, meaning legal usage rights are currently unclear.
Verdict
Use with caution: the concept is highly useful, but users should review the cleanup script carefully before running it to prevent accidental data loss.
Measure prompt and skill improvements with blind A/B comparison.
Know whether your skills actually improve agent behavior — not just look different.
Why rashomon?
Inspired by the Rashomon effect — the idea that the same event can produce different outcomes depending on perspective.
rashomon makes those differences explicit and comparable.
- Built a skill but unsure if it actually changes agent behavior?
- Iterating on skills and prompts by gut feel instead of evidence?
- Want proof that your changes made things better, not just different?
rashomon evaluates skills and prompts through blind comparison — running tasks with and without your changes in isolated environments, then comparing real outputs without knowing which version produced which.
Who Is This For?
rashomon is designed for:
- Skill authors who want evidence-based validation
- Developers using Claude Code daily
- Teams iterating on complex prompts (coding, analysis, writing)
- Anyone who wants evidence, not vibes, when improving skills and prompts
Not ideal if:
- You want one-shot prompt rewriting without comparison
Quick Example
Skill Evaluation
/recipe-eval-skill create
Creates a skill through interactive dialog, then evaluates effectiveness:
- Collects domain knowledge, project-specific rules, and trigger phrases
- Generates optimized skill content (graded A/B/C)
- Runs a test task with and without the skill in isolated environments, using blind A/B comparison
What the evaluation report looks like:
Skill Quality: Grade A
- Project-specific rules clearly encoded, no critical issues
Trigger Check: pass (discovered + invoked)
Execution Effectiveness:
- Winner: with-skill
- Assessment: structural improvement
- Key difference: 3-stage catch ordering and retry constraints
applied correctly (attributed to skill Rules 3 and 6)
Recommendation: ship
/recipe-eval-skill api-error-handling skill's scope needs adjustment
Updates an existing skill, then evaluates old vs new version side by side.
See a real-world example: I Built a Skill Reviewer. Then I Ran It on Itself.
Prompt Evaluation
/recipe-eval-prompt Write a function to sort an array
Analyzes prompt issues, generates an improved version, runs both in isolated environments, and shows what actually changed.
Prompt Evaluation DetailsWhat You Get
1. Detected Issues
- BP-002 (Vague Instructions): Sort order, language, and error handling not specified
- BP-003 (Missing Output Format): No expected output structure defined
2. Improved Prompt
Write a TypeScript function that sorts a number array in ascending order.
- Return empty array for empty input
- Include JSDoc comments
- Output: function code with example usage
3. Comparison Report
| Aspect | Original | Improved |
|---|---|---|
| Type definitions | None | Included |
| Edge case handling | None | Included |
| Documentation | None | JSDoc added |
Result: Structural Improvement - The optimization made a meaningful difference.
Example: When rashomon finds no real improvement
/recipe-eval-prompt Summarize this article in 3 bullet points
Result: Variance - Prompt was already well-scoped; differences were stylistic only.
Installation
Requires Claude Code (this is a Claude Code plugin)
# 1. Start Claude Code
claude
# 2. Install the marketplace
/plugin marketplace add shinpr/rashomon
# 3. Install plugin
/plugin install rashomon@rashomon
# 4. Restart session (required)
# Exit and restart Claude Code
Usage
Skill Evaluation
/recipe-eval-skill create
Create a new skill and evaluate its effectiveness.
/recipe-eval-skill my-skill-name what to change
Update an existing skill and compare old vs new.
Prompt Evaluation
/recipe-eval-prompt Your prompt here
From a file:
/recipe-eval-prompt Generate code following this skill: ./prompts/my-skill.md
For complex tasks that need more time, just mention it in natural language:
/recipe-eval-prompt Refactor the entire authentication module. This might take a while.
How It Works
Skill Evaluation (/recipe-eval-skill)
├── skill-creator (generates/modifies skills)
├── skill-reviewer (grades quality A/B/C)
├── eval-executor ×2 (isolated worktrees)
└── skill-eval-reporter (blind A/B comparison)
Prompt Evaluation (/recipe-eval-prompt)
├── prompt-analyzer (analyzes and optimizes)
├── prompt-executor ×2 (isolated worktrees)
└── report-generator (compares results)
Technical Details
Isolated Execution
rashomon uses git worktrees to run both versions in completely separate environments. A worktree is a Git feature that creates independent working directories from the same repository—this ensures the two executions don't interfere with each other.
Improvement Classification
Not all differences are improvements. rashomon classifies results into four categories:
| Classification | Meaning | Recommendation |
|---|---|---|
| Structural | Real improvement in accuracy, completeness, or quality | Use the new version |
| Context Addition | One version had more project-specific knowledge | Useful if the context is accurate |
| Expressive | Different wording, same substance | Either version is fine |
| Variance | Just normal LLM randomness | Original was already good |
Classification is based on:
- Whether detected issues were resolved
- Output completeness and constraint adherence
- Agreement between blind quality assessment and observable output differences
Both skill review and prompt analysis check against 8 common patterns:
| Priority | Issues |
|---|---|
| Critical | Negative instructions ("don't do X"), vague instructions, missing output format |
| High Impact | Unstructured prompts, missing context, complex tasks without breakdown |
| Enhancement | Biased examples, no permission for uncertainty |
P1: Critical (Must Fix)
| ID | Pattern | Problem | Fix |
|---|---|---|---|
| BP-001 | Negative Instructions | "Don't do X" often backfires—LLMs focus on what's mentioned | Reframe positively: "Don't include opinions" → "Include only factual information" |
| BP-002 | Vague Instructions | Missing specifics cause high output variance | Add explicit constraints: format, length, scope, tone |
| BP-003 | Missing Output Format | No format spec leads to inconsistent outputs | Define expected structure: JSON schema, section headers, etc. |
P2: High Impact (Should Fix)
| ID | Pattern | Problem | Fix |
|---|---|---|---|
| BP-004 | Unstructured Prompt | Wall of text obscures priorities | Apply 4-block pattern: Context / Task / Constraints / Output Format |
| BP-005 | Missing Context | No background leads to wrong assumptions | Add purpose, audience, relevant constraints |
| BP-006 | Complex Task | Undivided complex tasks have higher error rates | Break into steps with quality checkpoints |
P3: Enhancement (Could Fix)
| ID | Pattern | Problem | Fix |
|---|---|---|---|
| BP-007 | Biased Examples | Homogeneous examples cause overfitting | Diversify: include edge cases, different formats |
| BP-008 | No Uncertainty Permission | No "I don't know" option causes hallucination | Add: "If unsure, say so" |
Knowledge Base
rashomon learns from your project over time.
Location: .claude/.rashomon/prompt-knowledge.yaml
How it works:
- Automatically enabled when the file exists
- Stores project-specific patterns (not generic best practices)
- Referenced during analysis, updated after comparisons
- Max 20 entries, lowest-confidence ones removed first
Key principle: Old knowledge isn't automatically removed. Patterns that have worked for a long time are often the most valuable.
TroubleshootingTroubleshooting
Leftover worktrees
If rashomon exits unexpectedly, temporary directories might remain:
# Worktrees are stored in system temp directory
# Clean up manually if needed:
rm -rf ${TMPDIR:-/tmp}/worktree-rashomon-*
Timeout issues
For complex prompts that need more time, mention it when invoking:
/recipe-eval-prompt Complex task here. This might take longer than usual.
"Not a git repository" error
rashomon requires a git repository. Initialize one with:
git init
Requirements
- Git 2.5+
- Python 3.9+
- Claude Code
- Must run inside a git repository
License
MIT
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found