cc-plugin-eval

Name: cc-plugin-eval
Author: sjnims

A 4-stage evaluation framework for testing Claude Code plugin component triggering. Validates whether skills, agents, commands, hooks, and MCP servers correctly activate when expected.

Why This Exists

Claude Code plugins contain multiple component types (skills, agents, commands) that trigger based on user prompts. Testing these triggers manually is time-consuming and error-prone. This framework automates the entire evaluation process:

Discovers all components in your plugin
Generates test scenarios (positive and negative cases)
Executes scenarios against the Claude Agent SDK
Evaluates whether the correct component triggered

Features

Feature	Description
4-Stage Pipeline	Analysis → Generation → Execution → Evaluation
Multi-Component	Skills, agents, commands, hooks, and MCP servers
Programmatic Detection	100% confidence detection by parsing tool captures
Semantic Testing	Synonym and paraphrase variations to test robustness
Resume Capability	Checkpoint after each stage, resume interrupted runs
Cost Estimation	Token and USD estimates before execution
Batch API Support	50% cost savings on large runs via Anthropic Batches API
Multiple Formats	JSON, YAML, JUnit XML, TAP output

Quick Start

Prerequisites

Node.js >= 20.0.0
An Anthropic API key

Installation

# Clone the repository
git clone https://github.com/sjnims/cc-plugin-eval.git
cd cc-plugin-eval

# Install dependencies
npm install

# Build
npm run build

# Create .env file with your API key
echo "ANTHROPIC_API_KEY=sk-ant-your-key-here" > .env

Run Your First Evaluation

# See cost estimate without running (recommended first)
npx cc-plugin-eval run -p ./path/to/your/plugin --dry-run

# Run full evaluation
npx cc-plugin-eval run -p ./path/to/your/plugin

How It Works

flowchart LR
    subgraph Input
        P[Plugin Directory]
    end

    subgraph Pipeline
        S1[**Stage 1: Analysis**<br/>Parse plugin structure,<br/>extract triggers]
        S2[**Stage 2: Generation**<br/>Create test scenarios<br/>positive & negative]
        S3[**Stage 3: Execution**<br/>Run scenarios via<br/>Agent SDK]
        S4[**Stage 4: Evaluation**<br/>Detect triggers,<br/>calculate metrics]
    end

    P --> S1 --> S2 --> S3 --> S4

    S1 --> O1[analysis.json]
    S2 --> O2[scenarios.json]
    S3 --> O3[transcripts/]
    S4 --> O4[evaluation.json]

Stage Details

Stage	Purpose	Method	Output
1. Analysis	Parse plugin structure, extract trigger phrases	Deterministic parsing	`analysis.json`
2. Generation	Create test scenarios	LLM for skills/agents, deterministic for commands	`scenarios.json`
3. Execution	Run scenarios against Claude Agent SDK	Tool capture hooks	`transcripts/`
4. Evaluation	Detect triggers, calculate metrics	Programmatic first, LLM judge for quality	`evaluation.json`

Scenario Types

Each component generates multiple scenario types to thoroughly test triggering:

Type	Description	Example
`direct`	Exact trigger phrase	"create a skill"
`paraphrased`	Same intent, different words	"add a new skill to my plugin"
`edge_case`	Unusual but valid	"skill plz"
`negative`	Should NOT trigger	"tell me about database skills"
`semantic`	Synonym variations	"generate a skill" vs "create a skill"

CLI Reference

Full Pipeline

# Run complete evaluation
cc-plugin-eval run -p ./plugin

# With options
cc-plugin-eval run -p ./plugin \
  --config custom-config.yaml \
  --verbose \
  --samples 3

Individual Stages

# Stage 1: Analysis only
cc-plugin-eval analyze -p ./plugin

# Stages 1-2: Analysis + Generation
cc-plugin-eval generate -p ./plugin

# Stages 1-3: Analysis + Generation + Execution
cc-plugin-eval execute -p ./plugin

Resume & Reporting

# Resume an interrupted run
cc-plugin-eval resume -r <run-id>

# List previous runs
cc-plugin-eval list -p ./plugin

# Generate report from existing results
cc-plugin-eval report -r <run-id> --output junit-xml

Common Options

Option	Description
`-p, --plugin <path>`	Plugin directory path
`-c, --config <path>`	Config file (default: `config.yaml`)
`--dry-run`	Generate scenarios without execution
`--estimate`	Show cost estimate before execution
`--verbose`	Enable debug output
`--fast`	Only run previously failed scenarios
`--no-batch`	Force synchronous (non-batch) execution
`--rewind`	Undo file changes after each scenario
`--semantic`	Enable semantic variation testing
`--samples <n>`	Multi-sample judgment count
`--reps <n>`	Repetitions per scenario
`--output <format>`	Output format: `json`, `yaml`, `junit-xml`, `tap`

Programmatic Usage

In addition to the CLI, cc-plugin-eval exports a programmatic API for integration into build systems, test frameworks, and custom tooling.

Installation

npm install cc-plugin-eval

Basic Usage

import {
  runAnalysis,
  runGeneration,
  runExecution,
  runEvaluation,
  loadConfigWithOverrides,
  consoleProgress,
} from "cc-plugin-eval";
import type {
  EvalConfig,
  AnalysisOutput,
  TestScenario,
} from "cc-plugin-eval/types";

// Load configuration
const config = loadConfigWithOverrides("config.yaml", {
  plugin: "./path/to/plugin",
});

// Stage 1: Analyze plugin structure
const analysis = await runAnalysis(config);

// Stage 2: Generate test scenarios
const generation = await runGeneration(analysis, config);

// Stage 3: Execute scenarios (captures tool interactions)
const execution = await runExecution(
  analysis,
  generation.scenarios,
  config,
  consoleProgress, // or provide custom progress callbacks
);

// Stage 4: Evaluate results
const evaluation = await runEvaluation(
  analysis.plugin_name,
  generation.scenarios,
  execution.results,
  config,
  consoleProgress,
);

console.log(`Accuracy: ${(evaluation.metrics.accuracy * 100).toFixed(1)}%`);

Public API Exports

Export	Description
`runAnalysis`	Stage 1: Parse plugin structure and extract triggers
`runGeneration`	Stage 2: Generate test scenarios for components
`runExecution`	Stage 3: Execute scenarios and capture tool interactions
`runEvaluation`	Stage 4: Evaluate results and calculate metrics
`loadConfigWithOverrides`	Load configuration with CLI-style overrides
`consoleProgress`	Default progress reporter (console output)

Types

Import types via the cc-plugin-eval/types subpath:

import type {
  EvalConfig,
  AnalysisOutput,
  TestScenario,
  ExecutionResult,
  EvaluationResult,
  EvalMetrics,
} from "cc-plugin-eval/types";

Configuration

Configuration is managed via config.yaml. Here's a quick reference:

Scope (What to Test)

scope:
  skills: true # Evaluate skill components
  agents: true # Evaluate agent components
  commands: true # Evaluate command components
  hooks: false # Evaluate hook components
  mcp_servers: false # Evaluate MCP server components

Generation (Stage 2)

generation:
  model: "claude-sonnet-4-5-20250929"
  scenarios_per_component: 5 # Test scenarios per component
  diversity: 0.7 # 0.0-1.0, higher = more unique scenarios
  semantic_variations: true # Generate synonym variations

Execution (Stage 3)

execution:
  model: "claude-sonnet-4-20250514"
  max_turns: 5 # Conversation turns per scenario
  timeout_ms: 60000 # Timeout per scenario (1 min)
  max_budget_usd: 10.0 # Stop if cost exceeds this
  disallowed_tools: # Safety: block file operations
    - Write
    - Edit
    - Bash

Evaluation (Stage 4)

evaluation:
  model: "claude-sonnet-4-5-20250929"
  detection_mode: "programmatic_first" # Or "llm_only"
  num_samples: 1 # Multi-sample judgment

See the full config.yaml for all options, including:

tuning: Fine-tune timeouts, retry behavior, and token estimates
conflict_detection: Detect when multiple components trigger for the same prompt
batch_threshold: Use Anthropic Batches API for cost savings (50% discount)
sanitization: PII redaction with ReDoS-safe custom patterns

Performance Optimization

Session Batching (Default)

By default, scenarios testing the same component share a session with /clear between them. This reduces subprocess overhead by ~80%:

Mode	Overhead per Scenario	100 Scenarios
Batched (default)	~1-2s after first	~2-3 minutes
Isolated	~5-8s each	~8-13 minutes

The /clear command resets conversation history between scenarios while reusing the subprocess and loaded plugin.

When to Use Isolated Mode

Switch to isolated mode when you need complete separation between scenarios:

Testing plugins that modify filesystem state
Debugging cross-contamination issues between scenarios
When using rewind_file_changes: true (automatically uses isolated mode)

To use isolated mode:

execution:
  session_strategy: "isolated"

Or via the deprecated (but still supported) option:

execution:
  session_isolation: true

Output Structure

After a run, results are saved to:

results/
└── {plugin-name}/
    └── {run-id}/
        ├── state.json              # Pipeline state (for resume)
        ├── analysis.json           # Stage 1: Parsed components
        ├── scenarios.json          # Stage 2: Generated test cases
        ├── execution-metadata.json # Stage 3: Execution stats
        ├── evaluation.json         # Stage 4: Results & metrics
        └── transcripts/
            └── {scenario-id}.json  # Individual execution transcripts

Sample Evaluation Output

{
  "results": [
    {
      "scenario_id": "skill-create-direct-001",
      "triggered": true,
      "confidence": 100,
      "quality_score": 9.2,
      "detection_source": "programmatic",
      "has_conflict": false
    }
  ],
  "metrics": {
    "total_scenarios": 25,
    "accuracy": 0.92,
    "trigger_rate": 0.88,
    "avg_quality": 8.7,
    "conflict_count": 1
  }
}

Detection Strategy

Programmatic detection is primary for maximum accuracy:

During execution, tool capture hooks capture all tool invocations
Tool captures are parsed to detect Skill, Task, and SlashCommand calls
MCP tools detected via pattern: mcp__<server>__<tool>
Hooks detected via SDKHookResponseMessage events
Confidence is 100% for programmatic detection

LLM judge is secondary, used for:

Quality assessment (0-10 score)
Edge cases where programmatic detection is ambiguous
Multi-sample consensus when configured

Development

npm install       # Install dependencies
npm run build     # Build TypeScript
npm test          # Run tests
npm run lint      # Lint code
npm run typecheck # Type check

See CONTRIBUTING.md for detailed development setup, code style, testing requirements, and pull request guidelines.

Roadmap

Phase 1: Skills, agents, commands evaluation
Phase 2: Hooks evaluation (PR #58)
Phase 3: MCP servers evaluation (PR #63)
Phase 4: Cross-plugin conflict detection
Phase 5: Marketplace evaluation

Security Considerations

Permission Bypass Mode

Default: execution.permission_bypass: true enables automated evaluation by automatically approving all tool invocations. This is required for unattended runs but has security implications:

✅ Required for CI/CD and automated evaluation
⚠️ Plugins can perform any action permitted by allowed tools
🔒 Use disallowed_tools to restrict dangerous operations (default: [Write, Edit, Bash])
🔒 For untrusted plugins, set permission_bypass: false for manual review (disables automation)

Security Note: With permission bypass enabled, use strict disallowed_tools and run in sandboxed environments when evaluating untrusted plugins.

PII Protection & Compliance

Default: output.sanitization.enabled: false for backwards compatibility. Enable sanitization for PII-sensitive environments:

output:
  sanitize_transcripts: true # Redact saved files
  sanitize_logs: true # Redact console output
  sanitization:
    enabled: true
    custom_patterns: # Optional domain-specific patterns
      - pattern: "INTERNAL-\\w+"
        replacement: "[REDACTED_ID]"

Built-in redaction: API keys, JWT tokens, emails, phone numbers, SSNs, credit card numbers.

Enterprise use cases: Enable when handling PII or complying with GDPR, HIPAA, SOC 2, or similar regulations.

Default Tool Restrictions

The default disallowed_tools: [Write, Edit, Bash] prevents file modifications and shell commands. Modify with caution:

Enable Write/Edit only if testing file-modifying plugins
Enable Bash only if testing shell-executing plugins
Use rewind_file_changes: true to restore files after each scenario

Additional Safeguards

API keys: Loaded from environment variables (.env), never stored in config
Budget limits: Set execution.max_budget_usd to cap API spending
Timeout limits: Set execution.timeout_ms to prevent runaway executions
Plugin loading: Only local plugins supported (plugin.path), no remote loading
ReDoS protection: Custom sanitization patterns validated for Regular Expression Denial of Service vulnerabilities

Enterprise Deployments

For production/enterprise environments with compliance requirements, see the comprehensive security guide in SECURITY.md, including:

Threat model and risk assessment
Sandbox and isolation recommendations
Compliance checklist (GDPR, HIPAA, SOC 2)
Container isolation patterns

Contributing

See CONTRIBUTING.md for development setup, code style, and pull request guidelines.

This project follows the Contributor Covenant code of conduct.

License

MIT

Author

Steve Nims (@sjnims)

Acknowledgements

Anthropic for Claude, the Anthropic SDK, and the Claude Agent SDK
Bloom for architectural inspiration
Boris Cherny for Claude Code
Zod for runtime type validation
Commander.js for CLI framework
Vitest for testing
Monster Energy for fuel
deadmau5 for the beats

cc-plugin-eval

Why This Exists

Features

Quick Start

Prerequisites

Installation

Run Your First Evaluation

How It Works

Stage Details

Scenario Types

CLI Reference

Full Pipeline

Individual Stages

Resume & Reporting

Common Options

Programmatic Usage

Installation

Basic Usage

Public API Exports

Types

Configuration

Scope (What to Test)

Generation (Stage 2)

Execution (Stage 3)

Evaluation (Stage 4)

Performance Optimization

Session Batching (Default)

When to Use Isolated Mode

Output Structure

Sample Evaluation Output

Detection Strategy

Development

Roadmap

Security Considerations

Permission Bypass Mode

PII Protection & Compliance

Default Tool Restrictions

Additional Safeguards

Enterprise Deployments

Contributing

License

Author

Acknowledgements

Yorumlar (0)