skill-conductor
Health Pass
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 73 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
This agent provides a structured, architecture-first lifecycle for designing, testing, evaluating, and packaging AI skills. It integrates an evaluation engine using multiple grading agents, blind A/B testing, and benchmarking to ensure skills meet strict quality thresholds.
Security Assessment
The automated code scan reviewed 12 files and found no dangerous patterns, no hardcoded secrets, and no excessive or dangerous permissions. Based on the provided documentation, the tool executes local Python scripts (such as smoke tests), runs evaluations in isolated contexts, and leverages automated pipelines which inherently make local system and network interactions. However, these actions appear strictly confined to standard development, testing, and packaging workflows rather than malicious data exfiltration. Overall security risk is rated as Low.
Quality Assessment
The project demonstrates strong maintenance and community confidence. It received a push update as recently as today, indicating highly active development. The repository is transparent, includes clear descriptions and comprehensive documentation, and has garnered 73 GitHub stars, reflecting positive community reception and trust. Furthermore, it is fully open-source under the standard and permissive MIT license, making it highly accessible for both personal and commercial use.
Verdict
Safe to use.
Architecture-first skill lifecycle for AI agents. 5 modes: CREATE → EVAL → EDIT → REVIEW → PACKAGE. Integrates Anthropic's eval engine (grader/comparator/analyzer agents, blind A/B, benchmarks) with architecture patterns, TDD baseline, and 5-axis scoring. Not just testing - full design-to-distribution.
Skill Conductor
Architecture-first skill lifecycle: design → build → test → evaluate → package.
Most skill tools jump straight to "write SKILL.md." Conductor makes you choose the architecture first - because rewriting a wrong pattern costs more than writing it right.
v3: SOP practices + smoke tests
New in v3:
references/sop-practices.md— 80 years of Standard Operating Procedure wisdom applied to skill authoring. Inline checklists at risk-points, pre-flight checks, programmatic validation, exception handling patterns. Use for procedural skills (client intake, onboarding, reporting, escalation)scripts/test_smoke.py— fast safety net for skill-conductor scripts themselves. Verifies critical scripts execute on known-good skills, fail on known-bad, produce expected output shapes. Run:uv run scripts/test_smoke.py- Updated eval agents (grader, comparator, analyzer) with refined rubrics
- Improved
package_skill.py,eval_skill.py, and schema validation - Updated
patterns.mdandschemas.mdwith tighter definitions
v2: Anthropic's eval engine meets architecture-first design
Anthropic updated their skill-creator with serious eval infrastructure. We took the best of it:
From Anthropic's skill-creator (new):
- 3 specialized agents: grader (assertion checking + claim extraction), comparator (blind A/B testing), analyzer (post-hoc root cause analysis)
- Parallel eval execution with isolated contexts (no cross-contamination)
- Automated description optimization with train/test split (60/40)
- Benchmark tracking: pass rate, tokens, time with variance analysis
- HTML eval viewer with qualitative + quantitative tabs
What Conductor adds on top:
- Architecture before code. 5 patterns (Sequential, Iterative, Context-Aware, Domain Intelligence, Multi-MCP) with selection criteria. Pick wrong = rewrite everything later
- Degrees of freedom. Low (deterministic scripts) → Medium (pseudocode) → High (free text). Match freedom to risk tolerance
- TDD RED before writing. Verify the agent fails WITHOUT the skill first. If it already handles the task - you don't need a skill. Creator runs baselines in parallel with skill runs. Conductor runs baseline BEFORE you write anything
- 5-axis scoring with thresholds. Discovery, Clarity, Efficiency, Robustness, Completeness. Each 1-10. Score 45-50 = production. Below 25 = rewrite. Not "vibe check" - numbers
- Skill categorization. Capability uplift (teaching something new) vs Encoded preference (sequencing known abilities). Different skills need different testing strategies
Synthesized from
- Anthropic Skill Creator — eval infrastructure, grader/comparator/analyzer agents, benchmark pipeline
- The Complete Guide to Building Skills for Claude — 32 pages, 5 architecture patterns, success metrics
- Superpowers / writing-skills by Jesse Vincent — TDD approach, the "description trap" discovery
- Skills Best Practices by Minko Gechev — three-stage LLM validation, eval methodology
5 Modes
| Mode | What it does |
|---|---|
| CREATE | Architecture selection → TDD baseline → scaffold → write → verify → refactor |
| EVAL | 3-stage evaluation: Discovery (triggering) → Logic (execution) → Edge Cases (breaking) |
| EDIT | Problem → Signal → Fix table. Targeted improvements without breaking what works |
| REVIEW | Pass/fail checklist for third-party skills before you install them |
| PACKAGE | Validate structure + package as .skill for distribution |
Architecture patterns
Choose before writing a single line:
| Pattern | Use when |
|---|---|
| Sequential workflow | Clear step-by-step process |
| Iterative refinement | Output improves with cycles |
| Context-aware selection | Same goal, different tools by context |
| Domain intelligence | Specialized knowledge beyond tool access |
| Multi-MCP coordination | Workflow spans multiple services |
Eval infrastructure
┌─────────┐
│ SKILL │
└────┬────┘
│
┌──────────┼──────────┐
│ │ │
┌────▼────┐ ┌──▼───┐ ┌───▼────┐
│ Grader │ │ A/B │ │Analyzer│
│ │ │Blind │ │ │
│assertions│ │compare│ │root │
│+ claims │ │ │ │cause │
└─────────┘ └──────┘ └────────┘
│ │ │
└──────────┼──────────┘
│
┌─────▼─────┐
│ Benchmark │
│ mean±std │
└───────────┘
Installation
skills/
└── skill-conductor/
├── SKILL.md
├── agents/
│ ├── grader.md
│ ├── comparator.md
│ └── analyzer.md
├── eval-viewer/
│ ├── generate_review.py
│ └── viewer.html
├── references/
│ ├── patterns.md
│ ├── schemas.md
│ └── sop-practices.md
├── assets/
│ └── eval_review.html
└── scripts/
├── init_skill.py
├── eval_skill.py
├── run_eval.py
├── run_loop.py
├── improve_description.py
├── aggregate_benchmark.py
├── generate_report.py
├── package_skill.py
├── quick_validate.py
├── test_smoke.py
└── utils.py
OpenClaw: drop into ~/.openclaw/workspace/skills/
Claude Code: drop into .claude/skills/
Auto-activates when the agent detects a skill-building task.
Key discovery
Never put process steps in the skill description. If your description says "exports assets, generates specs, creates tasks" - the model follows the description and skips the body. Tested experimentally.
# ✅ Good
description: Analyze design files for developer handoff. Use when user uploads .fig files.
# ❌ Bad - model follows this and ignores SKILL.md body
description: Exports Figma assets, generates specs, creates Linear tasks, posts to Slack.
License
MIT
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found