skill-conductor

agent
Guvenlik Denetimi
Gecti
Health Gecti
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Community trust — 73 GitHub stars
Code Gecti
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested
Purpose

This agent provides a structured, architecture-first lifecycle for designing, testing, evaluating, and packaging AI skills. It integrates an evaluation engine using multiple grading agents, blind A/B testing, and benchmarking to ensure skills meet strict quality thresholds.

Security Assessment

The automated code scan reviewed 12 files and found no dangerous patterns, no hardcoded secrets, and no excessive or dangerous permissions. Based on the provided documentation, the tool executes local Python scripts (such as smoke tests), runs evaluations in isolated contexts, and leverages automated pipelines which inherently make local system and network interactions. However, these actions appear strictly confined to standard development, testing, and packaging workflows rather than malicious data exfiltration. Overall security risk is rated as Low.

Quality Assessment

The project demonstrates strong maintenance and community confidence. It received a push update as recently as today, indicating highly active development. The repository is transparent, includes clear descriptions and comprehensive documentation, and has garnered 73 GitHub stars, reflecting positive community reception and trust. Furthermore, it is fully open-source under the standard and permissive MIT license, making it highly accessible for both personal and commercial use.

Verdict

Safe to use.
SUMMARY

Architecture-first skill lifecycle for AI agents. 5 modes: CREATE → EVAL → EDIT → REVIEW → PACKAGE. Integrates Anthropic's eval engine (grader/comparator/analyzer agents, blind A/B, benchmarks) with architecture patterns, TDD baseline, and 5-axis scoring. Not just testing - full design-to-distribution.

README.md

Skill Conductor

Architecture-first skill lifecycle: design → build → test → evaluate → package.

Most skill tools jump straight to "write SKILL.md." Conductor makes you choose the architecture first - because rewriting a wrong pattern costs more than writing it right.

v3: SOP practices + smoke tests

New in v3:

  • references/sop-practices.md — 80 years of Standard Operating Procedure wisdom applied to skill authoring. Inline checklists at risk-points, pre-flight checks, programmatic validation, exception handling patterns. Use for procedural skills (client intake, onboarding, reporting, escalation)
  • scripts/test_smoke.py — fast safety net for skill-conductor scripts themselves. Verifies critical scripts execute on known-good skills, fail on known-bad, produce expected output shapes. Run: uv run scripts/test_smoke.py
  • Updated eval agents (grader, comparator, analyzer) with refined rubrics
  • Improved package_skill.py, eval_skill.py, and schema validation
  • Updated patterns.md and schemas.md with tighter definitions

v2: Anthropic's eval engine meets architecture-first design

Anthropic updated their skill-creator with serious eval infrastructure. We took the best of it:

From Anthropic's skill-creator (new):

  • 3 specialized agents: grader (assertion checking + claim extraction), comparator (blind A/B testing), analyzer (post-hoc root cause analysis)
  • Parallel eval execution with isolated contexts (no cross-contamination)
  • Automated description optimization with train/test split (60/40)
  • Benchmark tracking: pass rate, tokens, time with variance analysis
  • HTML eval viewer with qualitative + quantitative tabs

What Conductor adds on top:

  • Architecture before code. 5 patterns (Sequential, Iterative, Context-Aware, Domain Intelligence, Multi-MCP) with selection criteria. Pick wrong = rewrite everything later
  • Degrees of freedom. Low (deterministic scripts) → Medium (pseudocode) → High (free text). Match freedom to risk tolerance
  • TDD RED before writing. Verify the agent fails WITHOUT the skill first. If it already handles the task - you don't need a skill. Creator runs baselines in parallel with skill runs. Conductor runs baseline BEFORE you write anything
  • 5-axis scoring with thresholds. Discovery, Clarity, Efficiency, Robustness, Completeness. Each 1-10. Score 45-50 = production. Below 25 = rewrite. Not "vibe check" - numbers
  • Skill categorization. Capability uplift (teaching something new) vs Encoded preference (sequencing known abilities). Different skills need different testing strategies

Synthesized from

  1. Anthropic Skill Creator — eval infrastructure, grader/comparator/analyzer agents, benchmark pipeline
  2. The Complete Guide to Building Skills for Claude — 32 pages, 5 architecture patterns, success metrics
  3. Superpowers / writing-skills by Jesse Vincent — TDD approach, the "description trap" discovery
  4. Skills Best Practices by Minko Gechev — three-stage LLM validation, eval methodology

5 Modes

Mode What it does
CREATE Architecture selection → TDD baseline → scaffold → write → verify → refactor
EVAL 3-stage evaluation: Discovery (triggering) → Logic (execution) → Edge Cases (breaking)
EDIT Problem → Signal → Fix table. Targeted improvements without breaking what works
REVIEW Pass/fail checklist for third-party skills before you install them
PACKAGE Validate structure + package as .skill for distribution

Architecture patterns

Choose before writing a single line:

Pattern Use when
Sequential workflow Clear step-by-step process
Iterative refinement Output improves with cycles
Context-aware selection Same goal, different tools by context
Domain intelligence Specialized knowledge beyond tool access
Multi-MCP coordination Workflow spans multiple services

Eval infrastructure

                    ┌─────────┐
                    │  SKILL  │
                    └────┬────┘
                         │
              ┌──────────┼──────────┐
              │          │          │
         ┌────▼────┐ ┌──▼───┐ ┌───▼────┐
         │ Grader  │ │ A/B  │ │Analyzer│
         │         │ │Blind │ │        │
         │assertions│ │compare│ │root    │
         │+ claims │ │      │ │cause   │
         └─────────┘ └──────┘ └────────┘
              │          │          │
              └──────────┼──────────┘
                         │
                   ┌─────▼─────┐
                   │ Benchmark │
                   │ mean±std  │
                   └───────────┘

Installation

skills/
└── skill-conductor/
    ├── SKILL.md
    ├── agents/
    │   ├── grader.md
    │   ├── comparator.md
    │   └── analyzer.md
    ├── eval-viewer/
    │   ├── generate_review.py
    │   └── viewer.html
    ├── references/
    │   ├── patterns.md
    │   ├── schemas.md
    │   └── sop-practices.md
    ├── assets/
    │   └── eval_review.html
    └── scripts/
        ├── init_skill.py
        ├── eval_skill.py
        ├── run_eval.py
        ├── run_loop.py
        ├── improve_description.py
        ├── aggregate_benchmark.py
        ├── generate_report.py
        ├── package_skill.py
        ├── quick_validate.py
        ├── test_smoke.py
        └── utils.py

OpenClaw: drop into ~/.openclaw/workspace/skills/

Claude Code: drop into .claude/skills/

Auto-activates when the agent detects a skill-building task.

Key discovery

Never put process steps in the skill description. If your description says "exports assets, generates specs, creates tasks" - the model follows the description and skips the body. Tested experimentally.

# ✅ Good
description: Analyze design files for developer handoff. Use when user uploads .fig files.

# ❌ Bad - model follows this and ignores SKILL.md body
description: Exports Figma assets, generates specs, creates Linear tasks, posts to Slack.

License

MIT

Yorumlar (0)

Sonuc bulunamadi