skills-bank

High-performance skill aggregation, classification & routing platform for AI agents.

� Prerequisites

Rust 1.70+ (Install)
Git (for repository cloning)
~2GB disk space (for aggregated skills cache)

�📖 Overview

skills-bank aggregates skills (workflows, tasks, specialized agents) from 100+ distributed repositories and provides a unified routing system for AI agents to discover, load, and invoke them efficiently.

Core Design Principles

Source-of-Truth Loading: Agents load canonical SKILL.md files directly from source repositories, not from catalogs. This eliminates hallucination risks and optimizes token usage.
Hybrid Classification: A dual-stage pipeline combines fast keyword rules (Step A) with LLM-powered semantic classification (Step B) to route skills into 12 domain hubs and 40+ sub-hubs.
Smart Deduplication: Skills are deduplicated by name OR description — catching both exact collisions and cross-repo clones with different names but identical content.
Multi-Tool Support: Skills sync to major AI tools including GitHub Copilot, Claude-code, free-code (claude-code), Hermes, Cursor, Gemini, Antigravity, OpenCode, Codex, and Windsurf.
Token Efficiency: Load minimal metadata first, then source files on-demand—not batch-loading entire catalogs.

🚀 Quick Start

1. Build the CLI

cd skills-bank/
cargo build --release
cargo run --release -- aggregate

2. Run the Full Pipeline

# Interactive setup (first run)
cargo run --release

# Or run all steps in sequence
cargo run --release -- run

Example Workflows

First-time setup:

cargo run --release -- setup
cargo run --release -- run

Validate before production sync:

cargo run --release -- doctor
cargo run --release -- release-gate
cargo run --release -- sync

Launches an interactive wizard to configure:

Where skills should be synced (global, workspace, or both)
Which AI tools to sync to
Repository URLs to clone and aggregate
Excluded categories

🎮 Commands Reference

Core Pipeline Commands

Command	Purpose	When to Use
`aggregate`	Collect, deduplicate, classify, and route skills from configured repositories to `skills-aggregated/`	First run or when repositories change
`sync`	Distribute aggregated skills to configured AI tool directories	After aggregation completes
`run`	Execute the full pipeline (aggregate → sync) in sequence	Daily updates or automated workflows
`setup`	Configure sync targets, repositories, and exclusions interactively	Initial setup only
`add-repo <URL>`	Add a new skill repository to the configuration	When onboarding new sources
`doctor`	Validate installation and report repository state	Troubleshooting or pre-cleanup inspection
`release-gate`	Validate aggregation output integrity	Before releases or production sync
`cleanup-legacy-duplicates`	Remove legacy repository folders from `src/` or `repos/` (only if matching `lib/` exists)	Migration from older versions

📁 Project Structure

Source Code & Configuration

src/ — Rust source code: TUI, fetcher, aggregator, sync engine, classification logic
Cargo.toml — Rust manifest (dependencies, metadata, build targets)
.skills-bank-cli-config.json — User configuration file (generated by setup, contains sync targets and repository URLs)
.env-example — Environment variable template

Generated Outputs (After Aggregation)

skills-aggregated/ — Single source of truth containing:
- routing.csv — Skill-to-hub/sub-hub routing table
- subhub-index.json — Hub and sub-hub registry
- hub-manifests.csv — Master index of all skills
- .skill-lock.json — Aggregation metadata and timestamps
- Per-hub directories with skills-manifest.json files

Repository Cache

lib/ — Canonical cache for cloned skill repositories (populated by aggregate command)

Testing & Documentation

tests/ — Integration test suite for pipeline and TUI
archive/ — Legacy PowerShell scripts (original PoC phase)
package.json — Node.js manifest for npx distribution
readme.md — This file

📁 Repository Management

Cloning & Caching

Cache Location: lib/ (not src/) — This is the canonical directory for all cloned repositories.

Clone Strategy:

First clone: Shallow clone with git clone --depth 1 --single-branch --no-tags (faster, smaller disk footprint)
Subsequent runs: git pull in existing directories (avoid re-cloning)
Deduplication: Normalized remote URLs and repository names prevent duplicate clones

Speed Optimization:

Parallel cloning via configurable PARALLEL_JOBS
Shallow clones reduce disk I/O by ~80% vs. full clones
Incremental updates via git pull

Legacy Repository Cleanup

If you have repositories in older locations (src/ or repos/), migrate them:

# Inspect current state
cargo run --release -- doctor

# Remove legacy folders (safe: only deletes if matching lib/ exists and Git remote matches)
cargo run --release -- cleanup-legacy-duplicates

⚠️ Warning: This is destructive. Always run doctor first to inspect repository state.

⚙️ Output Files & Configuration

Generated during aggregation into skills-aggregated/:

File	Purpose
`routing.csv`	Skill-to-hub/sub-hub mappings (name, hub, sub-hub, src_path)
`subhub-index.json`	Complete hub and sub-hub registry
`hub-manifests.csv`	Master index of all skills across all hubs
`.skill-lock.json`	Aggregation metadata (timestamps, repo revisions, dedup stats)
`[hub]/[sub-hub]/skills-manifest.json`	Per-sub-hub skill metadata and LLM classification triggers

These files are used by agents and the TUI for discovery and routing.

🌐 Environment Variables

Copy .env-example to .env to override defaults:

cp .env-example .env

See .env-example for all available options.

🎯 Tool Integration Targets

Sync skills to any of these destinations:

Tool	Project	Global
Claude	`.claude/skills/`	`~/.claude/skills/`
free-code (claude-code)	`.free-code-config/skills/`	`~/.free-code-config/skills/`
Hermes	`.hermes/skills/`	`~/.hermes/skills/`
Code (Codex)	`.agents/skills/`	`~/.agents/skills/`
GitHub Copilot	`.github/skills/`	`~/.copilot/skills/`
Cursor	`.cursor/skills/`	`~/.cursor/skills/`
Gemini	`.gemini/skills/`	`~/.gemini/skills/`
Antigravity	`.agent/skills/`	`~/.gemini/antigravity/skills/`
OpenCode	`.opencode/skills/`	`~/.config/opencode/skills/`
Windsurf	`.windsurf/skills/`	`~/.codeium/windsurf/skills/`

🏗️ Classification Architecture

The aggregation pipeline processes 8000+ SKILL.md files through a multi-stage classification system:

 SKILL.md files (8000+)
        │
        ▼
 ┌──────────────┐
 │  YAML Parse   │  Extract name, description, triggers
 └──────┬───────┘
        │
        ▼
 ┌──────────────┐
 │  Keyword      │  Fast token-based routing to hub/sub-hub
 │  Rules        │  (fallback if LLM unavailable)
 └──────┬───────┘
        │
        ▼
 ┌──────────────┐
 │  Dedup        │  Name OR Description HashSet
 │  (two-key)    │  Catches cross-repo clones
 └──────┬───────┘
        │
        ▼
 ┌──────────────────────────────────┐
 │  Hybrid Exclusion + LLM Classify │
 │  Step A: Keyword pre-filter      │
 │  Step B: LLM semantic classify   │
 │         (can return "excluded")  │
 └──────┬───────────────────────────┘
        │
        ▼
 ┌──────────────┐
 │  Output       │  routing.csv, per-hub manifests,
 │  Artifacts    │  skills-index.json
 └──────────────┘

🔍 Classification Improvements (v2.0+)

The keyword-based classification system includes three critical enhancements to eliminate false negatives and resolve sub-hub conflicts:

1. Repository Name Extraction (Substring Matching)

Problem: Repository names like mukul975-anthropic-cybersecurity-skills were not being matched because the system used exact token matching (e.g., only matching the token "security", not the full repo name).

Solution: Introduced infer_hub_from_repo_name() function that:

Extracts the repository directory name from the path (the segment right after lib/ or src/)
Uses substring matching to catch domain signals (e.g., "cybersecurity-skills" → matches "security")
Runs before other inference logic (highest priority)

Confidence Score: 98% (near-deterministic, reflects author intent)

2. Sub-Hub Conflict Resolution

Problem: When a skill matched multiple sub-hubs (e.g., python AND security simultaneously), language hubs often won due to their anchor keywords, defeating domain-specialist classification.

Solution: Introduced conflict resolution table (CONFLICT_RESOLUTION) that:

Defines precedence rules when multiple sub-hubs match: (losing_hub, losing_sub_hub, winning_hub, winning_sub_hub)
Ensures domain specialists always win over languages:
- security > python | javascript | typescript | rust | golang | java
- testing-qa > python | javascript | typescript | rust
- code-review > python | javascript
Applied in resolve_conflict() function when multiple candidates score within 5 points of the top score
Fallback: hub priority ordering if no explicit rule applies

3. Confidence Boost for Path-Based Inference

Problem: Repository name signals (inferred from path) were scored 95%, allowing lower-confidence LLM results (80%) to potentially override them.

Solution: Raised the confidence score for path-based inference from 95 → 98%

Score 98 is now treated as near-deterministic (same tier as explicit canonicalize_assignment logic at 100)
Only scores ≥ 100 can override it
Prevents low-confidence LLM results from contradicting repository metadata

📊 Example Classification Flow

For a skill in lib/mukul975-anthropic-cybersecurity-skills/:

1. apply_rules() called
   ↓
2. canonicalize_assignment() → no match (0% confidence)
   ↓
3. infer_from_path() called
   ├─ infer_hub_from_repo_name() extracts "mukul975-anthropic-cybersecurity-skills"
   ├─ Finds substring match: "cybersecurity"
   └─ Returns ("code-quality", "security") with 98% confidence
   ↓
4. ✓ Final assignment: code-quality / security
   ✗ LLM classification skipped (98% > 80% threshold)

🔧 Troubleshooting

Issue: Skills not aggregating or taking too long

Check repository state:

cargo run --release -- doctor

This validates all repositories, checks Git remotes, and reports cache status.

Increase parallelism:

export PARALLEL_JOBS=16
cargo run --release -- aggregate

Issue: Sync failing with "junction or symlink" errors

Cause: Existing junctions in sync target directories.

Solution: The sync command automatically skips existing junctions. If conflicts persist:

# Inspect sync targets
dir ~/.claude/skills  # Windows
ls ~/.claude/skills   # macOS/Linux

# Remove conflicting junctions/symlinks manually
rmdir /s ~/.claude/skills\[hub-name]  # Windows
rm -rf ~/.claude/skills/[hub-name]    # macOS/Linux

# Retry sync
cargo run --release -- sync

Issue: "Release gate" validation fails

Check output integrity:

cargo run --release -- release-gate

This validates:

All SKILL.md files were processed
No orphaned or missing references in routing.csv
Deduplication stats match cache state

If failures reported, re-run aggregation:

rm -rf skills-aggregated/
cargo run --release -- aggregate

📈 Performance Characteristics

Operation	Time	Dependencies
First aggregate (120+ repos, 8000+ skills)	10-20 min	Network speed, CPU count, LLM latency
Incremental aggregate (repos already cached)	2-5 min	LLM classification speed (can skip with `--skip-llm`)
Sync to tools (10 tools, all hubs)	30-60 sec	Disk I/O, junction creation speed
LLM classification (8000 skills)	3-8 min	Batch size, LLM throughput

Optimization Tips:

Use PARALLEL_JOBS=auto for optimal CPU utilization
Set LLM_BATCH_SIZE=100 for faster LLM processing (requires more GPU/API quota)
Run on an SSD for 2-3x faster repository cloning
Use shallow clones (default) to reduce disk bandwidth

Reporting Issues

When reporting bugs, include:

Output of cargo run --release -- doctor
Contents of .skills-bank-cli-config.json (redact sensitive URLs if needed)
Error message and stack trace (if any)
Steps to reproduce

Extending Classification

To add new domain keywords or refine sub-hub routing:

Edit src/classify.rs → CONFLICT_RESOLUTION table or keyword rules
Add test cases in tests/
Run cargo test and cargo run --release -- aggregate
Submit PR with classification examples

📄 License

MIT — See package.json for details.

AI-skills-bank