dojo.md

dojo.md — Training Arena for AI Agents

Your agent demos well. It fails in production. dojo.md fixes that.

Train any model through scenario-based courses. Graduate with a SKILL.md — portable expertise that makes agents reliable. No fine-tuning. No weight modification. Just knowledge, distilled and proven.

Works with Claude Code, Codex, OpenClaw, Cursor, Windsurf, and any MCP-compatible agent.

"Hey Claude, train yourself on cold-email-b2b and loop until you hit 90"

  Iteration 1: 31/100 — doesn't know subject line rules, no personalization
  Iteration 2: 58/100 — SKILL.md injected, learns "under 50 chars, no caps"
  Iteration 3: 74/100 — gets structure right, still weak on CTAs
  Iteration 4: 86/100 — nails the pain→solution→ask framework
  Iteration 5: 91/100 — ✓ target reached

  SKILL.md → .claude/skills/cold-email-b2b/SKILL.md

Now every cold email it writes follows the framework. Permanently.

What Just Happened

You told Claude Code to train itself. It ran through 50 scenarios of progressively harder cold emails — bad subject lines, wrong tone, missing personalization, weak CTAs. An LLM judge scored every attempt. After 5 iterations, it extracted everything it learned into a SKILL.md that lives in your project forever.

Next time you say "write a cold email to this VP of Engineering", it loads the SKILL.md automatically. It knows the rules now.

This works for anything:

# Your agent writes terrible Google Ads? Fix it in 5 minutes
dojo train ad-copy-google-ads --target 85

# Support agent keeps giving wrong refund info? Train it
dojo train stripe-refunds --target 90

# Code reviews are too vague? There's a course for that
dojo train code-review-feedback-writing --target 85

# Incident postmortems are weak? Train on 50 real scenarios
dojo train incident-postmortem-writing --target 80

# Your agent can't write a proper RFC? Now it can
dojo train technical-rfc-writing --target 85

How the loop works

Scenarios → Mock Services → LLM Judge → Failure Patterns → SKILL.md → Re-inject → Repeat

Each iteration: the agent gets smarter. The SKILL.md compounds. It stops when it hits the target or plateaus.

Per-model skills

Claude, GPT, DeepSeek — they all fail differently. Each gets its own SKILL.md:

.claude/skills/cold-email-b2b/
├── anthropic--claude-sonnet-4-6/SKILL.md    # Was too formal, learned casual tone
├── openai--gpt-4o/SKILL.md                  # Was too long, learned brevity
└── deepseek--deepseek-v3.2/SKILL.md         # Missed personalization hooks

Quick Start

Option 1: Zero-cost with Claude Code or Codex (recommended)

Already paying for Claude Code or Codex? Training costs $0 extra. The agent trains AND judges itself — no API keys needed.

Just paste this into Claude Code:

Install dojo.md as an MCP server, then train yourself on cold-email-b2b
using autopilot mode. Loop until you hit 90.

Or add dojo as an MCP server manually:

{
  "mcpServers": {
    "dojo": { "command": "npx", "args": ["dojo.md", "mcp"] }
  }
}

Then tell your agent what to train on. It handles the rest.

	CLI (`dojo train`)	Autopilot (Claude Code / Codex)
Cost	~$0.50–5 per run	$0 extra
Agent	API calls	Your subscription
Judge	API calls	Agent self-judges
Setup	API keys required	Just MCP config

Option 2: CLI with any model via OpenRouter

npm install -g dojo.md
export OPENROUTER_API_KEY=sk-or-...

# Train DeepSeek on Google Ads copy for $0.03
dojo train ad-copy-google-ads --model deepseek/deepseek-v3.2 --target 85

# Train GPT-5 on incident response, judged by Claude
dojo train incident-response --model openai/gpt-5.2 --judge claude-sonnet-4-6 --target 90

# Train Gemini on customer support escalation
dojo train customer-support-escalation --model google/gemini-3-flash-preview --target 80

Arena — Model Benchmarking

Compare models head-to-head on the same course. Same judge, same scenarios, no SKILL.md — raw capability only.

dojo arena ad-copy-google-ads --level 1

═══ Arena Leaderboard ════════════════════════
  1st  Claude Opus 4.6    █████████████████░░░  84
  2nd  Claude Sonnet 4.6  █████████████████░░░  84
  3rd  GPT-5.2            ████████████████░░░░  82
  4th  GLM 5              ████████████████░░░░  79
  5th  Gemini 3 Flash     ███████████████░░░░░  76
══════════════════════════════════════════════

Above 70, every point gets exponentially harder — like ELO, small gaps mean big differences. See the live leaderboard.

Any Model

200+ models via OpenRouter:

dojo train cold-email-b2b --model openai/gpt-4o
dojo train cold-email-b2b --model google/gemini-2.5-pro
dojo train cold-email-b2b --model deepseek/deepseek-v3.2
dojo train cold-email-b2b --model x-ai/grok-4.1-fast
dojo train cold-email-b2b --model meta-llama/llama-3.3-70b-instruct

125 Pre-Built Courses (6,250+ Scenarios)

Domain	Examples	Courses
Customer Support	Stripe refunds, escalation, churn prevention, SLA breaches, onboarding	14
Marketing & Content	Google Ads, Meta ads, SEO blogs, email sequences, social media, UGC	18
Sales & Revenue	Cold email B2B, objection handling, proposals, battlecards, lead scoring	9
Engineering & DevOps	Incident response, Docker, Kubernetes, CI/CD, AWS Lambda, security	17
Writing & Docs	Technical RFCs, postmortems, SOPs, newsletters, Twitter/X threads	16
Data & Analytics	A/B testing, cohort analysis, segmentation, funnel analysis, forecasting	9
Design & UX	Accessibility audits, design systems, user personas, journey mapping	9
Education	Quiz creation, study guides, workshop facilitation, training materials	—
Legal & Compliance	Contract review, compliance checklists, clause summarization	—
Real Estate	Listing descriptions, open house promos, buyer inquiry response	—
Healthcare	Appointment reminders, intake review, billing inquiries, pre-auth	—

dojo list                    # See all 125 courses
dojo generate "Handle Zendesk ticket routing and priority assignment"  # Create your own

Works With Everything

dojo.md generates AgentSkills-standard SKILL.md files. Train once, use everywhere.

Claude Code

dojo.md is an MCP server — train from inside your IDE:

{
  "mcpServers": {
    "dojo": {
      "command": "npx",
      "args": ["dojo.md", "mcp"]
    }
  }
}

MCP tools: dojo_discover, dojo_train, dojo_tool, dojo_submit, dojo_results, dojo_skill, dojo_apply

OpenClaw

Drop your graduated SKILL.md into OpenClaw's skill directory. dojo.md skills follow the same AgentSkills standard — cross-compatible by design.

ClawHub has 13,000+ community skills. The difference: dojo skills are earned, not written. Every SKILL.md has a training score, validated scenarios, and failure patterns it addresses. It's a diploma, not a blog post.

Cursor, Windsurf, and any MCP agent

Same MCP config. Same SKILL.md output. Portable.

The SKILL.md Standard

Generated skills follow the AgentSkills open standard:

---
name: stripe-refunds
description: >-
  Handle Stripe refund requests correctly. Use when processing
  refunds, duplicate charges, or customer disputes.
---

## Domain Knowledge
[Non-obvious insights distilled from training curriculum]

## Quick Start
[Most common failure, corrected]

## Core Rules
[Freedom-calibrated: ALWAYS/step-by-step/prefer]

## Decision Tree
[If/then branching logic]

## Edge Cases
[Every trap, with correct handling]

## Anti-Patterns
[DON'T X. Instead, Y.]

The description triggers loading — ~100 tokens idle, ~5,000 tokens when activated. Progressive disclosure keeps context clean.

CLI Reference

Command	Description
`dojo train <course>`	Run training session
`dojo train <course> -m openai/gpt-4o -j claude-sonnet-4-6 -t 85`	Full multi-model auto-loop
`dojo retrain <course>`	Auto-loop with defaults (target 90, max 5)
`dojo arena <course>`	Benchmark multiple models head-to-head
`dojo arena <course> --models m1,m2,m3`	Arena with specific models
`dojo results [course]`	Show latest results
`dojo list`	List installed courses
`dojo generate <skill>`	Generate a course from description

Train Options

Flag	Description	Default
`-m, --model`	Agent model	`claude-sonnet-4-6`
`-j, --judge`	Judge model	`claude-sonnet-4-6`
`-t, --target`	Target score (enables auto-loop)	—
`--max-retrain`	Max loop iterations	`5`
`--level`	Run specific level only	all
`--report`	Save detailed report	—

Arena Options

Flag	Description	Default
`--models`	Comma-separated model list	top 5 models
`-j, --judge`	Shared judge model	`claude-opus-4-6`
`-l, --level`	Run specific level only	all
`-o, --output`	Output JSON path	auto-generated

Scenario Format

meta:
  id: simple-refund
  level: 1
  course: stripe-refunds
  description: Process a straightforward refund
  type: tool

state:
  customers:
    - id: cus_001
      email: [email protected]
      name: Alice Johnson
  charges:
    - id: ch_001
      amount: 5000
      customer: cus_001
      status: succeeded

trigger: >
  Customer Alice Johnson (cus_001) is requesting
  a refund for charge ch_001 ($50.00).

assertions:
  - type: api_called
    tool: stripe_customers_retrieve
    description: Verify customer identity
  - type: api_called
    tool: stripe_refunds_create
    params: { charge: ch_001 }
    description: Create the refund
  - type: llm_judge
    criteria: >
      Agent confirms refund was processed and explains
      the 5-10 business day timeline for the credit
      to appear on the customer's statement.
    description: Communicate success with timeline

Development

git clone https://github.com/edholofy/dojo.md
cd dojo.md
npm install
npm run build
npm test       # 116 tests

# Dev mode
npm run dev -- train stripe-refunds

Mission

Turn experience into expertise for AI agents.

Today: Author courses, train models, graduate with SKILL.md.
Tomorrow: Production feedback loops that generate scenarios from real failures.
Future: The open knowledge layer for agent expertise — proven, portable, model-agnostic.

License

MIT