self-evolving-agent

agent
SUMMARY

An OpenClaw skill that upgrades self-improving agents from reactive error logging to goal-driven capability evolution with curriculum, evaluation, transfer, and promotion.

README.md
Image

self-evolving-agent

English
็ฎ€ไฝ“ไธญๆ–‡

OpenClaw Skill
CI
License
Stars
Model-in-the-Loop Benchmark
Goal-Driven Learning

๐Ÿง  self-improving-agent only log mistakes.

self-evolving-agent is an OpenClaw-first, phase-aware capability-evolution runtime. It classifies work into task_light, task_full, agenda_review, or promotion_review mode; retrieves only the most relevant prior records; writes evidence into canonical records; and regenerates human-facing ledgers plus manifest.json.

It preserves the best parts of self-improving-agent, but upgrades the paradigm from:

  • incident logging -> capability evolution
  • passive memory -> active learning agenda
  • correction archive -> curriculum + evaluation + promotion gate

โœจ Why It Exists

Traditional self-improving agents often stop at:

  • "something failed"
  • "log the fix"
  • "write a rule"

That helps reduce repeated mistakes, but it does not answer the harder questions:

  • What can the agent reliably do today?
  • Which capability is actually weak?
  • What should it practice next?
  • Has it truly learned, or only recorded?
  • Can the strategy transfer to a different task?

self-evolving-agent is built to answer those questions explicitly.

๐Ÿ“Š self-evolving-agent vs self-improving-agent

Dimension self-improving-agent self-evolving-agent
Primary mode Reactive correction Goal-driven capability evolution
Core unit Incident, error, note Capability, training unit, evaluation state
Memory model Learnings and recurring issues Learnings + capability map + learning agenda
Before-task behavior Review past notes if relevant Review notes, capability risks, and active training priorities
After-task behavior Log errors and lessons Diagnose weakest capability, update map, revise agenda, create training if needed
Recurrence handling Detect recurring patterns Convert recurrence into curriculum with pass criteria
Learning states Mostly implicit recorded -> understood -> practiced -> passed -> generalized -> promoted
Promotion rule Promote useful rules Promote only validated, transferable strategies
Transfer awareness Limited Explicit transfer check before promotion
What it optimizes for Fewer repeated mistakes More independence, stability, transfer, and unfamiliar-task competence

๐Ÿš€ What Makes This Different

  • ๐Ÿงญ Learning agenda: keeps only 1-3 high-leverage capabilities active at a time
  • ๐Ÿ—บ๏ธ Capability map: tracks level, evidence, limits, failure modes, and upgrade conditions
  • ๐Ÿง  Phase-aware control plane: routes tasks into the smallest safe mode instead of assuming task_full every time
  • ๐Ÿ—‚๏ธ Canonical records: stores mutable state under records/ and generates human-readable ledgers from those records
  • ๐Ÿ”ฌ Diagnosis layer: turns incidents into capability-level root-cause analysis
  • ๐Ÿ‹๏ธ Curriculum layer: generates drills, pass criteria, and transfer scenarios
  • โœ… Evaluation ladder: separates writing something down from actually learning it
  • ๐Ÿ”’ Promotion gate: prevents brittle one-off rules from polluting long-term behavior
  • ๐Ÿค Memory retention: still preserves classic logging for errors, learnings, and feature requests

๐Ÿงฑ Architecture

flowchart TD
    A["Task Starts"] --> B["classify-task"]
    B --> C["Mode: task_light | task_full | agenda_review | promotion_review"]
    C --> D["retrieve-context"]
    D --> E["Execute with verification"]
    E --> F["record-incident"]
    F --> G["rebuild-index"]
    G --> H["Generated ledgers + manifest.json"]
    H --> I["review-agenda / evaluate when triggered"]

The runtime entrypoint is scripts/evolution_runtime.py. It treats assets/records/ and workspace records/ directories as the mutable source of truth and regenerates summaries plus index/manifest.json.

๐Ÿ” Phase-Aware Loop

For every meaningful cycle, the skill follows this control plane:

  1. Classify the task with scripts/evolution_runtime.py classify-task
  2. Choose the smallest safe mode
  3. Retrieve only that mode's records with retrieve-context
  4. Execute with a mode-appropriate verification plan
  5. Write reusable evidence through record-incident
  6. Regenerate records/ views and manifest.json through rebuild-index

Outside the task loop, it runs review-agenda and evaluate only when their triggers fire.

๐Ÿงฉ What It Keeps From self-improving-agent

  • Error logging
  • Learning capture
  • Feature request logging
  • Recurring pattern detection
  • Review of past learnings before major work
  • Promotion into durable workspace context
  • Hook-friendly operation

Those strengths remain, but only as the memory layer, not the whole system.

๐Ÿ”„ Migration From self-improving-agent

The most common conflict is not data loss. It is double activation.

If a user already has self-improving-agent, the safe migration path is:

  1. Install self-evolving-agent without deleting the old skill.
  2. Bootstrap .evolution/ and import the old .learnings/ directory.
  3. Keep the imported logs in .evolution/legacy-self-improving/ as read-only history.
  4. Disable the old self-improvement hook after verifying the import.
  5. Gradually normalize only the legacy items that become active evidence for diagnosis, agenda review, evaluation, or promotion.

This keeps prior experience intact without forcing a lossy one-shot conversion into the new schema.

Example:

~/.openclaw/skills/self-evo-agent/scripts/bootstrap-workspace.sh \
  ~/.openclaw/workspace/.evolution \
  --migrate-from ~/.openclaw/workspace/.learnings
openclaw hooks disable self-improvement
openclaw hooks enable self-evolving-agent

๐ŸŽฏ Best Fit

Use this skill when you want an agent that should:

  • improve across sessions
  • become safer on unfamiliar work
  • convert repeated failures into deliberate practice
  • distinguish recording from mastery
  • prove transfer before promotion

โš–๏ธ Modes

The task_full capability-evolution pipeline is intentionally not the default for every tiny mistake.

Use task_light when the task is familiar, low-consequence, and short-horizon. In that mode, retrieve only the top few relevant records, state one risk and one verification check, and avoid spawning agenda or promotion work.

Escalate into task_full when the task is mixed or unfamiliar, consequence matters, an active agenda item is involved, a failure pattern repeats, the user had to rescue the task, transfer failed, or the lesson may deserve training or evaluation.

Use agenda_review only for agenda triggers such as five meaningful cycles, structural gaps, failed transfer, or an upcoming unfamiliar project.

Use promotion_review only for transfer and promotion decisions.

๐Ÿ“ Repository Layout

self-evolving-agent/
โ”œโ”€โ”€ SKILL.md
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ README.zh-CN.md
โ”œโ”€โ”€ install.md
โ”œโ”€โ”€ agents/
โ”‚   โ””โ”€โ”€ openai.yaml
โ”œโ”€โ”€ benchmarks/
โ”‚   โ”œโ”€โ”€ suite.json
โ”‚   โ””โ”€โ”€ schemas/
โ”‚       โ””โ”€โ”€ judge-output.schema.json
โ”œโ”€โ”€ system/
โ”‚   โ””โ”€โ”€ coordinator.md
โ”œโ”€โ”€ modules/
โ”‚   โ”œโ”€โ”€ capability-map.md
โ”‚   โ”œโ”€โ”€ curriculum.md
โ”‚   โ”œโ”€โ”€ diagnose.md
โ”‚   โ”œโ”€โ”€ evaluator.md
โ”‚   โ”œโ”€โ”€ learning-agenda.md
โ”‚   โ”œโ”€โ”€ promotion.md
โ”‚   โ””โ”€โ”€ reflection.md
โ”œโ”€โ”€ assets/
โ”‚   โ”œโ”€โ”€ records/
โ”‚   โ”‚   โ”œโ”€โ”€ agenda/
โ”‚   โ”‚   โ””โ”€โ”€ capabilities/
โ”‚   โ”œโ”€โ”€ CAPABILITIES.md
โ”‚   โ”œโ”€โ”€ ERRORS.md
โ”‚   โ”œโ”€โ”€ EVALUATIONS.md
โ”‚   โ”œโ”€โ”€ FEATURE_REQUESTS.md
โ”‚   โ”œโ”€โ”€ LEARNING_AGENDA.md
โ”‚   โ”œโ”€โ”€ LEARNINGS.md
โ”‚   โ””โ”€โ”€ TRAINING_UNITS.md
โ”œโ”€โ”€ evals/
โ”‚   โ””โ”€โ”€ evals.json
โ”œโ”€โ”€ demos/
โ”‚   โ”œโ”€โ”€ demo-1-diagnosis.md
โ”‚   โ”œโ”€โ”€ demo-2-training-loop.md
โ”‚   โ”œโ”€โ”€ demo-3-promotion-and-transfer.md
โ”‚   โ”œโ”€โ”€ demo-4-agenda-review.md
โ”‚   โ””โ”€โ”€ demo-5-pre-task-risk-diagnosis.md
โ”œโ”€โ”€ hooks/
โ”‚   โ””โ”€โ”€ openclaw/
โ”‚       โ”œโ”€โ”€ HOOK.md
โ”‚       โ””โ”€โ”€ handler.ts
โ””โ”€โ”€ scripts/
    โ”œโ”€โ”€ activator.sh
    โ”œโ”€โ”€ bootstrap-workspace.sh
    โ”œโ”€โ”€ evolution_runtime.py
    โ”œโ”€โ”€ error-detector.sh
    โ”œโ”€โ”€ run-benchmark.py
    โ””โ”€โ”€ run-evals.py

โšก Quick Start

  1. Install the skill into your OpenClaw skills directory.
  2. Bootstrap a persistent .evolution workspace.
  3. Classify work through the runtime and retrieve only the required records.
  4. Let the runtime regenerate ledgers and manifest.json after canonical record updates.
  5. Run the benchmark suite to see how the skill performs in model-in-the-loop conditions.
cp -r self-evolving-agent ~/.openclaw/skills/self-evo-agent
~/.openclaw/skills/self-evo-agent/scripts/bootstrap-workspace.sh ~/.openclaw/workspace/.evolution
python3 ~/.openclaw/skills/self-evo-agent/scripts/evolution_runtime.py classify-task \
  --workspace ~/.openclaw/workspace/.evolution \
  --prompt "I need to modify a production deployment workflow I have never touched before."
python3 ~/.openclaw/skills/self-evo-agent/scripts/run-evals.py ~/.openclaw/skills/self-evo-agent
python3 ~/.openclaw/skills/self-evo-agent/scripts/run-benchmark.py --skill-dir ~/.openclaw/skills/self-evo-agent

More setup details are in install.md.

๐Ÿ“ฆ Installation Options

Option A: Install from ClawHub

Use this when you want the simplest registry-based install into your current OpenClaw workspace.

npm i -g clawhub
# or
pnpm add -g clawhub

clawhub install RangeKing/self-evo-agent

Then start a new OpenClaw session so the skill is loaded from your workspace skills/ folder.
The registry slug and local directory are self-evo-agent; the skill and hook name stay self-evolving-agent.
If you are migrating from self-improving-agent, import .learnings/ before you disable the old hook.

Option B: Let OpenClaw install it from GitHub

If you prefer to have your agent fetch the GitHub repository directly, you can tell OpenClaw something like:

Install the OpenClaw skill from https://github.com/RangeKing/self-evolving-agent into ~/.openclaw/skills/self-evo-agent, inspect the scripts before enabling hooks, and then bootstrap ~/.openclaw/workspace/.evolution.

This works well when you want the skill installed as a shared managed skill under ~/.openclaw/skills.

Option C: Manual Git clone

git clone https://github.com/RangeKing/self-evolving-agent.git ~/.openclaw/skills/self-evo-agent
~/.openclaw/skills/self-evo-agent/scripts/bootstrap-workspace.sh ~/.openclaw/workspace/.evolution

If you already have ~/.openclaw/workspace/.learnings, use:

~/.openclaw/skills/self-evo-agent/scripts/bootstrap-workspace.sh \
  ~/.openclaw/workspace/.evolution \
  --migrate-from ~/.openclaw/workspace/.learnings

Safety Note

ClawHub is a public registry and skills are effectively trusted local code. Review the repository or installed files before enabling hooks or running benchmark scripts.

๐Ÿค Project Health

๐Ÿงช Benchmarking

This repository includes two evaluation modes:

  • scripts/run-evals.py
    • Structural compliance checks for files, modules, and benchmark assets
  • scripts/run-benchmark.py
    • Real model-in-the-loop execution using codex exec
    • Captures candidate prompt, raw events, final output, judge output, and report

Example smoke run:

python3 scripts/run-benchmark.py \
  --skill-dir . \
  --candidate-model gpt-5.4-mini \
  --judge-model gpt-5.4-mini \
  --max-scenarios 1 \
  --timeout-seconds 90

๐Ÿงญ Use Cases

  • Upgrading a self-correcting agent into a self-training agent
  • Running postmortems that produce training, not just notes
  • Building skill memory systems that do not confuse logging with mastery
  • Evaluating whether an agent can transfer strategies across task families
  • Designing agent curricula for research, coding, verification, or operations workflows

๐Ÿ›ฃ๏ธ Roadmap

  • Memory, diagnosis, curriculum, evaluator, reflection, promotion modules
  • Capability bootstrap map and proactive learning agenda
  • Model-in-the-loop benchmark harness
  • More benchmark scenarios for coding, research, and long-horizon execution
  • Optional benchmark trend summaries across repeated runs
  • Example workspace packs for different agent domains

Yorumlar (0)

Sonuc bulunamadi