SkillLoop

SkillLoop is a standalone self-improvement harness for agent systems.

It sits beside an agent runtime, ingests completed agent traces, evaluates them, proposes durable memory and reusable skill updates, and exports fine-tuning-ready datasets. It is deliberately separated from Hermes or any other runtime so it can be reviewed, exported, versioned, or embedded without mutating an existing agent installation.

Why this exists

Most agents can execute tasks, but their learning loop is usually either missing or tightly coupled to one runtime. SkillLoop keeps that loop explicit:

ingest traces from an agent
normalize them into a stable schema
evaluate quality and learning signals
distill candidate memory and skill updates
require review before applying anything
export curated SFT/DPO data for model improvement

The MVP is local-first, stdlib-first, and review-first.

What SkillLoop does

Normalizes generic JSONL, Hermes-style exports, and Hermes state.db sessions
Stores traces, evaluations, and proposals in local SQLite
Uses a versioned normalized trace schema with runtime/adapter metadata
Preserves raw trace inputs and records raw + normalized content hashes
Records span-ready tool-call metadata such as IDs, timings, exit codes, status, error type, and artifact references
Scores traces through a registered evaluator with versioned provenance and structured evidence
Detects durable user preferences, corrections, success signals, and reusable workflows
Creates deduplicated memory and skill proposals instead of silently mutating global state
Tracks proposal lifecycle from pending to approved to applied
Applies approved proposals only into the selected project directory
Exports SFT JSONL and DPO JSONL datasets with optional score gates, split files, manifests, provenance, and count/token stats
Replays traces through evaluator versions to benchmark score/evidence changes before training
Generates reviewed training config artifacts for Unsloth, TRL, and Axolotl without running training
Redacts common secret patterns during ingestion/export

What SkillLoop does not do in v1

It does not replace an agent runtime
It does not fine-tune a model directly
It does not write into ~/.hermes/memories, ~/.hermes/skills, or global agent config
It does not require cloud services
It does not store credentials

Install for local development

git clone <repo-url>
cd skillloop
python -m pip install -e '.[dev]'

SkillLoop requires Python 3.11+.

Quickstart

Run the sample workflow from the repository root:

python -m skillloop.cli --path . init
python -m skillloop.cli --path . ingest generic examples/traces/simple_trace.jsonl
python -m skillloop.cli --path . traces list
python -m skillloop.cli --path . eval latest --evaluator rubric
python -m skillloop.cli --path . distill latest
python -m skillloop.cli --path . review list --verbose
python -m skillloop.cli --path . export sft --out data/sft.jsonl --min-score 70 --splits train=0.8,validation=0.1,test=0.1
python -m skillloop.cli --path . export dpo --out data/dpo.jsonl --min-score 70

The review list output shows proposal IDs. To test the approval/apply path, approve a listed proposal by full ID or unique prefix, then run apply:

python -m skillloop.cli --path . review approve <proposal-id-or-prefix>
python -m skillloop.cli --path . apply

You can also use the console script after installation:

skillloop --path . init
skillloop --path . ingest generic examples/traces/simple_trace.jsonl

CLI overview

skillloop --path <project-root> init
skillloop --path <project-root> setup --connect hermes [--start] [--auto-export]
skillloop --path <project-root> status [--json]
skillloop --path <project-root> ingest generic <jsonl-path>
skillloop --path <project-root> ingest hermes <json-path>
skillloop --path <project-root> ingest hermes-db --latest [--db-path ~/.hermes/state.db]
skillloop --path <project-root> ingest hermes-db --session-id <id> [--db-path ~/.hermes/state.db]
skillloop --path <project-root> traces list
skillloop --path <project-root> traces show <trace-id|latest>
skillloop --path <project-root> eval <trace-id|latest> [--evaluator rubric]
skillloop --path <project-root> distill <trace-id|latest>
skillloop --path <project-root> review list [--verbose]
skillloop --path <project-root> review approve <proposal-id-prefix>
skillloop --path <project-root> review reject <proposal-id-prefix>
skillloop --path <project-root> apply
skillloop --path <project-root> export sft --out <path> [--min-score N] [--splits train=0.8,validation=0.1,test=0.1] [--manifest-out manifest.json]
skillloop --path <project-root> export dpo --out <path> [--min-score N] [--splits train=0.8,validation=0.1,test=0.1] [--manifest-out manifest.json]
skillloop --path <project-root> benchmark [--baseline rubric_legacy] [--candidates rubric] [--out benchmark.json]
skillloop --path <project-root> training-config trl|unsloth|axolotl --dataset-manifest manifest.json --base-model <model> --output-dir <dir> --config-dir <dir>
skillloop --path <project-root> controller run
skillloop --path <project-root> controller history [--limit N]
skillloop --path <project-root> controller show <run-id-or-prefix>

Clean export boundary

SkillLoop writes only under the selected project root by default:

local state: .skillloop/skillloop.db
preserved raw trace inputs: .skillloop/raw_traces/*
approved memory exports: .skillloop/approved/memory/*.md
approved skill exports: .skillloop/approved/skill/*.md
training data exports: user-selected paths such as data/sft.jsonl
dataset manifests: default <out>.manifest.json or --manifest-out <path>

This is intentional. The first version is a clean export layer, not a global self-mutating runtime.

Repository layout

skillloop/
  adapters/      Trace ingestion adapters
  apply/         Review-approved filesystem exports
  distill/       Memory and skill proposal generation
  dataset.py     Dataset split, manifest, provenance, and stats helpers
  eval/          Evaluator registry, deterministic rubric, and structured evidence helpers
  export/        SFT and DPO dataset exporters
  review/        Proposal review queue helpers
  cli.py         Command-line interface
  schema.py      Normalized trace/eval/proposal dataclasses
  store.py       SQLite persistence layer
  training_config.py  Unsloth/TRL/Axolotl config generation only
examples/
  traces/        Sample input traces
tests/           Pytest coverage for the MVP
docs/            Architecture and usage documentation

Safety model

SkillLoop is review-first:

Ingested traces are stored locally
Raw traces are preserved locally with hashes for provenance
Evaluations carry evaluator name, evaluator version, evidence, and trace schema version
Distillation creates proposals, not global mutations
Duplicate active proposals are skipped by content hash
Human approval is required before apply
Applied proposals are marked applied with an application timestamp
Dataset exports include trace/evaluation/proposal provenance in record metadata and manifest summaries
Approved exports stay inside .skillloop/approved/
.env, .env.*, generated datasets, and local state are gitignored

See docs/safety.md for details.

Development checks

python -m pytest tests/ -q
python -m compileall skillloop tests -q
python -m pip wheel . --no-deps -w /tmp/skillloop-wheel-check

Expected MVP result: all tests pass and the sample workflow exports at least one SFT record.

Proof-of-work status

This repository is an initial proof-of-work for the SkillLoop architecture. It already demonstrates the core loop:

trace ingestion → evaluation → memory/skill proposals → human review → safe local apply → fine-tuning data export

The current proof-of-work also includes the first trustworthy-data layer needed before model training becomes meaningful:

schema-versioned traces with backward compatibility for old traces
runtime and adapter metadata on traces
span-ready tool-call schema
raw trace preservation and content hashes
evaluator provenance and structured evidence
evaluator registry for versioned scoring strategies
proposal deduplication and applied lifecycle tracking
dataset manifests, split exports, export metadata, provenance summaries, and deterministic token/count stats
replay benchmark reports that compare evaluator versions before training
Unsloth, TRL, and Axolotl config generation with explicit no-auto-training safety flags

See:

docs/architecture.md for system design
docs/cli.md for commands
docs/safety.md for safety boundaries
docs/trace-schema.md for data format

License

Apache-2.0. See LICENSE.