autostar
Health Pass
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 28 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
This tool is an automated optimization agent that uses a structured loop to iteratively improve measurable goals (like code quality or documentation). It relies on verifiable metrics to run experiments, evaluate the outcomes, and converge on the best possible version of the target artifact within a given budget.
Security Assessment
Overall Risk: Low. The automated code scan analyzed 12 files and found no dangerous patterns, hardcoded secrets, or requests for overly broad permissions. Because it is an agentic tool designed to mutate and evaluate files, it inherently requires the ability to read and modify local project files, as well as execute basic local commands (like linters or tests). Users should still exercise standard caution regarding what files and local tools an autonomous agent is permitted to access and run.
Quality Assessment
The project appears to be in excellent health. It is freshly maintained, with repository activity logged as recently as today. It uses the highly permissive and standard MIT license. While still a relatively niche project with 28 GitHub stars, this indicates a small but present level of community validation.
Verdict
Safe to use, though users should monitor the autonomous agent's file modifications and tool executions during initial runs.
Autoresearch ALL THE THINGS. RLVR for the masses.
a* (autostar)
If you can measure it, you can improve it.
Soft RLVR for the masses. a* turns any measurable goal into a structured optimisation loop. You define what "good" looks like. a* runs experiments, scores the results, learns from every attempt, and converges on the best version within your budget.
No reward model to train. No environment to build. No GPU cluster to provision. Just a goal, an evaluator, and an agent that knows how to search.
Fastest path
If you just want to install the skill and try it once in Claude Code:
npx skills add chrisvoncsefalvay/autostar
Then invoke it in Claude Code:
/skill autostar
The skill handles onboarding, confirms the mission with you, and runs experiments within your approved budget.
What this is
Most artifacts live where quality is real but hard to verify fully. Code can be type-checked but not beauty-checked. Prose can be spell-checked but not tone-checked.
Traditional RLVR (reinforcement learning from verifiable rewards) needs rewards you can compute with certainty: math proofs, unit tests, formal proofs. That covers a narrow slice of what people want to improve.
a* uses verifiable-ish rewards instead. It combines hard signals (type checkers, linters, test suites) with soft ones (LLM judges, human gates). Each track has its own verifier. The system runs enough steps per lap to build statistical confidence.
The result: an optimisation loop for anything you can split into measurable dimensions. Code quality, documentation, prompt engineering, writing style, API design, accessibility.
See it in action
Fixing this very documentation
How it works
a* runs in five phases:
1. Onboarding
An interactive dialogue — never skipped, never auto-inferred. The system breaks your goal into tracks (measurable dimensions), sets up verifiers, sets constraints, agrees a budget, and gets your approval before running.
2. Pre-run preparation
Baseline measurement, tool checks, disposition library query, and final mission confirmation.
3. Execution loop
The core cycle:
Step → mutate artifact, evaluate all tracks, ratchet (keep/revert)
Lap → N steps with same parameter family → statistical verdict
Round → set of laps → mandatory reflection (worth pursuing? ask user? pivot?)
Each round ends with a structured reflection: Worth pursuing? Ask the user? Pivot? The system escalates when scores plateau, tracks diverge, or budget pace is at risk — always with specific questions, not vague updates.
A built-in visualiser renders live progress as an inline HTML dashboard.
4. Memory and learning
The canonical source of truth is one local persistent backend for the whole
system. Short-term run logs stay append-only inside a run, while long-term
memory is stored in that backend and exported as human-readable JSON/JSONL
mirrors.
Within a run:
step_log.jsonlandreflections.jsonlare append-only sources of truthhypothesis_stack.json,track_trajectories.json, andmomentum.jsonare derived snapshots- if a derived file disagrees with a log, the log wins
Across runs:
- episodes and run summaries are persisted in the backend and mirrored to JSONL
- dispositions are versioned in the backend and mirrored to JSON
- Claude.ai fallback sessions use a text-first project memory pack exported from the backend
Memory access modes:
| Mode | Meaning |
|---|---|
direct_backend |
Runtime can use the canonical local backend directly |
connector_backed |
Runtime reaches memory through a connector |
project_pack |
Runtime uses a text-first project pack with manual sync |
none |
Short-term memory only |
Dispositions are the long-term knowledge base. They shape future actions based
on what worked and what didn't. Claude's built-in memory is useful for chat
continuity, but it is not the system of record for a* learning.
5. Post-run report
Baseline vs. final scores, trajectory charts, reflection log, what worked, what didn't, and budget accounting.
Verification taxonomy
Every track declares one of five verifier types:
| Type | Signal | Use when |
|---|---|---|
| Deterministic | Formula / regex / rule | Word count, format compliance, schema validation |
| External tool | CLI subprocess | pyright, pytest, eslint, lighthouse, vale, bandit |
| LLM judge | Structured LLM call with fixed scoring criteria | Readability, tone, documentation quality |
| Hybrid | Tool + LLM judge, aggregated | Factual accuracy (entity check gates quality score) |
| Human gate | Pause and ask the user | Brand approval, legal sign-off, aesthetics |
Verifiers are immutable during a run. Changing the evaluator mid-run is the main failure mode of autonomous optimisation — a* prevents it by design.
External judges
LLM judge tracks run in self mode (host agent judges inline) or external mode (separate model via subprocess). External mode keeps mutator and judge independent, and handles safety-filter conflicts: sensitive domains can use a judge that won't refuse to score legitimate content.
The contract: a* writes a JSON request, calls your command, and reads a JSON response (score + rationale) from stdout. Bring any model with a CLI wrapper.
Safety-filter resilience
When a safety filter refuses an LLM call, a* rephrases and retries up to twice. If retries fail, it offers the user specific options: switch to an external judge, adjust criteria, skip the track, or abort. All rejections are logged.
Progress format
a* writes structured, machine-readable output for external tools:
| File | Format | Updated |
|---|---|---|
step_log.jsonl |
JSONL | After every step |
reflections.jsonl |
JSONL | After every round |
tracks.json |
JSON | Once at run start |
mission.json |
JSON | Once at run start |
progress.json |
JSON | After every step |
progress.json is the single-file state snapshot. Poll it from dashboards, CLI tools, or downstream agents without parsing JSONL. Schemas for all formats are in schemas/.
Installation
Prerequisites
- Claude Code (for the native install path below)
- Python 3 +
pyyaml+jsonschema(for validation/packaging scripts) unzip(if installing from a release archive)
Claude Code
# Clone the repo
git clone https://github.com/chrisvoncsefalvay/autostar.git
# Copy the skill into your Claude Code skills directory
cp -r autostar/autostar-skill ~/.claude/skills/autostar-skill
Or install via skill.sh:
npx skills add chrisvoncsefalvay/autostar
Or install directly from a release:
# Download the .skill file from the latest release
curl -sL https://github.com/chrisvoncsefalvay/autostar/releases/latest/download/autostar-skill.skill -o autostar.skill
# Unzip into your skills directory
unzip autostar.skill -d ~/.claude/skills/
Claude.ai custom Skills
Build the Claude.ai upload archive:
python autostar-skill/scripts/package_skill.py --target claude-ai dist/
Upload dist/autostar-claude-ai-skill.zip to Claude.ai as a custom Skill.
For long-term memory in Claude.ai:
- preferred: configure the remote memory connector and enable it per conversation
- fallback: add a project memory pack to project knowledge and manually sync updated pack files after runs
- if neither exists, a* reports short-term-only mode explicitly
Other agents
The .skill format is a ZIP archive with SKILL.md and supporting files. Other agent frameworks can use it through a runtime adapter — see autostar-skill/references/runtime-capabilities.md. Without that layer, treat compatibility as partial.
Bundled adapters: Claude Code, Codex, Gemini CLI, Pi (full support); Claude.ai (reduced support); chat-only (unsupported boundary). Profiles and a template for new adapters live in autostar-skill/runtime-profiles/.
Inspect or select profiles with:
python autostar-skill/scripts/runtime_profile.py list
python autostar-skill/scripts/runtime_profile.py show claude-code
python autostar-skill/scripts/runtime_profile.py check-mission claude-code --verifier external_tool --verifier llm_judge
python autostar-skill/scripts/runtime_profile.py resolve claude-ai --project-pack ./memory-pack
Validation
python -m pip install pyyaml jsonschema pytest
python autostar-skill/scripts/quick_validate.py autostar-skill/
Usage
Invoke the skill in Claude Code:
/skill autostar
Or trigger it with natural language:
- "Optimise this code until the type checker is happy and the docs are good"
- "Iterate on this API handler — improve readability, keep tests passing"
- "Find the best configuration for this pipeline"
a* guides you through onboarding, gets your approval, and runs within your budget.
Repository structure
autostar/
autostar-skill/ # The distributable skill
SKILL.md # Main instruction set
assets/ # Visualiser template
scripts/ # Packaging, validation, runtime profile tools
references/ # Onboarding, verification, budgeting, memory,
# runtime adapters (Claude Code, Codex, Gemini,
# Pi, Claude.ai, chat-only, template)
runtime-profiles/ # Machine-readable capability profiles per runtime
schemas/ # JSON Schemas for output formats
.github/workflows/ # CI: validate on PR; CD: build .skill on tag
LICENSE
README.md
Design philosophy
a* is built on seven principles:
No silent inference. Every decision requires explicit user confirmation. Inferred values are defaults, never silent assumptions.
Immutable evaluation. Verifiers and scoring rules do not change during a run. The evaluator is the ground truth.
Statistical confidence. Laps run multiple steps to get distributions, not point estimates. Verdicts emerge from the data, not single observations.
Bidirectional learning. The system learns from success and failure equally. Failed attempts record their failure modes, not just "failed."
Reflection without action. Every round ends with a recorded reflection, even when nothing changes.
User in the loop at key moments. The system escalates on plateaus, diverging tracks, pace risk, and repeated failures — with specific questions, not vague updates.
Memory as decision-maker. Dispositions shape every major action. The system gets better at the same class of problem over time.
These are hard constraints enforced by the skill's structure, not suggestions.
Building and packaging
python -m pip install pyyaml jsonschema pytest
python autostar-skill/scripts/package_skill.py autostar-skill/ dist/
python autostar-skill/scripts/package_skill.py --target all dist/
Validates the skill and produces dist/autostar-skill.skill.
Contributing
Contributions are welcome. The skill's behaviour lives entirely in SKILL.md and its reference files — no runtime code to compile, no models to train.
Please open an issue before large changes.
Author
Chris von Csefalvay (@epichrisis)
Author of Post-Training: A Guide for Developers and AI Engineers (No Starch Press). Building tools that make post-training techniques accessible to practitioners — because the gap between "this works in a paper" and "this works in my project" is where most of the value lives.
License
MIT. See LICENSE.
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found