rockie-nugget
Health Warn
- License — License: Apache-2.0
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
An open, thin, model-agnostic harness for long-horizon autonomous AI research. Built on Goose.
🪨 rockie-nugget
An open harness for long-horizon, autonomous AI research.
rockie-nugget is a thin, model-agnostic agent runtime built for one job: running an AI agent
that does real research — reads the literature, writes code, launches GPU experiments,
reads the metrics, and iterates — for hours, unattended. It runs on your laptop, on a
server, or as a per-tenant runtime on the Rockie platform.
Status: early, and built in the open. The architecture below is the design we're
building to, and parts of it are still landing. If you work on self-improving AI, agentic
research, or research-agent evaluation, this is exactly the moment to get involved — see
Collaborating.
Why another harness?
Because the harness shouldn't be the interesting part. The lesson of the last two years is
that elaborate agent scaffolding gets obsoleted by better models — so rockie-nugget keeps the
loop deliberately thin and puts the leverage where it lasts:
- Model-agnostic. Point it at any OpenAI-compatible endpoint — a frontier model, a
cheap open-weight model, or your own local server. Bring your own key. - MCP-native. Tools are Model Context Protocol
servers. The agent's whole capability set is a declarative, versioned tool contract —
seecontract/. - Skills are just markdown. The Rockie skills ecosystem is plain
.mdfiles — read
them, fork them, write your own. No DSL, no lock-in. - Built to be an evaluation and training environment. The same harness that serves a
task can measure how good a model is at research, and (privately, on our side) train one.
What the agent actually does
Each step, the agent picks one move from a small, fixed tool menu — theresearch-env-v1 action space:
| Verb | Tools |
|---|---|
| Gather | web_search, fetch_url, read_file, list_files |
| Write | write_file |
| Run | run_command (sandboxed) |
| Experiment | submit_job, get_job — provision a GPU run, read its results |
| Finish | finish |
That's enough to do the real thing: read a paper, implement an idea, launch a training run,
read the metrics, decide what to try next. GPU experiments are provider-invariant — the
agent asks for 8xH100, and the runtime decides where to provision it (you never see, and
never have to care, which cloud answered).
Quickstart
Bring your own key (BYOK). The installer needs a Linux x86_64 host with glibc ≥ 2.36
(Debian bookworm, Ubuntu 22.04+) andpython3— the runtime is a Linux binary. Nativecargo installis planned (#1953).
# install (assembles the runtime + tool surface from this checkout — config only)
git clone https://github.com/Rockielab/rockie-nugget.git
cd rockie-nugget
./install.sh
# add ~/.local/bin to PATH if it isn't already
export PATH="$HOME/.local/bin:$PATH"
# point it at any OpenAI-compatible model (a frontier model, an open-weight
# model, or your own server) — bring your own key
export OPENAI_BASE_URL=https://api.your-provider.com OPENAI_API_KEY=sk-...
# run a research task
nugget run "Reproduce the rank-enrichment result from the learned-representations repo
and propose one experiment that would push effective rank higher."
install.sh fetches the sha-verified Goose runtime to ~/.local/bin/goose, registers this
repo's research-env-v1 MCP server as the agent's tool surface in~/.config/goose/config.yaml, installs the Rockie overlay (below), and installs anugget launcher that maps your BYOK env to a Goose provider (OPENAI_* → openai,ANTHROPIC_API_KEY → anthropic). Re-running is idempotent. Run it headless for long-horizon
work.
The Rockie overlay
Bare Goose is a capable agent; the overlay (overlay/) is what makes it
reason like Rockie. It is config-only — the same files the local installer copies into~/.config/goose/ are the files the platform mounts, so local install ≡ platform runtime.
overlay/goosehints— the ethos. The hard rules, the
Plan → Research → Build → Audit → Run → Assess → Codify loop, the pre-experiment
checklist, and the Brainstorm → Research → Attack → Validate waterfall. Loaded every turn.
The installer merges it into~/.config/goose/.goosehintsunder a managed sentinel block,
so your own hints are never clobbered.overlay/recipes/— composable task templates:autoresearch.yaml(frozen-metric
loop),experiment.yaml(pre-experiment gate →submit_job→ poll → report), andclean.yaml(anti-slop pre-commit pass). Run one withgoose run --recipe <file>.overlay/memory/+ the builtin memory extension — durable cross-session memory.
The agent emits[LEARN]/[DEAD-END]blocks; aStophook
(overlay/hooks/capture.sh) appends them to plain-text memory the next
session recalls. No database — plain files, deliberately.
The overlay names no model identity or provider SKU: it is fully model-agnostic, so it works
identically whether you run BYOK locally or against the served model on the platform.
Evaluating a model's research ability
rockie-nugget ships an eval environment so you can measure how well any model does
research inside the harness — not on a leaderboard in the abstract, but on the actual loop
it would run. The adapter pattern is a 3-file contract (prepare / run / grade); seeeval/adapters/super/ for a worked example against the open
SUPER benchmark. Adapters target the open research-agent
benchmarks:
- RE-Bench (METR) — real ML R&D engineering (kernels, scaling laws)
- ScienceAgentBench — data-driven scientific tasks from real papers
- SUPER (allenai) — setting up and running research repos from the wild (adapter shipped)
- MLE-bench — Kaggle-style ML engineering
# run the SUPER adapter against a BYOK model (3-file prepare/run/grade contract)
eval/adapters/super/prepare.sh && eval/adapters/super/run.sh # one model, one harness, one number
A single
nugget eval <bench>front-end over these adapters is on the roadmap; today
the adapters run directly.
A note we take seriously: the harness moves benchmark scores by double digits on its
own, so rockie-nugget's eval always changes one variable at a time — fix the harness to
compare models, fix the model to compare harness changes. Scores from a moving harness are
noise.
How it fits with Rockie
rockie-nugget is open and runs anywhere. The Rockie platform adds
the parts that need a backend:
- Skills — the markdown research toolchain (open).
mcp-rockie— the MCP server for dispatching agents to a 24/7 runtime and
provisioning GPUs through Rockie.- The Rockie Model — a Rockie-hosted model we're tuning for long-horizon research. Use
it from rockie-nugget by pointing at our endpoint; usage meters from your Rockie credits.
You can run rockie-nugget entirely locally with your own model and never touch the platform —
that's the point. Rockie earns its keep when you provision compute or use the hosted model
through it, not by locking up the harness.
Architecture
your task
│
▼
┌─────────────────────────────────────────────┐
│ rockie-nugget loop (thin, model-agnostic) │
│ prompt → model → tool-call → result → … │
└───────────────┬─────────────────────────────┘
│ frozen, versioned tool contract (MCP)
▼
web · files · run_command · submit_job/get_job · finish
│
▼ (GPU work is provider-invariant — cloud hidden from the agent)
Rockie GPU provisioning / your own compute
The agent only ever sees a uniform tool surface. Everything that varies for non-research
reasons — which GPU cloud, connection details, credentials — is pushed below the tool
boundary, so a trajectory looks identical no matter where it ran. (That same property is
what lets the environment double as clean training data on our side.)
Repository layout
| Path | What's there |
|---|---|
overlay/ |
The Rockie overlay (config-only): the ethos goosehints, research recipes/, the memory scaffold + capture hooks/. This is what turns bare Goose into a Rockie harness. |
contract/ |
The versioned tool contract (research-env-v1) — JSON Schemas, manifest, and the versioning discipline. This is the agent's action space. |
mcp/research-env-mcp/ |
A small stdlib-Python stdio MCP server that exposes the contract's tools to any MCP-native agent (e.g. Goose). |
eval/adapters/ |
Benchmark adapters — the 3-file prepare/run/grade pattern for scoring a model in-harness. |
Collaborating
rockie-nugget is being built in the open, and the most interesting open problems in it are
research problems, not engineering ones: reward design for long-horizon research,
contamination-resistant evaluation, and what an agent's tool surface should even be. If
you're a researcher working on self-improving AI, agentic research, or agent evaluation —
we'd love to build this with you, and we mean in person. Open an issue, pick up a
good first issue, or reach the team at
rockielab.com.
See CONTRIBUTING.md for how to get started andCODE_OF_CONDUCT.md for community expectations.
License
Apache License 2.0. rockie-nugget builds on
Goose (© Block, Inc., Apache-2.0); seeNOTICE for attribution.
rockie-nugget is part of the Rockie / Pebble ML stack. 🪨 Another rock on the pile.
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found