rockie-nugget

mcp
Guvenlik Denetimi
Uyari
Health Uyari
  • License — License: Apache-2.0
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Gecti
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

An open, thin, model-agnostic harness for long-horizon autonomous AI research. Built on Goose.

README.md

🪨 rockie-nugget

An open harness for long-horizon, autonomous AI research.

rockie-nugget is a thin, model-agnostic agent runtime built for one job: running an AI agent
that does real research — reads the literature, writes code, launches GPU experiments,
reads the metrics, and iterates — for hours, unattended. It runs on your laptop, on a
server, or as a per-tenant runtime on the Rockie platform.

Status: early, and built in the open. The architecture below is the design we're
building to, and parts of it are still landing. If you work on self-improving AI, agentic
research, or research-agent evaluation, this is exactly the moment to get involved — see
Collaborating.


Why another harness?

Because the harness shouldn't be the interesting part. The lesson of the last two years is
that elaborate agent scaffolding gets obsoleted by better models — so rockie-nugget keeps the
loop deliberately thin and puts the leverage where it lasts:

  • Model-agnostic. Point it at any OpenAI-compatible endpoint — a frontier model, a
    cheap open-weight model, or your own local server. Bring your own key.
  • MCP-native. Tools are Model Context Protocol
    servers. The agent's whole capability set is a declarative, versioned tool contract —
    see contract/.
  • Skills are just markdown. The Rockie skills ecosystem is plain .md files — read
    them, fork them, write your own. No DSL, no lock-in.
  • Built to be an evaluation and training environment. The same harness that serves a
    task can measure how good a model is at research, and (privately, on our side) train one.

What the agent actually does

Each step, the agent picks one move from a small, fixed tool menu — the
research-env-v1 action space:

Verb Tools
Gather web_search, fetch_url, read_file, list_files
Write write_file
Run run_command (sandboxed)
Experiment submit_job, get_job — provision a GPU run, read its results
Finish finish

That's enough to do the real thing: read a paper, implement an idea, launch a training run,
read the metrics, decide what to try next. GPU experiments are provider-invariant — the
agent asks for 8xH100, and the runtime decides where to provision it (you never see, and
never have to care, which cloud answered).

Quickstart

Bring your own key (BYOK). The installer needs a Linux x86_64 host with glibc ≥ 2.36
(Debian bookworm, Ubuntu 22.04+) and python3 — the runtime is a Linux binary. Native
cargo install is planned (#1953).

# install (assembles the runtime + tool surface from this checkout — config only)
git clone https://github.com/Rockielab/rockie-nugget.git
cd rockie-nugget
./install.sh

# add ~/.local/bin to PATH if it isn't already
export PATH="$HOME/.local/bin:$PATH"

# point it at any OpenAI-compatible model (a frontier model, an open-weight
# model, or your own server) — bring your own key
export OPENAI_BASE_URL=https://api.your-provider.com   OPENAI_API_KEY=sk-...

# run a research task
nugget run "Reproduce the rank-enrichment result from the learned-representations repo
            and propose one experiment that would push effective rank higher."

install.sh fetches the sha-verified Goose runtime to ~/.local/bin/goose, registers this
repo's research-env-v1 MCP server as the agent's tool surface in
~/.config/goose/config.yaml, installs the Rockie overlay (below), and installs a
nugget launcher that maps your BYOK env to a Goose provider (OPENAI_* → openai,
ANTHROPIC_API_KEY → anthropic). Re-running is idempotent. Run it headless for long-horizon
work.

The Rockie overlay

Bare Goose is a capable agent; the overlay (overlay/) is what makes it
reason like Rockie. It is config-only — the same files the local installer copies into
~/.config/goose/ are the files the platform mounts, so local install ≡ platform runtime.

  • overlay/goosehints — the ethos. The hard rules, the
    Plan → Research → Build → Audit → Run → Assess → Codify loop, the pre-experiment
    checklist, and the Brainstorm → Research → Attack → Validate waterfall. Loaded every turn.
    The installer merges it into ~/.config/goose/.goosehints under a managed sentinel block,
    so your own hints are never clobbered.
  • overlay/recipes/ — composable task templates: autoresearch.yaml (frozen-metric
    loop), experiment.yaml (pre-experiment gate → submit_job → poll → report), and
    clean.yaml (anti-slop pre-commit pass). Run one with goose run --recipe <file>.
  • overlay/memory/ + the builtin memory extension — durable cross-session memory.
    The agent emits [LEARN] / [DEAD-END] blocks; a Stop hook
    (overlay/hooks/capture.sh) appends them to plain-text memory the next
    session recalls. No database — plain files, deliberately.

The overlay names no model identity or provider SKU: it is fully model-agnostic, so it works
identically whether you run BYOK locally or against the served model on the platform.

Evaluating a model's research ability

rockie-nugget ships an eval environment so you can measure how well any model does
research inside the harness — not on a leaderboard in the abstract, but on the actual loop
it would run. The adapter pattern is a 3-file contract (prepare / run / grade); see
eval/adapters/super/ for a worked example against the open
SUPER benchmark. Adapters target the open research-agent
benchmarks:

  • RE-Bench (METR) — real ML R&D engineering (kernels, scaling laws)
  • ScienceAgentBench — data-driven scientific tasks from real papers
  • SUPER (allenai) — setting up and running research repos from the wild (adapter shipped)
  • MLE-bench — Kaggle-style ML engineering
# run the SUPER adapter against a BYOK model (3-file prepare/run/grade contract)
eval/adapters/super/prepare.sh && eval/adapters/super/run.sh   # one model, one harness, one number

A single nugget eval <bench> front-end over these adapters is on the roadmap; today
the adapters run directly.

A note we take seriously: the harness moves benchmark scores by double digits on its
own
, so rockie-nugget's eval always changes one variable at a time — fix the harness to
compare models, fix the model to compare harness changes. Scores from a moving harness are
noise.

How it fits with Rockie

rockie-nugget is open and runs anywhere. The Rockie platform adds
the parts that need a backend:

  • Skills — the markdown research toolchain (open).
  • mcp-rockie — the MCP server for dispatching agents to a 24/7 runtime and
    provisioning GPUs through Rockie.
  • The Rockie Model — a Rockie-hosted model we're tuning for long-horizon research. Use
    it from rockie-nugget by pointing at our endpoint; usage meters from your Rockie credits.

You can run rockie-nugget entirely locally with your own model and never touch the platform —
that's the point. Rockie earns its keep when you provision compute or use the hosted model
through it, not by locking up the harness.

Architecture

your task
   │
   ▼
┌─────────────────────────────────────────────┐
│  rockie-nugget loop  (thin, model-agnostic)  │
│  prompt → model → tool-call → result → …     │
└───────────────┬─────────────────────────────┘
                │  frozen, versioned tool contract (MCP)
                ▼
   web · files · run_command · submit_job/get_job · finish
                │
                ▼   (GPU work is provider-invariant — cloud hidden from the agent)
        Rockie GPU provisioning  /  your own compute

The agent only ever sees a uniform tool surface. Everything that varies for non-research
reasons — which GPU cloud, connection details, credentials — is pushed below the tool
boundary, so a trajectory looks identical no matter where it ran. (That same property is
what lets the environment double as clean training data on our side.)

Repository layout

Path What's there
overlay/ The Rockie overlay (config-only): the ethos goosehints, research recipes/, the memory scaffold + capture hooks/. This is what turns bare Goose into a Rockie harness.
contract/ The versioned tool contract (research-env-v1) — JSON Schemas, manifest, and the versioning discipline. This is the agent's action space.
mcp/research-env-mcp/ A small stdlib-Python stdio MCP server that exposes the contract's tools to any MCP-native agent (e.g. Goose).
eval/adapters/ Benchmark adapters — the 3-file prepare/run/grade pattern for scoring a model in-harness.

Collaborating

rockie-nugget is being built in the open, and the most interesting open problems in it are
research problems, not engineering ones: reward design for long-horizon research,
contamination-resistant evaluation, and what an agent's tool surface should even be. If
you're a researcher working on self-improving AI, agentic research, or agent evaluation —
we'd love to build this with you, and we mean in person. Open an issue, pick up a
good first issue, or reach the team at
rockielab.com.

See CONTRIBUTING.md for how to get started and
CODE_OF_CONDUCT.md for community expectations.

License

Apache License 2.0. rockie-nugget builds on
Goose (© Block, Inc., Apache-2.0); see
NOTICE for attribution.


rockie-nugget is part of the Rockie / Pebble ML stack. 🪨 Another rock on the pile.

Yorumlar (0)

Sonuc bulunamadi