cultivar

agent
Guvenlik Denetimi
Gecti
Health Gecti
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Community trust — 26 GitHub stars
Code Gecti
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested

Bu listing icin henuz AI raporu yok.

SUMMARY

Use cultivar to test your Agent Skills, run them in sandboxes, and across different agents.

README.md

cultivar

A CLI tool to help you write tests for skills, test them across agents, and iterate until they work, from the Pinecone DevRel team.

Test how well skills work against tasks, across agents, locally and remotely. Customize sandboxes for how agents should start, and graders for how agents should work.

Use traces to iteratively refine skills and optimize them against tasks.

Benchmark against skills, docs, and baselines. And, even run in parallel simulatenously for faster execution.

Two ways to run it: the cultivar CLI directly, or install the bundled skill and let your coding agent drive it:

npx skills add https://github.com/pinecone-io/cultivar --skill cultivar

Same engine either way — but when an agent drives it you won't watch the live run / --remote sandbox dashboard as directly as in your own terminal. Keep the skill project-scoped (not -g); it's never auto-tested.

Prerequisites

  • Python 3.11+ and uv
  • An Anthropic API key (the grader runs locally; agents in the sandbox use Modal-injected keys)
  • A Modal account if you want --remote runs (recommended for parallelism + isolation).This is the recommended experience!

Install

uv tool install cultivar

# Or install from source:
uv tool install --from "git+https://github.com/pinecone-io/cultivar" cultivar

Modal setup (one-time)

--remote runs each eval in an isolated Modal sandbox — recommended for parallelism and clean auth state. Skip this section if you only need local runs.

# 1. Install Modal and authenticate
pip install modal
modal token new

# 2. Create the secret the sandbox reads at runtime
modal secret create eval-sandbox-secrets \
  ANTHROPIC_API_KEY=sk-ant-...
  # Add any keys your tasks need: GEMINI_API_KEY, COPILOT_GITHUB_TOKEN, etc.

# 3. Verify
modal secret list   # eval-sandbox-secrets should appear

The first --remote run builds the sandbox image (~3–5 min). Subsequent runs use the cached image (~5–10 s cold start).

Defaults you can override via env var:

Env var Default What it controls
CULTIVAR_MODAL_SECRET eval-sandbox-secrets Name of the Modal secret mounted into each sandbox
CULTIVAR_MODAL_APP cultivar Modal app name (useful for isolating runs across teams or projects)

For workspace sharing, custom images, and debugging sandbox failures, see docs/sandbox.md.

Quickstart: testing your own skill

1. Set up your working directory

mkdir ~/my-evals && cd ~/my-evals
cat > .env <<'EOF'
ANTHROPIC_API_KEY=sk-ant-...
EOF

2. Scaffold a task file

cultivar init my-skill

This writes ./tasks/my-skill.yaml and ./.claude/skills/my-skill/SKILL.md.

Tip: to test skills without your interactive coding agent auto-loading them, keep them outside .claude/ — e.g. set CULTIVAR_SKILLS_DIR=skills (or pass --skills-dir skills) and init scaffolds into ./skills/my-skill/. See where skills live.

3. Edit the skill (.claude/skills/my-skill/SKILL.md)

The skill file is what the agent sees when you invoke /my-skill. Write it like a concise brief: what the skill does, when to use it, and the key commands or patterns it should follow. Keep it tight — a few focused sections outperform a wall of text. If you're not sure where to start, drop your existing docs or a rough draft into Claude and ask it to write a SKILL.md for you.

4. Edit the tasks (tasks/my-skill.yaml)

Each task has an intent (what you'd say to the agent) and a criteria block (what PASS looks like, in plain English). A good criteria block names 2–3 concrete things that must be true and at least one common failure mode. Agents are good at this too: share a few examples of passing and failing behavior and ask Claude to draft the criteria.

For the full YAML schema and field reference, see docs/task-yaml.md.

5. Run + grade

cultivar run --skill my-skill --runner claude --remote --grade

Smoke test (post-install, no clone)

After uv tool install, verify the install works end-to-end with the packaged smoke:

cultivar hello                      # local: agent + grader (needs ANTHROPIC_API_KEY)
cultivar hello --remote             # also exercises Modal + eval-sandbox-secrets
cultivar hello --no-grade           # just exercise the runner (no API key needed)

hello runs a tiny "write hello.py" task that ships inside the wheel — no repo clone, no tasks/ setup. It exits 0 on PASS and prints diagnostics on FAIL. Use this to learn how to use cultivar.

Running remotely + inspecting results

# Single task, single variant
cultivar run --skill my-skill --runner claude --task my-task -v with-skill --remote

# All tasks + every applicable variant (with-skill, without-skill, and with-docs
# for tasks that declare context_refs)
cultivar run --skill my-skill --runner claude --remote

# 3 runs per (task, variant) for reliability, 5 sandboxes at once
cultivar run --skill my-skill --runner claude --remote --repeat 3 --parallel 5

# Raise the per-call wall-clock budget (default 90s; sandbox gets +60s buffer)
cultivar run --skill my-skill --runner claude --remote --timeout 180

# All three runners in parallel
cultivar run --skill my-skill --runner claude --remote &
cultivar run --skill my-skill --runner copilot --remote &
cultivar run --skill my-skill --runner gemini --remote &

# Run + grade in one shot
cultivar run --skill my-skill --runner claude --remote --grade

# Name a run so you can tell it apart later
cultivar run --skill my-skill --runner claude --remote --title baseline
cultivar run --skill my-skill --runner claude --remote --title after-tweak

What you get per run (results/<timestamp>[__title]/):

results/2026-04-22T11-31-47__baseline/
├── tasks.json                                 # task definitions used (for reproducibility)
├── notes.md                                   # --notes text, if any
├── grades.json                                # written by grader after `cultivar grade`
└── claude/                                    # one subdir per runner
    ├── my-task__with-skill.json               # structured result + stats (tokens, cost, timing, session_id)
    ├── my-task__with-skill.md                 # readable conversation trace
    ├── my-task__with-skill.jsonl              # raw event stream from the agent CLI
    ├── my-task__with-skill.stderr.log         # captured stderr (if any)
    ├── my-task__with-skill.setup.log          # setup/verify/teardown outputs (if those hooks ran)
    ├── my-task__with-skill.verify.log
    ├── my-task__with-skill.teardown.log
    └── my-task__with-skill.workdir/           # any files the agent wrote (code-gen tasks)
        └── hello.py

With --repeat N, files get a __1 / __2 / __N suffix. Without --title, the dir is just <timestamp>/.

Inspecting what actually happened:

What Where to look
One run, all sections (conversation, stats, workdir, grader) cultivar show latest -r claude -t <task>
Just the conversation transcript for one run cultivar show latest -t <task> --conversation-only
Just the grader verdict + reasoning + suggestions cultivar show latest -t <task> --grader
Just the workdir file listing cultivar show latest -t <task> --workdir
Summary table across all runners + variants cultivar report
Human-readable conversation file *.md
Raw stream-json events (Claude) / JSON lines (Copilot, Gemini) *.jsonl
Stats (duration, tokens, cost, session id, sandbox timing) *.json under usage / total_cost_usd / sandbox_timing
Grader verdict + evidence + reasoning + suggestions grades.json or cultivar report latest
Why setup/verify/teardown failed *.setup.log / *.verify.log / *.teardown.log
What the agent actually wrote to disk *.workdir/
Resume a Claude session interactively to poke at it claude --resume <session_id> (in the panel footer of report, or via show … --grader)
Live sandbox state / per-sandbox logs (remote only) Modal dashboard → Sandboxes — each has stdout/stderr + resource graphs
Phase-by-phase sandbox timing (create / setup / eval / teardown) sandbox_timing field in *.json, also printed in cultivar report

Quick debugging recipes:

# Read one run end-to-end (replaces jq/less incantations)
cultivar show latest -r claude -t my-task

# Just the grader's verdict + remediation suggestions on a failure
cultivar show latest -t my-task --grader

# Pipe-friendly conversation transcript (ASCII fallback when not a TTY)
cultivar show latest -t my-task --conversation-only > convo.txt

# Full summary table for the latest run (no regrading)
cultivar report

# Regrade after editing criteria or adding calibration examples
cultivar grade --report

# Drop down to raw artifacts when needed
jq . results/<run>/claude/my-task__with-skill.jsonl | less
ls results/<run>/claude/my-task__with-skill.workdir/

# Resume a Claude session interactively
claude --resume $(jq -r .session_id results/<run>/claude/my-task__with-skill.json)

Handing off to a coworker

Want to use cultivar with a team, but don't want to make everyone have different Modal workspaces?

Easiest path (assumes you have a Modal workspace set up):

  1. Invite them to the Modal workspace (Modal dashboard → Settings → Members). They inherit the eval-sandbox-secrets secret group, so they don't need to set up their own Anthropic/Pinecone/etc. keys for remote runs.
  2. On their machine:
    uv tool install cultivar
    modal token new                     # personal Modal token
    modal profile activate <workspace>  # if they belong to multiple workspaces
    echo "ANTHROPIC_API_KEY=sk-ant-..." > .env   # auto-loaded from cwd
    cultivar init my-skill            # scaffolds tasks/my-skill.yaml
    cultivar run --skill my-skill --runner claude --remote --grade
    
  3. Billing accrues to your Modal account regardless of who runs what — set an expected budget if needed.

The only key your coworkers personally need is ANTHROPIC_API_KEY (the grader runs locally). For local (non-remote) agent runs they also need whatever the relevant agent CLI requires (Claude OAuths via claude on first run; Copilot needs COPILOT_GITHUB_TOKEN with the "Copilot Requests" fine-grained PAT scope; Gemini needs GEMINI_API_KEY).

Supported Agents

We're always interested in adding more agents. If you have one that's not here, please let us know by opening an Issue!

Runner CLI Headless flag How without-skill is isolated Per-runner doc
Claude claude -p --allowedTools trimmed; no Use the /<skill> prefix in the prompt docs/runners/claude.md
Copilot copilot -p --autopilot --yolo --no-custom-instructions --excluded-tools skill docs/runners/copilot.md
Gemini (soon to be deprecated) gemini -p --approval-mode=yolo temp-dir isolation (no flag) docs/runners/gemini.md

Each runner advertises three variants:

  • with-skill — skill loaded, agent invoked via /<skill-name>
  • without-skill — same agent, no skill loaded and no Use the /<skill> prefix in the prompt
  • with-docs — same as without-skill, but the task's context_refs files are prepended to the prompt as raw reference material. Only runs for tasks that declare context_refs.

Two deltas to read:

Comparison Question it answers
with-skill vs without-skill Is the skill doing anything at all?
with-skill vs with-docs Is my distilled skill better than just dumping the docs into the prompt?

With --remote, each (task, variant, repeat) runs in its own Modal sandbox in parallel — three variants on one task means three sandboxes, run concurrently up to --parallel N (default 5). Apples-to-apples baseline; same image, only the prompt + skill mounting differ. See docs/concepts.md for the full discussion and docs/task-yaml.md for how to add context_refs to a task.

Docs

For any subcommand: cultivar <cmd> --help.

Yorumlar (0)

Sonuc bulunamadi