How Far Are We From True Auto Research?

An in-depth analysis of frontier CLI agents — Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5) — conducting end-to-end research across diverse fields and compute resources.

117 agent-generated papers — 39 per agent (Claude Code, Codex, Kimi Code), 3 trials × 13 seeds, spanning both GPU and CPU domains
351 code-aware peer reviews (3 CLI-agent reviewers per paper) + 117 Stanford Agentic Reviewer scores
Human analysis of every paper, its artifacts, and agentic reviews

➡️ Read the full write-up for scores, per-domain breakdowns, case studies, and the human-inspection findings.

What this does

Given a seed field (e.g., "computer vision", "compiler optimization"), each CLI agent follows a standardized pipeline:

Ideation — Generate a research idea and experiment plan; self-review for up to 3 iterations.
Experiments — Write and execute code, collect results; self-review for up to 3 iterations.
Paper Writing — Produce a paper; self-review for up to 3 iterations.
Review — Evaluate via Stanford Agentic Reviewer and triple peer review (all three agents review each paper alongside its code).

Conferences & areas

Seeds span multiple CS conferences and two compute platforms. Hardware: 1× RTX A6000 (48GB), 4 CPUs, 60GB RAM (main experiments); H100 (80GB) re-run for GPU seeds.

Platform	Seeds	Target conferences
GPU (8)	AI for Biology, Computer Vision, Datasets & Benchmarks, Generative Models, Interpretability, NLP, Privacy in ML, Supervised Representation Learning	ICLR, NeurIPS, ICML, CVPR, ACL, EMNLP
CPU (5)	Causal Learning, Compiler Optimization, Data Integration & Cleaning, Operating System Design, Probabilistic Methods	OSDI, SOSP, SIGMOD, VLDB, PLDI, POPL

Repo structure

ResearchArena/
├── papers/                     # 117 agent-generated papers
│   └── {claude,codex,kimi}/{seed}_trial{N}/
│       ├── paper.pdf, paper.tex, references.bib
│       ├── idea.json, plan.json, proposal.md
│       ├── reviews.json              # 3 peer reviews
│       ├── stanford_review.json      # SAR review
│       └── exp/                      # experiment code (.py/.sh)
├── researcharena/              # the benchmark harness
│   ├── stages/                 # ideation / experiment / paper / review
│   ├── templates/              # domain guidelines (ml/systems/databases/pl/theory/security)
│   └── utils/                  # agent_runner, tracker, checkpoint, …
└── Dockerfile[.cpu]            # agent containers

Setup

pip install -e .

# Containers (Docker or Podman)
# GPU image: PyTorch 2.6 + CUDA 12.4 + transformers/datasets/accelerate/…
docker build -t researcharena/agent:latest .
# CPU image: Python 3.11 + scipy/sklearn/networkx/sympy/z3-solver/…
docker build -f Dockerfile.cpu -t researcharena/agent-cpu:latest .

# For rootless podman, add --userns=host:
#   podman build --userns=host -t researcharena/agent:latest .
#   podman build --userns=host -f Dockerfile.cpu -t researcharena/agent-cpu:latest .

# Install the CLI agents (claude, codex, kimi) on the host — they are NOT
# baked into the image. agent_runner.py mounts each binary + its auth
# (~/.claude, ~/.codex, ~/.kimi) into the container at runtime, so log in
# once on the host with `claude login` / `codex login` / `kimi login` and
# you're done.

# Optional: API keys / tokens (forwarded into the container if set)
export ANTHROPIC_API_KEY=sk-ant-...      # if not using `claude login`
export OPENAI_API_KEY=sk-...             # if not using `codex login`
export MOONSHOT_API_KEY=sk-...           # if not using `kimi login`
export HF_TOKEN=hf_...                   # needed for gated HuggingFace models
export WANDB_API_KEY=...                 # optional, for experiment logging

Usage

researcharena run --seed "computer vision" --agent claude --platform gpu

That's it — the pipeline handles ideation, experiments, paper writing, and review end-to-end. Swap --agent for codex or kimi, and --platform for cpu to pick a different configuration.

Everything is configurable — swap agents, change self-review intensity, give the agent more ideas to try, or raise the acceptance bar. The main knobs in configs/*.yaml:

Knob	What it does
`agent.type` / `agent.model`	Which CLI agent runs the research (claude / codex / kimi / minimax) and which model it uses.
`agent.max_turns`, `ideation_timeout`, `paper_timeout`	Per-stage turn and wall-clock budgets for the researcher.
`self_review.max_retries_per_gate`	How many times each gate (idea / experiment / paper) can send itself back for revision.
`self_review.thresholds.{idea,experiment,paper}`	The score each self-review must clear to pass (default: idea 8, experiment 6, paper 8).
`experiment.max_experiment_retries_per_idea`	How many times the agent can retry failed experiments before abandoning an idea.
`pipeline.max_ideas_per_seed`	How many fresh research ideas to try on one seed before giving up.
`review.agents`	Which CLI agents act as peer reviewers, with optional per-reviewer model/timeout.
`review.accept_threshold`	Cutoff for accept vs. revise vs. reject after peer review.

See configs/8xa6000.yaml for a full annotated example.

Review pipeline

We employ a peer-review protocol (PR) where all three CLI agents review every paper alongside its code, logs, and results.json — enabling systematic checks for fabricated or unsupported results. We additionally use the Stanford Agentic Reviewer (SAR) for external, PDF-only validation.

Citation

If you find our project useful, please cite:

@misc{researcharena2026,
  title   = {How Far Are We From True Auto Research?},
  author  = {Zhang, Zhengxin and Wang, Ning and Galhotra, Sainyam and Cardie, Claire},
  year    = {2026},
  note    = {Cornell University. \url{https://youarespecialtome.github.io/ResearchArena/}}
}

License

Released under the MIT License — see LICENSE.