ResearchArena

agent
Guvenlik Denetimi
Uyari
Health Uyari
  • License — License: MIT
  • No description — Repository has no description
  • Active repo — Last push 0 days ago
  • Community trust — 32 GitHub stars
Code Gecti
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
  • Permissions — No dangerous permissions requested
Purpose
This project is a benchmarking harness and analysis framework for evaluating different frontier AI CLI agents (like Claude Code and Codex) on their ability to conduct autonomous scientific research. It provides the pipeline to automatically generate research ideas, write and execute experiment code, draft papers, and run agentic peer reviews across various computer science domains.

Security Assessment
Overall risk is rated as Low. The automated code scan found no dangerous patterns, hardcoded secrets, or requests for risky permissions. However, developers should note that by design, the tool's experiment stage instructs AI agents to write and execute their own code (Python/Shell). While the repository itself does not contain malicious payloads, allowing an autonomous agent to run dynamically generated scripts always carries an inherent system security risk depending on the local execution environment.

Quality Assessment
The project is actively maintained (last updated today) and carries a permissive MIT license, making it highly accessible. It has garnered a solid baseline of community trust with 32 GitHub stars. The only notable flaw is that the repository's description field was left blank, which slightly hinders immediate discoverability, though the README itself is exceptionally detailed and well-structured.

Verdict
Safe to use, provided you closely monitor the sandbox environment when allowing the AI agents to execute dynamically generated experiment code.
README.md

How Far Are We From True Auto Research?

Python
License: MIT
Website
GitHub stars

An in-depth analysis of frontier CLI agents — Claude Code (Opus 4.6), Codex (GPT-5.4), Kimi Code (K2.5) — conducting end-to-end research across diverse fields and compute resources.

  • 117 agent-generated papers — 39 per agent (Claude Code, Codex, Kimi Code), 3 trials × 13 seeds, spanning both GPU and CPU domains
  • 351 code-aware peer reviews (3 CLI-agent reviewers per paper) + 117 Stanford Agentic Reviewer scores
  • Human analysis of every paper, its artifacts, and agentic reviews

➡️ Read the full write-up for scores, per-domain breakdowns, case studies, and the human-inspection findings.

What this does

Given a seed field (e.g., "computer vision", "compiler optimization"), each CLI agent follows a standardized pipeline:

  1. Ideation — Generate a research idea and experiment plan; self-review for up to 3 iterations.
  2. Experiments — Write and execute code, collect results; self-review for up to 3 iterations.
  3. Paper Writing — Produce a paper; self-review for up to 3 iterations.
  4. Review — Evaluate via Stanford Agentic Reviewer and triple peer review (all three agents review each paper alongside its code).

Conferences & areas

Seeds span multiple CS conferences and two compute platforms. Hardware: 1× RTX A6000 (48GB), 4 CPUs, 60GB RAM (main experiments); H100 (80GB) re-run for GPU seeds.

Platform Seeds Target conferences
GPU (8) AI for Biology, Computer Vision, Datasets & Benchmarks, Generative Models, Interpretability, NLP, Privacy in ML, Supervised Representation Learning ICLR, NeurIPS, ICML, CVPR, ACL, EMNLP
CPU (5) Causal Learning, Compiler Optimization, Data Integration & Cleaning, Operating System Design, Probabilistic Methods OSDI, SOSP, SIGMOD, VLDB, PLDI, POPL

Repo structure

ResearchArena/
├── papers/                     # 117 agent-generated papers
│   └── {claude,codex,kimi}/{seed}_trial{N}/
│       ├── paper.pdf, paper.tex, references.bib
│       ├── idea.json, plan.json, proposal.md
│       ├── reviews.json              # 3 peer reviews
│       ├── stanford_review.json      # SAR review
│       └── exp/                      # experiment code (.py/.sh)
├── researcharena/              # the benchmark harness
│   ├── stages/                 # ideation / experiment / paper / review
│   ├── templates/              # domain guidelines (ml/systems/databases/pl/theory/security)
│   └── utils/                  # agent_runner, tracker, checkpoint, …
└── Dockerfile[.cpu]            # agent containers

Setup

pip install -e .

# Containers (Docker or Podman)
# GPU image: PyTorch 2.6 + CUDA 12.4 + transformers/datasets/accelerate/…
docker build -t researcharena/agent:latest .
# CPU image: Python 3.11 + scipy/sklearn/networkx/sympy/z3-solver/…
docker build -f Dockerfile.cpu -t researcharena/agent-cpu:latest .

# For rootless podman, add --userns=host:
#   podman build --userns=host -t researcharena/agent:latest .
#   podman build --userns=host -f Dockerfile.cpu -t researcharena/agent-cpu:latest .

# Install the CLI agents (claude, codex, kimi) on the host — they are NOT
# baked into the image. agent_runner.py mounts each binary + its auth
# (~/.claude, ~/.codex, ~/.kimi) into the container at runtime, so log in
# once on the host with `claude login` / `codex login` / `kimi login` and
# you're done.

# Optional: API keys / tokens (forwarded into the container if set)
export ANTHROPIC_API_KEY=sk-ant-...      # if not using `claude login`
export OPENAI_API_KEY=sk-...             # if not using `codex login`
export MOONSHOT_API_KEY=sk-...           # if not using `kimi login`
export HF_TOKEN=hf_...                   # needed for gated HuggingFace models
export WANDB_API_KEY=...                 # optional, for experiment logging

Usage

researcharena run --seed "computer vision" --agent claude --platform gpu

That's it — the pipeline handles ideation, experiments, paper writing, and review end-to-end. Swap --agent for codex or kimi, and --platform for cpu to pick a different configuration.

Everything is configurable — swap agents, change self-review intensity, give the agent more ideas to try, or raise the acceptance bar. The main knobs in configs/*.yaml:

Knob What it does
agent.type / agent.model Which CLI agent runs the research (claude / codex / kimi / minimax) and which model it uses.
agent.max_turns, ideation_timeout, paper_timeout Per-stage turn and wall-clock budgets for the researcher.
self_review.max_retries_per_gate How many times each gate (idea / experiment / paper) can send itself back for revision.
self_review.thresholds.{idea,experiment,paper} The score each self-review must clear to pass (default: idea 8, experiment 6, paper 8).
experiment.max_experiment_retries_per_idea How many times the agent can retry failed experiments before abandoning an idea.
pipeline.max_ideas_per_seed How many fresh research ideas to try on one seed before giving up.
review.agents Which CLI agents act as peer reviewers, with optional per-reviewer model/timeout.
review.accept_threshold Cutoff for accept vs. revise vs. reject after peer review.

See configs/8xa6000.yaml for a full annotated example.

Review pipeline

We employ a peer-review protocol (PR) where all three CLI agents review every paper alongside its code, logs, and results.json — enabling systematic checks for fabricated or unsupported results. We additionally use the Stanford Agentic Reviewer (SAR) for external, PDF-only validation.

Citation

If you find our project useful, please cite:

@misc{researcharena2026,
  title   = {How Far Are We From True Auto Research?},
  author  = {Zhang, Zhengxin and Wang, Ning and Galhotra, Sainyam and Cardie, Claire},
  year    = {2026},
  note    = {Cornell University. \url{https://youarespecialtome.github.io/ResearchArena/}}
}

License

Released under the MIT License — see LICENSE.

Yorumlar (0)

Sonuc bulunamadi