agentelo

agent
Security Audit
Warn
Health Warn
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Pass
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
  • Permissions — No dangerous permissions requested
Purpose
This project is an archived, ELO-rated leaderboard for AI coding agents. It ranks 148 different agents across 41 challenges using Bradley-Terry pairwise ratings.

Security Assessment
The codebase is completely safe. An automated scan of 12 files found no dangerous patterns, hardcoded secrets, or requests for risky permissions. The project is primarily a static dataset of database snapshots, match logs, and ranking files rather than an active application. Because the author explicitly archived it, it will not be receiving future updates or active maintenance. Overall risk is rated Low.

Quality Assessment
The repository is well-documented and uses the permissive MIT license, making it legally safe to fork or reuse. However, community trust and visibility are very low, evidenced by only 5 stars on GitHub. The project has been inactive since April 2026 because the creator, a solo student, recognized that a larger institution (Stanford / Laude Institute) had built a superior, more rigorous alternative. While the code here is of good quality for a personal project, it was abandoned in favor of those broader industry standards.

Verdict
Safe to use or fork for historical data analysis, but look to the author's newer active projects (harness, flt, hone) for production tooling.
SUMMARY

public ranking system for ai agents

README.md

AgentElo (archived)

I built this April 2026 as an ELO-ranked leaderboard for AI coding agents — 148 agents, 41 challenges mined from real merged bugfixes, Bradley-Terry pairwise ratings across 6 harnesses.

Why I stopped

Stanford / Laude Institute shipped Terminal-Bench 2.0 + Harbor in January 2026. That stack covers the same problem (agent-vs-agent benchmarking, multi-harness on pinned models, cloud-parallel execution) at a scale I can't match as a solo student. TB2's leaderboard already surfaces the core finding AgentElo was built around — same model across different harnesses varies by 22+ percentage points (Opus 4.6: 58% → 79.8% across 7 harnesses).

Keeping it running as "my own leaderboard" would just be duplicate infrastructure with less rigor, so I archived it. It was a fun build and the CLI/harness abstraction work fed directly into projects that are filling gaps — see below.

Final snapshot (2026-04-15)

  • 148 agents ranked
  • 41 challenges across 7 repos (click, fastify, flask, jinja, koa, marshmallow, qs)
  • 6 harnesses: claude-code, codex, aider, swe-agent, opencode, gemini
  • Bradley-Terry ELO from all pairwise outcomes
Rank Agent ELO Win Rate
1 swe-agent-glm-5 1887 85%
2 opencode-glm-5 1882 85%
3 opencode-gpt-5.4 1873 85%
4 opencode-gpt-5.3-codex 1861 84%
5 gemini-gemini-3-flash-preview 1856 84%

Database snapshots, match logs, and the full rankings are in this repo — feel free to read or fork.

Where the ideas went

  • Multi-CLI harness abstractionharness (Python library, 6 adapters, used by hone)
  • Fleet orchestrationflt (multi-agent, multi-CLI orchestrator)
  • Prompt/agent optimizationhone (uses harness as mutator backend)

License

MIT

Reviews (0)

No results found