best-of-Agent-Harnesses

mcp
Security Audit
Fail
Health Warn
  • License — License: CC-BY-SA-4.0
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 9 GitHub stars
Code Fail
  • rm -rf — Recursive force deletion command in .github/workflows/setup-best-of-list.yml
Permissions Pass
  • Permissions — No dangerous permissions requested
Purpose
This project is a curated, ranked list of over 100 AI agent harnesses, orchestration frameworks, and engineering techniques. It serves as an informational directory rather than a functional software library or executable MCP server.

Security Assessment
Overall Risk: Low. The repository does not request dangerous permissions, access sensitive data, or contain hardcoded secrets. The only flagged issue is an `rm -rf` command located inside a GitHub Actions workflow file (`.github/workflows/setup-best-of-list.yml`). Because this command is strictly used for automated repository maintenance and updating the list, it does not pose a threat to developers who are simply reading or cloning the directory. No malicious execution or network exploits were found.

Quality Assessment
The repository is actively maintained, with the most recent push occurring today. It uses a standard Creative Commons license (CC-BY-SA-4.0), which is appropriate for curated documentation and lists. However, community visibility and trust are currently minimal. The project only has 9 GitHub stars, indicating it is either very new or has not yet been widely adopted by the broader developer community.

Verdict
Safe to use.
SUMMARY

🏆 Ranked list of 100+ agent harnesses, orchestration frameworks, and harness engineering techniques. Scored and updated weekly.

README.md

Best of Agent Harnesses and Harness Techniques

🏆  Curated list of AI agent harnesses, orchestration frameworks, and harness techniques for reliable agentic systems.

What is an agent harness?

An agent harness is the runtime that closes the loop between a stateless model and the outside world—managing perception, action, memory, and constraint enforcement—making it the de facto operating system of machine agency and, consequently, the layer where nearly all meaningful questions about AI autonomy, reliability, and control are actually resolved.

Every prior wave of automation was constrained by brittleness: you scripted exact behavior, and when the world deviated, the system broke. Foundation models inverted that problem—they're flexible but directionless, stateless, and disconnected from anything real. The agent harness exists to bridge that gap: it is the orchestration infrastructure that converts a model's per-turn reasoning into sustained, tool-using, error-recovering, goal-directed behavior across time. Architecturally, it plays the role the kernel played in operating systems or the controller played in industrial robotics—mediating between raw capability and a messy environment—but with a critical difference: the "capability" it governs is general-purpose cognition, which means the harness is simultaneously a scheduler, a permission system, a memory manager, and a policy enforcement layer, all under-specified and evolving in real time. The term itself barely exists in formal literature yet, which should concern anyone who cares about AI governance, because the harness is where abstract alignment goals either get operationalized into concrete constraints or quietly don't.

Why harnesses matter

Better models make harnesses more important: more capabilities mean more failure modes, and production needs retry logic, fallbacks, and validation. Harness quality—not just model quality—determines whether agents actually ship. This list ranks projects by relevance to harness concerns (environment, orchestration, lifecycle, guardrails) and by stars/activity.

Contents

Explanation

  • Simplicity ↔ capability: Where each project sits on the axis from minimal/simple (lean API, format only, thin layer) to high capability (full platform, many features, kitchen-sink).
  • OSS: ✅ = standard open-source license (MIT/Apache/BSD/GPL/MPL/AGPL/CC0). ⚠️ = source-available or restricted (e.g. n8n Fair-code, Elastic-2.0, Polyform). ❓ = no license file or unclear terms.
  • 🥇🥈🥉  Combined project-quality score
  • ⭐️  Star count from GitHub
  • 🐣  New project (less than 6 months old)
  • 💤  Inactive project (6 months no activity)
  • 💀  Dead project (12 months no activity)
  • 📈📉  Project is trending up or down
  • 👨‍💻  Contributors count from GitHub
  • 🔀  Fork count from GitHub
  • 📋  Issue count from GitHub
  • ⏱️  Last update timestamp on package manager

Progressive disclosure harnesses

Back to top

Formats, runtimes, and patterns that reveal context, tools, or instructions in layers—index first, details on demand—to control tokens and improve agent focus (the "map, not encyclopedia" principle).

# Project Description OSS Simplicity ↔ capability
1 agents.md Open format for repo-scoped agent briefings; v1.1 adds hierarchical scope and progressive disclosure so agents get a map of what exists, then load only what's relevant. Simple (format only)
2 awesome-cursorrules Curated .cursorrules and skills that leverage Cursor's index-then-load model; the canonical collection for rules-as-progressive-disclosure in the IDE. Simple (content bundle)
3 MCP-Zero Active tool discovery for autonomous agents: model requests tools by requirement; hierarchical semantic routing over 308 servers / 2,797 tools with ~98% token reduction (APIBank). Capability (3k tools, full routing)
4 langgraph-bigtool Build LangGraph agents with large tool sets; retrieval and on-demand tool loading so agents scale beyond context without stuffing every schema upfront. Capability (large tool sets)
5 spring-ai-tool-search-tool Dynamic tool discovery for Spring AI: model gets a search tool first, then pulls definitions for relevant tools; 34–64% token reduction across providers. Mid (search-then-load)
6 ToolGen ICLR 2025: unified tool retrieval and calling via generation; 47k+ tools without context stuffing—retrieval and invocation in one generative step. Capability (47k+ tools)
7 ToolRAG Semantic tool retrieval for LLMs; serves only the tools the user query demands (MCP-compatible), unlimited tool sets with zero context penalty. Mid (query-driven retrieval)

Coding agent products (IDEs, CLIs, full suites)

Back to top

Turnkey coding agents you install and run: IDE extensions, terminal CLIs, Dockerized workspaces. Each entry notes which part is the harness (the agent loop, tool wiring, approval model) versus the UI shell (VS Code extension, TUI, browser client).

# Project Description OSS Simplicity ↔ capability
1 Cline VS Code extension whose harness is a plan-then-act loop with per-step human approval and cost transparency; the VS Code integration is the UI shell. Open-source counterweight to Cursor. Mid (plan-then-act, approval gates)
2 Roo Code VS Code/Cursor extension in the Cline lineage. The harness is the approval-gated agent with custom modes and a strong MCP story; the IDE is the UI. Popular community fork when you want that workflow without the upstream extension. Mid (IDE extension, MCP-first)
3 Codex OpenAI's terminal coding agent. The harness is the sandboxed tool-call loop with multi-provider support; the CLI is the shell. Reference implementation for "official CLI that ships code." Mid (reference CLI, sandboxed)
4 Gemini CLI Google's first-party terminal agent for Gemini. The harness is the plugin/MCP tool-call loop; the terminal is the shell—Google's parallel to Claude Code / Codex, not just an API. Mid (official CLI, plugins, MCP)
5 crush Charm's terminal coding agent (successor to OpenCode). The harness is the tool-calling loop with session persistence; the Bubble Tea TUI is the shell. ⚠️ FSL-1.1-MIT Mid (terminal agent, TUI)
6 OpenHands Dockerized software-engineering agent. The harness is the bash/editor/browser toolset with micro-agents and event-stream session bridging; Docker is the sandbox. Main OSS choice for teams self-hosting autonomous repo work. ⚠️ (multi-license) Capability (Docker runtime, multi-surface agent)
7 goose Block's extensible Rust agent. The harness is the MCP/ACP extension model with recipes and provider choice; there's no fixed UI slot—you bolt it into whatever shell you use. Mid (extensions, MCP/ACP)
8 claw-code-agent Python reimplementation of the Claude Code agent architecture with zero external dependencies; interactive chat, streaming, plugin runtime, nested agent delegation, cost tracking, MCP transport—portable harness without the Rust/TS toolchain. Capability (pure Python, plugin runtime)
9 coderClaw Self-hosted multi-role coding system (Creator, Reviewer, Test, Refactor, etc.) with AST and semantic maps; IDE-agnostic, chat-channel triggers. Capability (multi-role, AST/semantic)

Coding harness configs and SDKs

Back to top

Skill packs, slash-command libraries, meta-prompting frameworks, and official SDKs that give you the harness (the agent loop, planning, memory, hooks) without bundling a specific IDE or CLI shell.

# Project Description OSS Simplicity ↔ capability
1 get-shit-done Goal-backward planning and wave-based execution over fresh context windows; avoids context rot by design. Python/JS meta-prompting for Claude Code, OpenCode, Gemini CLI. Mid (meta-prompting, you own stack)
2 GStack Garry Tan's Claude Code skill stack: 23 slash-command modes (CEO/eng/design review, QA, ship, browse, retro, …) that structure one assistant as a virtual engineering team. Daily driver while running YC. Capability (multi-role slash-command harness)
3 everything-claude-code The breakout 2026 harness pack for Claude Code (approaching 160k stars): 28 specialized subagents, 119 reusable skills, 60 slash commands, 34 rules, 20+ automated hooks. Ships a full "AI engineering team" as config. Capability (subagents + skills + hooks)
4 superpowers Performance-oriented harness pack for Claude Code, Codex, OpenCode, Cursor: skills, instincts, memory, security, research-first workflows. Treats harness engineering itself as the performance lever. Capability (multi-IDE skill stack)
5 pmstack Claude Code config for AI product managers: CLAUDE.md plus skills for competitive analysis, PRD-from-signal, metric frameworks, stakeholder briefs, and agent eval design. "GStack for PMs." Simple (skills bundle, PM-focused)
6 Claude Agent SDK Official Anthropic SDK (Python + TypeScript, demos, quickstarts): built-in tools, MCP, long-running coding agents with session bridging. Capability (full SDK, session bridging)
7 AutoHarness Lightweight governance harness: wraps any LLM client in ~2 lines for automated harness engineering—6–14 step pipeline, YAML constitution, risk-pattern matching, session persistence with cost tracking, multi-agent profiles. Simple (2-line wrapper, YAML gov)
8 RepoMaster Repo-scoped research harness: builds function-call and module-dependency graphs to explore only what's needed; large relative gains on MLE-bench and GitTaskBench with lower token use. Capability (graph-based exploration)
9 SWE-agent LM-driven harness built for SWE-bench: edit state, command execution, and issue-focused loop—the reference agent stack next to the benchmark itself. Capability (SWE-bench pairing, stateful edits)
10 OpenHarness (HKUDS) Open agent harness with a built-in personal agent ("Ohmo") that runs across Feishu, Slack, Telegram, and Discord; core tool-use, skills, memory, multi-agent coordination with auto-compaction for multi-day sessions. Capability (personal agent + multi-channel)

Frameworks

Back to top

General-purpose agent and LLM application frameworks (the app layer, not harnesses per se).

# Project Description OSS Simplicity ↔ capability
1 langgraph State-machine graphs over LLM steps; checkpointing, human-in-the-loop, and durable execution so workflows survive restarts. Capability (graphs, checkpointing, durable exec)
2 langchain Chains, tools, retrievers, and agents; the usual entry point for "add tools to an LLM" in Python/JS. Capability (kitchen-sink ecosystem)
3 llama-index Data-centric: indexing, RAG, and query engines; agent abstractions sit on top of your data pipelines. Capability (RAG + agents)
4 semantic-kernel Microsoft's plugin and planner layer for LLMs; C#, Python, Java; strong on enterprise auth and orchestration. Capability (enterprise, multi-language)
5 mastra TypeScript-first; agents, tools, and workflows with a single runtime and minimal boilerplate. ⚠️ Elastic-2.0 Mid (TS-first, minimal boilerplate)
6 agno Python agents with memory, knowledge bases, tools, and structured outputs; continues the PhiData-era product line under the Agno name—production apps, evals, and pipelines. Capability (memory, KB, observability)
7 letta Python agent runtime with tool use and control flow; lean API; stateful agents with long-horizon memory. Simple (lean API)
8 langflow Low-code UI to build and deploy LangChain/LangGraph flows; visual DAG editor and one-click run. Capability (low-code, visual)
9 rasa Conversational AI stack (NLU, dialogue, actions); long-standing OSS choice for chat and voice bots. Capability (full stack)
10 botpress Visual bot builder and runtime; multi-channel, open-source alternative to commercial bot platforms. Capability (visual builder, multi-channel)
11 Dify One-stop LLM app platform: visual workflows, RAG pipeline, 50+ tools, model management; "ship from prototype to prod" in a single UI. ⚠️ Fair-code Capability (one-stop platform)
12 n8n Fair-code workflow engine with 400+ nodes and native AI nodes; the self-hosted Zapier that actually does agents and LangChain. ⚠️ Fair-code Capability (400+ nodes, workflow engine)
13 AutoGPT The original autonomous loop: goal in, agent iterates with tools and memory; Forge is the dev framework, Benchmark the eval harness. ⚠️ Polyform-SU Capability (autonomous loop, tools, memory)
14 AIlice Fully autonomous general-purpose agent; one binary, Docker-ready, for when you want "set goal and walk away" without a framework. Capability (autonomous, one binary)
15 Bee Agent Framework Python + TypeScript, LF AI–backed; MCP/ACP, workflows, Requirement Agent; the one that pushes "production multi-agent" without LangChain. Capability (production multi-agent)
16 agent-squad AWS-originated orchestrator (now under 2FastLabs): intent classification, streaming, SupervisorAgent; "agent-as-tools" so one agent delegates to a squad. Capability (squad orchestration)
17 SuperAgentX Lightweight multi-agent orchestrator with an AGI-angle; minimal surface, docs-first, for teams that want orchestration without the kitchen sink. Simple (minimal surface)
18 AgentVerse Task-solving and simulation envs for multi-LLM agents; deploy many agents in custom environments without building infra from scratch. Capability (simulation envs, multi-agent)
19 R2R RAG-first: hybrid search, knowledge graphs, multimodal; the framework for "production RAG" when you care more about retrieval than chat UI. Capability (production RAG)
20 LiteSwarm Async-only, LiteLLM-backed Python; dynamic agent switching and type-safe context—for devs who want 100+ models without LangGraph's weight. Mid (100+ models, dynamic switching)
21 AgentStack Scaffolds full agent projects; plugs in CrewAI, LangGraph, OpenAI Swarm, LlamaStack and wires AgentOps observability from day one. Capability (scaffold, multi-backend)
22 AgentSilex ~300 lines of readable agent code on top of LiteLLM; the "I want to see the whole loop" option for learning or minimal production. Simple (~300 LOC)
23 Flowise Drag-and-drop LangChain UI; deploy flows without code. The low-code sibling to Langflow, with a different component and hosting story. ⚠️ Apache+CLA Capability (low-code, drag-drop)
24 browser-use Python layer over Playwright: natural-language goals become browser actions—web-agent loop without hand-rolling MCP or a custom driver for every site. Mid (LLM + browser, Playwright)

Multi-agent and orchestration

Back to top

Harnesses and patterns for multi-agent coordination and handoffs.

# Project Description OSS Simplicity ↔ capability
1 openai-agents-python Handoffs, guardrails, and multi-LLM routing; minimal surface so you own the loop. Simple (minimal surface)
2 crewAI Role-based agents (roles, goals, backstories) in Crews; Flows add event-driven and hierarchical control for production. Capability (roles, Flows, production)
3 autogen Conversable agents and group chats; code execution and human-in-the-loop; Microsoft origin, AG2 ecosystem. ✅ CC-BY Capability (group chat, code exec, AG2)
4 PraisonAI Autonomous multi-agent teams with a single entry point; emphasis on minimal config. Mid (single entry, minimal config)
5 AgentRL Multitask, multiturn RL for LLM agents; Ray-based scaling, rollout/actor workers—for teams that want to train agents, not just run them. Capability (RL, Ray, train agents)

Plugins, MCPs, CLI tools

Back to top

IDE plugins, concrete MCP servers, and CLI tools that give agents tools and context.

# Project Description OSS Simplicity ↔ capability
1 aider Git-aware CLI pair programmer; edits in-repo, supports multiple models and MCP so agents see version control and tools. Mid (CLI, git-aware, MCP)
2 agentlog Persistent decision memory for any project: remember, recall, reflect. Single-file Python CLI that stores decisions as JSONL and uses Claude or Gemini to retrieve and synthesize patterns—Karpathy's LLM Wiki concept as a CLI. Simple (one file, three commands)
3 claude-mem Claude Code plugin that captures everything an agent does during a session, AI-compresses it (via claude-agent-sdk), and injects the relevant context into future sessions—session-to-session memory as a drop-in. Capability (session capture + compression)
4 Better-OpenCodeMCP MCP server for OpenCode/Crush: async task execution, model bridging (e.g. Claude→Gemini), process pooling. Mid (MCP server, model bridging)
5 MCP Python SDK Official SDK to build and consume MCP servers/clients in Python; stdio and SSE transports. Simple (SDK only)
6 MCP TypeScript SDK Official MCP implementation for Node/TS; reference for the protocol. Simple (protocol reference)
7 continue Open-source IDE extension (VS Code, JetBrains); in-editor completion and chat with local or API models. Capability (IDE extension, multi-editor)
8 MCP Inspector GUI to test and debug MCP servers; inspect tools, resources, and prompts. Simple (debug GUI)
9 github-mcp-server MCP server for GitHub: repos, issues, PRs, code search; so your agent can "use GitHub" without hand-rolled API glue. Mid (GitHub API surface)
10 Docker MCP Gateway Docker's official MCP CLI plugin / gateway; container-aware MCP tooling from Docker (replaces deprecated docker/mcp-servers path). Mid (Docker-aware MCPs)
11 puppeteer-mcp-server Browser automation via MCP: tabs, screenshots, forms, JS execution; the one that connects to existing Chrome for dev/debug. Mid (browser automation)
12 puppeteer-real-browser-mcp Puppeteer MCP with real-browser and anti-detection; for agents that need to drive sites that block headless. Mid (real browser, anti-detect)

Evaluation and benchmarking harnesses

Back to top

Agentic eval systems, reasoning benchmarks, and open agent benchmarks.

# Project Description OSS Simplicity ↔ capability
1 ARC-AGI-2 ARC Prize task set: grid-based abstraction/reasoning; public and private splits for generalization. Simple (task set)
2 arc-agi-benchmarking Runner for ARC-AGI: multi-provider (OpenAI, Anthropic, Gemini, etc.), rate limits, retries, and scoring. Mid (runner, multi-provider)
3 AgencyBench Long-horizon agent benchmark: 32 scenarios, 138 tasks, ~1M tokens and ~90 tool calls; Docker sandbox and rubric-based + LLM judges. Capability (32 scenarios, Docker, judges)
4 TRAIL Trace reasoning and agentic issue localization; 148 long-context traces, 841 errors, 20+ error types; Hugging Face dataset. Mid (traces, Hugging Face)
5 AgentBench ICLR'24 benchmark: agents across AlfWorld, DB, knowledge graphs, OS, webshop; Docker Compose, function-calling interface. Capability (multi-env, Docker Compose)
6 WebArena Realistic web env (e.g. e‑commerce, CMS, dev tools); 812 tasks; measures end-to-end web agent success. Capability (812 tasks, web env)
7 SWE-bench LMs resolve real GitHub issues; Docker harness, instance IDs; standard for code-agent evals. Capability (real GitHub issues, standard)
8 SWE-Gym Training and evaluation for SWE agents and verifiers (ICML 2025). Capability (training + eval, ICML)
9 swe-smith Data generation for SWE agents; 50k+ instances across 128 repos; used for SWE-agent-LM training. Capability (50k+ instances, data gen)
10 SUPER Agents that set up and run ML/NLP from GitHub repos; 45 expert problems, 152 masked tasks, 602 AutoGen tasks; Docker-based. Capability (ML/NLP repos, Docker)
11 VitaBench ICLR'26: 66 tools, real-world apps (delivery, travel, retail); 100 cross-scenario + 300 single-scenario tasks; adopted by Qwen/Seed. Capability (66 tools, cross-scenario)
12 letta-evals Eval harness for stateful Letta agents; configurable suites and grading (LLM or rule-based) so you can measure what you ship. Mid (Letta-specific harness)
13 gaia-agent Modular runner for the GAIA benchmark (450 real-world assistant questions); multi-agent evaluation without the Inspect AI lock-in. Mid (GAIA runner, modular)
14 WebVoyager End-to-end web agent with LMMs: screenshots + actions on real sites; benchmark on 15 sites, GPT-4V for automatic eval. Capability (LMMs, screenshots, 15 sites)
15 inspect_evals UK AISI/Arcadia/Vector: GAIA and other evals in Inspect AI; level 1–3, sandboxed, tool-calling solvers. Mid (Inspect AI, UK gov)
16 inspect_ai Inspect AI core: composable eval tasks, sandboxes, scorers, and multi-model runs; the framework behind inspect_evals, not just the task bundle. Capability (eval framework, AISI stack)
17 Agent Lightning Microsoft's training-oriented harness: optimization loops for agent behavior—when you need to improve policies over rollouts, not only score a fixed prompt. Capability (agent training, Microsoft stack)

Research and task-specific harnesses

Back to top

Deep research, document QA, and domain-specific agent loops.

# Project Description OSS Simplicity ↔ capability
1 openagents Platform for autonomous agents and autopilot-style workflows; decentralized/Nostr-oriented. Capability (platform, decentralized)
2 multi-scale-agentic-rag-playbook NVIDIA's playbook: RAG at different scales with LangGraph agents, abstract search, and query routing—reference architecture, not a product. Mid (playbook, reference arch)
3 Agentic_RAG_System Ollama + LangChain, FAISS/BM25/RRF retrieval and an agentic reasoning loop; one concrete stack for "RAG that corrects itself." Mid (Ollama + LangChain stack)

Libraries and SDKs

Back to top

Lightweight runtimes, tool loops, and provider-agnostic harness primitives.

# Project Description OSS Simplicity ↔ capability
1 pydantic-ai Type-safe Python agents with Pydantic I/O; multi-provider, MCP, Logfire observability, and human-in-the-loop. Capability (type-safe, MCP, Logfire)
2 open-harness TypeScript Agent class on Vercel AI SDK; streaming events, filesystem/bash tools, MCP, and subagent delegation. Capability (streaming, tools, subagents)
3 vercel/ai React and Node SDK for streaming, tool calls, and agent-style UIs; provider-agnostic. Mid (React/Node SDK, provider-agnostic)
4 agent-harness Thin Python shim to swap OpenAI vs Anthropic agent SDKs behind one interface. Simple (thin shim)
5 smolagents Code-as-action agents: model outputs Python executed in sandbox (E2B, Modal, etc.); ~1k LOC core. Mid (code-as-action, ~1k LOC)
6 Community-curated agent lists Broader directories: e.g. brandonhimpfen/awesome-ai-agents, axioma-ai-labs/awesome-ai-agent-frameworks, mb-mal/awesome-ai-agents-frameworks—differ by scope and update cadence. Simple (curated lists)
7 agentic TypeScript agent stdlib: works with any LLM and the TS AI SDK; few abstractions, so you own the loop and the UI. (archived Feb 2026.) Simple (stdlib, you own loop)
8 strands-agents Model-driven Python SDK; decorators for tools, native MCP, multi-agent; "minimal code" without sacrificing provider choice. Mid (decorators, MCP, minimal code)
9 LiteLLM One interface to 100+ LLMs; routing, caching, budgets. Not an agent framework—the pipe every agent framework uses. Simple (LLM pipe only)
10 litellm2 LiteLLM plus structured Pydantic outputs, budget controls, and agent-style tool loops; OpenRouter-default option. Mid (LiteLLM + tool loops)
11 openai-agents-js Official OpenAI Agents SDK for Node/TS: handoffs, guardrails, voice; the JS counterpart to openai-agents-python. Capability (handoffs, guardrails, voice)
12 agent-framework LiteLLM-backed Python with dynamic tool registry, query routing, memory, and Streamlit UI; "full-stack agent app" in one repo. Capability (tool registry, routing, Streamlit)
13 agentic-ai Agentic AI stdlib for TypeScript; any LLM, any TS AI SDK; another "thin layer so you own the rest" option. Simple (thin layer)
14 E2B Firecracker sandboxes for executing agent-generated code; the hosted isolation layer many tool-calling demos use instead of running arbitrary LLM output on your laptop. Mid (sandbox API, code execution)
15 Daytona Elastic dev environments for AI-generated code: workspaces, Git, previews—infra harness between "the model wrote a patch" and "it ran in a real machine." Mid (dev env API, isolation)


Related Resources

Contribution

Contributions are welcome. To add or suggest projects:

For contribution guidelines, see CONTRIBUTING.md and the Code of Conduct.

License

CC BY-SA 4.0

Reviews (0)

No results found