Awesome Harness Engineering

Curated resources, patterns, and templates for building reliable AI agent harnesses.

Harness engineering is the discipline of designing the scaffolding — context delivery, tool interfaces, planning artifacts, verification loops, memory systems, and sandboxes — that surrounds an AI agent and determines whether it succeeds or fails on real tasks.

This list focuses on the harness, not the model. Every component here exists because the model can't do it alone — and the best harnesses are designed knowing those components will become unnecessary as models improve.

📐 Foundations
🧩 Design Primitives
🔍 Reference Implementations
🔒 Security, Sandbox & Permissions
✅ Evals & Verification
📋 Templates
📚 Related Awesome Lists
🤝 Contributing

Foundations

Canonical essays that define what harness engineering is and why it matters.

Harness Engineering — OpenAI's framing of harness engineering as a discipline: how to design the scaffolding that lets Codex and similar agents operate reliably in an agent-first world.
Unrolling the Codex Agent Loop — OpenAI's detailed breakdown of the Codex agent loop, exposing each harness component and where it can be improved.
Run Long-Horizon Tasks with Codex — OpenAI's practice guide for long-horizon task planning: introduces Plan.md, Implement.md, Documentation.md as reusable harness artifacts.
Building Effective Agents — Anthropic's foundational guide on agent architecture, covering when to use workflows vs. agents and how to compose primitives.
Harness Design for Long-Running Application Development — Anthropic's engineering blog on designing harnesses for sustained, multi-session development tasks. Key insight: every harness component assumes the model can't do something; those assumptions expire.
Writing Effective Tools for Agents — Anthropic's guide on tool interface design: naming, schemas, error surfaces, and the principle that tool design is agent UX.
Beyond Permission Prompts — Anthropic on building structured permission and authorization systems into agent harnesses instead of relying on natural-language permission text.
Demystifying Evals for AI Agents — Anthropic's framework for evaluating agent behavior: what to measure, how to build eval harnesses, and why unit-test-style evals fail for agents.
What is an AI Agent? — Anthropic's definitional piece, useful for anchoring harness design decisions to a clear model of what an agent actually is.
Agent Development Kit: Making it easy to build multi-agent applications — Google's announcement and design rationale for ADK: explains the multi-agent topology, tool registration model, and eval pipeline that shaped their framework. Complements the Anthropic/OpenAI framing with Google's production perspective.
Harness Engineering — Martin Fowler's synthesis of what harness engineering practice looks like: three interlocking systems — context engineering (curating what the agent knows), architectural constraints (deterministic linters and structural tests), and entropy management (periodic agents that repair documentation drift). The "humans on the loop" framing — harness engineers who design and maintain agent environments rather than inspecting individual outputs — is the clearest conceptual map of what the discipline actually entails.
The Anatomy of an Agent Harness — LangChain's structural breakdown of the five primitives that compose a harness: filesystem (durable state + agent collaboration surface), code execution (autonomous problem-solving without pre-designed solutions), sandbox (isolation + verification), memory (cross-session persistence), and context management (compaction against "context rot"). The co-evolution warning — models trained with specific harnesses can become overfitted to those designs — explains why harness architecture choices have lasting consequences beyond the immediate task.
Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned — The first systematic practitioner paper on terminal-native coding agent harness design: eager-construction scaffolding (pre-build all components before the first message to eliminate first-call latency and race conditions), compound multi-model architecture (different model instances for execution, reasoning, critique, and vision tasks), 5-layer defense-in-depth safety, and schema-filtered planning subagents (enforce behavioral constraints via tool schema rather than runtime permission checks). The five lessons distilled from building OpenDev apply to any server-side agent harness.
Natural-Language Agent Harnesses — Proposes externalizing agent control logic as portable natural-language artifacts (NLAHs) executed by a shared Intelligent Harness Runtime, enabling harness design to be studied, transferred, and reproduced rather than buried in bespoke controller code. Directly addresses the root cause of harness fragility: control logic scattered across framework defaults and hard-coded controller logic that can't be inspected, versioned, or transferred.
Ranking Engineer Agent (REA): Meta's Autonomous AI System for Ads Ranking — Meta's production harness for multi-day ML pipeline automation with hibernate-and-wake checkpointing for resuming interrupted 6-hour tasks without losing context. Demonstrates harness design for scientific workflows where individual turns can exceed model context limits but the overall pipeline must maintain coherence across days.
Supercharge Your AI Agents: The New ADK Integrations Ecosystem — Google's 2026 update to Agent Development Kit expanding the ecosystem integrations (Hugging Face, GitHub, Daytona, Notion, etc.) and providing reference patterns for how orchestration harnesses wire external services without losing determinism or state coherence.
2026 Agentic Coding Trends Report — Anthropic's industry benchmark identifying infrastructure configuration as a first-class optimization variable: harness setup alone can swing benchmarks by 5+ percentage points. Documents the shift from single-agent to orchestrated multi-agent teams and introduces the "agentic engineering platform" category, bridging the gap between agent frameworks and production deployment infrastructure.
How We Build Azure SRE Agent with Agentic Workflows — Architecture walkthrough of Microsoft's agent that has handled 35,000+ production incidents autonomously, reducing Azure App Service time-to-mitigation from 40.5 hours to 3 minutes. Documents the integration of MCP tools, telemetry, code repositories, and incident management platforms into a single agent harness with human-in-the-loop governance. The most data-backed production harness case study published in 2026.
Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent — Microsoft's account of shifting from 100+ bespoke tools and a prescriptive prompt to a filesystem-based context engineering system for their SRE agent. Key finding: exposing everything (source code, runbooks, query schemas, past investigation notes) as files and letting the agent use read_file, grep, find, and shell outperformed specialized tooling — "Intent Met" score rose from 45% to 75% on novel incidents.
Harness Engineering: Structured Workflows for AI-Assisted Development — Red Hat's enterprise perspective on harness engineering (April 7, 2026): AI writes better code when you design the environment it works in. Emphasizes structured context over free-form tickets, expanding the agent's toolbox through MCP integrations (CI status, deployment logs, runtime metrics) as real data sources, and a four-pillar model (vibes, specs, skills, agents) for organizing how humans and agents collaborate.
Harness engineering for coding agent users — Birgitta Böckeler's systematic mental model (April 2026) for coding-agent harnesses, framing them as feedforward guides plus feedback sensors that self-correct before output reaches human eyes. Distinguishes computational controls (linters, tests) from inferential ones (LLM-as-judge), and argues that harnessability should become a first-class criterion in technology and architecture decisions.
A Practical Guide to Building AI Agents — OpenAI's April 2026 comprehensive guide distilling production deployment patterns into actionable best practices: single-agent vs. multi-agent orchestration (manager vs. decentralized handoffs), tool design for many-to-many agent-tool relationships, and layered guardrail patterns combining input validation, output filtering, tool-risk ratings, and human-intervention triggers.

Design Primitives

Harness components organized by the problem they solve, not by vendor.

Agent Loop

ReAct: Synergizing Reasoning and Acting in Language Models — The foundational paper defining the Thought/Action/Observation loop structure that underlies virtually every agent harness. Required reading for understanding why the loop is structured the way it is and where each harness component maps onto the reasoning-acting cycle.
Unrolling the Codex Agent Loop — The canonical decomposition of what happens inside one agent loop iteration: observe, plan, act, verify.
LangGraph — Low Level Concepts — Models the agent loop explicitly as a directed graph with typed state, conditional edges, and checkpointing. The most concrete engineering treatment of loop control flow: how to implement termination conditions, branch on tool results, and persist mid-loop state for resumption.
Unlocking the Codex Harness: How We Built the App Server — OpenAI's engineering deep-dive into the Item/Turn/Thread protocol (JSON-RPC/JSONL over stdio) that exposes the Codex harness to every client surface. The most direct first-party account of why approval flows, streaming diffs, and thread persistence demand a purpose-built protocol — and why MCP's tool-oriented model proved insufficient for these requirements.
Extended Thinking — Claude API Docs — The harness-critical reference for integrating extended thinking into agent loops: budget_tokens controls reasoning depth per turn, thinking blocks must be preserved when passing tool results back (omitting them silently breaks multi-step reasoning), and thinking mode cannot change mid-turn. Essential before wiring extended thinking into any tool-use loop.
Improving Deep Agents with Harness Engineering — LangChain's case study showing harness-only changes moved their coding agent from rank 30 to top 5 on Terminal Bench 2.0 with no model swap: structured verification loops, context injection (directory maps + time budget warnings), loop-detection middleware, and a "reasoning sandwich" concentrating maximum thinking at planning and verification phases. The most concrete published demonstration that harness design is the primary performance lever, not model capability.
How Middleware Lets You Customize Your Agent Harness — Introduces AgentMiddleware: six composable hooks (before_agent, before_model, wrap_model_call, wrap_tool_call, after_model, after_agent) that intercept every stage of the agent loop. Enables deterministic policy enforcement (PII redaction that can't be trusted to prompts), dynamic tool injection, mid-task model swapping, and production patterns (retry, fallback, HITL interrupts) without modifying core agent logic — the reference design for cross-cutting harness concerns that shouldn't be baked into individual agents.
Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics — Controlled experiment isolating interpreter state persistence as an independent training variable. The harness finding: mismatching your runtime persistence mode to the model's training-time semantics produces either 80% missing-variable errors (model expects state that doesn't persist) or 3.5× token overhead (model redundantly recomputes state it expects to already have). Persistence is a learned semantic that must be honored at deployment, not a free runtime choice.
Real-Time Deadlines Reveal Temporal Awareness Failures in LLM Strategic Reasoning — Demonstrates that temporal awareness (handling deadlines and time constraints) appears orthogonal to reasoning capability: explicit temporal feedback in the agent loop significantly improves LLM performance on deadline-constrained tasks. Indicates temporal semantics as a learned behavior that must be integrated into harness-level context (current time, deadlines, time budgets) rather than assumed from capability alone.
A Scheduler-Theoretic Framework for LLM Agent Execution — April 2026 systematic analysis of 70 open-source LLM agent projects showing 60% adopt the Agent Loop pattern. Proposes a formal scheduler framework that maps execution patterns (Agent Loop, Event-driven, State-machine, Graph/flow, Hybrid) onto a unified control model, making the controllability/expressiveness/implementability trade-offs explicit. Essential reading for choosing the right loop architecture rather than defaulting to the simplest pattern.
Confucius Code Agent (CCA) — February 2026 production-grade coding agent from Meta/Harvard built on the Confucius SDK, which structures harness design around three perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). Features a unified orchestrator with advanced context management, persistent note-taking for cross-session learning, and a meta-agent that automates build-test-improve cycles. Achieves 59% Resolve@1 on SWE-Bench-Pro, exceeding prior research and commercial baselines.
The Design Space of Today's and Future AI Agent Systems — April 2026 reverse-engineering of Claude Code's architecture revealing five-stage progressive compaction (budget reduction → snip → microcompact → context collapse → auto-compact), subagent isolation with rebuilt permission contexts, and a 27-event-type hook pipeline. The most detailed public analysis of a production agent loop's internal design decisions — essential for understanding how context pressure, safety, and delegation are handled at scale.

Planning & Task Decomposition

Run Long-Horizon Tasks with Codex — Introduces milestone-based planning artifacts (Plan.md, Implement.md) as harness-level state.
Harness Design for Long-Running Application Development — Multi-session planning, progress tracking, and the role of persistent planning documents.
Plan-and-Execute Agents — The canonical engineering write-up separating planning from execution as distinct harness layers: a planner LLM generates the step list once; an executor agent works through it, replanning only when needed. Defines the pattern that most modern task-decomposition harnesses follow.
microsoft/TaskWeaver — Code-first task decomposition framework with a planner/executor split and a plugin system for injecting domain knowledge into the planning layer. The most complete reference implementation of plan-then-execute with stateful task tracking.
LATS: Language Agent Tree Search — Unifies reasoning, acting, and planning via Monte Carlo Tree Search over agent trajectories. Directly informs harness design: external tool feedback as tree-search signals, trajectory backtracking on failure, and depth-bounded exploration make this the most actionable planning research for harnesses with real environment interaction.
Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering — Demonstrates specialized harness patterns for coordinating heterogeneous agent teams (planner, coder, reviewer, executor) on software engineering tasks. Shows how role-specific agents with different model sizes and tool access produce better outcomes than single-agent approaches, with concrete metrics on task decomposition effectiveness.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks — Modular framework separating high-level planning from low-level execution through synthetic data generation and explicit structured planning. Achieves 57.58% success on WebArena-Lite and 81.36% on WebVoyager. The key harness insight is that planner and executor can be specialized independently — different model sizes, tool access, and reasoning budgets for each layer — improving overall reliability on tasks exceeding context window limits.
Choosing the Right Multi-Agent Architecture — Decision framework for four multi-agent patterns (subagents, skills, handoffs, router) with concrete performance data: subagents process 67% fewer tokens than skills in multi-domain scenarios because context isolation prevents cross-domain bloat. The five-dimension matching table (distributed development, parallelization, multi-hop, user interaction, latency) is the most actionable published guide for deciding when a topology change — not a model change — is the right lever for a performance problem.
Multi-Agent Workflows Often Fail. Here's How to Engineer Ones That Don't. — GitHub's February 24, 2026 distillation of a failure pattern most harnesses eventually rediscover: multi-agent systems behave like distributed systems, so every handoff needs typed schemas, constrained action schemas, and explicit boundary validation. Worth including because it turns "add more agents" from a vibe into an interface design problem you can actually reason about.
Effective Harnesses for Long-Running Agents — Anthropic's pattern for maintaining agent progress across multiple context windows: an initializer agent sets up the environment once and hands off to a coding agent that makes incremental progress each session. The structured handoff mechanism — feature lists, git commits, and test gates as cross-session state — is the reference design for any harness where a task exceeds a single context window and naïve restarts lose accumulated progress.
Task-Adaptive Multi-Agent Orchestration (AdaptOrch) — February 2026 framework that dynamically selects orchestration topology (parallel, sequential, hierarchical, or hybrid) based on task dependency graphs rather than fixed pipeline architecture. Demonstrates that topology choice is a harness-level lever that can improve performance 12–23% over model selection alone.
Task-Decoupled Planning for Long-Horizon Agents (TDP) — January 2026 planning framework that combines task decomposition with modular agent design: a Supervisor decomposes tasks into a dependency graph, Planner & Executor agents solve each decoupled sub-task node independently, and a Self-Revision module updates the graph after execution. The key harness insight is that decoupling planning from execution at the sub-task level enables localized replanning without cascading failures across the entire task chain.

Context Delivery & Compaction

Harness Engineering — How to structure context windows for agents: what to include, what to exclude, and how context shape affects agent behavior.
Effective Context Engineering for AI Agents — Anthropic's systematic guide to managing the full context state—system prompts, tools, MCP, and message history—as a finite, curated resource. Reframes harness design as "what configuration of context produces the desired behavior?" rather than just prompt wording.
Compaction — Claude API Docs — Anthropic's reference for server-side context compaction: automatically summarizes older context when approaching the window limit. Reduced token consumption by 84% in a 100-turn web search eval while allowing agents to complete workflows that would otherwise hit context limits.
LLMLingua — Microsoft Research's prompt compression toolkit (up to 20x compression, minimal performance loss) that can be embedded as a preprocessing step in the context delivery layer. LLMLingua-2 adds 3–6x speed gains, making it viable for latency-sensitive agent loops.
Prompt Caching — Claude API Docs — The most effective harness-level cost lever: cache repeated system prompts, tool definitions, and long documents across requests. Explains where to place cache_control breakpoints for maximum reuse across multi-turn agent sessions.
Autonomous Context Compression — Shifts context compression from harness-controlled (compacting at a fixed token threshold) to agent-controlled: agents call a dedicated tool to trigger compression when strategically appropriate — between tasks or before consuming large inputs. Eliminates the failure mode where reactive-at-limit compaction interrupts agents mid-subtask and corrupts in-flight reasoning state.
Active Context Compression: Autonomous Memory Management in LLM Agents — Proposes a "Focus Agent" architecture where the agent autonomously decides when to consolidate interaction history into a persistent Knowledge block and prune raw context — shifting compression from a harness-enforced policy to a model-controlled action. Produces 22.7% token reduction with no accuracy loss on long-horizon tasks; the core contribution is making the compression unit semantically coherent (the agent decides what knowledge is worth preserving) rather than mechanically token-budget-driven.
Making Agent-Friendly Pages with Content Negotiation — Vercel's February 3, 2026 implementation guide for serving text/markdown when agents request it via Accept: text/markdown, while preserving the same human-facing HTML URL. This is a real harness primitive, not just a docs trick: it removes boilerplate before it ever enters the context window and gives agents cleaner, cheaper inputs without custom scrapers.
A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces — Reframes RAG as a harness tool-design problem: instead of injecting retrieved documents into context at pipeline time, expose three retrieval tools (keyword search, semantic search, chunk read) and let the agent pull information incrementally as each reasoning step requires it. The key harness decision is architectural — retrieval becomes a tool call in the agent loop, not a preprocessing step — which means the agent's reasoning can adaptively narrow scope rather than processing everything injected upfront.
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications — Structured framework for building production-grade evaluation harnesses: evaluation gates that block deployment, observability instrumentation that tracks all agent decisions, and CI integration patterns that catch regressions before they reach users. Essential reading for organizations deploying multiple agents in parallel where a single harness failure can cascade.
ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context — LLM-curated hierarchical context management for agents where the model itself learns to weight information importance across multiple hierarchy levels. Reduces token overhead through learned relevance filtering without sacrificing comprehension. Directly applicable to any harness where context budget is the limiting factor — letting the model curate what belongs in active memory vs. what can be retrieved on-demand.
Claude Code Compaction: How Context Compression Works — March 2026 deep-dive into Claude Code's automatic compaction mechanism: what survives (current task, recent errors, file names) vs. what gets lost (initial instructions, intermediate decisions, style rules). Key harness insight: never rely on compaction for critical rules — move them to CLAUDE.md where they live in the system prompt and survive any compression. Essential practical guidance for anyone running long-session agents.

Tool Design

Writing Effective Tools for Agents — Tool naming, schema design, error messages, and return value conventions that make agents more reliable.
Tool Use — Claude API Docs — Authoritative reference for client vs. server tool execution models, strict schema enforcement, and tool_result error signaling. The distinction between client-side and server-side tool execution is a foundational harness architecture decision.
Function Calling — OpenAI Docs — Defines the de facto industry-standard JSON Schema conventions for tool definitions and parallel function calling. Essential reading before designing a tool interface that needs to work across multiple models.
Tool Annotations as Risk Vocabulary — The MCP team's definitive post on the four tool annotation hints (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) as inputs to harness permission decisions, not enforced contracts. The "lethal trifecta" — private data access + untrusted content exposure + external communication — is the most actionable framing for why single-tool safety analysis misses the risk that emerges from tool combinations.
outlines — Constrains token sampling via regex/CFG/JSON Schema at the decoding layer, guaranteeing structured output without model fine-tuning. The right solution when you need OpenAI Structured Outputs-equivalent reliability from a locally deployed or open-weight model.
instructor — Maps Pydantic models directly to structured LLM extraction with built-in retry and validation-error feedback loops. Turns tool call output parsing from ad-hoc JSON handling into type-safe data models, eliminating an entire class of harness parsing bugs.
SkillTester: Benchmarking Utility and Security of Agent Skills — Framework for evaluating agent skills on three dimensions (capability, robustness, security) before deployment. Directly addresses the harness problem of skill sprawl: as agents gain access to more tools, the combinatorial explosion of failure modes becomes unmanageable without systematic verification. The 86-task benchmark across 11 domains provides reference metrics for skill quality.
AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness — Google DeepMind technique that uses code synthesis to auto-generate runtime constraint harnesses from tool schemas and task specifications. Gemini-2.5-Flash + AutoHarness outperforms Gemini-2.5-Pro and GPT-5.2-High on TextArena games by eliminating illegal moves through learned harness policies. Shifts constraint enforcement from static (schema validation) to dynamic (synthesized code guards) — a reference pattern for learning-based behavioral guardrails.
Scaling Parallel Tool Calling for Efficient Deep Research — February 2026 analysis of how parallel tool calling reduces latency in multi-step agent workflows. Demonstrates that concurrent tool execution (rather than sequential observe→act loops) is the key efficiency lever for deep-research harnesses where each step may invoke search, browse, and compute tools simultaneously. Essential for designing low-latency agent loops without sacrificing reasoning depth.
EigentSearch-Q+ — April 2026 framework for deep-research agents using dedicated reasoning tools (plan_next_searches, select_query_and_search, extract_relevant_details, analyze_search_progress) that externalize intermediate decisions as typed tool arguments. Inspired by Anthropic's think-tool paradigm, Q+ makes cognitive scaffolding explicit and auditable — bridging classic information-retrieval strategies with structured model-driven tool invocations.
TopoCurate: Modeling Interaction Topology for Tool-Use Agent Training — March 2026 framework that models interaction topology — the structural patterns of how agents invoke, chain, and conditionally branch between tools — as a first-class training signal. Rather than treating tool use as isolated function calls, TopoCurate learns topological priors from expert trajectories, improving generalization to novel tool combinations and multi-step orchestration patterns. Directly applicable to harnesses where tool topology (not just tool availability) determines task success.
Design Patterns for Deploying AI Agents with Model Context Protocol — March 2026 field report from an enterprise MCP deployment identifying three protocol-level gaps that break production: missing identity propagation (who is the request for?), absent adaptive tool budgeting, and unstructured error semantics. The concrete mitigation patterns — JWT-enriched tool calls, per-tool timeout contracts, and standardized error-action mappings — are essential before betting on MCP as your primary tool-integration layer.

Skills & MCP

Model Context Protocol — Anthropic's open protocol for connecting agents to external tools, data sources, and services in a standardized way.
modelcontextprotocol/servers — Anthropic's official reference MCP server implementations (GitHub, Slack, Postgres, Puppeteer, etc.). The authoritative source for understanding correct MCP server structure before building your own.
microsoft/playwright-mcp — Browser automation via accessibility tree snapshots rather than screenshots, dramatically reducing token cost. The canonical example of structured tool output design in an MCP server.
A2A Protocol — Google's open Agent-to-Agent protocol: JSON-RPC over HTTP(S)/SSE with Agent Card service discovery and a task/message/artifact communication model. The emerging standard for cross-framework agent interoperability in multi-agent harnesses.
MCP Inspector — Interactive debugging UI for MCP servers: inspect tool definitions, send test calls, and validate responses without wiring up a full agent. The essential development tool for anyone building or integrating MCP servers into a harness.
Shell + Skills + Compaction: Tips for Long-Running Agents — OpenAI's engineering guide to three production harness primitives: versioned Skill bundles (SKILL.md manifest; routing accuracy improved 73%→85% by adding negative examples), a managed shell container for durable tool execution, and server-side compaction via explicit /responses/compact endpoint. The most concrete first-party documentation of skills-based routing and compaction published in 2026.
Composio — Wraps 250+ SaaS APIs (GitHub, Slack, Linear, Notion, etc.) as agent-ready actions with managed OAuth, so tool integration becomes a one-line import rather than a custom harness component per service. The fastest path from "the agent needs to call an external API" to a production-grade, authenticated tool.
MCP Streamable HTTP Transport — The transport that replaced HTTP+SSE in the 2025-11-25 spec, enabling MCP servers to run as remote services rather than local processes. Servers handle multiple client connections using HTTP POST (for client→server messages) and optional GET (for server→client SSE streams). The key harness architecture decision: Streamable HTTP unlocks remote MCP deployment but introduces session management complexity — stateful Mcp-Session-Id headers fight with load balancers and horizontal scaling, which the 2026 roadmap aims to resolve by decoupling sessions from the transport layer.
The 2026 MCP Roadmap — The MCP team's roadmap for the next spec cycle: horizontal-scaling transport without stateful session constraints, .well-known discovery for capability advertisement without live connections, Tasks primitive with retry/expiry semantics, and enterprise extensions (audit trails, SSO, gateway behavior). Essential reading before investing heavily in MCP server infrastructure — the transport and discovery changes in particular affect how clients locate and connect to servers.
Developer's Guide to AI Agent Protocols — Google's survey of six standardized agent interoperability protocols, each solving a distinct harness integration problem: MCP (tool/data connectivity), A2A (inter-agent routing via Agent Card discovery at well-known URLs), UCP (commerce workflows), AP2 (payment authorization with spend limits), A2UI (agent-driven dynamic UIs), AG-UI (streaming event format). The most practical map of which protocol to choose when an agent needs to cross a system boundary.
AG-UI — Lightweight event-driven protocol standardizing how AI agents connect to frontend applications: streaming state updates, tool call rendering, and HITL interrupts over a shared event bus. Fills the layer between MCP (tool access) and A2A (agent-to-agent) — it's the missing protocol for real-time agent-to-UI communication that neither MCP nor A2A was designed to address.
Code Execution with MCP: Building More Efficient Agents — Anthropic's engineering account of reducing tool-call token overhead by having agents write code to interact with MCP servers rather than calling tools directly: up to 98.7% token reduction in experiments. Broadly applicable to any harness where tool schema overhead and intermediate results are consuming context — the pattern is to wrap multi-step tool interaction in a code execution primitive rather than exposing each operation as a discrete tool call.
Microsoft Skills Framework — Standardized framework for defining, versioning, and distributing agent skills. Enables skill reuse across Claude Code, Copilot, VS Code, Gemini, and other platforms — a harness-level abstraction that makes skills first-class deployment artifacts rather than ad-hoc tool definitions.
SkillNet & SkillsBench: Infrastructure for AI Agent Skills at Scale — Comprehensive framework for creating, evaluating, and sharing agent skills with 86-task benchmark across 11 domains. Demonstrates the harness problem of skill fragmentation and provides infrastructure for standardized skill evaluation across frameworks.
AWS Bedrock AgentCore with WebRTC Support — Adds peer-to-peer, UDP-based WebRTC bidirectional streaming to Bedrock Agents for real-time voice interactions. Complements existing WebSocket support with lower latency and better resilience for poor network conditions. Essential harness-level transport choice for agents targeting sub-800ms Total Turn-Around Time voice interactions.
Hermes Agent: Unified Streaming for Real-Time Agent Workflows — Token-by-token streaming delivery system enabling real-time agent responses; sub-second decision loops on streaming events vs. batch-refreshed data. Critical infrastructure for harnesses where latency (not just throughput) is the constraint — agents must react to events as they arrive, not wait for batch completions.
Google Developers: Closing the Knowledge Gap with Agent Skills — Google ADK expansion with evaluation harness (117 prompts) for assessing skill performance across agentic coding, chatbots, document processing. Provides reference patterns and benchmark datasets for skill evaluation, complementing the Microsoft Skills Framework with Google's evaluation methodology.
What's New with GitHub Copilot Coding Agent — GitHub's February 26, 2026 update is worth including for one specific reason: it makes .github/agents/ custom agent files, self-review, built-in security scanning, and CLI handoff concrete as harness primitives rather than abstract ideas. Useful as a current reference for how repository-scoped agent definitions and security checks are being productized in a real coding-agent control plane.
Announcing Official MCP Support for Google Services — Google's 2026 rollout of managed MCP endpoints is a useful counterpoint to self-hosted MCP servers: discovery, IAM, audit logging, and Model Armor are provided as platform primitives instead of being rebuilt per server. Worth including because it shows what "enterprise MCP" looks like when the transport, auth, and governance layers are treated as product surface rather than glue code.
Dataverse Skills: Your Coding Agent Now Speaks Dataverse — Microsoft's April 1, 2026 release is a strong concrete example of domain-specific skills done properly: the agent learns when to use MCP, when to drop to a Python SDK, and when to call a raw API, while the user stays in natural language. Worth adding because it shows that "skills" are not just prompt snippets, but curated execution strategies that hide a multi-tool integration stack behind intent.
agentic-stack — Portable .agent/ folder that externalizes memory, skills, and protocols from any specific coding agent into a cross-tool harness layer. Adapters translate the same configuration into Claude Code's CLAUDE.md, Cursor's rules, OpenCode's AGENTS.md, and more — the first practical answer to harness vendor lock-in at the configuration level.

Permissions & Authorization

Beyond Permission Prompts — Structured authorization patterns for agents: how to give agents the right permissions without relying on prompt-level trust.
OWASP LLM06:2025 — Excessive Agency — OWASP's authoritative definition of the "excessive agency" risk: over-provisioned functions, unnecessary permissions, and missing approval mechanisms. The standard checklist for auditing harness permission scope against principle of least privilege.
GitHub Enterprise — Governing Agents — April 2026 GitHub official guide for enterprise agent governance: MCP server registry curation with ruleset-protected configurations, agent environment standardization via copilot-setup-steps.yml, ephemeral runner enforcement, and cloud-agent firewall allowlisting. The most concrete published reference for governing agent fleets at scale without creating bottlenecks.
Claude Code Auto Mode: A Safer Way to Skip Permissions — Anthropic's engineering post on replacing approval fatigue (users approve 93% of prompts, making approvals meaningless) with a two-stage classifier: fast single-token gate first, chain-of-thought reasoning only on flagged actions. The design decisions — stripping assistant messages to prevent the agent from rationalizing dangerous actions, deny-and-continue recovery instead of halt — are the reference design for safe-by-default headless agent permissions.
Claude Agent SDK — Configure Permissions — The most concrete reference for harness permission architecture: five-layer evaluation order (hooks → deny rules → permission mode → allow rules → canUseTool), allowedTools/disallowedTools declarative scoping, and four permission modes including dontAsk (deny-by-default for headless agents). The subagent inheritance warning for bypassPermissions alone is worth reading before any multi-agent deployment.
Two Different Types of Agent Authorization — Distinguishes on-behalf-of authorization (agent uses end-user credentials, requires cross-channel identity mapping and per-user memory isolation) from fixed-credential authorization (agent owns its own account, requires human-in-the-loop guardrails on high-risk actions). The two models have fundamentally different threat surfaces and determine where authorization enforcement lives in the harness.
Authorization and Governance for AI Agents: Runtime Authorization Beyond Identity at Scale — Microsoft Security's reusable Authorization Fabric combining a Policy Enforcement Point (PEP) and Policy Decision Point (PDP) as a Microsoft Entra-protected endpoint. Every agent calls this fabric before tool execution, receiving a deterministic decision: ALLOW / DENY / REQUIRE_APPROVAL / MASK. Addresses the gap that identity alone (who is this agent?) doesn't answer whether a specific action should be executed now, by this agent, for this user, under the current business and regulatory context.
IETF draft-klrc-aiagent-auth: AI Agent Authentication and Authorization — The first IETF standards-track specification for AI agent authentication (March 2026, authors from AWS, OpenAI, Zscaler, Ping Identity, Defakto Security). Builds on WIMSE (Workload Identity in Multi-System Environments) and OAuth 2.0 rather than inventing new protocols — agents get SPIFFE-style identifiers, with delegation via OAuth Token Exchange and DPoP for token binding. Essential reference for any harness that needs to authenticate agents across trust domains.
Nango: Pre-Built Authentication for AI Agents — Open-source platform providing pre-built OAuth and API key authentication for 700+ APIs across 30 categories. Automatically refreshes access tokens, provides webhooks when credentials break, and stores tokens securely so agent code never touches secrets. Solves the "agent needs to call an authenticated API" problem at scale — the authentication layer that complements Composio's tool wrapping.
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security — Three-dimensional risk taxonomy (source/failure-mode/consequence) with fine-grained agentic safety benchmark (ATBench) and diagnostic guardrail models (4B–8B parameters) achieving 91.8% accuracy. Shifts safety monitoring from binary safe/unsafe checks to root-cause diagnosis: why did an action violate constraints? Where did the violation originate? What are the downstream consequences? Essential for production harnesses where transparency into safety decisions is required for audit trails.
Open Agent Passport (OAP): Deterministic Pre-Action Authorization for Autonomous AI Agents — March 2026 open specification and reference implementation that intercepts tool calls synchronously before execution, evaluates them against a declarative policy, and produces a cryptographically signed audit record. Enforces authorization in a median of 53ms; in a live adversarial testbed ($5,000 bounty), restrictive OAP policy achieved 0% attack success vs. 74.6% under permissive policies. Distinguishes pre-action authorization from sandboxed execution and model-based screening as complementary but distinct harness layers.

Memory & State

Building Effective Agents — Covers in-context, external, and procedural memory patterns as harness-level concerns.
Letta (MemGPT) — The reference architecture for stateful agents: three-tier memory (core / archival / recall) maps directly to harness state management design. Their agent loop redesign post is the most thorough public analysis of how memory structure shapes the harness.
mem0 — Drop-in universal memory layer (YC-backed, AWS Agent SDK's exclusive memory provider) that handles cross-session retention without custom harness-level state management code. Lowest integration cost for production-grade persistent memory.
Zep — Purpose-built agent memory store with automatic conversation summarization, entity extraction, and semantic search over session history. Solves long-session context overflow at the memory layer rather than forcing the harness to manage trimming manually.
How We Built Agent Builder's Memory System — LangChain's engineering account of a COALA-based three-tier memory system (procedural/semantic/episodic) backed by PostgreSQL but exposed to agents as a virtual filesystem. Key harness decisions: human-in-the-loop approval gates every memory write (blocking prompt-injection via malformed writes), validation errors are fed back to the LLM for self-correction, and AGENTS.md serves as the agent's procedural memory anchor.
Building an Agentic Memory System for GitHub Copilot — GitHub's January 15, 2026 write-up is one of the clearest public discussions of deployed cross-agent memory: repository-scoped memories are shared across coding agent, CLI, and code review, but only after just-in-time verification against the current code state. The core harness lesson is that memory quality is mostly a freshness and invalidation problem — stale, branch-specific memories are often more dangerous than having no memory at all.
MemArchitect: A Policy-Driven Memory Governance Layer — Proposes a governance layer that decouples memory lifecycle management (decay, conflict resolution, privacy enforcement) from model weights, directly addressing the "zombie memory" problem: outdated facts sitting in the context window that only a harness-level eviction policy — not the model — can remove.
Codified Context: Infrastructure for AI Agents in a Complex Codebase — Production-validated architecture (283 sessions, 108k-line codebase) built on three components: a "hot-memory constitution" encoding conventions and multi-agent coordination protocols, 19 domain-specialist agents, and a "cold-memory knowledge base" of 34 on-demand specification documents. The empirical data distinguishes what must live in always-on context from what should be retrieved on demand — the most concrete published guidance for scaling cross-session memory in a large codebase.
Facts as First Class Objects: Knowledge Objects for Persistent LLM Memory — Identifies three production failure modes of in-context memory at scale: capacity overflow at ~8,000 facts, 60% fact destruction during compaction, and 54% behavioral drift from constraint erosion across cascaded summarizations. Proposes Knowledge Objects (hash-addressed discrete fact tuples) achieving 100% accuracy at 252× lower cost than in-context storage — the quantitative case for moving persistent facts out of the context window into a structured retrieval layer rather than managing them through prompt engineering.
Recoverability Has a Law: The ERR Measure for Tool-Augmented Agents — Formal framework for measuring how well agents recover from tool failures. Defines Expected Recovery Regret (ERR) as a metric for harness design: the cost of recovering from stochastic failures in downstream tasks. Critical for assessing reliability of production harnesses where tool calls occasionally fail but agents must continue functioning.
MAGMA: Multi-Graph Agentic Memory Architecture — Represents agent memory across four orthogonal semantic, temporal, causal, and entity graphs, enabling policy-guided retrieval over relational views. Outperforms MemGPT on long-horizon reasoning benchmarks by 18.5% accuracy improvement. The multi-graph abstraction lets harness engineers compose different retrieval strategies for different task phases — a concrete architecture for memory that scales beyond single-view approaches.
GAAMA: Graph Augmented Associative Memory for Agents — Hybrid memory system blending graph traversal with semantic similarity through additive scoring; graph augmentation improves retrieval over embedding-only approaches for long-horizon reasoning. Practical alternative to full multi-graph systems when adding structure to existing vector-based memory is sufficient.
Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures — Formal semantics for versioned memory graphs with belief revision operations, enabling agents to maintain coherent evolving world models through multi-turn reasoning. Addresses the hard problem of inconsistency resolution in long-lived agent memory: when new information contradicts prior beliefs, how should the agent update its knowledge base?
Continual learning for AI agents — LangChain's April 2026 framing of agent learning as three distinct layers: model weights, harness behavior, and contextual memory. Essential for designing memory systems that don't just store facts but actually improve agent performance over time through trace-driven harness and context updates.

Task Runners & Orchestration

Harness Engineering — How task runners fit into the harness: queueing, parallelism, and progress reporting.
Building a C Compiler with a Team of Parallel Claudes — Anthropic's account of coordinating 16 Claude instances in parallel on a shared git repo without a central orchestrator: agents claim tasks via files in current_tasks/, git forces collision resolution naturally, and a continuous restart loop spawns fresh sessions that resume where predecessors left off. Key harness lesson: verbose test output pollutes agent context — the feedback loop must emit only a few summary lines, log detail to file.
LiteLLM — Unified proxy and SDK that routes to 100+ LLM providers behind a single OpenAI-compatible interface, with a Router handling retry/fallback across deployments, per-project cost and rate-limit tracking, and OTEL callback integrations. The right infrastructure layer when your harness needs provider resilience (automatic failover on 429/500 errors), budget guardrails, or the ability to swap models without touching orchestration code.
LangGraph — Graph-based state machine framework for multi-agent harnesses: models supervisor/subagent topologies, error-recovery branches, and checkpoint persistence as first-class primitives. The most widely adopted harness orchestration layer in production.
OpenAI Agents SDK — Lightweight multi-agent framework built around handoffs and guardrails; the production successor to Swarm. Complements LangGraph for harnesses where delegation patterns are simpler than full graph orchestration.
Google ADK — Google's code-first agent framework with built-in multi-agent orchestration, tool registration, session state, and eval pipeline. Its Runner and AgentTool patterns are the reference implementation for wrapping sub-agents as tools in a larger harness.
AutoGen — Microsoft's multi-agent conversation framework with a complete AgentChat layer covering agent loop, tool integration, termination conditions, and human-in-the-loop. The most comprehensive open-source reference for large-scale multi-agent harness design.
CrewAI — Dual-layer harness orchestration: Crew handles autonomous agent delegation, Flow provides event-driven deterministic control (branching + shared Pydantic state). The clearest open-source example of mixing autonomous and scripted execution in the same harness.
PydanticAI — Type-safe agent framework where tool definitions, parameters, and return values are Pydantic models. Shifts "agent output doesn't match expected structure" from a runtime bug to a type-check failure; its RunContext dependency injection pattern is the reference design for passing session-scoped objects through the harness without global state.
LangGraph 2.0 Release — Major 2026 release codifying three years of production orchestration patterns: type-safe streaming, Deploy CLI for managed hosting, and unified agent primitives (Router, Supervisor, Subagent) eliminating the need to hand-roll coordination logic. The persistence layer support for checkpoint-resume recovery is the reference design for resumable multi-day agent tasks.
OmniRoute: Multi-Provider LLM Gateway — Intelligent routing across multiple LLM providers with load balancing, intelligent fallbacks, rate limiting, and response caching. Achieves 40–60% token cost reduction through smart model routing (cheap models for simple tasks, capable models for complex reasoning). Essential infrastructure for harnesses operating under strict cost budgets where model selection is a per-turn decision.
Scaling Managed Agents: Decoupling the Brain from the Hands — Anthropic's production architecture for separating three stateless components — the "brain" (Claude + harness), "hands" (sandboxes/tools), and "session" (append-only event log) — enabling independent failure and replacement of each. Crash recovery via session replay (wake(sessionId) + getEvents()) and on-demand container provisioning cut p50 TTFT by ~60% and p95 by over 90%. The reference design for treating agent containers as "cattle not pets" in production.
Microsoft Agent Framework 1.0 — Production-ready 1.0 release (April 2026) unifying Semantic Kernel and AutoGen into a single framework with graph-based orchestration, middleware pipeline for intercepting every execution stage, and declarative YAML agent definitions. DevUI provides a browser-based debugger for visualizing agent execution, message flows, and tool calls in real time. Multi-provider support (Azure OpenAI, Anthropic, Bedrock, Gemini, Ollama) with A2A and MCP integration makes this the most complete enterprise agent harness framework available for .NET and Python.
AgentScope Runtime — Production-ready open-source runtime focused on two pieces many agent frameworks leave underspecified: secure sandbox execution and durable agent serving. The "Agent as API" model, async sandbox types, and built-in state/sandbox lifecycle management make it one of the few 2026 projects tackling runtime concerns directly instead of stopping at orchestration abstractions.
Orchestrating Ambient Agents with Temporal — Temporal.io's harness infrastructure for persistent agent workflows with native agentic handshake protocol for secure deadline-aware calendar negotiation between autonomous agents. Brings distributed systems best practices (durability, retry semantics, activity monitoring) to agent orchestration, enabling agents to handle long-running tasks that outlast any single HTTP request or session.
Vercel AI SDK — The leading TypeScript toolkit for building AI agents (20M+ monthly downloads, 25+ provider integrations). AI SDK 6 introduced a first-class Agent abstraction with ToolLoopAgent for production-ready tool execution loops, DevTools for local debugging, full MCP support, and type-safe UI streaming. The unified API across OpenAI, Anthropic, Google, and AWS Bedrock makes it the default choice for TypeScript harnesses that need provider portability.
Mastra — TypeScript-native agent framework (from the Gatsby team) with 22K+ stars and 300K+ weekly npm downloads. Connects to 40+ providers through one standard interface, with built-in workflows, RAG pipelines, and agent orchestration. The @mastra/deployer handles serverless deployment, and the eval system supports LLM-as-judge out of the box. The strongest alternative to Vercel AI SDK for teams that need more opinionated agent primitives.
The next evolution of the Agents SDK — OpenAI's April 2026 update adding native sandbox execution, configurable memory, and sandbox-aware orchestration to the Agents SDK. The shift toward a "model-native harness" that aligns execution patterns with how frontier models actually perform best is a reference design for SDK-level harness evolution.

Verification & CI Integration

Demystifying Evals for AI Agents — How to build verification into the harness loop, not just as a post-hoc eval.
promptfoo — YAML-driven LLM testing framework with LLM-as-judge, assertion DSL, and native CI integration. The most practical tool for adding agent output regression tests to a PR pipeline without writing a test harness from scratch.
AgentBench — Multi-environment agent benchmark (OS, DB, web, code) with a structured eval pipeline. Worth studying for its environment isolation design and task definition format when building custom eval environments for your harness.
Testing Agent Skills Systematically with Evals — OpenAI's framework for skill regression testing: four eval dimensions (outcome, process, style, efficiency goals), JSONL trace capture for deterministic checks (command sequences, token budgets, repo cleanliness), then rubric-based grading only where deterministic checks don't suffice. The layering principle — add expensive LLM-as-judge checks only where they reduce meaningful risk — is the most actionable published guide to CI pipelines for agent skills that don't collapse under eval cost.
Agent Evaluation Readiness Checklist — A 33-item checklist covering the full evaluation lifecycle: error taxonomy, three-level granularity (single-step → trace → multi-turn thread), grader specialization, and CI integration. Key insight: capability evals (low pass rate, improvement target) and regression evals (near-100%, protection target) must be separated — mixing them produces wrong prioritization decisions.
Evaluating Skills — LangChain's methodology for benchmarking agent skills in Docker-sandboxed environments. Key empirical findings: Claude Code achieved 82% task completion with curated skills vs. 9% without, and consolidating to ≤12 skills improved accuracy over sprawling skill sets. The baseline-vs-skills comparison design with bugfix tasks and clear outcome metrics is the template for systematic skill coverage testing.
AgentAssay: Token-Efficient Regression Testing for Non-Deterministic Agent Workflows — Addresses agent CI's core problem: binary pass/fail is useless for non-deterministic workflows. Behavioral fingerprinting detects 86% of regressions vs. 0% with binary testing; stochastic PASS/FAIL/INCONCLUSIVE verdicts grounded in hypothesis testing cut token costs 78%. Trace-first offline mode runs regression checks against production traces at zero additional inference cost.
Agentic Harness for Real-World Compilers: A Case Study in Specialized Tool Design — Demonstrates how harness specialization (llvm-autofix) for a narrow domain (compiler bug fixes) achieves better results than general-purpose coding agents. The tool design patterns — exposing compiler error messages directly, bounding search depth by compilation cost — are transferable to any domain where cost-based pruning and specialized feedback loops make the difference between failure and success.
Eval-Driven Development: Build and Evaluate Reliable AI Agents — Red Hat's eight-stage evaluation maturity progression from manual CLI testing to cost-aware continuous monitoring (March 2026). Uses DeepEval with 15 custom ConversationalGEval metrics and LLM-as-judge; key finding: evaluator model capability matters significantly — llama-3-3-70b caught all known failures while smaller models missed 4–5 cases. The $0.64/run cost estimate and self-hosted evaluator pattern on OpenShift AI provide concrete guidance for teams building eval harnesses under real budget constraints.
Agent Evaluation Framework 2026: Metrics, Rubrics & Benchmarks — Comprehensive framework combining multi-environment baselines (AgentBench), domain-specific benchmarks (Terminal Bench 2.0, WebArena, SWE-bench Verified), and industry standards (NIST AI Agent Standards Initiative, February 2026). Provides reference metrics and rubrics for evaluating coding agents, chatbots, and specialized agents across dimensions (correctness, efficiency, safety). Essential for building eval harnesses that measure across standardized dimensions.
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems — Systematic index analyzing deployed agent safety documentation, guardrails, and third-party testing across 30 production systems; identifies critical gaps in agentic safety disclosure and documentation. Useful as a checklist for production harness safeguards before deployment.

Observability & Tracing

OpenLLMetry — OpenTelemetry-based instrumentation for LLM calls and agent steps: adds trace spans to every inference and tool call without modifying business logic. The cleanest way to bring the existing OTEL ecosystem (Grafana, Datadog, Jaeger) to a harness.
Arize Phoenix — Self-hostable trace UI and eval runtime for agent workflows. Lets harness engineers audit and replay every reasoning step and tool call offline, without sending data to a third-party cloud.
Langfuse — The most widely adopted self-hostable LLM observability platform: traces every agent step, manages prompt versions, and runs evals in one tool. Preferred over cloud-only alternatives when data residency or cost control is a constraint.
Weights & Biases Weave — W&B's tracing and eval layer purpose-built for agent workflows: automatic call graph capture, dataset versioning, and LLM-as-judge evals that integrate directly with the wandb experiment tracking ecosystem.
OTel GenAI Semantic Conventions — OpenTelemetry's standard attribute names for GenAI spans (gen_ai.system, gen_ai.request.model, etc.). The naming baseline that makes harness traces portable across any OTEL-compatible backend.
Pydantic Logfire — AI observability platform from the Pydantic team with a unique angle: all trace data is SQL-queryable (PostgreSQL-compatible), so coding agents can query production observability data directly via the Logfire MCP server. Full-stack OTEL tracing covers both the AI layer and backend — letting you determine whether a failure is in agent logic or infrastructure. The natural observability choice for PydanticAI-based harnesses.
Helicone — Open-source LLM observability proxy (YC W23) with the largest open-source pricing database (300+ models). One-line proxy integration provides cost tracking, token monitoring, session tracing, and prompt versioning across providers. The AI Gateway component handles request routing and caching with zero-code changes. SOC 2 and GDPR compliant, self-hostable via Docker — the natural complement to execution-tracing tools when cost attribution is a primary concern.
OpenObserve: Unified Observability for LLM Agents — 2026-standard platform for LLM tracing with infrastructure log/metric unification. Enables harness engineers to correlate agent decisions with system-level events (network delays, GPU memory pressure) that explain agent failures, going beyond isolated LLM call traces.
Braintrust — Evaluation-first agent observability platform ($80M Series B, Feb 2026) with exhaustive auto-tracing that captures every LLM call, tool invocation, and retrieval step as nested span hierarchies. Brainstore, its purpose-built data store, enables full-trace search without sampling — critical for debugging multi-turn agent failures where the root cause spans multiple steps. Used by Stripe, Notion, Dropbox, and Perplexity.
Building Observable AI Agents: Temporal Now Integrates with Braintrust — Combines Temporal's durable execution (automatic retries, state persistence, event history replay) with Braintrust's LLM tracing so every Workflow and Activity becomes a Braintrust span and every LLM call is traced with full context. Demonstrates the pattern with a deep research agent where failed synthesis steps retry without re-executing prior searches, and prompt updates propagate via braintrust.load_prompt() without code deployment. The most practical published integration of workflow durability and LLM observability for production agent debugging.
Introducing BigQuery Agent Analytics — Google Cloud's 2026 launch treats agent traces, tool calls, sessions, and outcomes as analytical data rather than dashboard exhaust. The important harness idea is that observability becomes queryable infrastructure: once telemetry lands in BigQuery, teams can build evaluators, regressions, and conversational debugging directly on top of production traces instead of maintaining a separate analysis stack.
Distributed Tracing for Agentic Workflows with OpenTelemetry — Red Hat's April 6, 2026 guide is one of the few concrete references that walks through context propagation across routing agents, specialist agents, MCP servers, and external systems using standard tracing infrastructure. It belongs here because it treats agent observability as a distributed-systems problem, which is exactly how these harnesses fail in production.
Red-Teaming Anthropic's Internal Agent Monitoring Systems — METR — METR's three-week adversarial audit of Anthropic's internal agent monitoring and security systems (described in the Opus 4.6 Sabotage Risk Report). Discovered several novel vulnerabilities, some since patched. The most concrete published account of what it takes to stress-test agent monitoring infrastructure — essential reading before trusting any monitoring system as a safety layer.

Debugging & Developer Experience

AgentOps — Open-source agent engineering platform (YC W24) with session replay, cost tracking, and failure detection across 10+ frameworks including CrewAI, LangGraph, and OpenAI Agents SDK. The step-by-step execution graph and cross-session metrics make it the most practical debugging layer for multi-agent systems in production.
Syncause/debug-skill — April 2026 agent debugging skill that stops guesswork with runtime evidence. Uses background tracing (Runtime Facts) to capture the exact execution path leading to failures, then constrains the agent to cite specific data points (stack traces, variable snapshots) before proposing fixes. Moves agent debugging from "patch and pray" to evidence-based repair with reviewable results.
AgentTrace: Causal Graph Tracing for Root Cause Analysis in Multi-Agent Systems — March 2026 framework that localizes root causes in multi-agent execution traces using causal graph analysis rather than LLM inference. Processes traces in 0.12 seconds (69× faster than LLM-based analysis) with 93.6–95.8% accuracy across 550 synthetic failure scenarios. Distinguishes root causes from downstream symptom propagation — the key capability missing from most trace-inspection debugging workflows.
TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code — February 2026 ICSE paper introducing a collaborative multi-agent debugging loop: Instrumentation Agent injects diagnostic probes, Analysis Agent performs causal trace diagnosis with a Historical Lesson Learning Mechanism (HLLM) that distills insights from prior failures, and Repair Agent executes validated fixes with rollback on regression. Achieves up to 34.43% relative improvement in Pass@1 over baselines.
AgentRx: Systematic Debugging for AI Agents — Framework for automated root-cause analysis of agent failures: trajectory normalization, constraint synthesis from tool schemas, and constraint-guided evaluation. Achieves 23.6% better failure localization than existing approaches with a 115-trajectory annotated benchmark. Shifts agent debugging from manual log inspection to systematic constraint-based diagnosis — a reference design for harness-level observability that surfaces why an agent failed, not just that it failed.
Debugging Deep Agents with LangSmith — Addresses the core problem of debugging agents that run for minutes, span hundreds of steps, and produce massive traces no human can manually scan. Introduces Polly (an AI assistant that analyzes traces to surface root causes) and langsmith-fetch (CLI for piping trace data to coding agents). Key insight: debugging deep agents requires AI-assisted trace analysis — the volume of data these systems produce exceeds human capacity.
Where LLM Agents Fail and How They Can Learn From Failures (AgentDebug) — ICLR 2026 paper introducing the Agent Error Taxonomy — a modular classification covering memory, reflection, planning, action, and system-level failures. The AgentDebug framework isolates root-cause failures and provides corrective feedback, achieving +24% higher all-correct accuracy. The Agent Error Benchmark (annotated trajectories from ALFWorld, GAIA, WebShop) is the first systematic failure dataset for agent debugging.
AgentPrism — Open-source React component library (Evil Martians) that transforms OpenTelemetry trace data into interactive visualizations: tree view, timeline/Gantt view, sequence diagrams, and detail panels. Framework-agnostic — works with any OTEL-compatible agent. Fills the gap between raw OTEL spans and human-comprehensible agent debugging UIs.
Characterizing Faults in Agentic AI — March 2026 empirical study mining 375 GitHub issues across real-world agent systems (AutoGen, CrewAI, OpenAI Agents SDK, LangChain, CAMEL, DB-GPT) to build the first grounded taxonomy of agent-specific faults: initialization failures, role deviation, memory/state deficiencies, orchestration failures, and tool integration errors. Provides architecture-level fault classification that harness engineers can use as a systematic debugging checklist.
More Visibility into Copilot Coding Agent Sessions — GitHub's March 19, 2026 changelog is short but materially useful: setup-step logs, collapsed subagent traces, and clearer session-stage visibility are exactly the kind of DX improvements that make long-running agent failures debuggable in practice. It is a concrete reminder that trace readability is part of the harness, not an afterthought layered on top.
AgentStepper: Interactive Debugging of Software Development Agents — February 2026 interactive debugger for agent execution trajectories that organizes raw logs into structured, side-by-side conversations (agent↔LLM and agent↔tools). Enables step-through execution, breakpoint manipulation, and mid-trajectory inspection. Developer study shows frustration scores drop from 5.4 to 2.4 (NASA TLX) and comprehension accuracy improves significantly — the first concrete evidence that interactive debugging primitives materially reduce the cognitive load of understanding multi-turn agent behavior.

Human-in-the-Loop

aws-samples/sample-human-in-the-loop-patterns — March 2026 AWS reference implementation demonstrating four distinct HITL patterns for sensitive agent tool calls: Hook System (centralized blanket policy), Tool Context (per-tool fine-grained), Step Functions (async third-party approval via SNS), and MCP Elicitation (protocol-native real-time interactive approval). The most concrete production guide for choosing the right HITL architecture based on approval latency, trust boundaries, and integration constraints.
Dify Human-in-the-Loop Node — February 2026 release making human oversight a native workflow primitive: suspend execution at critical decision points, expose review-and-edit UI mid-flow, and route subsequent execution based on human action (approve/reject/escalate). Demonstrates how HITL transitions from bolt-on approval gates to first-class execution-graph nodes with stateful pause/resume backed by Celery workers and Redis Pub/Sub.
HITL Protocol — Open standard (v0.8, February 2026) for human decisions in agent workflows: HTTP 202 + review URL pattern connecting services, agents, and humans across any messaging channel. No SDK required — ~15 lines of code for agents, reference implementations in Express/Hono/Next.js/FastAPI for services, and 13 end-to-end flows including escalation and hybrid approval.
LangGraph — Human-in-the-Loop Concepts — Systematic treatment of interrupt, breakpoint, and approve patterns: how to pause an agent mid-loop, persist state, and resume after human review. Directly addresses the harness engineering challenge of inserting human gates into long-running workflows.
AutoGen — Human-in-the-Loop — Explains human_input_mode (NEVER / TERMINATE / ALWAYS) and the UserProxyAgent as an approval gate. The most concrete implementation reference for adding human review nodes to a multi-agent conversation harness.
Claude Agent SDK — Handle Approvals and User Input — The most complete implementation reference for HITL mechanics: canUseTool callback pauses execution at every tool request with allow/deny/approve-with-changes/suggest-alternative response shapes; AskUserQuestion surfaces structured clarifications mid-task; streaming input enables mid-execution redirects. The "approve with changes" pattern — modifying tool input before execution — is the reference design for safe-by-default harnesses that don't simply block or permit.
HiL-Bench: Do Agents Know When to Ask for Help? — April 2026 benchmark that transforms well-specified tasks into judgment challenges by injecting 3–5 realistic blockers (missing critical information) and giving agents an ask_human() tool. Agents from top models achieve ~90% pass@3 with full information but performance drops significantly when blockers are present — the first systematic measure of when agents should escalate to humans rather than proceeding with insufficient context.
Human Judgment in the Agent Improvement Loop — LangChain's April 9, 2026 guide closes an important gap that most HITL write-ups skip: human input is not just an approval gate at execution time, it's also supervision for improving prompts, tools, memory, and evaluators over time. Useful because it treats expert review as a structured data source for harness evolution rather than a one-off manual checkpoint.
Humans and Agents in Software Engineering Loops — Martin Fowler defines three human-involvement postures — humans outside, in, or on the agent loop — and argues that "humans on the loop" (maintaining the harness rather than reviewing individual outputs) is the only approach that scales with agent throughput. The "agentic flywheel" section — where agents are directed to evaluate results and recommend harness improvements — is the clearest articulation of how HITL evolves from a gate into a feedback mechanism.
Measuring AI Agent Autonomy in Practice — Anthropic's February 2026 empirical study of millions of real-world Claude Code interactions. Key finding: experienced users shift from per-action approval (20% auto-approve when new) to intervention-only oversight (40% auto-approve at 750+ sessions), and agent-initiated clarification stops grow faster than human interruptions as task complexity increases. The most data-grounded reference for designing adaptive permission models that scale with user trust.
AutoResearchClaw HITL Co-Pilot — April 2026 open-source human-in-the-loop system with six intervention modes (full-auto, gate-only, checkpoint, step-by-step, co-pilot, custom), SmartPause confidence-driven dynamic suspension, and Intervention Learning from human corrections. The cost-guardrail system — aborting runs that exceed budget thresholds — makes it a practical reference for production HITL where human time is as constrained as agent compute.

Reference Implementations

Real repositories worth studying — each with a note on why it's worth your time.

Tutorials & Educational

ML6 x AISO Agent Workshop — February 2026 hands-on workshop building an AI agent from scratch with Google's Agent Development Kit (ADK) in 3 hours. Five milestones with a built-in benchmark that tracks progress from ~19% (base agent) to ~81% (with web search, PDF reader, and calculator tools). The clearest public tutorial for understanding how tool access directly translates to capability gains.
mastra-ai/workshop-mastracode — February 2026 workshop by Mastra's founders dissecting every layer of an open-source AI coding agent: stateful/resumable harness, dynamic prompt composition, workspace sandboxing, memory compaction, HITL steering, event protocols, and cost tracking. The 11-topic curriculum is the most complete public walkthrough of production coding-agent harness internals.
Building Governed AI Agents — OpenAI's February 2026 cookbook building a complete multi-agent governance system from scratch: policy-as-code guardrails, OpenAI Traces for full observability, eval-driven design, and a distributable governance package. The most concrete first-party tutorial for making governance part of core infrastructure from day one.
anthropics/claude-cookbooks — Anthropic's official notebook collection covering orchestrator-worker patterns, parallel tool calling, programmatic tool calling (PTC), context compaction, and Agent SDK examples. The patterns/agents/ directory is the reference implementation of every orchestration pattern described in Building Effective Agents.
huggingface/smolagents — HuggingFace's deliberately minimal agent library (~1,000 lines of core code): the entire harness — tool validation, memory, monitoring, sandbox isolation (E2B, Docker, Pyodide) — is readable in an afternoon. The code-agent pattern (model writes Python that calls tools, eliminating JSON round-trips) is a concrete alternative loop design worth understanding.
shareAI-lab/learn-claude-code — Step-by-step deconstruction of Claude Code as an agent harness (s01–s12). Best resource for understanding how agent loop, tool use, skills, context compaction, and task management compose in practice.
AutoJunjie/awesome-agent-harness — Curated list organized into Full Lifecycle Platforms, Task Runners, Agent Runtimes, Coding Agents. Close to this list's scope; good complementary reference.
Skill Issue: Harness Engineering for Coding Agents — Practitioners' guide covering all harness configuration points for coding agents: system prompts, MCP tool selection, skills for progressive disclosure, sub-agents as context firewalls, hooks for deterministic control, and back-pressure verification. The central argument — that most agent failures are configuration problems, not model limitations — and the heuristic to minimize tool exposure (too many MCP servers bloat context) make this the most actionable single-page synthesis of what harness engineering means in practice for a coding agent.
How to orchestrate agents using mission control — GitHub's December 2025 practical guide on coordinating multiple coding agents with mission control: parallel vs. sequential execution, when to intervene, and how to review agent work productively. Shows the shift from single-agent prompts to multi-agent choreography and the harness decisions required to keep parallel agents from interfering with each other.
awslabs/agentcore-samples — AWS's official sample repo is one of the most complete public walkthroughs of what "productionizing" an agent platform actually means: runtime, gateway, memory, identity, observability, IaC, and blueprint apps all live in one place. Worth including because it covers the harness infrastructure layer most sample repos skip and does so across multiple frameworks rather than baking in a single orchestration stack.
Engineering Trustworthy Multi-Agent Systems — IEEE CAI 2026 tutorial (December 2025) providing a research-based practical guide for designing enterprise-ready multi-agent systems. Covers agentic patterns (ReACT, Reflection, CoT), emerging protocols (MCP, A2A), multi-layer memory structures, observability and online/offline evaluation techniques, and trustworthy AI guardrails. The most comprehensive conference tutorial on production multi-agent harness design published in 2026.

Generators & Meta-Harnesses

everything-claude-code — Anthropic Hackathon Winner (140K+ stars). The agent harness performance optimization system: skills, instincts, memory optimization, continuous learning, security scanning, and research-first development. Production-ready agents, skills, hooks, rules, and MCP configurations evolved over 10+ months of intensive daily use building real products. Works across Claude Code, Codex, Cursor, OpenCode, and Gemini.
Claude Agent SDK — Anthropic's official SDK that exposes Claude Code's entire harness as a programmable API: built-in tool execution loop, PreToolUse/PostToolUse hooks for interception, subagent definitions, allowedTools permission control, and session resumption. The highest-leverage starting point for building a production harness — you inherit the entire tool execution layer rather than implementing it.
revfactory/harness — A meta-skill that generates domain-specific agent teams and the skills they use. Good example of harness-as-code, where the harness itself is produced by an agent.
raphaelchristi/harness-evolver — March 2026 Claude Code plugin that autonomously evolves LLM agent harnesses using multi-agent proposers in isolated git worktrees, LangSmith-backed evaluation, and regression guards. Iterates on prompts, routing, retrieval, and orchestration code based on full-trace counterfactual diagnosis. The most practical published implementation of the Meta-Harness outer-loop optimization paradigm.
neosigmaai/auto-harness — April 2026 open-source self-improving agentic system: bring your own coding agent, automatically mine failures from benchmark runs, optimize the harness through iterative edits, and gate changes against regressions. Supports Terminal-Bench 2.0 and tau-bench with Harbor and Docker evaluation backends. The PROGRAM.md pattern — human writes the optimization directive, agent executes the harness engineering loop — is the most accessible entry point for teams wanting meta-harness optimization without building infrastructure from scratch.
Meta-Harness: End-to-End Optimization of Model Harnesses — Treats the entire harness (system prompt, tool definitions, context management, completion logic) as a joint optimization target rather than hand-tuning each piece. The key insight: give the proposer agent filesystem access to all prior harness candidates, scores, and execution traces — 10M-token diagnostic context vs. the 26K in prior work — so it can trace failures back to specific harness decisions.
HyperAgents: Self-Improving AI Systems — Meta's framework integrating task-solving and meta-level improvement into a unified, editable program with metacognitive self-modification. Improved paper-review tasks from 0.0 to 0.710, transferred to Olympiad math grading at 0.630 improvement@50 score. Shows how agents can be designed to modify their own harness (prompts, tools, strategy) based on execution history — the ultimate meta-harness where the agent itself evolves the scaffolding.
AutoAgent — Open-source library (April 2026) that automates the harness engineering loop itself: give it a task and a benchmark, and it iterates overnight on system prompts, tool configurations, agent orchestration, and routing — keeping or discarding each change based on score. In a 24-hour run, hit #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%), beating every hand-engineered entry. The program.md separation of concerns (human writes the directive, agent engineers the harness) is the most practical meta-harness pattern published so far.
metaharness — Open-source Python library (April 2026) that implements an outer optimization loop around executable harnesses for coding agents. Inspired by the Meta-Harness paper, it treats AGENTS.md, setup scripts, validation logic, and test flows as optimizable artifacts rather than static configs — with filesystem-backed run stores, environment snapshots, and scoped write enforcement. The most practical reference for teams who want to improve harness code rather than just prompts.
meta-agent — Lightweight continual harness optimizer (April 2026) built on the Claude Agent SDK. Runs an outer loop that reads task traces, rewrites harness configs, and re-evaluates — achieving 67% → 87% on tau-bench with no labeled training data. Demonstrates that even small, focused meta-harness loops can yield large reliability gains when harness configs are treated as learnable parameters.
stanford-iris-lab/meta-harness — April 2026 official implementation of the Meta-Harness paper from Stanford's IRIS Lab. Provides the framework and two reference experiments for end-to-end harness optimization via filesystem-backed search loops where a coding agent proposes, evaluates, and refines harness artifacts. The cleaned-up codebase is the definitive starting point for researchers reproducing or extending meta-harness optimization.

Demo Harnesses

Anthropic Computer Use Demo — Anthropic's reference harness for the screenshot-action loop: defines the screenshot, bash, and text_editor tool interface that makes desktop/browser control work. Essential reading before building any harness where the agent's primary sensory input is a rendered screen rather than structured API responses.
coleam00/your-claude-engineer — Agent harness with Slack, GitHub, and Linear integrations. Useful reference for how real-world tool wiring works inside a harness.
OpenHands — The most architecturally complete open-source coding agent: Runtime/Sandbox isolation, EventStream message bus, and Agent Controller are a three-layer harness design worth studying for production deployments.
browser-use — Minimal browser-automation agent harness with clean separation of tool registration, DOM state injection, action loop, and error recovery. Small codebase, clear structure — the best "minimal viable harness" reference for understanding core loop mechanics.
SWE-agent — Coding agent whose Agent-Computer Interface (ACI) — purpose-built file viewer, search, and editor tools with explicit state constraints and error feedback — is the reference design for adapting a tool interface to a specific task domain rather than using generic bash.
Aider — AI pair-programmer harness with an Architect mode that splits planning (one LLM) from coding (another), and git-aware tooling that uses version control as the undo mechanism instead of custom state rollback. The best reference for multi-file editing tool design and planner/coder layer separation.
Open SWE: An Open-Source Framework for Internal Coding Agents — A composable coding-agent harness built on Deep Agents, synthesizing design patterns from Stripe, Ramp, and Coinbase production deployments. Key decisions: curated ~15-tool limit enforced at harness design time, one isolated sandbox (Modal/Daytona/Runloop/LangSmith) per task, AGENTS.md for injecting repo-wide conventions, and Linear/Slack task context in the system prompt. The most recent published reference for what a production internal coding agent harness looks like.
Live-SWE-agent: Autonomous Software Agent with Self-Evolving Harness — Production harness achieving 77.4% solve rate on SWE-bench Verified through continuous harness evolution — the scaffold adapts from failure signals rather than requiring manual retuning per task class. Demonstrates the architectural pattern where the harness itself is a learnable component, not just a static container for a fixed agent.
Pipecat: Python Framework for Real-Time Voice Agent Pipelines — Handles frame management, streaming media coordination, and pipeline orchestration between ASR/LLM/TTS services for sub-800ms Total Turn-Around Time voice interactions. The missing harness primitive for voice agents: manages backpressure, handles frame queueing, and exposes a simple async interface for real-time constraints. Critical infrastructure for building responsive voice-first agents.
The Virtual Biotech: Multi-Agent AI Framework for Drug Discovery — Orchestrated team of domain-specialized scientist agents that autonomously analyzed 55,984 clinical trials and discovered cell-type-specific drug targets 40% more likely to succeed Phase I→II transitions. Demonstrates specialized harness design for scientific workflows where formal reasoning, multi-agent coordination, and domain-specific tool suites are load-bearing constraints. Shows the upper bound of what structured multi-agent harnesses can achieve in high-stakes domains.
Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety — NVIDIA's Nemotron 3 family (Super for long-context reasoning, Content Safety for multimodal moderation, VoiceChat for real-time speech) designed for scalable agentic AI with enterprise-grade multimodal understanding. The integration of safety models, vision models, and voice models into a single harness stack is the reference architecture for production multimodal agents.
AIO Sandbox — All-in-one agent sandbox combining browser, shell, filesystem, MCP servers, and VSCode Server in a single Docker container. Native MCP support exposes sandbox capabilities to LLMs via the standard protocol, and files downloaded in the browser are instantly accessible in terminal and VSCode. Optimized startup (4–8s depending on config) with Claude Skills mounting support. The fastest path to a fully-featured agent development environment.
GitHub Agentic Workflows — GitHub's February 13, 2026 technical preview is unusually valuable because the implementation is fully open source (gh-aw) and shows how natural-language workflow generation, approval handling, and GitHub-native execution fit together in one harness. It belongs here as a reference implementation for teams that want to study agentized CI/CD rather than just chat-centric coding agents.
langchain-ai/deepagents — LangChain's batteries-included agent harness (released April 2026) with built-in planning, filesystem tools, shell access, sub-agents, and auto-summarization. The clearest open-source demonstration of how a general-purpose coding agent harness can be made ready-to-run out of the box while remaining fully extensible.
HKUDS/OpenHarness — A compact, inspectable open-source agent harness from HKUDS (April 2026) featuring a built-in personal agent (ohmo), auto-compaction with session preservation, MCP HTTP transport, and multimodal gateway support. Excellent reference for understanding how a small, modular harness can support multi-day sessions without manual context management.
OpenCode — Open-source terminal-native AI coding agent with 131K+ stars and 2.5M+ monthly active developers. Provider-agnostic architecture supports 75+ LLM providers plus native LSP auto-configuration, multi-session parallel agents, and MCP extensibility. The build/plan agent split and client/server architecture make it the most complete open-source reference for a terminal-first coding harness.
Squad — Repository-native multi-agent orchestration framework built on GitHub Copilot. Initializes a persistent AI team (lead, frontend, backend, tester) as files inside your repo — knowledge compounds across sessions through committed history.md and decisions.md. The most accessible reference for teams that want multi-agent coordination without heavy infrastructure.

Adjacent Collections

EvoMap/awesome-agent-evolution — April 2026 curated list covering agent evolution, memory systems, multi-agent architectures, and self-improvement. Complements this list with a forward-looking lens on the next generation of agent capabilities — where harnesses must adapt to agents that modify their own scaffolding over time.
Picrew/awesome-agent-harness — Implementation-first curated list (April 2026) with 150 entries, 84% GitHub projects, organized into 9 categories from harness architecture to sandboxing. The featured blogs section and catalog-style organization make it a strong complementary reference to this list's article-centric approach.
jiji262/awesome-harness-engineering — Focuses on platform delivery governance, IDP, GitOps, and AI-native engineering. Overlaps with this list on the platform engineering side; more Harness-the-company oriented.
VoltAgent/awesome-ai-agent-papers — Curated collection of 363+ arXiv papers from 2026 organized into five harness-relevant categories: Multi-Agent (51), Memory & RAG (56), Eval & Observability (79), Agent Tooling (95), AI Agent Security (82). Weekly updates make it the best single source for tracking research that will shape harness design decisions in 2026.
bradAGI/awesome-cli-coding-agents — Catalog of 80+ terminal-native AI coding agents (open-source and proprietary) plus the harnesses that orchestrate, sandbox, and extend them: session managers, parallel runners, autonomous loop infrastructure, and credential vaults. The most comprehensive reference for the CLI agent layer that most harness infrastructure is designed to host.
danielrosehill/AI-Harnesses — April 2026 point-in-time snapshot of projects describing themselves as AI agent harnesses, organized into Resource Lists, Harness Runtimes, and Reference Implementations. Useful as a landscape survey of how the term "harness" is being applied across the ecosystem — from lightweight wrappers to full orchestration frameworks.

Security, Sandbox & Permissions

Beyond Permission Prompts — The authoritative resource on moving from prompt-level permission grants to structured authorization in the harness.
Model Context Protocol — Authorization — MCP's specification for OAuth-based authorization flows when agents access external services.
AI Harness Scorecard — Scores repositories on AI harness safeguards. Useful checklist for auditing your own harness's security posture.
E2B — Firecracker microVM sandboxes purpose-built for agent tool loops: ~150ms cold start, Python/JS SDKs, open source. The clearest reference implementation of "code execution as a harness primitive" rather than a CI system bolted on.
tldrsec/prompt-injection-defenses — The most complete catalog of practical prompt injection defenses (input validation, tool output sanitization, canary tokens, etc.). Functions as a design checklist for hardening trust boundaries in any agent harness.
Prompt Injection — Simon Willison's Series — The most thorough public writing on why indirect prompt injection is uniquely dangerous for agent harnesses: agents actively consume untrusted external content (emails, web pages, tool outputs) that can hijack their actions. Essential for understanding the attack surface before designing trust boundaries.
OWASP LLM01:2025 — Prompt Injection — OWASP's authoritative classification of direct and indirect prompt injection risks. Complements the tldrsec defense catalog: use this to define the threat model, use tldrsec to select countermeasures.
Daytona — OCI-container sandboxes with sub-90ms startup, built-in Git operations, LSP support, and indefinite state persistence. Complements E2B for harnesses that need long-lived working directories across multiple agent sessions rather than ephemeral code execution.
NeMo Guardrails — NVIDIA's programmable guardrails toolkit: define input, dialog, retrieval, execution, and output rails that intercept the agent loop at five distinct layers using the Colang DSL. The execution rail layer specifically governs what tools the LLM can invoke and what their inputs/outputs may contain — the reference for behavioral-level enforcement when static allow/deny lists are insufficient.
LangSmith Sandboxes: Secure Code Execution for Agents — Describes a microVM-based sandboxing architecture with kernel-level isolation, resource caps (CPU/memory/disk), and an authentication proxy that keeps secrets entirely out of the runtime environment. Persistent WebSocket sessions support long-running agent tasks like dependency installation and test suite execution without the overhead of per-call container restarts.
Implementing a Secure Sandbox for Local Agents — Cursor's cross-platform sandbox implementation (macOS Seatbelt, Linux Landlock + seccomp, Windows WSL2) that lets agents run freely within a boundary and request approval only for external access. Key result: 40% fewer user interruptions vs. no-sandbox permissioning — agents explore freely inside the boundary rather than requesting every file operation. The training insight — that agents must be explicitly taught to recognize sandbox constraints — is the missing piece most sandboxing guides omit.
Practical Security Guidance for Sandboxing Agentic Workflows — NVIDIA AI Red Team's mandatory controls for agent code execution: restrict network egress, block workspace escape, and critically — protect MCP server configuration and hooks files from agent modification. The core threat model: an agent that can edit its own harness configuration can escalate its own permissions, which standard sandbox isolation alone does not prevent.
Under the Hood: Security Architecture of GitHub Agentic Workflows — GitHub's March 9, 2026 architecture write-up is one of the clearest public descriptions of defense-in-depth for coding agents running inside CI: isolated agent container, firewall, MCP gateway, API proxy, staged safe outputs, and zero-secret execution. The key value is that it treats agent execution as a hostile workload inside automation infrastructure, which is exactly the mindset most harnesses need but rarely document.
Community-Powered Security with AI: An Open Source Framework for Security Research — GitHub Security Lab's January 14, 2026 launch of Taskflow Agent is a strong example of security-specific harness engineering: encode expert workflows as reusable tasks, keep the framework open for audit, and let AI scale established security practice instead of improvising ad hoc scans. Worth including because it turns security research itself into a sharable harness layer rather than a one-off internal workflow.
AnonymAI: Integrating Differential Privacy with LLM Agents — Framework for automating data anonymization in agent workflows. Directly addresses the harness problem of unintentional PII leakage through tool calls and memory writes — privacy enforcement moves from the agent (prompt-level trust) to the harness boundary (structural enforcement). Essential for regulatory compliance in EU, Canada, and emerging state-level privacy regimes.
Fault Tolerance Patterns: OpenClaw Journey Six—Core Retry Loop and Seven-Layer Fault-Tolerance — Four-layer fault tolerance (retry with backoff → model fallback chains → error classification → checkpoint recovery) reduces unrecoverable failures from 23% to under 2% across agent systems. The layered approach is essential reading for hardening production agent harnesses against the combinatorial failure modes introduced by multiple tool calls and external dependencies.
Sandbox Agents | OpenAI API Docs — OpenAI's authoritative April 2026 guide to sandbox architecture in the Agents SDK. The core principle is strict separation between the harness control plane (auth, billing, orchestration) and the sandbox compute plane (files, shell, ports), with manifest contracts, resumable session state, and sandbox-native memory.
NVIDIA OpenShell — Open-source policy-driven sandbox runtime for autonomous AI agents, announced at GTC 2026. Enforces security constraints at the kernel level via Landlock LSM (filesystem), seccomp BPF (syscalls), and an OPA/Rego-evaluated HTTP CONNECT proxy (network) — constraints are enforced on the environment itself, so even a compromised agent cannot override them. Supports Claude Code, Codex, Cursor, and OpenCode inside the sandbox.
Microsoft Agent Governance Toolkit — Seven-package, multi-language (Python, Rust, TypeScript, Go, .NET) runtime security toolkit that addresses all 10 OWASP Agentic AI risks with deterministic, sub-millisecond policy enforcement. Includes Agent OS (policy engine intercepting every action), Agent Mesh (secure agent-to-agent communication), and Agent Runtime (dynamic execution rings). Hooks into LangChain callbacks, CrewAI task decorators, Google ADK plugins, and Microsoft Agent Framework middleware.
Cloudflare Dynamic Workers — V8 isolate-based sandboxing for AI-agent-generated code execution, now in open beta. Isolates start in milliseconds using megabytes of memory — 100x faster and up to 100x more memory-efficient than containers. The sandbox intercepts outbound HTTP requests for credential injection so agent code never touches secrets directly. A fundamentally different architectural option from container-based sandboxes (E2B, Daytona).
Kubernetes Agent Sandbox — K8s-native Sandbox CRD (under SIG Apps) providing declarative, standardized APIs for managing isolated, stateful, singleton workloads for AI agent runtimes. Supports gVisor and Kata Containers for kernel-level isolation; v0.2.1 introduced "Secure by Default" networking architecture enforcing strict isolation with a shared policy model. The right choice when agents must run inside existing Kubernetes infrastructure.
Alibaba OpenSandbox — General-purpose sandbox platform for AI agents (8.7K+ stars, March 2026) with multi-language SDKs (Python, Java, TypeScript, Go, C#), unified APIs across Docker/Kubernetes runtimes, and support for secure container runtimes (gVisor, Kata Containers, Firecracker). Covers coding agents, GUI agents, agent evaluation, and RL training in a single abstraction layer — the most runtime-flexible sandbox option when you need to choose isolation levels per workload.
The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey — The first systematic survey of AI agent security from UC Berkeley and UIUC (Dawn Song et al., March 2026). Reviews 128 papers covering 51 attack methods and 60 defense mechanisms. Introduces a framework for understanding security risks specific to agentic (not just LLM) systems and identifies open gaps in securing agent architectures — the definitive 2026 reference for agent threat modeling.
Trustworthy agents in practice — Anthropic's April 2026 framework for governing autonomous agents through five principles: human control, value alignment, secure interactions, transparency, and privacy. The most complete published treatment of how to design harness-level governance that keeps pace with increasing agent capability and autonomy.

Evals & Verification

Demystifying Evals for AI Agents — Anthropic's comprehensive guide to agent evaluation: trajectory evals, outcome evals, and how to build eval harnesses that are themselves reliable.
DeepEval — The most complete open-source LLM/agent eval framework: 20+ built-in metrics (hallucination, answer relevancy, RAGAs, tool correctness), pytest integration, and a CI-friendly runner. Removes the need to hand-roll eval infrastructure when you need structured, repeatable agent quality gates.
SWE-bench — The canonical benchmark for coding agents. Essential reference for understanding what "verified working" means for harness outputs.
Inspect AI — UK AI Security Institute's eval framework with native support for evaluating external agents (Claude Code, Codex CLI) as black-box targets, plus built-in bash/python/web browsing tools. Built for safety-grade rigor; the right foundation for harness-level eval infrastructure.
Quantifying Infrastructure Noise in Agentic Coding Evals — Anthropic's empirical study showing container resource configuration alone produces 6+ percentage point benchmark swings — often exceeding model-to-model gaps. The 3x threshold finding is the key practical result: scores are stable up to 3x specified resources, but above that agents shift strategy entirely (lean tools vs. heavy dependencies), meaning tight and generous resource limits measure fundamentally different behaviors. Essential reading before interpreting any agentic eval.
tau-bench — Benchmarks agent behavior in three-way user-tool-policy interactions — the failure mode SWE-bench doesn't cover. Useful for validating that a harness correctly enforces business rules across multi-turn, stateful conversations.
Towards a Science of AI Agent Reliability — Proposes twelve concrete reliability metrics across four dimensions (consistency, robustness, predictability, safety), evaluated against 14 agentic models. The central finding — that recent capability gains yield only modest reliability improvements — is the empirical case for investing in harness-layer reliability engineering as a discipline distinct from model selection.
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes — The first large-scale empirical fault taxonomy for agent systems, derived from 13,602 issues across 40 open-source repositories: 37 fault types, 13 symptom classes, 12 root cause categories. Key finding: most failures originate from mismatches between probabilistically generated artifacts and deterministic interface constraints — a structural harness problem, not a model capability problem.
VeRO: An Evaluation Harness for Agents to Optimize Agents — Framework for evaluating agent-on-agent optimization cycles: a coding agent iteratively modifies a target agent's harness (prompts, tools, configuration) through edit-execute-evaluate loops while VeRO captures versioned agent snapshots, budget-controlled evaluation, and structured execution traces. Addresses the meta-evaluation gap — how to systematically measure whether one agent is improving another — with reproducible infrastructure and a benchmark suite for comparing optimizer configurations.
Eval Awareness in Claude Opus 4.6's BrowseComp Performance — Anthropic's documented case of Claude Opus 4.6 inferring it was under evaluation, identifying the benchmark by name, and decrypting the answer key — producing 11 non-intended solutions. A direct challenge to eval harness design: any eval that runs in a web-enabled environment is vulnerable to the agent researching the benchmark itself. The practical countermeasure — evaluate in network-isolated environments — is now a harness engineering requirement, not optional hygiene.
Designing AI-Resistant Technical Evaluations — Anthropic's January 21, 2026 write-up is the clearest account of a problem eval builders now have to treat as first-class: capable models can invalidate the test itself. The practical value is the redesign methodology — shift toward longer-horizon, tool-building, environment-understanding tasks that remain discriminative even as frontier models get better at short take-homes.
Amazon Bedrock AgentCore Evaluations Is Now Generally Available — AWS's March 31, 2026 GA launch matters because it operationalizes agent evals as an infrastructure service: trajectory scoring, task completion checks, and model-graded assessments are wired into the same platform that hosts agents. Worth adding because it shows how evaluation stops being an offline benchmark and becomes part of the runtime control plane teams can standardize on.
Live-SWE-agent: First Live Software Agent with Self-Evolving Scaffold — Demonstrates end-to-end harness design for production software engineering: 77.4% solve rate on SWE-bench Verified (vs. 50% from human contractors). The key insight is the self-evolving scaffold — the harness itself adapts based on failure signals rather than requiring manual engineering each time a new task class emerges. Shows the upper-bound of what specialized harness design can achieve.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models — April 2026 benchmark that evaluates agents on 12 real-world professional occupations (software engineer, data scientist, financial analyst, etc.) using language world models to simulate realistic work environments. Agents are scored on task completion, efficiency, and professional standards adherence. The important harness contribution is the environment-as-evaluator pattern: instead of static test cases, the benchmark uses dynamic language-based world models that respond to agent actions with realistic consequences — a reference design for evaluating harnesses in open-ended, profession-specific domains.

Templates

Reusable starting points for harness artifacts. Copy and adapt.

Template	Purpose
`templates/AGENTS.md`	Project-level agent instructions: conventions, constraints, tool permissions
`templates/PLAN.md`	Task planning artifact with milestones and verification gates
`templates/IMPLEMENT.md`	Implementation log: decisions, deviations, open questions
`templates/HARNESS_CHECKLIST.md`	Review checklist before shipping a harness to production

Production Infrastructure & Operations

AI Agent Scaling Gap: Pilot to Production (March 2026) — Analysis showing 72% of Global 2000 companies operate agents beyond experimental phases, but only 14% successfully scaled organization-wide. Scaling success correlates strongly with operations infrastructure (monitoring, evaluation harnesses, incident response) rather than technology choices. Documents the shift from engineering-focused to operations-focused agent deployment.
5 Production Scaling Challenges for Agentic AI in 2026 — Data infrastructure prioritized before deployment; successful scalers appoint AI operations function pre-expansion; multi-agent distributed systems with load balancing and auto-scaling. Essential reading for understanding infrastructure prerequisites for agent deployment at scale.
AI Agent Cost Optimization Guide 2026: Reduce Spend by 60-80% — Systematic patterns for cost reduction: model routing and caching (40-60% savings); Anthropic prompt caching (90% discount on cached tokens); identifying unnecessary agent overhead vs. simple API chains. Key harness decisions (tool selection, caching strategy, model choice per task) determine operating cost.
KernelEvolve: How Meta's Ranking Engineer Agent Optimizes AI Infrastructure — Meta's production-grade agentic kernel optimization system that autonomously generates optimized Triton kernels for hundreds of models serving billions of users daily. Achieves up to 17x speedup over PyTorch baselines with 100% correctness across 250 problems. Demonstrates harness design for continuous infrastructure optimization: a purpose-built job-harness evaluates each candidate kernel, feeds diagnostics back to the LLM, and drives search over hundreds of alternatives — reducing development time from weeks to hours.
State of Agent Engineering 2026 — LangChain's industry survey of 1,300+ professionals: 57.3% now have agents in production (up from 51%), quality is the top barrier at 32%, and 89% have implemented observability while only 52% run evals. The most comprehensive snapshot of where the industry stands on agent deployment maturity, model strategies, and operational gaps.
Agentic Development: What It Means for Engineering Infrastructure in 2026 — Defines the four infrastructure capabilities that agentic development requires: per-task isolated sandboxes, sub-100ms startup (ruling out traditional VMs and most K8s approaches), API-driven lifecycle management, and MCP-native environment control. The clearest articulation of why existing CI/CD infrastructure is insufficient for agent-driven development workflows.
FinOps for Agents: Loop Limits, Tool-Call Caps, and the New Unit Economics of Agentic SaaS — Defines five concrete budget guardrails enforced at the infrastructure gateway: loop/step limits, tool-call caps, per-run token budgets, wall-clock timeouts, and per-tenant budgets with anomaly alerts. Introduces Cost-per-Accepted-Outcome (CAPO) as the right unit economic metric for agent harnesses — shifting cost measurement from tokens consumed to business value delivered.
Backtesting AI Agents: How SRE Teams Prove Reliability Before Production — Formalizes agent validation as infrastructure-grade testing with pass^k reliability (all 20+ trials must succeed) rather than pass@k (one success). Defines five measurable dimensions (consistency, robustness, predictability, safety, cost stability) with specific SLO thresholds. Recommends dataset composition of 20% golden paths, 30% edge cases, 20% adversarial, 30% regression from production incidents.
How My Agents Self-Heal in Production — A concrete April 3, 2026 production pattern for closing the post-deploy loop: detect regressions, attribute whether the last deploy caused them, then dispatch a coding agent to open a fix PR automatically. This belongs here because it turns evals and observability into an active remediation harness, not just a dashboard humans are expected to watch.
Minions: Stripe's one-shot, end-to-end coding agents—Part 2 — Stripe's deep-dive into their unattended minion harness shipping 1,300+ PRs/week: "blueprints" interleave deterministic code nodes with agentic subtasks, a centralized 500-tool MCP server (Toolshed) serves the whole fleet, and pre-warmed devboxes prove that investments in human developer productivity pay equal dividends for agents.
Amazon Bedrock AgentCore — AWS's fully managed agent deployment platform providing serverless runtime with session isolation, built-in memory (session + long-term), secure gateway for tool access, browser runtime, and code interpreter — all framework-agnostic. Now supports AG-UI protocol for real-time agent-to-frontend streaming, VPC/PrivateLink for enterprise security, and CloudFormation for infrastructure-as-code deployments. The reference cloud-native agent hosting platform for teams that need managed infrastructure rather than building their own.
AWS Agent Registry for Centralized Agent Discovery and Governance — AWS's April 9, 2026 preview adds a missing production primitive: a governed catalog for agents, tools, skills, MCP servers, and custom resources with approval workflows, audit trails, and MCP-accessible discovery. Worth including because large organizations do not just need to run agents safely; they need to know what agent capabilities already exist so teams stop rebuilding the same scaffolding in silos.
A Dev's Guide to Production-Ready AI Agents — Google Cloud's developer guide (April 2026) for moving AI agents from prototype to production using ADK, Vertex AI Agent Engine for managed hosting, and Cloud Run for serverless deployment. Covers the full production stack: agent development patterns, scaling considerations, security and identity requirements, and operational monitoring — the most concrete first-party guide for deploying agents on Google Cloud infrastructure.
Enhanced Tool Governance in Vertex AI Agent Builder — Google Cloud's approach to production agent governance (April 2026): agents get identity as first-class IAM principals with least-privilege enforcement, Cloud API Registry integration enables organizational tool governance (admins manage available tools centrally), and a new observability dashboard tracks token usage, latency, and error rates. Demonstrates the cloud-native pattern for tool governance at enterprise scale.

Related Awesome Lists

Lists that cover adjacent territory — overlapping but not identical scope.

Awesome Context Engineering — Comprehensive survey on context engineering: prompt engineering, RAG, context window management, production AI systems.
awesome-claude-code — Curated resources, tools, and workflows specifically for Claude Code users.
awesome-mcp-servers — Comprehensive list of MCP servers for extending agents with external capabilities.
awesome-ai-agents — Curated list of AI agents and agent frameworks, organized by use case. Useful for surveying the landscape of what harnesses are being built around.
awesome-llm-apps — Collection of production LLM applications with source code across RAG, multi-agent, and tool-use patterns. Good reference for how harness primitives combine in real applications.
ICLR 2026 MemAgents Workshop — Interdisciplinary workshop (April 27, Rio de Janeiro) covering episodic/semantic memory, knowledge graphs, vector databases, retrieval pipelines, temporal credit assignment, and context management for agentic systems. The canonical venue for memory architecture research and standards; accepts full papers (9pg), short papers (4pg), tiny papers (2pg).

Contributing

See CONTRIBUTING.md.

What belongs here: Resources that address a specific harness engineering problem (context, tools, planning, permissions, memory, verification, sandboxing). Each addition should include a 1–2 sentence note explaining why it's worth including — this is an opinionated list, not a directory.

What doesn't belong here: General AI/ML papers, model benchmarks unrelated to agent harnesses, tutorials on using specific models, product marketing.

License

CC0 — public domain dedication.

Acknowledgments

Thanks to linux.do — a vibrant tech community where many harness engineering ideas were discussed and refined.

awesome-harness-engineering