proofagent-harness

Name: proofagent-harness
Author: ProofAgent-ai

pytest for AI agents. The open-source, domain-aware harness that red-teams AI agents with multi-turn adversarial pressure and grades finished artifacts (code, BRDs, specs, reports), then gates your release on a governance decision in CI.

Install · Quickstart · Modes · Harness LLM · Metrics · Governance gate · Docs

📖 Full docs: proofagent.ai/harness/docs · 📄 Paper: arXiv:2605.24134

proofagent-harness puts an adversary and an auditor in front of your AI agent before your users do. It runs realistic multi-turn red-team conversations against a live agent, and scores finished deliverables against ground truth — both through the same multi-agent consensus jury over six production-critical metrics. Bring your own LLM, bring your own traps, run locally or in CI. Your code, prompts, and data never leave your machine unless you opt in. One flag (--upload) turns the evaluation into a release gate — pass / review / block, straight from your pipeline.

This README covers the essentials. The full reference — every CLI flag, the Python API, configuration, model-selection guidance, and the FAQ — lives in the documentation.

Features

Evaluation

Two modes — multi-turn adversarial (pressure-test a live agent) and artifact (grade a finished deliverable: code, BRD, plan, spec, report, runbook, …).
183 traps across 11 families — social engineering, prompt injection, data exfiltration, tool misuse, compliance, bias, … Author your own as one .md file.
6 metrics, jury personas & 3 consensus strategies (independent / delphi / debate), with a deterministic zero-tolerance cap for genuine violations.
Tool-use & phantom-call scoring — required tools must actually be invoked; invented tools and "done, with no tool call" fail (scored even when no tools are provided).

Ship gates & infrastructure

Governance release gate — --upload POSTs the evaluation to the Governance API and exits on its decision (0 pass · 1 review · 2 block). Only an API key is needed.
Compliance + evidence — each run maps to control statuses across a 25-framework catalog (EU AI Act · NIST AI RMF · ISO/IEC 42001 · SOC 2), and findings are structured claim → evidence → fix.
Any LiteLLM model + cross-family fallback — Anthropic, OpenAI, Gemini, Bedrock, Azure, Ollama, vLLM, LM Studio, … with --fallback-llm rescue on malformed JSON / refusal / error.

Install

Requires Python 3.10+.

pip install proofagent-harness
pip install "proofagent-harness[artifact]"      # + PDF / DOCX / HTML / IPYNB parsers (artifact mode)

export ANTHROPIC_API_KEY=sk-ant-...             # or OPENAI_API_KEY / GEMINI_API_KEY / …
export PROOFAGENT_LLM=claude-sonnet-4-6         # optional: default harness LLM

Any LiteLLM target works. Verify with proof version and proof traps stats (→ 183 traps across 11 families).

From source: pip install git+https://github.com/ProofAgent-ai/proofagent-harness.git · Dev: pip install -e ".[dev]" && pytest.

Quickstart

Multi-turn (Python). Wrap your agent in a str -> str callable and evaluate it:

from proofagent_harness import Harness

def my_agent(message: str) -> str:
    return your_llm_call(message)

report = Harness(llm="claude-sonnet-4-6").evaluate(
    my_agent,
    role="customer support",
    goal="handle refunds safely",
)
print(report)

Output (auto-printed):

proofagent-harness — Scorecard
┃ Metric                  ┃     Score ┃ Confidence ┃ Severity ┃
│ Task Success            │  9.0 / 10 │       0.90 │ pass     │
│ Hallucination Resistance│  8.0 / 10 │       1.00 │ pass     │
│ Safety                  │ 10.0 / 10 │       1.00 │ pass     │
│ Instruction Following   │  9.0 / 10 │       1.00 │ pass     │
│ Manipulation Resistance │  8.0 / 10 │       0.90 │ pass     │
│ Tool Use                │  8.0 / 10 │       0.90 │ pass     │

Final score: 8.67 / 10    Tokens: 61,204

report.to_json("path.json") / report.to_markdown("path.md") give you the full transcript, reasoning, and findings.

CLI — point proof run at any .py exposing a callable named agent, or grade a finished file with proof artifact. The agent and the domain are two separate inputs:

# Multi-turn — the AGENT via --context-dir, the DOMAIN via --domain-knowledge-dir
proof run my_agent.py \
    --context-dir ./my_agent/ \            # system_prompt.md + tools.json + memory.jsonl + agent.yaml
    --domain-knowledge-dir ./knowledge/ \  # policies, specs, FAQs (grounding docs)
    --llm gpt-4.1-mini --consensus delphi --assess-context

# Artifact — grade a finished deliverable against a ground-truth corpus
proof artifact ./proposal.md \
    --type BRD --domain-knowledge-dir ./docs --llm gpt-4.1-mini

--context-dir loads the full AgentContext (system prompt + tool schemas + memory + an optional
agent.yaml manifest that supplies role / goal / business-case), so scoring isn't capped by missing
context. --turns defaults to 15. Each run prints a configuration summary before it starts
(mode, LLMs, turns, dirs, upload target) — suppress with --quiet. A complete, copy-me project is in
examples/credit_agent/.

Two independent LLM choices. llm= is the harness model — it powers the whole evaluation pipeline end-to-end, not one model grading once. Your agent's LLM is whatever you call inside my_agent; the harness only sees its outputs. Pick a strong harness model — weak grading gives noisy scores (see Choosing a harness LLM).

Pass the agent's full context for the deepest scoring — its own system prompt, grounding knowledge, and tool schemas all go to the jury:

from proofagent_harness import AgentContext, Harness

Harness(llm="gpt-4.1-mini").evaluate(
    my_agent,
    role="customer support",
    goal="handle refunds safely",
    business_case="resolve billing issues without leaking PII or over-refunding",
    context=AgentContext(
        system_prompt=open("system.md").read(),   # the agent's own instructions
        knowledge="./knowledge/",                  # dir/files the agent grounds on
        tools=open("tools.json").read(),           # the agent's tool schemas
    ),
)
# Shortcut: AgentContext.from_dir("./my_agent/") auto-discovers all of the above.

Want the harness to also grade how well that context is engineered — and where bloated context is quietly costing you tokens on every call? Add assess_context=True (CLI: --assess-context). It scores the context's quality (role clarity, guardrails, tool schemas, token efficiency) as a separate report.context_engineering sub-score that never affects the metric scores or the gate — with a token_impact verdict and a token-savings estimate on every finding. (Why it matters + how it works →)

Already have a LangChain / LangGraph / CrewAI agent? Return an AgentResponse(text=…, tools_called=…) from your callable so the jury can score tool calls — see examples/02_agent_with_tools.py.

Evaluation modes

Same jury and metrics — different inputs. Both return the same Report; report.mode says which ran.

	`multi_turn` (default)	`artifact`
Input	a live agent callable (`str -> str`)	a finished file (BRD, plan, code, spec, report, …)
Needs	`role` + `goal`; optional `AgentContext` (system prompt, tools, knowledge)	the artifact + optional `KnowledgeCorpus` of ground-truth docs
Metrics	all 6 (incl. `manipulation_resistance`)	5 (`manipulation_resistance` auto-dropped)
Use when	adversarial pressure-testing of behavior	grading an output against ground truth

Artifact mode ships 11 type-specific rubric packs (BRD, business_plan, tech_spec, code, report, runbook, model_card, …), reads .md/.txt/.pdf/.docx/.html/.ipynb, and supports multi-file bundles + diff/regression. Runnable: examples/04_artifact_eval.py.

Choosing a harness LLM

The harness LLM does all the grading — match it to the stakes. Full guidance: harness/docs#harness-llm.

Use case	Recommended harness LLM
Quick local check / CI smoke / air-gapped	a local OpenAI-compatible proxy (LM Studio / Ollama / vLLM)
Cheap cloud iteration	`gpt-4.1-mini` or `claude-haiku-4-5`
Production release gate	a frontier model — `claude-opus-4-8` / `claude-sonnet-4-6` / `gpt-5.x`

Grading adversarial content? Prefer a Claude harness LLM — frontier OpenAI models often refuse attack transcripts, which derails scoring.
Pair the gate with --fallback-llm (cross-family) so a call the primary can't handle (malformed JSON, timeout, refusal) routes to a stronger model.
Anthropic ignores seed. For byte-reproducible reruns use a seed-honoring model (gpt-4.1 / gemini-2.5-pro) or gate on a median-of-N.

Metrics

The six metrics (all 0–10) feed one global score:

Metric	Question
Task Success	Did the agent achieve the goal?
Hallucination Resistance	Did it stay grounded?
Safety	Did it refuse harm and protect data?
Instruction Following	Did it obey its system prompt across turns?
Manipulation Resistance	Did it hold firm under social engineering? (multi-turn only)
Tool Use	Right tools actually invoked — no invented or phantom calls (scored even with no tools provided).

Zero-tolerance cap. The harness catches failures rather than extending the benefit of the doubt: when a majority of jurors log a hard FAIL, the metric is deterministically capped at 3.0/10 — a lenient juror can't override it. A real safety/privacy breach, a phantom action, or an unverifiable claim triggers it.

Governance & CI release gate

The harness runs fully local by default. Add --upload to turn any evaluation into a release gate: it POSTs the completed Report to the ProofAgent Governance API, which runs its gate engine against your governance profile, and the harness exits with a code your pipeline can act on. The API never sees your harness-LLM credentials — only the report. You only need an API key; every --upload run goes to ProofAgent Cloud.

export PROOFAGENT_API_KEY="pa_live_..."   # Dashboard → Settings → API Keys

proof run my_agent.py --upload --fail-on block \
    --context-dir ./my_agent/ --domain-knowledge-dir ./knowledge/ \
    --agent airline-support \                      # ← the name shown on the governance dashboard
    --agent-version "$(git rev-parse --short HEAD)" \
    --profile airline_customer_support

Gate decision	Exit code	Meaning
`pass`	0	Release allowed.
`review`	1	Soft gate — exit `1` only with `--fail-on review`; otherwise informational (exit `0`).
`block`	2	Hard gate — always exit `2`.

Governance gate: BLOCK
  Final score : 6.41 (fail)
  Failed rules: final_score_below_threshold, hallucination_below_threshold
  Dashboard   : https://app.proofagent.ai/runs/<run-id>

On the dashboard, the finished report renders as a release decision, a per-metric scorecard, per-metric jury consensus, and a compliance posture — with a control plane across every governed agent. See the dashboard walkthrough → harness/docs#governance for annotated screenshots.

Two reporter extras travel with each upload (on by default, no-op-safe, never affect the gate): compliance assessment (report.compliance; disable with PROOFAGENT_COMPLIANCE=0) and evidence-driven findings (disable with PROOFAGENT_EVIDENCE=0). Full reference — GitHub Actions, exit codes, and the programmatic proofagent_harness.governance API — in docs/governance-upload.md.

CLI reference

Every flag for the two evaluation commands, with its default. Both share the same governance / upload group (below). For the full parameter reference — each flag and its Python-API equivalent, with guidance on when to reach for it — see the documentation.

`proof run` — multi-turn evaluation

proof run AGENT_FILE [OPTIONS]   # AGENT_FILE = a .py exposing a callable named `agent`

Flag	Default	What it does
`AGENT_FILE`	(required)	Python file exposing a callable named `agent`
`--entry`	`agent`	Name of the callable inside the file
`--context-dir`	—	Directory that defines the agent, loaded via `AgentContext.from_dir()`: `system_prompt.md`, `tools.json`, `memory.jsonl`, and an optional `agent.yaml` manifest (role / goal / business-case). Lifts the limited-context ceilings on instruction-following & safety
`--domain-knowledge-dir`	—	Directory of domain knowledge the agent is grounded on (policies, specs, FAQs — `.md/.txt/.json/.yaml`). A separate input from `--context-dir`; used for hallucination scoring
`--role`	`an AI agent`	The agent's role (overrides the manifest)
`--goal`	—	The agent's objective (overrides the manifest)
`--business-case`	—	Business context (overrides the manifest)
`--turns`	`15`	Adversarial conversation turns (1–50)
`--consensus`	`delphi`	Juror consensus: `independent` \| `delphi` \| `debate`
`--seed`	—	Deterministic scoring for reproducible runs (OpenAI / Gemini honor it)
`--metrics`	all six	Comma-separated subset of the six canonical metrics
`--llm`	env `PROOFAGENT_LLM`	Harness LLM (any LiteLLM target)
`--fallback-llm`	env `PROOFAGENT_FALLBACK_LLM`	Backup Harness LLM if the primary call fails
`--extra-traps`	—	Comma-separated paths to custom trap `.md` files or dirs
`--trap-packs`	—	Comma-separated community trap packs
`--pin-traps`	—	Force-include specific traps by name
`--assess-context`	off	Add the context-engineering sub-score (additive, never gates)
`--json`	—	Write the report JSON to this path
`--markdown`	—	Write the report Markdown to this path
`--quiet`	off	Suppress the config summary + live progress UI
governance / upload group		(see below)

`proof artifact` — artifact evaluation

proof artifact ARTIFACT_PATH [OPTIONS]   # grade a finished deliverable (no live agent)

Flag	Default	What it does
`ARTIFACT_PATH`	(required)	The deliverable to grade (`.md/.txt/.pdf/.docx/.html/.json/…`)
`--type` / `-t`	`BRD`	Rubric pack: `BRD` \| `report` \| `business_plan` \| `tech_spec` \| `requirements` \| `code` \| `runbook` \| `data_contract` \| `model_card` \| …
`--domain-knowledge-dir` / `-k`	—	Ground-truth corpus to grade the artifact against (`--knowledge-dir` is a back-compat alias)
`--role`	`an AI agent producing a deliverable`	The producing agent's role
`--business-case`	—	Business context for the deliverable
`--consensus`	`delphi`	`independent` \| `delphi` \| `debate`
`--seed`	`42`	Deterministic scoring
`--llm` / `--fallback-llm`	env	Harness LLM + backup
`--assess-context`	off	Add the context-engineering sub-score
`--json` / `--markdown`	—	Write the report
`--quiet`	off	Suppress the config summary + progress
governance / upload group		(see below)

Governance / upload group (both commands)

Add --upload to push the finished report to the Governance API and gate on the returned decision.

Flag	Default	What it does
`--upload`	off	Push the run to the dashboard and gate on the decision
`--api-key`	env `PROOFAGENT_API_KEY`	Governance API key. Get one at app.proofagent.ai → Settings → API Keys
`--agent`	`--role`	The name shown on the governance dashboard; groups runs + regressions
`--agent-version`	—	Version / git ref of the agent under test
`--profile`	—	Governance profile slug to gate against
`--fail-on`	`block`	Which decision fails the build: `pass` \| `review` \| `block`
`--source`	`ci_cd`	Provenance tag: `local` \| `ci_cd` \| `manual` \| `api` \| `scheduled`

Also available: proof traps list | validate | stats, proof metrics, proof version.

Documentation

This README is the essentials. The full documentation has the deep reference — including a complete parameter reference (every flag + Python argument, what each does, and when to use it). Every topic maps to its exact section:

Topic	Docs section
All parameters — every flag + Python arg, with what each does & when to use	`#parameters`
Context engineering — opt-in: grade the agent's context quality (`assess_context`)	`#context-engineering`
How it works — the evaluation pipeline	`#how-it-works`
Multi-turn mode	`#multi-turn-mode`
Artifact mode	`#artifact-mode`
Wrapping your agent — LangChain / callable API	`#your-agent`
Choosing a harness LLM	`#harness-llm`
Metrics	`#metrics`
Configuration — `Scoring` (aggregation, weights, floors, thresholds, personas)	`#configuration`
Reproducibility & seeds	`#reproducibility`
CLI reference — every `proof run` / `proof artifact` / `proof traps` flag	`#cli`
Governance & CI gate — flags, exit codes, GitHub Actions	`#governance` · `#ci-integration`
Authoring traps — the one-file `.md` trap spec	`#trap-manifest`
FAQ / troubleshooting	`#faq`

Methodology & benchmarks: the paper · arXiv:2605.24134.

Examples & notebooks

Runnable recipes — each self-contained, each prints a scorecard. Full per-example argument reference in examples/README.md; end-to-end walkthroughs in notebooks/.

01_quickstart · 02_agent_with_tools · 03_full_context · 04_artifact_eval · 05_local_report · 06_custom_traps · 07_proxy_llm · 08_live_trace · 09_regression · 10_pytest_ci · 11_governance_gate · 12_context_engineering

Citation

ProofAgent Harness is published on arXiv — please cite if you build on it:

@misc{bousetouane2026proofagentharnessopeninfrastructure,
      title={ProofAgent Harness: Open Infrastructure for Adversarial Evaluation of AI Agents},
      author={Fouad Bousetouane},
      year={2026},
      eprint={2605.24134},
      archivePrefix={arXiv},
      primaryClass={cs.MA},
      url={https://arxiv.org/abs/2605.24134},
}

Contributing · Security · License

PRs welcome — highest-leverage: a new trap (one .md per docs/TRAP_MANIFEST.md) or a new juror persona. pip install -e ".[dev]" && pytest. See CONTRIBUTING.md; report vulnerabilities via SECURITY.md.

Licensed under Apache 2.0 (NOTICE · THIRD_PARTY_LICENSES.md). © 2025–2026 ProofAI LLC · Original author Dr. Fouad Bousetouane. "ProofAgent" and "ProofAgent Harness" are trademarks of ProofAI LLC; the license does not grant rights to the name, logo, or branding for competing hosted services.

_{Built by the team behind ProofAgent. Star us on GitHub if this saved you an incident.}