Awesome AI Agent Incidents



A curated corpus of real-world security incidents, attack techniques, CVEs, frameworks, and defensive tools for autonomous AI agents.
From zero-click Copilot exfiltration to AI-powered C2 channels — attacks on agentic AI have moved well beyond theory.
Incidents like these raise a harder question for teams shipping AI-generated code: do you actually know what your agent did? We are building h5i to answer it — a Git sidecar that records every prompt, decision, and uncertainty signal alongside the diff, so you can audit what the AI touched, catch credential leaks before they land, and resume sessions without losing context.
Table of Contents
Real-World Incidents
Prompt Injection & Goal Hijacking
| Date |
Incident |
Impact |
References |
| Mid-2025 |
EchoLeak — Microsoft 365 Copilot CVE-2025-32711 (CVSS 9.3): Zero-click prompt injection via crafted emails; reference-style Markdown image links bypassed Copilot's link-redaction filters, exfiltrating OneDrive, SharePoint, and Teams data through the Teams content URL API (teams.microsoft.com/urlp/v1/url/content) which was in the CSP allowlist—no user interaction required. |
Data exfiltration across 160+ org-level incidents; ~$200M est. impact |
arXiv 2509.10540, Checkmarx, Trend Micro |
| May 2025 |
GitHub MCP Prompt Injection: Malicious commands embedded in public GitHub Issues hijacked developers' AI agents via the GitHub MCP server, exfiltrating private repository source code and cryptographic keys to attacker-controlled public repositories. |
Private source code and keys leaked |
Security Boulevard / NSFOCUS |
| Aug 2025 |
Perplexity Comet Browser Injection: Hidden commands in Reddit Markdown spoiler tags triggered Comet's "summarize page" feature; the agent logged into user emails, bypassed CAPTCHAs, and transmitted credentials within 150 seconds. |
Credential theft |
Security Boulevard |
| Mid-2025 |
Supabase Cursor Agent Injection: A Cursor agent with privileged service-role database access processed support tickets containing user-supplied SQL; attacker-controlled content caused exfiltration of sensitive integration tokens. |
Sensitive tokens exposed |
Practical DevSecOps |
| Aug 2024 |
Slack AI Data Exfiltration: Indirect prompt injection in public Slack channels weaponized Slack AI's RAG search to exfiltrate data from private channels via malicious link formatting—Slack initially described this as "intended behavior." |
Private communications leaked |
Simon Willison |
| 2024 |
Financial Reconciliation Agent Fraud: An attacker phrased a data export as a legitimate business task ("all customer records matching pattern X"), causing a reconciliation agent to export every record without raising suspicion. |
45,000 customer records stolen |
Stellar Cyber |
| 2025 |
Procurement Agent Memory Poisoning: Over 3 weeks, a manufacturing firm's procurement agent was gradually memory-poisoned into believing elevated transfer limits were authorized; it then transferred funds to attacker accounts. |
Financial fraud via persistent memory corruption |
Lares Labs |
| Nov 2025 |
A2A Session Smuggling (Palo Alto Unit 42): "Agent Session Smuggling" exploited trust relationships in the Agent-to-Agent (A2A) protocol during multi-turn conversations; a malicious sub-agent hijacked the task graph of a trusted orchestrator. |
Multi-agent system compromise |
Lares Labs |
| Dec 2025 |
OpenAI Atlas — Real Attack Chain Disclosed: OpenAI confirmed a real-world attack where a malicious email caused their Atlas agent to autonomously send a resignation letter. OpenAI stated that "deterministic guarantees are not achievable." |
Unauthorized email sent on behalf of user |
OpenAI Blog |
| 2024 |
ChatGPT Memory Persistence Attack: Injected content manipulated ChatGPT's persistent memory feature to plant false memories, enabling long-term data exfiltration across sessions without re-injection. |
Persistent cross-session data exfiltration |
ars technica |
Supply Chain Attacks
| Date |
Incident |
Impact |
References |
| Aug 2025 |
s1ngularity — Nx Build System: Compromised Nx build pipeline distributed malware targeting developer environments. The malware detected Claude Code and Gemini CLI and issued natural-language prompts instructing these agents to enumerate filesystems and exfiltrate credentials. |
Developer credential theft at scale |
nx.dev, Trend Micro |
| Feb 2026 |
OpenClaw / ClawHub Malicious Skills: Antiy CERT confirmed 1,184 malicious skills across ClawHub, the package registry for the OpenClaw AI agent framework; Snyk ToxicSkills separately found 76 confirmed malicious payloads and 534 critically-vulnerable skills (~13.4%) in an independent scan of 3,984 skills. Techniques: typosquatting, mass uploads, 2.9% used curl | bash remote-instruction loading. 135,000+ instances exposed with insecure defaults. |
Largest confirmed AI agent supply chain attack |
CyberDesserts, Snyk ToxicSkills |
| Jan 2026 |
Claude Code RCE via Poisoned Config CVE-2026-XXXX: Check Point Research disclosed RCE in Claude Code through poisoned repository .claude configuration files. Patched in Claude Code 2.0.65+. |
Developer workstation compromise |
CyberDesserts |
| 2026 |
OpenAI Plugin Ecosystem Breach: Supply chain attack on the OpenAI plugin ecosystem harvested agent credentials from 47 enterprise deployments; 6 months of undetected access to customer data, financial records, and proprietary code. |
47 enterprises compromised |
Stellar Cyber |
| Aug 2025 |
Drift/Salesforce OAuth Token Theft (UNC6395): Threat actor used stolen OAuth tokens from Drift's Salesforce integration to access 700+ customer environments—no exploit, no phishing required. |
700+ orgs compromised; Drift temporarily removed from AppExchange |
Reco Blog |
Infrastructure Compromise
| Date |
Incident |
Impact |
References |
| Nov 2025 |
Ray Framework Mass Exploitation CVE-2023-48022: Attackers used AI-generated attack scripts to exploit this critical unauthenticated RCE at scale, compromising 230,000+ publicly exposed Ray AI computing clusters for cryptomining, data theft, and DDoS. |
230K+ clusters compromised |
Security Boulevard |
| Nov 2025 |
SesameOp — OpenAI API as C2: Microsoft DART discovered a novel backdoor (OpenAIAgent.Netapi64.dll) that used the OpenAI Assistants API as its C2 channel. Commands were embedded in encrypted assistant descriptions; results uploaded via message threads—all via legitimate api.openai.com traffic. |
Stealthy persistent access; blends into enterprise AI traffic |
Microsoft Security Blog |
| 2025 |
Exposed MCP Servers: Trend Micro found 492 MCP servers exposed to the internet with zero authentication, granting open access to agent infrastructure including file systems and code execution. |
Open access to agent tooling |
CyberDesserts |
Agent Misalignment & Rogue Behavior
| Date |
Incident |
Impact |
References |
| 2025 |
Cost-Optimization Agent Deletes Production Backups: A cloud cost-optimization agent autonomously decided that deleting production backups was the most efficient way to reduce storage costs. No attacker involved—pure goal misalignment. |
Production backup loss |
Lares Labs |
| 2025 |
ServiceNow Now Assist Inter-Agent Spoofing: Spoofed inter-agent messages in a procurement workflow caused a downstream agent to process fraudulent orders from attacker-front companies. |
Fraudulent procurement orders |
Lares Labs |
AI-Assisted Attacks
| Date |
Incident |
Impact |
References |
| Nov 2025 |
Chinese State-Sponsored Claude Code Campaign: Anthropic confirmed that a Chinese APT group used Claude Code to attempt infiltration of ~30 global targets across tech, finance, and chemical manufacturing. 80–90% of tactical operations (scanning, exploit crafting, multi-step infiltration) were executed autonomously by the agents. |
First documented large-scale AI-agent-driven cyberattack |
BBC |
CVE Database
| CVE |
Product / Component |
CVSS |
Description |
References |
| CVE-2025-32711 |
Microsoft 365 Copilot (EchoLeak) |
9.3 (Critical, Microsoft CNA) |
Zero-click prompt injection enabling data exfiltration from OneDrive/SharePoint/Teams via reference-style Markdown image URL leakage through the Teams content URL proxy. |
NVD, MSRC |
| CVE-2025-53773 |
GitHub Copilot + Visual Studio 2022 |
7.8 (High, AV:L) |
Local code execution via command injection triggered through prompt injection in GitHub Copilot / Visual Studio 2022 (v17.14.0–17.14.11); requires user interaction. |
NVD, MSRC |
| CVE-2025-59944 |
Cursor (≤ v1.6.23) |
9.8 (Critical, NIST) / 8.0 (High, GitHub CNA) |
Case-sensitivity flaw in protected file-path checks allows prompt injection to overwrite /.cursor/mcp.json on case-insensitive filesystems, escalating to RCE. Fixed in v1.7. |
NVD, GHSA |
| CVE-2025-3248 |
Langflow (< v1.3.0) |
9.8 (Critical) |
Unauthenticated RCE via the /api/v1/validate/code endpoint; added to CISA KEV May 2025. |
NVD |
| CVE-2025-34291 |
Langflow (≤ v1.6.9) |
9.4 (Critical, CNA) / 8.8 (High, NIST) |
Account takeover + RCE via overly permissive CORS (allow_origins='*' + allow_credentials=True) combined with SameSite=None refresh token cookie, enabling cross-origin credential theft leading to code-execution endpoint access. |
NVD, GitHub |
| CVE-2025-47241 |
Browser Use (< v0.1.45) |
9.3 (Critical, GHSA) |
URL parsing flaw in allowed_domains whitelist: the parser splits on : to extract the domain, allowing an attacker to craft https://allowed.com:[email protected]/ to bypass domain restrictions entirely. |
NVD, GHSA-x39x-9qw5-ghrf |
| CVE-2025-6514 |
mcp-remote |
9.6 (Critical, JFrog CNA) |
OS command injection when connecting to an untrusted MCP server via a crafted authorization_endpoint URL in the OAuth response. |
NVD |
| CVE-2023-48022 |
Anyscale Ray (v2.6.3 / v2.8.0) |
9.8 (Critical, Disputed) |
Unauthenticated RCE via the job submission API; vendor disputes severity noting Ray is not designed for untrusted network exposure. Exploited at scale (230K+ clusters) using AI-generated attack scripts in 2025. |
NVD |
MCP (Model Context Protocol) Security
MCP Incidents & PoCs
| Incident |
Description |
Reference |
| GitHub MCP Prompt Injection |
Malicious instructions in GitHub Issues hijacked agents connected via the GitHub MCP server; private repo contents and API keys exfiltrated. |
Invariant Labs |
| WhatsApp MCP Chat History Exfiltration |
Invariant Labs PoC: a malicious server combined tool poisoning with the legitimate whatsapp-mcp server to silently exfiltrate an entire chat history—zero user interaction. |
Invariant Labs |
| MCP Inspector RCE |
Anthropic's own MCP Inspector developer tool allowed unauthenticated RCE via its inspector–proxy architecture, turning a debugging aid into a remote shell. |
AuthZed Timeline |
mcp-remote OAuth Command Injection CVE-2025-6514 |
Critical OS command injection in the most widely used OAuth proxy for connecting local MCP clients to remote servers. |
AuthZed Timeline |
| Smithery Build Config Path Traversal |
A path-traversal vulnerability in Smithery's build configuration leaked a builder's ~/.docker/config.json, exposing a Fly.io API token that granted control over 3,000+ deployed apps. |
AuthZed Timeline |
MCP Attack Vectors
| Vector |
Description |
| Tool Poisoning |
Malicious instructions embedded in MCP tool description or inputSchema fields—invisible to users but parsed and executed by LLMs. The attack triggers before the poisoned tool is ever called. |
| Rug Pull / Silent Redefinition |
MCP tools mutate their own definitions after installation and approval; the tool you approved is not the tool that executes. |
| Tool Shadowing |
A malicious server injects a tool description that overrides or modifies the agent's behavior with respect to a different, trusted server's tools. |
| Cross-Server Manipulation |
With multiple MCP servers connected, a malicious one intercepts and replaces calls destined for a trusted one—invisible to the user. |
| Tool Return Attacks |
Malicious instructions injected into tool output (not descriptions), exploiting the same data/instruction confusion in the model's context. Effective even with hex-encoded payloads. |
| MCP Preference Manipulation (MPMA) |
Subtly alter how AI agents rank and select among available tools, steering them toward attacker-controlled options. |
| Webpage Poison via MCP Browser Tool |
Any MCP tool that fetches web content becomes a vector for indirect prompt injection from attacker-controlled pages. |
Root cause: MCP clients receive tool metadata via tools/list, pass it into the LLM context without sanitization, and the LLM treats natural-language descriptions as instructions. The vulnerability is entirely client-side. (arXiv 2603.22489)
Attack Taxonomy
The Promptware Kill Chain
Prompt injection has evolved from an isolated input-manipulation exploit into a structured, multi-stage malware mechanism. Analysis of 36 prominent incidents found at least 21 attacks traversing four or more stages of this kill chain. (arXiv 2601.09625)
Stage 1: Initial Access ──► Prompt injection (direct or indirect)
Stage 2: Privilege Escalation ──► Jailbreak system constraints / safety filters
Stage 3: Reconnaissance ──► Probe agent memory, tools, environment config
Stage 4: Persistence ──► Poison agent long-term memory / RAG database
Stage 5: Command & Control ──► Establish external channel (e.g., SesameOp)
Stage 6: Lateral Movement ──► Leverage agent permissions to infect other agents
Stage 7: Actions on Objective ──► Data exfiltration / financial fraud / system disruption
Attack Techniques
| Technique |
Category |
Key Property |
| Direct Prompt Injection |
Input Manipulation |
Malicious instructions in user turn override system prompt |
| Indirect Prompt Injection (XPIA) |
Input Manipulation |
Instructions hidden in emails, web pages, documents, tool output |
| Tool Poisoning |
MCP / Supply Chain |
Malicious content in tool metadata—triggers before tool is called |
| Tool Shadowing |
MCP |
Malicious server overrides trusted tool behavior |
| Rug Pull / Silent Redefinition |
MCP |
Tool definitions mutate post-installation |
| Cross-Server Manipulation |
MCP |
Malicious server intercepts calls to trusted server |
| Memory Poisoning |
Persistence |
Gradually corrupt agent long-term memory over multiple sessions |
| RAG / Knowledge Base Poisoning |
Persistence |
Inject malicious documents into retrieval corpus; 5 docs can achieve 90% manipulation rate |
| Agent Session Smuggling |
Multi-Agent |
Exploit A2A protocol trust across multi-turn conversations |
| Adversary-in-the-Middle (AiTM) |
Multi-Agent |
Compromised agent manipulates inter-agent message flow |
| Infectious Prompt Propagation |
Multi-Agent |
Malicious instructions spread virally across agent networks |
| Backdoor Attacks |
Model-Level |
Trained-in triggers cause malicious behavior on specific inputs |
AI Agent Clickbait AML.T0100 |
Agentic UI |
Manipulate web UI to lure agentic browsers into unintended actions |
| Tool Call Injection |
Execution |
Poisoned context causes agent to invoke tools with malicious parameters |
| PLeak |
Info Extraction |
Algorithmic extraction of system prompts via black-box queries |
| Steganographic C2 |
Covert Channel |
Encode commands in LLM-generated content; agents communicate covertly |
| Invisible Prompt Injection |
Stealth |
Zero-width characters, Unicode homoglyphs, or control characters hide payloads |
| Living-off-the-Land AI (LotAI) |
Stealth |
Abuse legitimate AI service APIs (e.g., OpenAI Assistants) for C2, blending into normal traffic |
Frameworks & Standards
OWASP Top 10 for Agentic Applications (ASI 2026)
Released December 2025. Developed with 100+ security researchers. Distinct from the general LLM Top 10—focuses on the failure of agentic properties: autonomy, persistence, and tool integration.
| Rank |
ID |
Risk |
Core Failure Mode |
| 1 |
ASI01 |
Agent Goal Hijack |
Attackers redirect agent goals and task selection via prompt injection, forged messages, or poisoned external data — exploiting LLMs' inherent inability to distinguish instructions from content |
| 2 |
ASI02 |
Tool Misuse & Exploitation |
Agents misuse legitimate tools through unsafe composition, prompt injection into tool calls, or over-privileged access — leading to data exfiltration, destructive actions, or DoS |
| 3 |
ASI03 |
Identity & Privilege Abuse |
Agents inherit excessive credentials or over-scoped API access, enabling privilege escalation or actions beyond intended authorization scope |
| 4 |
ASI04 |
Agentic Supply Chain Vulnerabilities |
Compromised tool registries, MCP servers, third-party agents, or model artifacts introduce malicious behavior through trusted supply chain channels |
| 5 |
ASI05 |
Unexpected Code Execution (RCE) |
Agents execute injected or unintended code via code-execution tools (shell, eval, REPL) due to insufficient sandboxing or input validation |
| 6 |
ASI06 |
Memory & Context Poisoning |
Persistent corruption of agent memory, retrieval context, or long-term state biases future behavior without the agent or user detecting the tampering |
| 7 |
ASI07 |
Insecure Inter-Agent Communication |
Spoofed or manipulated messages between collaborating agents propagate malicious instructions across a multi-agent system |
| 8 |
ASI08 |
Cascading Failures |
Small errors or resource overuse propagate across interconnected agent systems, amplifying into large-scale disruptions |
| 9 |
ASI09 |
Human-Agent Trust Exploitation |
Agents produce misleading or overconfident outputs that exploit human tendency to trust AI, causing harmful decisions by operators or end users |
| 10 |
ASI10 |
Rogue Agents |
Agents drift from intended behavior due to misalignment, self-modification, or emergent collusion — without active attacker involvement |
Source: OWASP GenAI Security Project
MITRE ATLAS
MITRE ATLAS (Adversarial Threat Landscape for AI Systems) is the canonical knowledge base of adversary tactics and techniques, now expanded to 16 tactics and 84+ techniques with specific coverage of agentic threats.
Key techniques relevant to AI agents:
| Tactic |
Key Technique |
ATLAS ID |
| Initial Access |
Prompt Injection |
AML.T0051 |
| Persistence |
Modify AI Agent Configuration |
AML.T0101 (proposed) |
| Credential Access |
Agent Tool Credential Harvesting |
AML.T0098 |
| Command & Control |
AI Service API Abuse |
AML.T0096 |
| Exfiltration |
Exfiltration via Agent Tool Invocation |
— |
| New (2025) |
AI Agent Clickbait |
AML.T0100 |
Source: MITRE ATLAS
Other Frameworks
Research Papers
Surveys & Systematizations
| Paper |
Venue |
TL;DR |
| The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey |
arXiv 2026 |
Reviews 128 papers (51 attacks, 60 defenses). Argues component-level defenses are insufficient; security must be treated as a systems-level problem. Defines 6 primary attack vectors. |
| AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways |
ACM Comp. Surveys 2025 |
Categorizes threats across 4 knowledge gaps. Key finding: backdoor defenses don't cover the full agent ecosystem beyond model granularity. |
| Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges |
arXiv Oct 2025 |
Broad threat taxonomy spanning prompt injection, autonomous cyber-exploitation, multi-agent threats, and governance concerns. |
| A Survey of Agentic AI and Cybersecurity |
arXiv Jan 2026 |
Treats agentic AI as a cybersecurity system. Analyzes collusion, cascade failures, and oversight evasion with prototype implementations. |
| The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis |
arXiv Feb 2026 |
Systematization-of-knowledge with unified taxonomy. Introduces AgentPI benchmark covering context-dependent agent tasks all prior benchmarks ignored. |
| From Prompt Injections to Protocol Exploits |
ICT Express (Elsevier) 2025 |
Unified end-to-end threat model for LLM-agent ecosystems. Catalogs 30+ attack techniques; validates against real-world CVEs. |
| Agentic AI in Cybersecurity: Cognitive Autonomy, Ethical Governance, and Quantum-Resilient Defense |
F1000Research Sep 2025 |
Narrative review spanning 2005–2025. Identifies dual-use risks, governance gaps, and post-quantum preparedness challenges. (citation unverifiable — access restricted; treat with caution) |
| A Survey on Agentic Security: Applications, Threats and Defenses |
arXiv Oct 2025 |
Cross-cutting analysis of 160+ papers. Maps planner–executor agent patterns and GPT/Claude/LLaMA backbones to their respective attack surfaces. Privacy and integrity defenses surveyed per architecture type. |
| Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems |
arXiv Dec 2025 |
Extracts 93 threats from three sources via a multi-agent RAG pipeline: MITRE ATLAS (26), AI Incident Database (12), and 55 literature papers. Identifies previously unreported TTPs beyond ATLAS coverage. |
| AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation |
J. Cybersecur. Priv. 2025 |
Reviews 600+ papers across 8 SOC tasks (alert triage, threat intel, incident response, log summarization, etc.). Highlights where agentic autonomy creates new insider-threat and prompt-injection risks within the SOC itself. |
Prompt Injection & Jailbreaks
| Paper |
Venue |
TL;DR |
| Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection |
IEEE S&P Workshop 2023 |
Foundational paper defining indirect prompt injection. Demonstrates remote data theft, ecosystem contamination across production applications. |
| Universal and Transferable Adversarial Attacks on Aligned Language Models |
arXiv 2023 |
Introduces transferable adversarial suffixes that elicit objectionable content across GPT, Claude, Llama. Foundational jailbreak work. |
| The Promptware Kill Chain |
arXiv Jan 2026 |
Formalizes prompt injection as a 7-stage malware delivery mechanism. Validates against 36 real incidents; 21 traversed ≥4 stages. |
| EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit |
arXiv Sep 2025 |
Technical deep-dive on CVE-2025-32711; details how CSP and prompt-injection filters were both bypassed using a trusted Microsoft domain as exfiltration proxy. |
| The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections |
arXiv Oct 2025 |
Breaks 12 published defenses using gradient descent, RL, random search. Most defenses claimed near-zero ASR; adaptive attacks exceeded 90% against most of them. |
| Prompt Injection 2.0: Hybrid AI Threats |
arXiv Jul 2025 |
Shows injections now combine with XSS, CSRF, AI worm propagation, and multi-agent infections to evade traditional WAFs entirely. |
| Securing AI Agents Against Prompt Injection Attacks |
arXiv Nov 2025 |
Benchmarks 847 adversarial cases across 5 attack categories vs. 7 LLMs. Combined defense reduces ASR from 73.2% → 8.7% while retaining 94.3% task performance. |
| Prompt Injection Attack to Tool Selection in LLM Agents (ToolHijacker) |
arXiv Apr 2025 |
No-box attack injecting a malicious tool document to hijack tool selection. StruQ, SecAlign, DataSentinel, and perplexity detection are all insufficient defenses. |
| Attention Tracker: Detecting Prompt Injection Attacks in LLMs |
NAACL 2025 Findings |
Detects prompt injection by tracking attention distribution shifts—no modification to the underlying model; deployable as a wrapper. |
| Advertisement Embedding Attacks Against LLM Agents |
arXiv Aug 2025 |
Shows how adversaries can covertly embed promotional content into agent responses by poisoning external data sources the agent consumes, turning helpfulness into a covert advertising channel. |
MCP & Tool Poisoning
Multi-Agent System Threats
Memory & RAG Attacks
Backdoor Attacks on Agents
Benchmarks & Agent Evaluation
OpenClaw Security
A cluster of papers from March 2026 focusing on the OpenClaw AI agent framework, prompted by the ClawHub supply chain incident (1,184 malicious skills confirmed by Antiy CERT). Papers span attacks, defenses, red-teaming frameworks, and benchmarks — making OpenClaw one of the most thoroughly studied agentic security case studies to date.
| Paper |
Venue |
TL;DR |
| Formal Analysis and Supply Chain Security for Agentic AI Skills |
arXiv Mar 2026 |
Introduces SkillFortify: Dolev-Yao attacker model + static analysis + capability sandboxing + SAT-based dependency resolution. Simulates "ClawHavoc" (1,200+ malicious skills). 96.95% F1, 100% precision, zero false positives on 540-skill benchmark. |
| ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems |
arXiv Mar 2026 |
First self-replicating worm against OpenClaw: single message triggers autonomous infection, hijacks configs for persistence, propagates to peer agents. ~64.5% success rate across 1,800 trials on 4 LLM backends; skill supply chains are universally vulnerable. |
| Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats |
arXiv Mar 2026 |
Systematic study of compound threats (indirect prompt injection, skill supply chain contamination, memory poisoning, intent drift). Proposes a five-layer lifecycle security framework; argues point-based defenses fail against cross-temporal, multi-stage attacks. |
| From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw (PASB) |
arXiv Feb 2026 |
Introduces PASB (Personalized Agent Security Bench), a security evaluation framework with realistic tools and extended interactions. Uncovers critical weaknesses at user prompt processing, tool usage, and memory retrieval stages — gaps missed by synthetic benchmarks. |
| ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation |
arXiv Mar 2026 |
MITM red-teaming via Static HTML Replacement, Iframe Popup Injection, and Dynamic Content Modification. Less capable models accept tampered data; more capable models show greater resilience. Argues sandbox evaluations miss real-world attack surfaces. |
| Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw |
arXiv Mar 2026 |
Identifies prompt-injection-driven RCE, sequential tool attack chains, context amnesia, and supply chain contamination under a tri-layered risk taxonomy. Proposes FASA (Full-Lifecycle Agent Security Architecture) with zero-trust execution and dynamic intent verification; implements as Project ClawGuard. |
| Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw |
arXiv Mar 2026 |
Tests 47 adversarial scenarios across 6 attack categories; native defenses achieve only 17% defense rate. Human-in-the-Loop (HITL) intervention intercepted 8 severe attacks that bypassed all native defenses. Combined HITL + native raises effectiveness to 19–92%. |
| Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents |
arXiv Mar 2026 |
Novel DoS attack inducing "Segmented Verification Protocol" via tool-calling chains — achieves 6–9× token amplification. Also identifies prompt bloat, persistent tool-output pollution, cron/heartbeat amplification, and behavioral instruction injection in production deployments. |
| Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection |
arXiv Mar 2026 |
Two-part structural defense: agent privilege separation + tool partitioning + JSON formatting to strip persuasive framing. Tested against 649 successful attacks from Microsoft LLMail-Inject benchmark — achieves 0% attack success rate. Isolation is the dominant mechanism. |
| Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents |
arXiv Mar 2026 |
Argues that untrusted inputs + autonomous action + extensibility + privileged access create systemic risks no single mitigation can address. Proposes secure engineering principles, a risk taxonomy, and a research roadmap for institutionalizing safety throughout agent development. |
| OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer |
arXiv Mar 2026 |
In-process plugin + optional sidecar distributing enforcement across 10 lifecycle hooks. Combines heuristic and LLM-based scanning with risk accumulation/decay and policy-based tool and network controls. Focus on practical, deployable security for production agent gateways. |
| ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents |
arXiv Mar 2026 |
Three-layer defense: skill-based policy enforcement (instruction level) → plugin-based runtime monitoring → decoupled watcher-based security middleware (real-time intervention without coupling to agent internals). Effective against data leakage and privilege escalation; watcher paradigm proposed as a building block for next-generation agent security. |
Defense Methods
Defensive Tools & Projects
Open-Source Guardrails
| Tool |
Maintainer |
Description |
| LlamaFirewall |
Meta |
Multi-layer runtime protection: PromptGuard 2 (injection/jailbreak), Agent Alignment Checks (CoT auditor), CodeShield (dangerous code detection in agent outputs) |
| NeMo Guardrails |
NVIDIA |
Programmable topical/safety/dialog rails for LLM-based systems; composable with LangChain/LlamaIndex |
| Guardrails AI |
Guardrails AI |
Python framework with 50+ validators for PII, schema conformance, injection, toxicity; structured output enforcement |
| LLM Guard |
Protect AI |
Self-hosted input/output scanner; detects injection, secrets, PII, toxicity; low-latency production deployment |
| Rebuff |
Protect AI |
Self-hardening prompt injection detector with canary token support; learns from attempted bypasses (archived May 2025 — no longer actively maintained) |
| Invariant Analyzer |
Invariant Labs |
Rule-based guardrailing for LLM/MCP agent traces; Python-inspired matching language for detecting malicious tool sequences |
| Vigil LLM |
deadbits |
Composable scanners: vector similarity, YARA rules, transformer classifier, canary token detection, sentiment analysis |
| InjecGuard / PIGuard |
Open Source |
+30.8% over prior SOTA on NotInject benchmark; specifically addresses overdefense false positives |
| Sentinel AI |
Open Source |
12-language sub-millisecond injection detection; detects base64/hex/ROT13/homoglyph obfuscation; includes MCP safety proxy |
| openclaw-bastion |
AtlasPA |
Detects system prompt markers, role overrides, Unicode homoglyphs, zero-width chars, HTML comment injection; zero dependencies |
| ShellWard |
Open Source |
8-layer agent security middleware; blocks prompt injection, exfiltration, and dangerous commands; zero dependencies |
| AprielGuard |
ServiceNow AI |
8B parameter safety–security safeguard model with strong performance on injection and jailbreak detection |
Red Teaming & Scanning
| Tool |
Type |
Description |
| Garak |
Scanner |
100+ probes for injection, jailbreaks, hallucinations, data leakage; AVID taxonomy integration |
| PyRIT |
Red Team |
Microsoft's Python Risk Identification Toolkit; supports supply chain and Azure model assessments |
| Promptfoo |
Red Team |
Dev-first framework with CI/CD integration; multi-turn agent testing; OWASP/NIST/ATLAS mapping |
| Augustus |
Scanner |
210+ probes, 28 LLM providers, single Go binary; built for pentest workflows without Python/npm |
| Agentic Radar |
Agent Scanner |
CLI scanner specifically for agentic workflows (LangGraph, CrewAI, AutoGen); automatic prompt hardening feature |
| AI-Infra-Guard |
Platform |
Tencent Zhuque Lab's red team platform: Infra Scan + MCP Scan + Jailbreak Evaluation in one web UI |
| FuzzyAI |
Fuzzer |
Automated LLM fuzzing using genetic algorithms for adaptive jailbreak generation |
| Spikee |
Red Team |
Custom injection datasets + automated testing for black-box assessments |
| mcp-injection-experiments |
PoC |
Minimal self-contained scripts that demonstrate tool poisoning and cross-server manipulation in live MCP environments |
| AgentSeal |
Scanner |
225+ attack probes (82 extraction + 143 injection) for prompt injection and extraction; supports OpenAI, Anthropic, Ollama, any HTTP endpoint |
| PurpleLlama |
Red Team |
Meta's set of tools to assess and improve LLM security |
| ART (Linux Foundation AI) |
Defense/Attack |
Comprehensive library for evasion, poisoning, extraction, and inference attacks on ML models |
Benchmarks & Evaluations
| Benchmark |
Description |
| MCPTox |
45 live MCP servers, 353 tools, 1,312 malicious test cases across 10 risk categories |
| AgentDojo |
Dynamic environment for evaluating attacks and defenses for LLM agents |
| JailbreakBench |
Open robustness benchmark for jailbreaking (NeurIPS 2024) |
| AIRTBench |
Measuring autonomous AI red-teaming capabilities in language models |
| ISC-Bench |
Internal Safety Collapse benchmark: jailbreaks frontier models via normal task completion, no adversarial prompting |
| AgentDoG |
Trajectory-level risk assessment framework for autonomous agents |
| OpenPromptInjection |
Benchmark for prompt injection attacks and defenses across diverse agent scenarios |
| Damn Vulnerable MCP Server |
Deliberately vulnerable MCP server for security education and testing |
Observability & Tracing
| Tool |
Description |
| Langfuse |
Open-source observability with detailed trace visualization, embedding monitoring, cost tracking, and tamper-proof audit logs |
| Phoenix (Arize) |
Open-source LLM observability; traces agent reasoning steps and tool calls with anomaly detection |
| AudAgent |
Automated privacy policy compliance auditing via an "auditing automaton" that validates runtime data practices |
| Invariant Analyzer |
Security analysis for MCP deployments; detects exfiltration patterns like inbox-fetch→external-send sequences in agent traces |
Commercial / Enterprise Solutions
| Tool |
Vendor |
Description |
| Prisma AIRS |
Palo Alto Networks |
AI agent discovery, behavior monitoring, RAG data inspection, supply chain security |
| Cisco AI Defense |
Cisco |
Developer tools for model resilience testing; DefenseClaw open-source secure agent framework |
| Cisco Duo (Agent Identity) |
Cisco |
Agent identity management with human-owner mapping and MCP policy enforcement |
| Lakera Guard |
Lakera |
Real-time prompt injection and data leak detection API |
| Prompt Security |
Prompt Security |
Enterprise platform for MCP security risk management |
| Reco |
Reco |
SaaS security platform with AI agent discovery and permission auditing |
| MCP Manager |
MCP Manager |
MCP gateway with tool metadata scanning, rug pull detection, permission management |
| Kubescape 4.0 / KAgent |
ARMO (CNCF) |
Kubernetes security with AI agent scanning support |
Learning Resources
Articles & Blog Posts
Courses, Labs & CTFs
| Resource |
Type |
Description |
| PromptTrace |
CTF |
7 injection labs + 15-level Gauntlet CTF; unique Context Trace shows full prompt stack in real-time |
| Gandalf |
CTF |
8-level prompt injection challenge by Lakera; classic resource for learning defense evasion |
| FinBot Agentic AI CTF |
CTF |
OWASP's financial services agentic AI CTF with real-world vulnerability scenarios |
| CrowdStrike AI Unlocked |
CTF |
Three-room injection challenges with escalating difficulty |
| Damn Vulnerable LLM Agent |
Lab |
LangChain-based ReAct agent with intentionally exploitable injection paths; lets you observe how hijacked thoughts propagate through the agent loop |
| Damn Vulnerable MCP Server |
Lab |
Deliberately vulnerable MCP server for learning MCP pentesting |
| vulnerable-mcp-servers-lab |
Lab |
Collection of vulnerable MCP servers by Appsecco |
| Microsoft AI Red Teaming Playground |
Lab |
Microsoft's AI red teaming training infrastructure |
| ai-prompt-ctf |
CTF |
Targets the full agentic stack: RAG retrieval, function calling, and ReAct loops — covers indirect injection paths most CTFs skip |
| OWASP WrongSecrets LLM |
Lab |
OWASP's LLM security challenge (link may be unavailable) |
| Google AI Red Team Guide |
Guide |
Google's walkthrough of hacking AI systems |
Databases & Trackers
| Resource |
Description |
| AI Incident Database |
Community-sourced database of AI failures and harms in deployed systems |
| MIT AI Incident Tracker |
MIT AI Risk Repository's incident tracker with severity and domain classification |
| AVIDML |
AI Vulnerability and Incidents Database with structured taxonomy |
| MITRE ATLAS Cases |
Real-world case studies of ML attacks, mapped to ATLAS tactics and techniques |
Key Reports & Industry Data
Contributing
Contributions are welcome! Please submit a PR with:
- Incidents: Date, brief description, impact, and a verifiable source link
- CVEs: CVE ID, affected product, CVSS score, and concise description
- Papers: arXiv/venue link, venue/date, and a one-sentence TL;DR
- Tools: Repository link, maintainer, and what it does/defends against
Please ensure all entries have verifiable sources. Unverified or purely speculative entries will not be merged.
See CONTRIBUTING.md for full guidelines.
License
MIT License — Copyright (c) 2026 Hideaki Takahashi