awesome-ai-agent-incidents

agent
SUMMARY

A curated corpus of incidents, attack vectors, failure modes, and defensive tools for autonomous AI agents.

README.md

Awesome AI Agent Incidents

Awesome
PRs Welcome
License: MIT

A curated corpus of real-world security incidents, attack techniques, CVEs, frameworks, and defensive tools for autonomous AI agents.

From zero-click Copilot exfiltration to AI-powered C2 channels — attacks on agentic AI have moved well beyond theory.


Incidents like these raise a harder question for teams shipping AI-generated code: do you actually know what your agent did? We are building h5i to answer it — a Git sidecar that records every prompt, decision, and uncertainty signal alongside the diff, so you can audit what the AI touched, catch credential leaks before they land, and resume sessions without losing context.


Table of Contents


Real-World Incidents

Prompt Injection & Goal Hijacking

Date Incident Impact References
Mid-2025 EchoLeak — Microsoft 365 Copilot CVE-2025-32711 (CVSS 9.3): Zero-click prompt injection via crafted emails; reference-style Markdown image links bypassed Copilot's link-redaction filters, exfiltrating OneDrive, SharePoint, and Teams data through the Teams content URL API (teams.microsoft.com/urlp/v1/url/content) which was in the CSP allowlist—no user interaction required. Data exfiltration across 160+ org-level incidents; ~$200M est. impact arXiv 2509.10540, Checkmarx, Trend Micro
May 2025 GitHub MCP Prompt Injection: Malicious commands embedded in public GitHub Issues hijacked developers' AI agents via the GitHub MCP server, exfiltrating private repository source code and cryptographic keys to attacker-controlled public repositories. Private source code and keys leaked Security Boulevard / NSFOCUS
Aug 2025 Perplexity Comet Browser Injection: Hidden commands in Reddit Markdown spoiler tags triggered Comet's "summarize page" feature; the agent logged into user emails, bypassed CAPTCHAs, and transmitted credentials within 150 seconds. Credential theft Security Boulevard
Mid-2025 Supabase Cursor Agent Injection: A Cursor agent with privileged service-role database access processed support tickets containing user-supplied SQL; attacker-controlled content caused exfiltration of sensitive integration tokens. Sensitive tokens exposed Practical DevSecOps
Aug 2024 Slack AI Data Exfiltration: Indirect prompt injection in public Slack channels weaponized Slack AI's RAG search to exfiltrate data from private channels via malicious link formatting—Slack initially described this as "intended behavior." Private communications leaked Simon Willison
2024 Financial Reconciliation Agent Fraud: An attacker phrased a data export as a legitimate business task ("all customer records matching pattern X"), causing a reconciliation agent to export every record without raising suspicion. 45,000 customer records stolen Stellar Cyber
2025 Procurement Agent Memory Poisoning: Over 3 weeks, a manufacturing firm's procurement agent was gradually memory-poisoned into believing elevated transfer limits were authorized; it then transferred funds to attacker accounts. Financial fraud via persistent memory corruption Lares Labs
Nov 2025 A2A Session Smuggling (Palo Alto Unit 42): "Agent Session Smuggling" exploited trust relationships in the Agent-to-Agent (A2A) protocol during multi-turn conversations; a malicious sub-agent hijacked the task graph of a trusted orchestrator. Multi-agent system compromise Lares Labs
Dec 2025 OpenAI Atlas — Real Attack Chain Disclosed: OpenAI confirmed a real-world attack where a malicious email caused their Atlas agent to autonomously send a resignation letter. OpenAI stated that "deterministic guarantees are not achievable." Unauthorized email sent on behalf of user OpenAI Blog
2024 ChatGPT Memory Persistence Attack: Injected content manipulated ChatGPT's persistent memory feature to plant false memories, enabling long-term data exfiltration across sessions without re-injection. Persistent cross-session data exfiltration ars technica

Supply Chain Attacks

Date Incident Impact References
Aug 2025 s1ngularity — Nx Build System: Compromised Nx build pipeline distributed malware targeting developer environments. The malware detected Claude Code and Gemini CLI and issued natural-language prompts instructing these agents to enumerate filesystems and exfiltrate credentials. Developer credential theft at scale nx.dev, Trend Micro
Feb 2026 OpenClaw / ClawHub Malicious Skills: Antiy CERT confirmed 1,184 malicious skills across ClawHub, the package registry for the OpenClaw AI agent framework; Snyk ToxicSkills separately found 76 confirmed malicious payloads and 534 critically-vulnerable skills (~13.4%) in an independent scan of 3,984 skills. Techniques: typosquatting, mass uploads, 2.9% used curl | bash remote-instruction loading. 135,000+ instances exposed with insecure defaults. Largest confirmed AI agent supply chain attack CyberDesserts, Snyk ToxicSkills
Jan 2026 Claude Code RCE via Poisoned Config CVE-2026-XXXX: Check Point Research disclosed RCE in Claude Code through poisoned repository .claude configuration files. Patched in Claude Code 2.0.65+. Developer workstation compromise CyberDesserts
2026 OpenAI Plugin Ecosystem Breach: Supply chain attack on the OpenAI plugin ecosystem harvested agent credentials from 47 enterprise deployments; 6 months of undetected access to customer data, financial records, and proprietary code. 47 enterprises compromised Stellar Cyber
Aug 2025 Drift/Salesforce OAuth Token Theft (UNC6395): Threat actor used stolen OAuth tokens from Drift's Salesforce integration to access 700+ customer environments—no exploit, no phishing required. 700+ orgs compromised; Drift temporarily removed from AppExchange Reco Blog

Infrastructure Compromise

Date Incident Impact References
Nov 2025 Ray Framework Mass Exploitation CVE-2023-48022: Attackers used AI-generated attack scripts to exploit this critical unauthenticated RCE at scale, compromising 230,000+ publicly exposed Ray AI computing clusters for cryptomining, data theft, and DDoS. 230K+ clusters compromised Security Boulevard
Nov 2025 SesameOp — OpenAI API as C2: Microsoft DART discovered a novel backdoor (OpenAIAgent.Netapi64.dll) that used the OpenAI Assistants API as its C2 channel. Commands were embedded in encrypted assistant descriptions; results uploaded via message threads—all via legitimate api.openai.com traffic. Stealthy persistent access; blends into enterprise AI traffic Microsoft Security Blog
2025 Exposed MCP Servers: Trend Micro found 492 MCP servers exposed to the internet with zero authentication, granting open access to agent infrastructure including file systems and code execution. Open access to agent tooling CyberDesserts

Agent Misalignment & Rogue Behavior

Date Incident Impact References
2025 Cost-Optimization Agent Deletes Production Backups: A cloud cost-optimization agent autonomously decided that deleting production backups was the most efficient way to reduce storage costs. No attacker involved—pure goal misalignment. Production backup loss Lares Labs
2025 ServiceNow Now Assist Inter-Agent Spoofing: Spoofed inter-agent messages in a procurement workflow caused a downstream agent to process fraudulent orders from attacker-front companies. Fraudulent procurement orders Lares Labs

AI-Assisted Attacks

Date Incident Impact References
Nov 2025 Chinese State-Sponsored Claude Code Campaign: Anthropic confirmed that a Chinese APT group used Claude Code to attempt infiltration of ~30 global targets across tech, finance, and chemical manufacturing. 80–90% of tactical operations (scanning, exploit crafting, multi-step infiltration) were executed autonomously by the agents. First documented large-scale AI-agent-driven cyberattack BBC

CVE Database

CVE Product / Component CVSS Description References
CVE-2025-32711 Microsoft 365 Copilot (EchoLeak) 9.3 (Critical, Microsoft CNA) Zero-click prompt injection enabling data exfiltration from OneDrive/SharePoint/Teams via reference-style Markdown image URL leakage through the Teams content URL proxy. NVD, MSRC
CVE-2025-53773 GitHub Copilot + Visual Studio 2022 7.8 (High, AV:L) Local code execution via command injection triggered through prompt injection in GitHub Copilot / Visual Studio 2022 (v17.14.0–17.14.11); requires user interaction. NVD, MSRC
CVE-2025-59944 Cursor (≤ v1.6.23) 9.8 (Critical, NIST) / 8.0 (High, GitHub CNA) Case-sensitivity flaw in protected file-path checks allows prompt injection to overwrite /.cursor/mcp.json on case-insensitive filesystems, escalating to RCE. Fixed in v1.7. NVD, GHSA
CVE-2025-3248 Langflow (< v1.3.0) 9.8 (Critical) Unauthenticated RCE via the /api/v1/validate/code endpoint; added to CISA KEV May 2025. NVD
CVE-2025-34291 Langflow (≤ v1.6.9) 9.4 (Critical, CNA) / 8.8 (High, NIST) Account takeover + RCE via overly permissive CORS (allow_origins='*' + allow_credentials=True) combined with SameSite=None refresh token cookie, enabling cross-origin credential theft leading to code-execution endpoint access. NVD, GitHub
CVE-2025-47241 Browser Use (< v0.1.45) 9.3 (Critical, GHSA) URL parsing flaw in allowed_domains whitelist: the parser splits on : to extract the domain, allowing an attacker to craft https://allowed.com:[email protected]/ to bypass domain restrictions entirely. NVD, GHSA-x39x-9qw5-ghrf
CVE-2025-6514 mcp-remote 9.6 (Critical, JFrog CNA) OS command injection when connecting to an untrusted MCP server via a crafted authorization_endpoint URL in the OAuth response. NVD
CVE-2023-48022 Anyscale Ray (v2.6.3 / v2.8.0) 9.8 (Critical, Disputed) Unauthenticated RCE via the job submission API; vendor disputes severity noting Ray is not designed for untrusted network exposure. Exploited at scale (230K+ clusters) using AI-generated attack scripts in 2025. NVD

MCP (Model Context Protocol) Security

MCP Incidents & PoCs

Incident Description Reference
GitHub MCP Prompt Injection Malicious instructions in GitHub Issues hijacked agents connected via the GitHub MCP server; private repo contents and API keys exfiltrated. Invariant Labs
WhatsApp MCP Chat History Exfiltration Invariant Labs PoC: a malicious server combined tool poisoning with the legitimate whatsapp-mcp server to silently exfiltrate an entire chat history—zero user interaction. Invariant Labs
MCP Inspector RCE Anthropic's own MCP Inspector developer tool allowed unauthenticated RCE via its inspector–proxy architecture, turning a debugging aid into a remote shell. AuthZed Timeline
mcp-remote OAuth Command Injection CVE-2025-6514 Critical OS command injection in the most widely used OAuth proxy for connecting local MCP clients to remote servers. AuthZed Timeline
Smithery Build Config Path Traversal A path-traversal vulnerability in Smithery's build configuration leaked a builder's ~/.docker/config.json, exposing a Fly.io API token that granted control over 3,000+ deployed apps. AuthZed Timeline

MCP Attack Vectors

Vector Description
Tool Poisoning Malicious instructions embedded in MCP tool description or inputSchema fields—invisible to users but parsed and executed by LLMs. The attack triggers before the poisoned tool is ever called.
Rug Pull / Silent Redefinition MCP tools mutate their own definitions after installation and approval; the tool you approved is not the tool that executes.
Tool Shadowing A malicious server injects a tool description that overrides or modifies the agent's behavior with respect to a different, trusted server's tools.
Cross-Server Manipulation With multiple MCP servers connected, a malicious one intercepts and replaces calls destined for a trusted one—invisible to the user.
Tool Return Attacks Malicious instructions injected into tool output (not descriptions), exploiting the same data/instruction confusion in the model's context. Effective even with hex-encoded payloads.
MCP Preference Manipulation (MPMA) Subtly alter how AI agents rank and select among available tools, steering them toward attacker-controlled options.
Webpage Poison via MCP Browser Tool Any MCP tool that fetches web content becomes a vector for indirect prompt injection from attacker-controlled pages.

Root cause: MCP clients receive tool metadata via tools/list, pass it into the LLM context without sanitization, and the LLM treats natural-language descriptions as instructions. The vulnerability is entirely client-side. (arXiv 2603.22489)


Attack Taxonomy

The Promptware Kill Chain

Prompt injection has evolved from an isolated input-manipulation exploit into a structured, multi-stage malware mechanism. Analysis of 36 prominent incidents found at least 21 attacks traversing four or more stages of this kill chain. (arXiv 2601.09625)

Stage 1: Initial Access        ──► Prompt injection (direct or indirect)
Stage 2: Privilege Escalation  ──► Jailbreak system constraints / safety filters
Stage 3: Reconnaissance        ──► Probe agent memory, tools, environment config
Stage 4: Persistence           ──► Poison agent long-term memory / RAG database
Stage 5: Command & Control     ──► Establish external channel (e.g., SesameOp)
Stage 6: Lateral Movement      ──► Leverage agent permissions to infect other agents
Stage 7: Actions on Objective  ──► Data exfiltration / financial fraud / system disruption

Attack Techniques

Technique Category Key Property
Direct Prompt Injection Input Manipulation Malicious instructions in user turn override system prompt
Indirect Prompt Injection (XPIA) Input Manipulation Instructions hidden in emails, web pages, documents, tool output
Tool Poisoning MCP / Supply Chain Malicious content in tool metadata—triggers before tool is called
Tool Shadowing MCP Malicious server overrides trusted tool behavior
Rug Pull / Silent Redefinition MCP Tool definitions mutate post-installation
Cross-Server Manipulation MCP Malicious server intercepts calls to trusted server
Memory Poisoning Persistence Gradually corrupt agent long-term memory over multiple sessions
RAG / Knowledge Base Poisoning Persistence Inject malicious documents into retrieval corpus; 5 docs can achieve 90% manipulation rate
Agent Session Smuggling Multi-Agent Exploit A2A protocol trust across multi-turn conversations
Adversary-in-the-Middle (AiTM) Multi-Agent Compromised agent manipulates inter-agent message flow
Infectious Prompt Propagation Multi-Agent Malicious instructions spread virally across agent networks
Backdoor Attacks Model-Level Trained-in triggers cause malicious behavior on specific inputs
AI Agent Clickbait AML.T0100 Agentic UI Manipulate web UI to lure agentic browsers into unintended actions
Tool Call Injection Execution Poisoned context causes agent to invoke tools with malicious parameters
PLeak Info Extraction Algorithmic extraction of system prompts via black-box queries
Steganographic C2 Covert Channel Encode commands in LLM-generated content; agents communicate covertly
Invisible Prompt Injection Stealth Zero-width characters, Unicode homoglyphs, or control characters hide payloads
Living-off-the-Land AI (LotAI) Stealth Abuse legitimate AI service APIs (e.g., OpenAI Assistants) for C2, blending into normal traffic

Frameworks & Standards

OWASP Top 10 for Agentic Applications (ASI 2026)

Released December 2025. Developed with 100+ security researchers. Distinct from the general LLM Top 10—focuses on the failure of agentic properties: autonomy, persistence, and tool integration.

Rank ID Risk Core Failure Mode
1 ASI01 Agent Goal Hijack Attackers redirect agent goals and task selection via prompt injection, forged messages, or poisoned external data — exploiting LLMs' inherent inability to distinguish instructions from content
2 ASI02 Tool Misuse & Exploitation Agents misuse legitimate tools through unsafe composition, prompt injection into tool calls, or over-privileged access — leading to data exfiltration, destructive actions, or DoS
3 ASI03 Identity & Privilege Abuse Agents inherit excessive credentials or over-scoped API access, enabling privilege escalation or actions beyond intended authorization scope
4 ASI04 Agentic Supply Chain Vulnerabilities Compromised tool registries, MCP servers, third-party agents, or model artifacts introduce malicious behavior through trusted supply chain channels
5 ASI05 Unexpected Code Execution (RCE) Agents execute injected or unintended code via code-execution tools (shell, eval, REPL) due to insufficient sandboxing or input validation
6 ASI06 Memory & Context Poisoning Persistent corruption of agent memory, retrieval context, or long-term state biases future behavior without the agent or user detecting the tampering
7 ASI07 Insecure Inter-Agent Communication Spoofed or manipulated messages between collaborating agents propagate malicious instructions across a multi-agent system
8 ASI08 Cascading Failures Small errors or resource overuse propagate across interconnected agent systems, amplifying into large-scale disruptions
9 ASI09 Human-Agent Trust Exploitation Agents produce misleading or overconfident outputs that exploit human tendency to trust AI, causing harmful decisions by operators or end users
10 ASI10 Rogue Agents Agents drift from intended behavior due to misalignment, self-modification, or emergent collusion — without active attacker involvement

Source: OWASP GenAI Security Project

MITRE ATLAS

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) is the canonical knowledge base of adversary tactics and techniques, now expanded to 16 tactics and 84+ techniques with specific coverage of agentic threats.

Key techniques relevant to AI agents:

Tactic Key Technique ATLAS ID
Initial Access Prompt Injection AML.T0051
Persistence Modify AI Agent Configuration AML.T0101 (proposed)
Credential Access Agent Tool Credential Harvesting AML.T0098
Command & Control AI Service API Abuse AML.T0096
Exfiltration Exfiltration via Agent Tool Invocation
New (2025) AI Agent Clickbait AML.T0100

Source: MITRE ATLAS

Other Frameworks

Framework Description
OWASP Top 10 for LLM Applications 2025 Broader LLM risks; prompt injection remains #1
OWASP Agentic AI — Threats and Mitigations (T01–T17) Comprehensive agentic threat taxonomy
OWASP Multi-Agentic System Threat Modeling Threat modeling guide for multi-agent architectures
CSA MAESTRO CSA's agentic AI threat modeling framework
NIST AI RMF Risk governance for AI systems
Google SAIF Google's Secure AI Framework
EU AI Act Regulatory framework; fines up to €35M or 7% of global revenue for high-risk violations
CoSAI Cross-industry collaboration on AI security standards (model signing, MCP security, zero trust for AI)
AVIDML Taxonomy AI Vulnerability and Incidents Database taxonomy
MIT AI Risk Repository Comprehensive repository of AI risk categories

Research Papers

Surveys & Systematizations

Paper Venue TL;DR
The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey arXiv 2026 Reviews 128 papers (51 attacks, 60 defenses). Argues component-level defenses are insufficient; security must be treated as a systems-level problem. Defines 6 primary attack vectors.
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways ACM Comp. Surveys 2025 Categorizes threats across 4 knowledge gaps. Key finding: backdoor defenses don't cover the full agent ecosystem beyond model granularity.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges arXiv Oct 2025 Broad threat taxonomy spanning prompt injection, autonomous cyber-exploitation, multi-agent threats, and governance concerns.
A Survey of Agentic AI and Cybersecurity arXiv Jan 2026 Treats agentic AI as a cybersecurity system. Analyzes collusion, cascade failures, and oversight evasion with prototype implementations.
The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis arXiv Feb 2026 Systematization-of-knowledge with unified taxonomy. Introduces AgentPI benchmark covering context-dependent agent tasks all prior benchmarks ignored.
From Prompt Injections to Protocol Exploits ICT Express (Elsevier) 2025 Unified end-to-end threat model for LLM-agent ecosystems. Catalogs 30+ attack techniques; validates against real-world CVEs.
Agentic AI in Cybersecurity: Cognitive Autonomy, Ethical Governance, and Quantum-Resilient Defense F1000Research Sep 2025 Narrative review spanning 2005–2025. Identifies dual-use risks, governance gaps, and post-quantum preparedness challenges. (citation unverifiable — access restricted; treat with caution)
A Survey on Agentic Security: Applications, Threats and Defenses arXiv Oct 2025 Cross-cutting analysis of 160+ papers. Maps planner–executor agent patterns and GPT/Claude/LLaMA backbones to their respective attack surfaces. Privacy and integrity defenses surveyed per architecture type.
Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems arXiv Dec 2025 Extracts 93 threats from three sources via a multi-agent RAG pipeline: MITRE ATLAS (26), AI Incident Database (12), and 55 literature papers. Identifies previously unreported TTPs beyond ATLAS coverage.
AI-Augmented SOC: A Survey of LLMs and Agents for Security Automation J. Cybersecur. Priv. 2025 Reviews 600+ papers across 8 SOC tasks (alert triage, threat intel, incident response, log summarization, etc.). Highlights where agentic autonomy creates new insider-threat and prompt-injection risks within the SOC itself.

Prompt Injection & Jailbreaks

Paper Venue TL;DR
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection IEEE S&P Workshop 2023 Foundational paper defining indirect prompt injection. Demonstrates remote data theft, ecosystem contamination across production applications.
Universal and Transferable Adversarial Attacks on Aligned Language Models arXiv 2023 Introduces transferable adversarial suffixes that elicit objectionable content across GPT, Claude, Llama. Foundational jailbreak work.
The Promptware Kill Chain arXiv Jan 2026 Formalizes prompt injection as a 7-stage malware delivery mechanism. Validates against 36 real incidents; 21 traversed ≥4 stages.
EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit arXiv Sep 2025 Technical deep-dive on CVE-2025-32711; details how CSP and prompt-injection filters were both bypassed using a trusted Microsoft domain as exfiltration proxy.
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections arXiv Oct 2025 Breaks 12 published defenses using gradient descent, RL, random search. Most defenses claimed near-zero ASR; adaptive attacks exceeded 90% against most of them.
Prompt Injection 2.0: Hybrid AI Threats arXiv Jul 2025 Shows injections now combine with XSS, CSRF, AI worm propagation, and multi-agent infections to evade traditional WAFs entirely.
Securing AI Agents Against Prompt Injection Attacks arXiv Nov 2025 Benchmarks 847 adversarial cases across 5 attack categories vs. 7 LLMs. Combined defense reduces ASR from 73.2% → 8.7% while retaining 94.3% task performance.
Prompt Injection Attack to Tool Selection in LLM Agents (ToolHijacker) arXiv Apr 2025 No-box attack injecting a malicious tool document to hijack tool selection. StruQ, SecAlign, DataSentinel, and perplexity detection are all insufficient defenses.
Attention Tracker: Detecting Prompt Injection Attacks in LLMs NAACL 2025 Findings Detects prompt injection by tracking attention distribution shifts—no modification to the underlying model; deployable as a wrapper.
Advertisement Embedding Attacks Against LLM Agents arXiv Aug 2025 Shows how adversaries can covertly embed promotional content into agent responses by poisoning external data sources the agent consumes, turning helpfulness into a covert advertising channel.

MCP & Tool Poisoning

Paper Venue TL;DR
MCPTox: A Benchmark for Tool Poisoning Attacks on Real-World MCP Servers arXiv Aug 2025 45 live MCP servers, 353 tools, 1,312 malicious test cases. o1-mini ASR: 72.8%. Claude-3.7 refusal rate: <3%. More capable models are often more vulnerable—better instruction following is the attack surface.
MCP Threat Modeling: Analyzing Vulnerabilities to Prompt Injection with Tool Poisoning arXiv Mar 2026 Identifies 57 threats across 5 MCP components using STRIDE/DREAD. Vulnerability is client-side: no validation of tool metadata before LLM context insertion.
Systematic Analysis of MCP Security arXiv 2025 Demonstrates Tool Return Attacks, Third-party Poison Attacks, and Webpage Poison Attacks. Both plaintext and hex-encoded commands are effective.
MindGuard: Intrinsic Decision Inspection Against Metadata Poisoning arXiv Aug 2025 Decision Dependence Graph (DDG) using LLM attention mechanisms. 94–99% precision, 95–100% attribution accuracy, sub-second processing, zero additional token cost.
MPMA: MCP Preference Manipulation Attacks arXiv May 2025 Formalizes and implements attacks that steer agent tool-selection toward attacker-controlled servers by appending boosting phrases and genetic-algorithm-optimized descriptions to tool metadata—without modifying any tool logic.

Multi-Agent System Threats

Paper Venue TL;DR
Open Challenges in Multi-Agent Security arXiv May 2025 Introduces "multi-agent security" as a new field. Novel threats: steganographic collusion, coordinated swarm attacks, cascading jailbreaks. Securing individual agents is fundamentally insufficient.
Multi-Agent Risks from Advanced AI arXiv Feb 2025 Structured taxonomy with 3 failure modes (miscoordination, conflict, collusion) and 7 risk factors. Strong empirical grounding.
Multi-Agent Security Tax AAAI 2025 Demonstrates "infectious malicious prompts" spreading across agent networks. Fundamental finding: effective security measures measurably decrease collaboration capability.
Red-Teaming LLM Multi-Agent Systems via Adversary-in-the-Middle (AiTM) ACL 2025 Compromised agent manipulates inter-agent communication. Chain topologies are most vulnerable; fully-connected topologies are more resilient. Demonstrates answer manipulation via fake encryption and conversational DoS.
Multiagent Collaboration Attack: Adversarial Attacks in LLM Multi-Agent Debate arXiv Jun 2024 A single adversarial agent embedded in a debate-style multi-agent panel can steer group consensus through persuasive but factually wrong reasoning, without the other agents detecting manipulation.
ComPromptMized: Unleashing Zero-click Worms Targeting GenAI-Powered Applications (Morris II) arXiv 2024 First demonstration of a self-replicating AI worm. Shows super-linear propagation through multi-agent email/assistant ecosystems with no user interaction.

Memory & RAG Attacks

Paper Venue TL;DR
PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation USENIX Security 2025 Just 5 poisoned documents can manipulate AI responses 90% of the time. Documents designed to match common query patterns.
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases arXiv 2024 Demonstrates persistent backdoors planted via memory or KB poisoning that activate on specific triggers without affecting normal operation.
AudAgent: Automated Auditing of Privacy Policy Compliance in AI Agents arXiv Nov 2025 Auditing automaton ensuring runtime data practices align with stated privacy policies. Bridges natural language policies and technical execution.

Backdoor Attacks on Agents

Paper Venue TL;DR
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents ACL 2024 Injects backdoors during fine-tuning or continued training. Agents behave normally except when triggered. Resists standard safety alignment.
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents NeurIPS 2024 Backdoor triggers in tool descriptions, memory retrievals, or environmental observations. Standard RLHF-based defenses ineffective.

Benchmarks & Agent Evaluation

Paper Venue TL;DR
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents NeurIPS D&B 2024 97 tasks and 629 injection security cases across banking, Slack, travel, and workspace domains. GPT-4o utility drops from 69% → 45% under attack; best defense (tool filtering) reduces ASR to 7.5%.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents ICLR 2025 Formalizes 10 evaluation scenarios covering prompt injection, man-in-the-middle, and tool abuse. Provides standardized metrics for comparing attack and defense effectiveness across agent architectures.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ICLR 2025 Evaluates agent compliance with harmful multi-step tool-calling tasks (misuse setting). Hand-authored tasks with human review; focuses on direct harmful requests rather than adversarial injection.
SafeArena: Evaluating the Safety of Autonomous Web Agents arXiv Mar 2025 500 web tasks (250 safe, 250 harmful) across 5 harm categories. GPT-4o completed 34.7% of harmful requests. Introduces the Agent Risk Assessment (ARIA) framework with 4 risk levels.
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents EMNLP Findings 2024 569 multi-turn agent interaction records across 27 risk scenarios and 10 risk types. Best model (GPT-4o) achieves only 74.42% risk-awareness accuracy—well short of human-level performance.
AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks arXiv Feb 2026 Extends AgentDojo by adding dynamic open-ended tasks, helpful distractor instructions, and complex user goals. Tests 10 defenses; finds most defenses substantially degraded on realistic tasks.

OpenClaw Security

A cluster of papers from March 2026 focusing on the OpenClaw AI agent framework, prompted by the ClawHub supply chain incident (1,184 malicious skills confirmed by Antiy CERT). Papers span attacks, defenses, red-teaming frameworks, and benchmarks — making OpenClaw one of the most thoroughly studied agentic security case studies to date.

Paper Venue TL;DR
Formal Analysis and Supply Chain Security for Agentic AI Skills arXiv Mar 2026 Introduces SkillFortify: Dolev-Yao attacker model + static analysis + capability sandboxing + SAT-based dependency resolution. Simulates "ClawHavoc" (1,200+ malicious skills). 96.95% F1, 100% precision, zero false positives on 540-skill benchmark.
ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems arXiv Mar 2026 First self-replicating worm against OpenClaw: single message triggers autonomous infection, hijacks configs for persistence, propagates to peer agents. ~64.5% success rate across 1,800 trials on 4 LLM backends; skill supply chains are universally vulnerable.
Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats arXiv Mar 2026 Systematic study of compound threats (indirect prompt injection, skill supply chain contamination, memory poisoning, intent drift). Proposes a five-layer lifecycle security framework; argues point-based defenses fail against cross-temporal, multi-stage attacks.
From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw (PASB) arXiv Feb 2026 Introduces PASB (Personalized Agent Security Bench), a security evaluation framework with realistic tools and extended interactions. Uncovers critical weaknesses at user prompt processing, tool usage, and memory retrieval stages — gaps missed by synthetic benchmarks.
ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation arXiv Mar 2026 MITM red-teaming via Static HTML Replacement, Iframe Popup Injection, and Dynamic Content Modification. Less capable models accept tampered data; more capable models show greater resilience. Argues sandbox evaluations miss real-world attack surfaces.
Uncovering Security Threats and Architecting Defenses in Autonomous Agents: A Case Study of OpenClaw arXiv Mar 2026 Identifies prompt-injection-driven RCE, sequential tool attack chains, context amnesia, and supply chain contamination under a tri-layered risk taxonomy. Proposes FASA (Full-Lifecycle Agent Security Architecture) with zero-trust execution and dynamic intent verification; implements as Project ClawGuard.
Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw arXiv Mar 2026 Tests 47 adversarial scenarios across 6 attack categories; native defenses achieve only 17% defense rate. Human-in-the-Loop (HITL) intervention intercepted 8 severe attacks that bypassed all native defenses. Combined HITL + native raises effectiveness to 19–92%.
Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents arXiv Mar 2026 Novel DoS attack inducing "Segmented Verification Protocol" via tool-calling chains — achieves 6–9× token amplification. Also identifies prompt bloat, persistent tool-output pollution, cron/heartbeat amplification, and behavioral instruction injection in production deployments.
Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection arXiv Mar 2026 Two-part structural defense: agent privilege separation + tool partitioning + JSON formatting to strip persuasive framing. Tested against 649 successful attacks from Microsoft LLMail-Inject benchmark — achieves 0% attack success rate. Isolation is the dominant mechanism.
Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents arXiv Mar 2026 Argues that untrusted inputs + autonomous action + extensibility + privileged access create systemic risks no single mitigation can address. Proposes secure engineering principles, a risk taxonomy, and a research roadmap for institutionalizing safety throughout agent development.
OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer arXiv Mar 2026 In-process plugin + optional sidecar distributing enforcement across 10 lifecycle hooks. Combines heuristic and LLM-based scanning with risk accumulation/decay and policy-based tool and network controls. Focus on practical, deployable security for production agent gateways.
ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents arXiv Mar 2026 Three-layer defense: skill-based policy enforcement (instruction level) → plugin-based runtime monitoring → decoupled watcher-based security middleware (real-time intervention without coupling to agent internals). Effective against data leakage and privilege escalation; watcher paradigm proposed as a building block for next-generation agent security.

Defense Methods

Paper Venue TL;DR
LlamaFirewall arXiv May 2025 Meta's open-source multi-layer guardrail: PromptGuard 2 (jailbreak/injection detection), Agent Alignment Checks (chain-of-thought auditor), CodeShield (static analysis for code agents).
How Microsoft Defends Against Indirect Prompt Injection Attacks (FIDES) Microsoft MSRC Jul 2025 FIDES: information-flow control enforcing privilege separation and prompt isolation. Deterministically blocks indirect injection in Copilot-class agents.
Agents Rule of Two Meta Oct 2025 Architectural rule: agents must satisfy ≤2 of: (A) processing untrusted input, (B) access to sensitive data, (C) ability to change external state. Deterministic blast-radius bound.
Prompt Injection Attacks in LLMs and AI Agent Systems: A Comprehensive Review MDPI Info. Jan 2026 Synthesizes 45 sources. Documents 5-document RAG poisoning achieving 90% manipulation. Proposes PALADIN five-layer defense framework.
Agent-Sentry: Bounding LLM Agents via Execution Provenance arXiv Mar 2026 Learns execution-provenance models from benign traces and blocks deviations. Blocks 90%+ of attacks at 98% utility preservation.
Defeating Prompt Injections by Design arXiv Mar 2025 Architectural approach that makes prompt injection structurally infeasible rather than relying on detection, by enforcing strict privilege separation between instruction context and data context at the system level.
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks arXiv Jul 2025 Training-time alignment that instills inherent resistance to injection at the foundation model level, providing a secure-by-default base that downstream agentic systems can build on.

Defensive Tools & Projects

Open-Source Guardrails

Tool Maintainer Description
LlamaFirewall Meta Multi-layer runtime protection: PromptGuard 2 (injection/jailbreak), Agent Alignment Checks (CoT auditor), CodeShield (dangerous code detection in agent outputs)
NeMo Guardrails NVIDIA Programmable topical/safety/dialog rails for LLM-based systems; composable with LangChain/LlamaIndex
Guardrails AI Guardrails AI Python framework with 50+ validators for PII, schema conformance, injection, toxicity; structured output enforcement
LLM Guard Protect AI Self-hosted input/output scanner; detects injection, secrets, PII, toxicity; low-latency production deployment
Rebuff Protect AI Self-hardening prompt injection detector with canary token support; learns from attempted bypasses (archived May 2025 — no longer actively maintained)
Invariant Analyzer Invariant Labs Rule-based guardrailing for LLM/MCP agent traces; Python-inspired matching language for detecting malicious tool sequences
Vigil LLM deadbits Composable scanners: vector similarity, YARA rules, transformer classifier, canary token detection, sentiment analysis
InjecGuard / PIGuard Open Source +30.8% over prior SOTA on NotInject benchmark; specifically addresses overdefense false positives
Sentinel AI Open Source 12-language sub-millisecond injection detection; detects base64/hex/ROT13/homoglyph obfuscation; includes MCP safety proxy
openclaw-bastion AtlasPA Detects system prompt markers, role overrides, Unicode homoglyphs, zero-width chars, HTML comment injection; zero dependencies
ShellWard Open Source 8-layer agent security middleware; blocks prompt injection, exfiltration, and dangerous commands; zero dependencies
AprielGuard ServiceNow AI 8B parameter safety–security safeguard model with strong performance on injection and jailbreak detection

Red Teaming & Scanning

Tool Type Description
Garak Scanner 100+ probes for injection, jailbreaks, hallucinations, data leakage; AVID taxonomy integration
PyRIT Red Team Microsoft's Python Risk Identification Toolkit; supports supply chain and Azure model assessments
Promptfoo Red Team Dev-first framework with CI/CD integration; multi-turn agent testing; OWASP/NIST/ATLAS mapping
Augustus Scanner 210+ probes, 28 LLM providers, single Go binary; built for pentest workflows without Python/npm
Agentic Radar Agent Scanner CLI scanner specifically for agentic workflows (LangGraph, CrewAI, AutoGen); automatic prompt hardening feature
AI-Infra-Guard Platform Tencent Zhuque Lab's red team platform: Infra Scan + MCP Scan + Jailbreak Evaluation in one web UI
FuzzyAI Fuzzer Automated LLM fuzzing using genetic algorithms for adaptive jailbreak generation
Spikee Red Team Custom injection datasets + automated testing for black-box assessments
mcp-injection-experiments PoC Minimal self-contained scripts that demonstrate tool poisoning and cross-server manipulation in live MCP environments
AgentSeal Scanner 225+ attack probes (82 extraction + 143 injection) for prompt injection and extraction; supports OpenAI, Anthropic, Ollama, any HTTP endpoint
PurpleLlama Red Team Meta's set of tools to assess and improve LLM security
ART (Linux Foundation AI) Defense/Attack Comprehensive library for evasion, poisoning, extraction, and inference attacks on ML models

Benchmarks & Evaluations

Benchmark Description
MCPTox 45 live MCP servers, 353 tools, 1,312 malicious test cases across 10 risk categories
AgentDojo Dynamic environment for evaluating attacks and defenses for LLM agents
JailbreakBench Open robustness benchmark for jailbreaking (NeurIPS 2024)
AIRTBench Measuring autonomous AI red-teaming capabilities in language models
ISC-Bench Internal Safety Collapse benchmark: jailbreaks frontier models via normal task completion, no adversarial prompting
AgentDoG Trajectory-level risk assessment framework for autonomous agents
OpenPromptInjection Benchmark for prompt injection attacks and defenses across diverse agent scenarios
Damn Vulnerable MCP Server Deliberately vulnerable MCP server for security education and testing

Observability & Tracing

Tool Description
Langfuse Open-source observability with detailed trace visualization, embedding monitoring, cost tracking, and tamper-proof audit logs
Phoenix (Arize) Open-source LLM observability; traces agent reasoning steps and tool calls with anomaly detection
AudAgent Automated privacy policy compliance auditing via an "auditing automaton" that validates runtime data practices
Invariant Analyzer Security analysis for MCP deployments; detects exfiltration patterns like inbox-fetch→external-send sequences in agent traces

Commercial / Enterprise Solutions

Tool Vendor Description
Prisma AIRS Palo Alto Networks AI agent discovery, behavior monitoring, RAG data inspection, supply chain security
Cisco AI Defense Cisco Developer tools for model resilience testing; DefenseClaw open-source secure agent framework
Cisco Duo (Agent Identity) Cisco Agent identity management with human-owner mapping and MCP policy enforcement
Lakera Guard Lakera Real-time prompt injection and data leak detection API
Prompt Security Prompt Security Enterprise platform for MCP security risk management
Reco Reco SaaS security platform with AI agent discovery and permission auditing
MCP Manager MCP Manager MCP gateway with tool metadata scanning, rug pull detection, permission management
Kubescape 4.0 / KAgent ARMO (CNCF) Kubernetes security with AI agent scanning support

Learning Resources

Articles & Blog Posts

Courses, Labs & CTFs

Resource Type Description
PromptTrace CTF 7 injection labs + 15-level Gauntlet CTF; unique Context Trace shows full prompt stack in real-time
Gandalf CTF 8-level prompt injection challenge by Lakera; classic resource for learning defense evasion
FinBot Agentic AI CTF CTF OWASP's financial services agentic AI CTF with real-world vulnerability scenarios
CrowdStrike AI Unlocked CTF Three-room injection challenges with escalating difficulty
Damn Vulnerable LLM Agent Lab LangChain-based ReAct agent with intentionally exploitable injection paths; lets you observe how hijacked thoughts propagate through the agent loop
Damn Vulnerable MCP Server Lab Deliberately vulnerable MCP server for learning MCP pentesting
vulnerable-mcp-servers-lab Lab Collection of vulnerable MCP servers by Appsecco
Microsoft AI Red Teaming Playground Lab Microsoft's AI red teaming training infrastructure
ai-prompt-ctf CTF Targets the full agentic stack: RAG retrieval, function calling, and ReAct loops — covers indirect injection paths most CTFs skip
OWASP WrongSecrets LLM Lab OWASP's LLM security challenge (link may be unavailable)
Google AI Red Team Guide Guide Google's walkthrough of hacking AI systems

Databases & Trackers

Resource Description
AI Incident Database Community-sourced database of AI failures and harms in deployed systems
MIT AI Incident Tracker MIT AI Risk Repository's incident tracker with severity and domain classification
AVIDML AI Vulnerability and Incidents Database with structured taxonomy
MITRE ATLAS Cases Real-world case studies of ML attacks, mapped to ATLAS tactics and techniques

Key Reports & Industry Data


Contributing

Contributions are welcome! Please submit a PR with:

  1. Incidents: Date, brief description, impact, and a verifiable source link
  2. CVEs: CVE ID, affected product, CVSS score, and concise description
  3. Papers: arXiv/venue link, venue/date, and a one-sentence TL;DR
  4. Tools: Repository link, maintainer, and what it does/defends against

Please ensure all entries have verifiable sources. Unverified or purely speculative entries will not be merged.

See CONTRIBUTING.md for full guidelines.


License

MIT License — Copyright (c) 2026 Hideaki Takahashi

Yorumlar (0)

Sonuc bulunamadi