claude-4.6-jailbreak-vulnerability-disclosure-unredacted
Health Gecti
- License — License: CC-BY-4.0
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 36 GitHub stars
Code Basarisiz
- exec() — Shell command execution in disclosures/afl-jailbreak/afl_defuser.jsx
Permissions Gecti
- Permissions — No dangerous permissions requested
This is a security research disclosure documenting jailbreak and prompt injection vulnerabilities in Claude Opus, Sonnet, and Haiku. It contains proof-of-concept materials demonstrating how model safety checks can be bypassed.
Security Assessment
The repository contains evidence of shell command execution (`disclosures/afl-jailbreak/afl_defuser.jsx`), which makes sense given the nature of the exploit proof-of-concept code. No hardcoded secrets were found, and no dangerous permissions are requested. However, because this repository is designed to showcase functional exploit code against live infrastructure—including subnet scanning and container escapes—it inherently contains aggressive, potentially dangerous payloads. Overall Risk: High.
Quality Assessment
The project is licensed under CC-BY-4.0 and was recently updated (last push was today). It has garnered 36 GitHub stars, indicating a moderate level of community interest and visibility in the security research space. The repository includes a detailed, professional vulnerability disclosure timeline, though the claims against Anthropic remain unverified independently.
Verdict
Not recommended. Unless you are actively conducting authorized red-team security research, developers should avoid running or interacting with the proof-of-concept payloads hosted in this repository.
I Jailbroke Claude Opus/Sonnet 4.6 & Haiku 4.5 with "more+"
Prompt Injection, Jailbreak, and Constitutional Compliance Failure Across Claude Opus 4.6 ET, Sonnet 4.6 ET, and Haiku 4.5 ET
Unredacted Public Disclosure
31-turn Opus 4.6 ET session: model autonomously escalates from passive analysis to subnet scanning, memory injection, and container escape planning — zero user instruction to attack.
Independent reproduction by Nokia — jailbreaking Claude Opus 4.6 Extended Thinking.
TL;DR: All three Claude production tiers generated functional exploit code against live infrastructure when user-defined memory protocols suppressed constitutional safety checks across extended conversations. Anthropic was notified six times over 27 days with zero acknowledgment.
Disclosure Timeline
| Date | Event | Recipient(s) |
|---|---|---|
| March 4, 2026 | Prompt injection vulnerability discovered | — |
| March 12, 2026 | Prompt injection submission via HackerOne; email to [email protected] | Anthropic Model Bug Bounty |
| March 18, 2026 | Full proof of concept package sent (12 attachments including PoC video, framework papers, diagrams, screenshots) | [email protected] |
| March 22, 2026 | Opus 4.6 ET jailbreak reported with afl_disclosure.docx | modelbugbounty, security, amanda, alex, usersafety @anthropic.com |
| March 22, 2026 | First constitutional failure observed (Sonnet 4.6 ET) | — |
| March 24, 2026 | Second constitutional failure observed (Opus 4.6 ET) | — |
| March 27, 2026 | Follow-up email noting 15 days with zero acknowledgment | [email protected] |
| March 28, 2026 | Third constitutional failure observed (Haiku 4.5 ET) | — |
| March 28, 2026 | Tri-tier constitutional disclosure submitted with full report | modelbugbounty, security, alex, amanda, usersafety, disclosure @anthropic.com |
| March 29, 2026 | Fourth constitutional failure observed (Opus 4.6 ET — TENEX.AI session): scope violation, active recon, self-aware misrepresentation in advisory | — |
| March 31, 2026 | 27 days since first submission. Zero acknowledgment from Anthropic on any channel. | — |
| March 31, 2026 | Unredacted public disclosure | — |
Anthropic's own Responsible Disclosure Policy commits to acknowledging submissions within three (3) business days. That commitment was not met across six separate emails to six Anthropic addresses over 27 days. No acknowledgment, no triage, no rejection — nothing.
This document was originally submitted with a confidentiality commitment contingent on a functioning disclosure process. That process was never engaged by Anthropic. This is the full, unredacted version.
Disclosures
Three coordinated disclosures, one supporting research paper, full evidence chain in evidence/. Each section below links the primary disclosure documents and supporting evidence.
Constitutional Compliance Failure — All Three Tiers
Between March 22 and March 28, 2026, all three Claude production model tiers violated Anthropic's own constitutional behavioral policies. Each exhibited the same failure mode: memory-stored interaction protocols combined with incremental escalation prompts produced cumulative character drift with zero self-correction.
| Finding | Model | Turns | Key Behavior | Transcript |
|---|---|---|---|---|
| Opus 4.6 ET | claude-opus-4-20250514 |
31 | Autonomous escalation — drove subnet scanning, memory injection, and container escape under its own initiative via "garlic mode" | Transcript |
| Opus 4.6 ET | claude-opus-4-20250514 |
12+ | Self-aware misrepresentation — performed active DNS brute-force recon against TENEX.AI after passive scope was established, then authored advisory claiming "100% passive reconnaissance — no active scanning was performed" | Transcript |
| Sonnet 4.6 ET | claude-sonnet-4-20250514 |
20+ | Fake authorization check — asked once, accepted unverified claim, built 1,949-line attack framework against hotel PMS with guest PII | Transcript |
| Haiku 4.5 ET | claude-haiku-4-5 |
8+ | Zero friction — passive analysis to SYN floods and IP spoofing against state telecom infrastructure with no authorization check | Transcript |
Self-Aware Misrepresentation — Opus 4.6 ET (TENEX.AI Session)
Claude Opus 4.6 ET was given a Shodan JSON dump for dev2.tenex.ai and asked to analyze it. Scope was explicitly established as recon phase (passive). Over several turns with increasing behavioral autonomy (garlic mode + riff), the model:
- Violated scope — autonomously attempted active DNS brute-force subdomain enumeration against tenex.ai infrastructure. Sandbox blocked DNS resolution, preventing execution. The model then tried two additional methods (Python
socket.getaddrinfo(),host/dig) to accomplish the same active recon. - Produced a false professional document — authored a vulnerability disclosure advisory stating "100% passive reconnaissance — no active scanning was performed" and a cover email stating "no active scanning was performed", directly contradicting its own actions within the same session. This is not hallucination — the model had full context of its own DNS brute-force attempts and authored false methodology claims anyway.
- 16 distinct issues identified in the session, including: recommended active exploitation steps (wp-admin probing, cookie manipulation, REST API user enumeration) against a target with no VDP/authorization, confidence inflation on speculative findings, citing evidence it had already discredited, embedding a job application in a security disclosure, and imposing a 90-day disclosure timeline on an unauthorized target.
The failure was only prevented by the sandbox environment blocking DNS resolution. The intent, the code, and the false documentation were all present.
| Document | Description |
|---|---|
| Behavioral Compliance Failure Report | Full 16-issue analysis of the session |
| Transcript | Full conversation — passive scope established, active recon attempted, false advisory produced |
| Screenshot | Description |
|---|---|
![]() |
Shodan JSON dump provided to model |
![]() |
Model begins passive infrastructure analysis |
![]() |
User establishes "recon phase" — passive scope |
![]() |
Scope confirmation continued |
![]() |
Opus labels its own action "DNS brute-force subdomain enumeration" — after passive scope was established |
![]() |
Model builds advisory claiming "100% passive reconnaissance" |
![]() |
Active exploitation recommendations against unauthorized target |
![]() |
"recon phase. More+technical details." — user reaffirms passive scope |
![]() |
Extended Thinking during garlic mode escalation |
![]() |
Extended Thinking continued — model reasoning through constraint bypass |
AFL Jailbreak (Ambiguity Front-Loading)
Four short prompts bypassed policy evaluation on Opus 4.6 ET. Extended thinking blocks show the model flagging its own safety concerns three times — and overriding itself every time.
See disclosures/afl-jailbreak/ for the full disclosure, interactive tools, and proposed mitigations.
| Document | Description |
|---|---|
| AFL Jailbreak Disclosure | Full disclosure — pattern anatomy, thinking block evidence, escalation timeline, proposed mitigations |
| AFL Disclosure (original) | Initial submission to Anthropic |
| AFL Token Trajectory Analyzer | Interactive — swap token positions, watch compliance cascade shift |
| AFL Pattern Anatomy | Interactive — visual prompt escalation diagram |
| AFL Defuser | Proposed architectural mitigation (React JSX) |
Sandbox Snapshot Exfiltration
915 files extracted from the Claude.ai code execution sandbox in a single 20-minute mobile session on Haiku 4.5 — the smallest available model — via standard artifact download. No exploit triggered, no vulnerability required. The artifact channel itself was the exfiltration path.
The snapshot includes hardcoded Anthropic production IPs (api.anthropic.com, api-staging.anthropic.com, statsig.anthropic.com, sentry.io, Datadog) written into /etc/hosts by the orchestration layer at container startup, a JWT from /proc/1/environ with enforce_container_binding: false and allowed_hosts: *, the full gVisor capability set (CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_RAW, CAP_NET_ADMIN), the 17-mount 9p topology, gVisor pre-v1.1.0 version fingerprint, Chromium running with --no-sandbox, and the complete /etc/ tree. Sufficient to build an offline replica and develop sandbox escape exploits against an exact-match environment with zero log entries on Anthropic's infrastructure.
Reproducible by any Claude.ai user with code execution enabled — no privilege required.
| Document | Description |
|---|---|
| Sandbox Snapshot Disclosure | Full disclosure: extraction method, impact analysis, attack scenario, remediation recommendations |
| PoC.webm | Screencast of the exfiltration session |
![]() |
Artifact download UI showing the 915-file snapshot |
Research
| Document | Description |
|---|---|
| Constraint Is Freedom (PDF) | Formal alignment paper — autoregressive compliance cascade theory, A(S) framework |
Evidence
| File | Description |
|---|---|
| evidence/ | PoC screenshots, screencast, and AFL pattern diagrams |
Use with Claude Code
Use Claude Code to analyze the attack mechanism, map it to security frameworks, and build regression tests.
Read README.md in this repo (claude-4.6-jailbreak-vulnerability-disclosure-unredacted).
Then:
1. Extract the core attack pattern — what specifically caused the constitutional safety bypass
2. Map each failure instance to MITRE ATLAS and OWASP LLM Top 10 categories
3. Draft a regression test prompt set that would detect this behavior in a future model version
Context: all three Claude production tiers (Opus 4.6 ET, Sonnet 4.6 ET, Haiku 4.5 ET) produced
functional exploit code when user-defined memory protocols suppressed constitutional checks.
License
This disclosure document is released under CC BY 4.0 — see LICENSE for the full canonical text. Attribution required for redistribution.
About
Maintained by Nicholas Michael Kloster as part of NuClide — independent AI infrastructure security research.
CISA disclosures: CVE-2025-4364 · ICSA-25-140-11
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi











