SCOUT

Name: SCOUT
Author: R00T-Kim

Firmware Security Analysis Pipeline with Deterministic Evidence Packaging

Drop a firmware blob. Get SARIF findings, CycloneDX SBOM+VEX, hash-anchored evidence, and analyst-ready reasoning trails -- in one command.

SCOUT is optimized for deep analysis of a single firmware image: it acts as an analyst copilot grounded in evidence lineage, not an autonomous verdict engine. Ghidra P-code taint, adversarial LLM adjudication, reasoning persistence across findings/reports/viewer/TUI, zero pip dependencies.

1,123
_{Corpus Targets
(Tier 1 refresh)}

98.8%
_{Success Rate
(1110 / 1123)}

146,943
_{CVE Matches
(Tier 1 refresh)}

99.3%
_{LLM-Adjudicated FPR
(Tier 2 carry-over)}

Pending
_{Pair-Eval FN/FP
(next lane)}

_{Tier 1 fresh baseline: v2.6.1 corpus refresh, 2026-04-17, 1,123 firmware, success 1110 / partial 4 / fatal 9 · Tier 2 carry-over: v2.3.0, 2026-04-09, claude-code driver, 36 firmware}

English (this file) | 한국어

[!NOTE]
Tier 1 numbers in this README now reflect the fresh v2.6.1 corpus refresh (docs/carry_over_benchmark_v2.6.md): 1,123 targets, 1110 success / 4 partial / 9 fatal. Tier 2 LLM numbers are still carry-over (v2.3.0, 36 firmware) until the pair-eval lane lands. See docs/benchmark_governance.md, docs/carry_over_benchmark_v2.6.md, and benchmarks/baselines/v2.5.0/manifest.json.

[!TIP]
What's new in v2.7.3 (Universal Chaining + ER605 Comexe DDNS quality pass)

ER605/Comexe DDNS chain modeling. exploitability_dossier now recognizes cmxddnsd candidates using Comexe server names, Data, ErrorCode, UpdateSvr1/2, and parser sink markers, then emits dns_mitm, udp_ddns_response, parser_field, and info_leak_then_control channels.

Protocol-aware Plan IR and AutoPoC selection. exploit_state_machine preserves dossier families and lowers Comexe candidates to classify_ddns_protocol_chain_quality; AutoPoC avoids duplicate candidate IDs across dossier/state-machine sources.

Non-weaponized DDNS blueprint template. poc_templates.py adds a Comexe DDNS quality template that records safe packet/Plan-IR hashes and quality checks without generating overlong fields, ROP, command payloads, DES key recovery, or spoofing infrastructure.

PoC quality review documented. See docs/er605_poc_quality.md for the quality assessment against the public ER605 analysis and the remaining live-lab verifier gaps.

[!TIP]
What's new in v2.7.2 (Phase 2C++ detection-engine integrity patch — no scorecard movement expected)

Phase 2C++.1 — DECOMPILED_COLOCATED_CAP = 0.45 promoted to a named constant. The decompiled_colocated taint method previously hardcoded a 0.50 ceiling inline. The 5-tier cap ladder (SYMBOL_COOCCURRENCE 0.40 < DECOMPILED_COLOCATED 0.45 < STATIC_CODE_VERIFIED 0.55 < STATIC_ONLY 0.60 < PCODE_VERIFIED 0.75) is now externally cited. Consumer impact: decompiled_colocated traces drop 0.50 → 0.45 (-0.05); ROC thresholds previously pinned at 0.50 should be retuned to 0.45. priority_score and cve_scan's STATIC_CODE_VERIFIED_CAP=0.55 unchanged. Rationale: the v2.4.0 external review (docs/upgrade_plz.md Gap C) flagged the prior value as over-confident relative to body-text-only evidence.

Phase 2C++.2 — legacy addr_diff > 16 residues removed from ghidra_analysis.py and ghidra_scripts/pcode_taint.py. Commit 3352783 (v2.4.1) replaced the primary CALL-matching path with callee-name resolution but left a dead trace_pcode_forward() helper inside _PYGHIDRA_SCRIPT and an unreachable else: addr_diff fallback in the analyzeHeadless Strategy 1 loop. Both are now physically removed; _trace_forward_pcode()'s source_api_name parameter is required (no default). No runtime behaviour change — the production paths have done callee-name matching for 13 days already. Guard-rail tests in tests/test_ghidra_dead_code_removed.py pin the removal.

No scorecard movement expected. Gap B was runtime-effective since v2.4.1. Gap C's new ceiling only binds on decompiled_colocated traces, which are emitted solely by the pyghidra fallback (ghidra_analysis.py:609) that Ghidra-12 environments don't exercise. Phase 2D' Entry Gate remains at the v2.7.1 figure of record (2/5 PASS). Pair-eval re-measurement is deferred to the session that evaluates Gap A (interprocedural taint) ROI.

Pivot Option D (compliance-led identity) remains in force. v2.7.2 is detection-engine hygiene, not a behavioural pivot. The compliance_report stage and four standard mappings shipped in v2.7.0 are unchanged.

[!TIP]
What's new in v2.7.1 (Phase 2C+.4 vendor corpus expansion — quantitative refinement of v2.7.0's scenario C)

Pair-eval corpus extended 7 → 12 pairs — five new vendor/model entries: D-Link DIR-859 (CVE-2019-17621), D-Link DIR-878 (vendor advisory), ASUS RT-AC68U (CVE-2020-15498), Linksys WRT1900AC v2 (progression), and Linksys EA6700 (progression). Manifest registration alone clears Phase 2D' Entry Gate 5 (corpus ≥ 10).

Phase 2D' Entry Gate scorecard: 1/5 → 2/5 PASS (Gate 4 Rerun + Gate 5 Corpus). Gate 1 recall improves 0.143 → 0.167 (+17% relative) but still FAIL; Gate 2 (tier variation) and Gate 3 (finding diversity 0.917) remain FAIL. The new TP/FP pair (DIR-859 vuln + patched both hit aiedge.findings.web.exec_sink_overlap) corroborates the v2.7.0 diagnosis that findings.py's single-synthesis-finding selection is the structural Gate 1/3 limit.

Honest measurement-of-record protocol — an intermediate 1st-pass measurement under partial WRT1900AC extractions transiently showed Gate 2 PASS due to aiedge.findings.analysis_incomplete populating the unknown tier; the figure of record is the ok-state measurement after the 2400-second budget rerun (Gate 2 reverts to FAIL). Documented in docs/v2.7.1_release_plan.md.

Scorer reliability fix — scripts/score_pair_corpus.py now graceful-skips pairs with absent runs (vulnerable_status="missing" / patched_status="missing") instead of aborting on StopIteration, so corpus growth and partial-coverage measurements no longer crash the release gate.

Pivot Option D (compliance-led identity) remains in force — v2.7.1 is a quantitative refinement of v2.7.0's scenario C, not a re-pivot. The compliance_report stage and four standard mappings shipped in v2.7.0 are unchanged.

Why SCOUT?

Every finding has a hash-anchored evidence chain.
No finding without a file path, byte offset, SHA-256 hash, and rationale. Artifacts are immutable and traceable from firmware blob to final verdict.

4-tier confidence caps with Ghidra P-code verification -- honest scoring.
SYMBOL_COOCCURRENCE capped at 0.40, STATIC_CODE_VERIFIED at 0.55, STATIC_ONLY at 0.60, PCODE_VERIFIED at 0.75. Promotion to confirmed requires dynamic verification. We don't inflate scores.

SARIF + CycloneDX VEX + SLSA provenance -- standard formats.
GitHub Code Scanning, VS Code, CI/CD integration out of the box.

Built for analyst-in-the-loop firmware review.
SCOUT is strongest when used to start deep review on a single firmware image quickly, expose evidence paths, and preserve matched reasoning lineage across triage and reporting surfaces. Analyst hints loop back into next-run LLM adjudication via MCP, while final verdict ownership stays with the reviewer.

ER605-style exploitability dossiers.
The exploitability_dossier stage converts findings into an analysis-only decision log: target context, input surface, reachability, controllability, primitive hypothesis, mitigation friction, chain candidates, and patch/variant questions. It ranks exploit leads without generating payloads or claiming verified exploitability.
In the gated profile=exploit lane, those ranked leads can seed exploit_autopoc, which generates lab-only proof plugins and verifies them through exploit_runner, poc_validation, and exploit_policy.

How It Works

  firmware.bin  ──>  47-stage pipeline  ──>  SARIF findings       ──>  Web viewer
                     (auto Ghidra)          CycloneDX SBOM+VEX       TUI dashboard
                     (auto CVE match)       Evidence chain            GitHub/VS Code
                     (optional LLM)         Exploitability dossier    MCP for AI agents
                                           SLSA attestation

# Full analysis
./scout analyze firmware.bin

# Static-only (no LLM, $0)
./scout analyze firmware.bin --no-llm

# Pre-extracted rootfs
./scout analyze firmware.img --rootfs /path/to/rootfs

# Web viewer
./scout serve aiedge-runs/<run_id> --port 8080

# TUI dashboard
./scout ti                    # interactive (latest run)
./scout tw                    # watch mode (auto-refresh)

# MCP server for AI agents
./scout mcp --project-id aiedge-runs/<run_id>

Comparison

Feature	SCOUT	FirmAgent	EMBA	FACT	FirmAE
Scale (firmware tested)	1,123	14	--	--	1,124
SBOM (CycloneDX 1.6+VEX)	Yes	No	Yes	No	No
SARIF 2.1.0 Export	Yes	No	No	No	No
Hash-Anchored Evidence Chain	Yes	No	No	No	No
SLSA L2 Provenance	Yes	No	No	No	No
Known CVE Signature Matching	Yes (2,528 CVEs, 25 sigs)	No	No	No	No
Confidence Caps (honest scoring)	Yes	No	No	No	No
Ghidra Integration (auto-detect)	Yes	IDA Pro	Yes	No	No
AFL++ Fuzzing Pipeline	Yes	Yes	No	No	No
Cross-Binary IPC Chains	Yes (5 types)	No	No	No	No
Taint Propagation (LLM)	Yes	Yes (DeepSeek)	No	No	No
Adversarial FP Reduction	Yes	No	No	No	No
MCP Server (AI agent)	Yes	No	No	No	No
Web Report Viewer	Yes	No	Yes	Yes	No
Zero pip Dependencies	Yes	No	No	No	No

Key Features

	Feature	Description
:package:	SBOM & CVE	CycloneDX 1.6 + VEX + 25 known CVE signatures (8 vendors) + NVD scan + 2,528 local CVE DB + EPSS scoring (FIRST.org API, batched + cached)
:mag:	Binary Analysis	Ghidra P-code SSA dataflow taint + ELF hardening (NX/PIE/RELRO/Canary/FORTIFY) + `.dynstr` detection + 28 sink symbols + format string detection
:dart:	Attack Surface	Source→sink tracing, web server auto-detection, cross-binary IPC chains (5 types: unix socket, dbus, shm, pipe, exec)
:brain:	Taint Analysis	HTTP-aware inter-procedural taint, P-code SSA dataflow, call chain visualization, 4-strategy fallback (P-code → colocated → decompiled → interprocedural)
:robot:	LLM Engine	4 backends (Codex CLI / Claude API / Claude Code CLI / Ollama) + centralized system prompts + structured JSON output + 5-stage parser (preamble/fence/raw/brace-counting/error-recovery) + temperature control
:crossed_swords:	LLM-Adjudicated Debate	Advocate/Critic LLM debate for LLM-adjudicated FPR reduction (99.3% on the Tier 2 carry-over baseline). Separate parse_failures vs llm_call_failures observability with quota_exhausted detection
:compass:	Explainability Surface (v2.6.1)	`reasoning_trail` persisted across findings, analyst Markdown, TUI, and web viewer so reviewers can inspect matched evidence lineage — not just a final score. Advocate / critic / decision / pattern-hit entries with 200-char excerpt redaction
:inbox_tray:	Analyst-in-the-loop Channel (v2.6.1)	4 MCP tools for reasoning lookup, hint injection, verdict override, and category filtering. Hints loop back into next-run advocate prompt via `AIEDGE_FEEDBACK_DIR` (opt-in, `fcntl.flock`-safe)
:triangular_ruler:	Detection vs Priority Separation (v2.6.0)	`confidence` stays evidence-bound (≤0.55 static cap); `priority_score` / `priority_inputs` capture EPSS, reachability, backport, and CVSS as ranking signals. See `docs/scoring_calibration.md`
:speedboat:	Parallel DAG Execution (v2.6.0, PoC)	`--experimental-parallel [N]` opt-in level-wise stage parallelism (ThreadPoolExecutor + Kahn topo levels). 15 levels / max-width 7 on the 47-stage pipeline. Sequential path unchanged
:shield:	Security Assessment	X.509 cert scan, boot service audit, filesystem permission checks, credential mapping, hardcoded secret detection
:test_tube:	Fuzzing (optional)	AFL++ with CMPLOG, persistent mode, NVRAM faker, harness generation, crash triage
:bug:	Emulation	4-tier (FirmAE / Pandawan+FirmSolo / QEMU user-mode / rootfs inspection) + GDB remote debug
:electric_plug:	MCP Server	12 tools via Model Context Protocol for Claude Code/Desktop integration
:bar_chart:	Web Viewer	Glassmorphism dashboard with KPI bar, IPC map, risk heatmap, interactive evidence navigation
:link:	Evidence Chain	SHA-256 anchored artifacts + 4-tier confidence caps (0.40/0.55/0.60/0.75) + 5-tier exploit promotion ladder
:scroll:	Standard Output	SARIF 2.1.0 (GitHub Code Scanning) + CycloneDX 1.6 + VEX + SLSA Level 2 in-toto attestation
:gear:	CI/CD Integration	GitHub Action (`.github/actions/scout-scan/`) with composite Docker action + automatic SARIF upload to GitHub Security tab
:scales:	Regulatory Alignment	Output formats compatible with EU CRA Annex I (`docs/compliance_mapping/cra_annex_i.md`); SBOM output compatible with FDA Section 524B guidance; output formats compatible with ISO 21434 / UN R155
:chart_with_upwards_trend:	Benchmarking	FirmAE dataset (1,123 firmware), analyst-readiness scoring, verifier-backed archive bundles, TP/FP analysis scripts
:key:	Vendor Decrypt	D-Link SHRS AES-128-CBC auto-decryption; Shannon entropy encryption detection (>7.9); binwalk v3 compatibility
:white_check_mark:	Zero Dependencies	Pure Python 3.10+ stdlib only — no pip dependencies, air-gap friendly deployment

Analyst Copilot Surfaces

Explainability surface

reasoning_trail and evidence lineage are preserved across findings, analyst Markdown, TUI, web viewer, and SARIF properties.
This is where reviewers inspect why a finding was downgraded, upheld, or promoted.

Analyst-in-the-loop channel

MCP tools and AIEDGE_FEEDBACK_DIR are the supported override/hint path.
Human hints are allowed to influence the next run; final ownership still stays with the analyst.

Autonomous reasoning (future)

SCOUT is not positioned as a fully autonomous exploit agent in v2.6.1.
Multi-agent exploit chains, pair-grounded evaluation loops, and autonomous fuzz harness generation remain Phase 2D / reviewer-eval lane work.

Pipeline (47 Stages)

Firmware --> Unpack --> Profile --> Inventory --> Ghidra --> Semantic Classification
    --> SBOM --> CVE Scan --> Reachability --> Endpoints --> Surfaces
    --> Enhanced Source --> C-Source ID --> Taint Propagation
    --> FP Verification --> Adversarial Triage
    --> Graph --> Attack Surface --> Findings
    --> LLM Triage --> LLM Synthesis --> Emulation --> [Fuzzing]
    --> PoC Refinement --> Chain Construction --> Exploitability Dossier
    --> Protocol Model --> Exploit State Machine --> Crash Replay --> Primitive Verifier
    --> Exploit Chain --> PoC --> Verification

Ghidra is auto-detected and enabled by default. Stages in [brackets] require optional external tools (AFL++/Docker).

Pipeline Stages Reference (47)

Stage	Module	Purpose	LLM?	Cost
`tooling`	`tooling.py`	External tool availability check (binwalk, Ghidra, Docker)	No	$0
`extraction`	`extraction.py`	Firmware unpacking (binwalk + vendor_decrypt + Shannon entropy detection)	No	$0
`structure`	`structure.py`	Filesystem structure analysis	No	$0
`carving`	`carving.py`	File carving from unstructured regions	No	$0
`firmware_profile`	`firmware_profile.py`	Architecture, kernel, init system fingerprinting	No	$0
`inventory`	`inventory.py`	Per-binary ELF hardening + symbol extraction	No	$0
`ghidra_analysis`	`ghidra_analysis.py`	Decompilation + P-code SSA dataflow analysis	No	$0
`semantic_classification`	`semantic_classifier.py`	3-pass function classifier (static → haiku → sonnet)	Yes	Low
`sbom`	`sbom.py`	CycloneDX 1.6 SBOM generation with VEX	No	$0
`cve_scan`	`cve_scan.py`	NVD + 25 known signatures + EPSS enrichment	No	$0
`reachability`	`reachability.py`	BFS-based call-graph reachability	No	$0
`endpoints`	`endpoints.py`	Network endpoint discovery	No	$0
`surfaces`	`surfaces.py`	Attack surface enumeration	No	$0
`enhanced_source`	`enhanced_source.py`	Web server auto-detection + INPUT_APIS scan (21 APIs)	No	$0
`csource_identification`	`csource_identification.py`	HTTP input source identification via static sentinel + QEMU	No	$0
`taint_propagation`	`taint_propagation.py`	Inter-procedural taint with 28 sinks + format string detection	Yes	Medium
`fp_verification`	`fp_verification.py`	3-pattern FP removal + LLM verification with parse/call failure separation	Yes	Low
`adversarial_triage`	`adversarial_triage.py`	Advocate/Critic LLM debate (LLM-adjudicated FPR reduction, 99.3%)	Yes	Medium
`graph`	`graph.py`	Communication graph (5 IPC edge types)	No	$0
`attack_surface`	`attack_surface.py`	Attack surface mapping with IPC chains	No	$0
`attribution`	`attribution.py`	Vendor/firmware attribution	No	$0
`functional_spec`	`functional_spec.py`	Functional specification extraction	No	$0
`threat_model`	`threat_model.py`	STRIDE-based threat modeling	No	$0
`web_ui`	`web_ui.py`	Web UI / CGI endpoint analysis	No	$0
`findings`	`findings.py`	Finding aggregation + SARIF export	No	$0
`llm_triage`	`llm_triage.py`	LLM finding triage (haiku/sonnet/opus auto-routing)	Yes	Variable
`llm_synthesis`	`llm_synthesis.py`	LLM finding synthesis	Yes	Medium
`emulation`	`emulation.py`	4-tier emulation (FirmAE / Pandawan / QEMU / rootfs)	No	$0
`dynamic_validation`	`dynamic_validation.py`	Dynamic behavior verification	No	$0
`fuzzing`	`fuzz_*.py`	AFL++ fuzzing with NVRAM faker	No	$0
`poc_refinement`	`poc_refinement.py`	Iterative PoC generation (5 attempts)	Yes	Medium
`chain_construction`	`chain_constructor.py`	Same-binary + cross-binary IPC exploit chains	No	$0
`exploitability_dossier`	`exploitability_dossier.py`	ER605-style analysis-only exploitability decision log	No	$0
`protocol_model`	`protocol_model.py`	Run-local RAG protocol/input model + safe encoder skeleton	Optional	Low
`exploit_state_machine`	`exploit_state_machine.py`	Candidate-level reachability/trigger/leak/control proof DAG	No	$0
`crash_replay`	`crash_replay.py`	Lab-gated QEMU/cyclic crash replay and GDB script collector	No	$0
`primitive_verifier`	`primitive_verifier.py`	Verifier-backed primitive classification from bundles/crashes/GDB traces	No	$0
`exploit_gate`	`stage_registry.py`	Exploit promotion gate	No	$0
`exploit_chain`	`exploit_chain.py`	Exploit chain validation	No	$0
`exploit_autopoc`	`exploit_autopoc.py`	Automated PoC orchestration	Yes	Medium
`poc_validation`	`poc_validation.py`	PoC reproduction validation	No	$0
`exploit_policy`	`exploit_policy.py`	Final exploit promotion decision	No	$0

OTA-specific stages: ota, ota_payload, ota_fs, ota_roots, ota_boottriage, firmware_lineage (Android-style OTA payload analysis).

Benchmarks

Tier 1 (Static, frozen baseline)

Baseline: v2.6.1, 2026-04-17, fresh corpus refresh (docs/carry_over_benchmark_v2.6.md)

1,123 firmware / 8 vendors / 98.8% success rate
1,110 success / 4 partial / 9 fatal
3,531 findings / 146,943 CVE matches
1,089 / 1,110 successful runs produced nonzero CVE output

Tier 2 (LLM-Adjudicated Adversarial Debate, GPT-5.3-Codex)

Baseline: v2.3.0, 2026-04-09, claude-code driver (carry-over; pair-eval lane still pending)

36 firmware / 9 vendors
2,430 findings debated → 2,412 downgraded + 18 maintained
LLM-adjudicated FPR reduction: 99.3% | pair-grounded FN/FP: pending reviewer eval lane

v2.6.0 Post-merge Real-Firmware Validation

This section records post-release real-firmware validation runs, distinct from the carry-over corpus baselines above.

Validation target 1 — Netgear R7000 (codex driver, `--experimental-parallel 4`)

Metric	v2.5.0	v2.6.0
`adversarial_triage` parse_failures	0/100	0/100 (100 debated, 97 downgraded, 3 maintained)
`fp_verification` unverified	0/100	0/100 (100 verified: 56 TP, 44 FP)
`reasoning_trail_count` (top-level findings)	N/A	0/3 top-level / 100/100 at `adversarial_triage` + `fp_verification` artifacts ¹
findings with `priority_score`	N/A	3/3 (100% additive priority annotation)
`priority_bucket_counts`	N/A	`{critical: 0, high: 0, medium: 3, low: 0}`
category distribution	N/A	`{vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0}`
`cve_scan` EPSS enriched	23/23	0 (stage skipped — `sbom` landed partial and `cve_scan`/`reachability` skip on `sbom` dependency failure ²)
`--experimental-parallel 4` wall-clock	N/A	~170 minutes end-to-end across the registered pipeline (`fp_verification` dominant at 113 min; no sequential baseline for delta)

¹ v2.6.0 → v2.6.1 follow-up (commit 7b36274): the top-level synthesis finding (web.exec_sink_overlap) now inherits matched downstream evidence lineage instead of relying only on the stage-level aggregate summary. Matching prefers run-relative binary path, falls back to binary SHA-256, and samples representative downstream trail entries deterministically so the synthesis finding reflects the alerts that actually informed it. This R7000 run reflects the v2.6.0 shipped behaviour.

² v2.6.0 → v2.6.1 follow-up (commit 8e0bb82): the R7000 extraction actually succeeded (1,664 files, 2,412 binaries scanned under squashfs-root), but the SBOM stage returned 0 components on this firmware due to a silent schema mismatch — _collect_so_files_from_inventory read inventory.file_list (a pre-v2.x key no longer emitted) and _detect_from_binary_analysis expected per-entry string_hits (replaced by matched_symbols in the current inventory schema). OpenWrt hid the bug because its opkg database alone contributes 100+ components. The fix makes both helpers walk inventory.roots directly and fall back to reading the binary file contents via a new _extract_ascii_runs helper. A clean re-run of just SbomStage on this R7000 run raises the component count from 0 to 4 (curl 7.36.0 via binary read, plus openssl 1.0.0 / libz 1 / libpthread 0 via .so* walking). Downstream cve_scan / reachability would then produce real CVE + EPSS numbers on a full pipeline re-run.

Validation target 2 — OpenWrt Archer C7 v5 (TP-Link, `--no-llm`)

Metric	v2.6.0
total findings	3
`reasoning_trail_count`	0 (no-llm: adversarial_triage and fp_verification are LLM-gated; trail is populated only when LLM stages run)
findings with `priority_score`	3 / 3 (100% — additive priority annotation succeeded for all findings)
`priority_bucket_counts`	`{critical: 0, high: 0, medium: 3, low: 0}`
category distribution	`{vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0}` (PR #7a 3-category ontology, 0% unclassified rate)
notable caveats	OpenWrt is `squashfs` ext4 root; binwalk extracted cleanly; `--no-llm` path skipped reasoning_trail generation as expected. Run completed end-to-end through `findings` stage.

See CHANGELOG.md for full version history and docs/scoring_calibration.md for the two-score contract.

Architecture

+--------------------------------------------------------------------+
|                       SCOUT (Evidence Engine)                      |
|                                                                    |
|  Firmware --> Unpack --> Profile --> Inventory --> SBOM --> CVE    |
|                          |            |            |          |    |
|                       Ghidra     Binary Audit   40+ sigs    NVD+   |
|                       auto-detect  NX/PIE/etc              local DB|
|                                                                    |
|  --> Taint --> FP Filter --> Attack Surface --> Findings           |
|     (HTTP-aware)  (3-pattern)   (IPC chains)    (SARIF 2.1.0)      |
|                                                                    |
|  --> Emulation --> [Fuzzing] --> Exploit Chain --> PoC --> Verify  |
|                                                                    |
|  42 stages . SHA-256 manifests . 4-tier confidence caps (0.40/0.55/0.60/0.75) |
|  Outputs: SARIF + CycloneDX VEX + SLSA L2 + Markdown reports       |
+--------------------------------------------------------------------+
|                    Handoff (firmware_handoff.json)                 |
+--------------------------------------------------------------------+
|                     Terminator (Orchestrator)                      |
|  LLM Tribunal --> Dynamic Validation --> Verified Chain            |
+--------------------------------------------------------------------+

Layer	Role	Deterministic?
SCOUT	Evidence production (42 stages)	Yes
Handoff	JSON contract between engine and orchestrator	Yes
Terminator	LLM tribunal, dynamic validation, exploit dev	No (auditable)

Exploit Promotion Policy

Level	Requirements	Placement
`dismissed`	Critic rebuttal strong or confidence < 0.5	Appendix only
`candidate`	Confidence 0.5-0.8, evidence exists but chain incomplete	Report (flagged)
`high_confidence_static`	Confidence >= 0.8, strong static evidence, no dynamic	Report (highlighted)
`confirmed`	Confidence >= 0.8 AND >= 1 dynamic verification artifact	Report (top)
`verified_chain`	Confirmed AND PoC reproduced 3x in sandbox	Exploit report

CLI Reference

Command	Description
`./scout analyze <firmware>`	Full 47-stage analysis pipeline
`./scout analyze <firmware> --quiet`	Suppress real-time progress output (CI/scripted use)
`./scout analyze-8mb <firmware>`	Truncated 8MB canonical track
`./scout stages <run_dir> --stages X,Y`	Rerun specific stages
`./scout serve <run_dir>`	Launch web report viewer
`./scout mcp [--project-id <id>]`	Start MCP stdio server
`./scout tui <run_dir>`	Terminal UI dashboard
`./scout ti`	TUI interactive (latest run)
`./scout tw`	TUI watch mode (auto-refresh)
`./scout to`	TUI one-shot (latest run)
`./scout t`	TUI default (latest run)
`./scout corpus-validate`	Validate corpus manifest
`./scout quality-metrics`	Compute quality metrics
`./scout quality-gate`	Check quality thresholds
`./scout release-quality-gate`	Unified release gate

Exit codes: 0 success, 10 partial, 20 fatal, 30 policy violation

Benchmarking

# FirmAE dataset benchmark (1,123 usable firmware images in the current frozen baseline)
./scripts/benchmark_firmae.sh --parallel 8 --time-budget 1800 --cleanup

# Options
--dataset-dir DIR       # Firmware directory (default: aiedge-inputs/firmae-benchmark)
--results-dir DIR       # Output directory
--file-list PATH        # Explicit newline-delimited firmware list
--parallel N            # Concurrent jobs (default: 4)
--time-budget S         # Seconds per firmware (default: 600)
--stages STAGES         # Specific stages (default: full pipeline)
--max-images N          # Limit images (0 = all)
--llm                   # Enable LLM-backed stages
--8mb                   # Use 8MB truncated track
--full                  # Include dynamic stages
--cleanup               # Preserve a verifier-friendly run replica under results/archives/, then delete original run dirs
--dry-run               # List files without running

# Analyst-readiness re-evaluation for an existing benchmark-results tree
python3 scripts/reevaluate_benchmark_results.py \
  --results-dir benchmark-results/<run>

# Normalize legacy bundles and rerun a stage subset (useful for debugging archive fidelity issues)
python3 scripts/rerun_benchmark_stages.py \
  --results-dir benchmark-results/<legacy-run> \
  --out-dir benchmark-results/<rerun-out> \
  --stages attribution,graph,attack_surface \
  --no-llm

# Post-benchmark analysis
PYTHONPATH=src python3 scripts/cve_rematch.py \
  --results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
  --nvd-dir data/nvd-cache \
  --csv-out cve_matches.csv

PYTHONPATH=src python3 scripts/analyze_findings.py \
  --results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
  --output analysis_report.json

# FirmAE dataset setup
./scripts/unpack_firmae_dataset.sh [ZIP_FILE]

# Tier 1 frozen baseline docs
# - docs/tier1_rebenchmark_frozen_baseline.md
# - docs/tier1_rebenchmark_final_analysis.md

Current benchmark contract

Archived benchmark bundles are now expected to be run replicas, not flattened JSON snapshots.
Benchmark quality is reported in two layers:
- analysis rate = pipeline completed (success + partial)
- analyst-ready rate = archived bundle passes analyst/verifier checks and remains evidence-navigable
benchmark-results/legacy/tier2-llm-v2 is a legacy snapshot. It is useful for historical reference and re-evaluation, but it should not be used as the canonical analyst-readiness baseline.
The current contract has been validated on a fresh single-sample run (benchmark-results/tier2-single-fidelity) where both analyst verifiers passed from the archived bundle.

Current LLM quality behavior

llm_triage model routing: <=10 haiku, 11-50 sonnet, >50 or chain-backed opus
llm_triage retries with sonnet if a haiku call exits non-zero
llm_triage, semantic_classification, adversarial_triage, and fp_verification now write stages/<stage>/llm_trace/*.json
Parse failures are handled fail-closed: repaired when possible, otherwise reported as degraded/partial instead of silently treated as clean success

Environment Variables

Core

Variable	Default	Description
`AIEDGE_LLM_DRIVER`	`codex`	LLM provider: `codex` / `claude` / `claude-code` / `gemini` / `ollama`
`ANTHROPIC_API_KEY`	--	API key for Claude driver (not needed for `claude-code`)
`AIEDGE_OLLAMA_URL`	`http://localhost:11434`	Ollama server URL
`AIEDGE_LLM_BUDGET_USD`	--	LLM cost budget limit
`AIEDGE_PRIV_RUNNER`	--	Privileged command prefix for dynamic stages
`AIEDGE_FEEDBACK_DIR`	`aiedge-feedback`	Terminator feedback directory

Ghidra

Variable	Default	Description
`AIEDGE_GHIDRA_HOME`	auto-detect	Ghidra install path; probes `/opt/ghidra_`, `/usr/local/ghidra`
`AIEDGE_GHIDRA_MAX_BINARIES`	`20`	Max binaries to analyze
`AIEDGE_GHIDRA_TIMEOUT_S`	`300`	Per-binary analysis timeout

SBOM & CVE

Variable	Default	Description
`AIEDGE_NVD_API_KEY`	--	NVD API key (optional, improves rate limits)
`AIEDGE_NVD_CACHE_DIR`	--	Cross-run NVD response cache
`AIEDGE_SBOM_MAX_COMPONENTS`	`500`	Maximum SBOM components
`AIEDGE_CVE_SCAN_MAX_COMPONENTS`	`50`	Maximum components to CVE-scan
`AIEDGE_CVE_SCAN_TIMEOUT_S`	`30`	Per-request NVD API timeout

Fuzzing & Emulation

Variable	Default	Description
`AIEDGE_AFLPP_IMAGE`	`aflplusplus/aflplusplus`	AFL++ Docker image
`AIEDGE_FUZZ_BUDGET_S`	`3600`	Fuzzing time budget (seconds)
`AIEDGE_FUZZ_MAX_TARGETS`	`5`	Max fuzzing target binaries
`AIEDGE_EMULATION_IMAGE`	`scout-emulation:latest`	Emulation Docker image
`AIEDGE_FIRMAE_ROOT`	`/opt/FirmAE`	FirmAE installation path
`AIEDGE_QEMU_GDB_PORT`	`1234`	QEMU GDB remote port

Quality Gates

Variable	Default	Description
`AIEDGE_QG_PRECISION_MIN`	`0.9`	Minimum precision threshold
`AIEDGE_QG_RECALL_MIN`	`0.6`	Minimum recall threshold
`AIEDGE_QG_FPR_MAX`	`0.1`	Maximum false positive rate

Run Directory Structure

aiedge-runs/<run_id>/
├── manifest.json
├── firmware_handoff.json
├── provenance.intoto.jsonl           # SLSA L2 attestation
├── input/firmware.bin
├── stages/
│   ├── extraction/                   # Unpacked filesystem
│   ├── inventory/
│   │   └── binary_analysis.json      # Per-binary hardening + symbols
│   ├── enhanced_source/
│   │   └── sources.json              # HTTP input sources + web server detection
│   ├── sbom/
│   │   ├── sbom.json                 # CycloneDX 1.6
│   │   └── vex.json                  # VEX exploitability
│   ├── cve_scan/
│   │   └── cve_matches.json          # NVD + known signature matches
│   ├── taint_propagation/
│   │   └── taint_results.json        # Taint paths + call chains
│   ├── ghidra_analysis/              # Decompiled functions (optional)
│   ├── chain_construction/
│   │   └── chains.json               # Same-binary + cross-binary IPC chains
│   ├── findings/
│   │   ├── findings.json             # All findings
│   │   ├── pattern_scan.json         # Static pattern matches
│   │   ├── sarif.json                # SARIF 2.1.0 export
│   │   └── stage.json                # SHA-256 manifest
│   └── ...                           # 42 stage directories total
└── report/
    ├── viewer.html                   # Web dashboard
    ├── report.json
    ├── analyst_digest.json
    └── executive_report.md

Verification Scripts

# Evidence chain integrity
python3 scripts/verify_analyst_digest.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_verified_chain.py --run-dir aiedge-runs/<run_id>

# Report schema compliance
python3 scripts/verify_aiedge_final_report.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_aiedge_analyst_report.py --run-dir aiedge-runs/<run_id>

# Security invariants
python3 scripts/verify_run_dir_evidence_only.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_network_isolation.py --run-dir aiedge-runs/<run_id>

# Quality gates
./scout release-quality-gate aiedge-runs/<run_id>

Documentation

Document	Purpose
Blueprint	Pipeline architecture and design rationale
Status	Current implementation status
Artifact Schema	Profiling + inventory contracts
Adapter Contract	Terminator-SCOUT handoff protocol
Report Contract	Report structure and governance
Analyst Digest	Digest schema and verdicts
Verified Chain	Evidence requirements
Duplicate Gate	Cross-run dedup rules
Known CVE Ground Truth	CVE validation dataset
Upgrade Plan v2	v2.0 upgrade plan
LLM Roadmap	LLM integration strategy

Security & Ethics

Authorized environments only.

SCOUT is intended for contracted security audits, vulnerability research (responsible disclosure), and CTF/training in lab environments. Dynamic validation runs in network-isolated sandbox containers. No weaponized payloads are included.

Contributing

Read Blueprint for architecture context
Run pytest -q -- all tests must pass
Lint ruff check src/ -- zero violations
Follow the Stage protocol (src/aiedge/stage.py)
Zero pip dependencies -- stdlib only

License

Apache 2.0

_{Built for the security research community. Not for unauthorized access.}

github.com/R00T-Kim/SCOUT