SCOUT
Health Uyari
- License — License: Apache-2.0
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 9 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
Deterministic firmware-to-exploit evidence engine. Drop a firmware blob, get hash-anchored findings with SARIF + CycloneDX SBOM + verified exploit chains.
SCOUT
Firmware Security Analysis Pipeline with Deterministic Evidence Packaging
Drop a firmware blob. Get SARIF findings, CycloneDX SBOM+VEX, hash-anchored evidence, and analyst-ready reasoning trails -- in one command.
SCOUT is optimized for deep analysis of a single firmware image: it acts as an analyst copilot grounded in evidence lineage, not an autonomous verdict engine. Ghidra P-code taint, adversarial LLM adjudication, reasoning persistence across findings/reports/viewer/TUI, zero pip dependencies.
| 1,123 Corpus Targets (Tier 1 refresh) |
98.8% Success Rate (1110 / 1123) |
146,943 CVE Matches (Tier 1 refresh) |
99.3% LLM-Adjudicated FPR (Tier 2 carry-over) |
Pending Pair-Eval FN/FP (next lane) |
[!NOTE]
Tier 1 numbers in this README now reflect the fresh v2.6.1 corpus refresh (docs/carry_over_benchmark_v2.6.md): 1,123 targets, 1110 success / 4 partial / 9 fatal. Tier 2 LLM numbers are still carry-over (v2.3.0, 36 firmware) until the pair-eval lane lands. Seedocs/benchmark_governance.md,docs/carry_over_benchmark_v2.6.md, andbenchmarks/baselines/v2.5.0/manifest.json.
[!TIP]
What's new in v2.7.3 (Universal Chaining + ER605 Comexe DDNS quality pass)
- ER605/Comexe DDNS chain modeling.
exploitability_dossiernow recognizescmxddnsdcandidates using Comexe server names,Data,ErrorCode,UpdateSvr1/2, and parser sink markers, then emitsdns_mitm,udp_ddns_response,parser_field, andinfo_leak_then_controlchannels.- Protocol-aware Plan IR and AutoPoC selection.
exploit_state_machinepreserves dossier families and lowers Comexe candidates toclassify_ddns_protocol_chain_quality; AutoPoC avoids duplicate candidate IDs across dossier/state-machine sources.- Non-weaponized DDNS blueprint template.
poc_templates.pyadds a Comexe DDNS quality template that records safe packet/Plan-IR hashes and quality checks without generating overlong fields, ROP, command payloads, DES key recovery, or spoofing infrastructure.- PoC quality review documented. See
docs/er605_poc_quality.mdfor the quality assessment against the public ER605 analysis and the remaining live-lab verifier gaps.
[!TIP]
What's new in v2.7.2 (Phase 2C++ detection-engine integrity patch — no scorecard movement expected)
- Phase 2C++.1 —
DECOMPILED_COLOCATED_CAP = 0.45promoted to a named constant. Thedecompiled_colocatedtaint method previously hardcoded a0.50ceiling inline. The 5-tier cap ladder (SYMBOL_COOCCURRENCE 0.40 < DECOMPILED_COLOCATED 0.45 < STATIC_CODE_VERIFIED 0.55 < STATIC_ONLY 0.60 < PCODE_VERIFIED 0.75) is now externally cited. Consumer impact:decompiled_colocatedtraces drop0.50 → 0.45(-0.05); ROC thresholds previously pinned at 0.50 should be retuned to 0.45.priority_scoreandcve_scan'sSTATIC_CODE_VERIFIED_CAP=0.55unchanged. Rationale: the v2.4.0 external review (docs/upgrade_plz.mdGap C) flagged the prior value as over-confident relative to body-text-only evidence.- Phase 2C++.2 — legacy
addr_diff > 16residues removed fromghidra_analysis.pyandghidra_scripts/pcode_taint.py. Commit3352783(v2.4.1) replaced the primary CALL-matching path with callee-name resolution but left a deadtrace_pcode_forward()helper inside_PYGHIDRA_SCRIPTand an unreachableelse: addr_difffallback in the analyzeHeadless Strategy 1 loop. Both are now physically removed;_trace_forward_pcode()'ssource_api_nameparameter is required (no default). No runtime behaviour change — the production paths have done callee-name matching for 13 days already. Guard-rail tests intests/test_ghidra_dead_code_removed.pypin the removal.- No scorecard movement expected. Gap B was runtime-effective since v2.4.1. Gap C's new ceiling only binds on
decompiled_colocatedtraces, which are emitted solely by the pyghidra fallback (ghidra_analysis.py:609) that Ghidra-12 environments don't exercise. Phase 2D' Entry Gate remains at the v2.7.1 figure of record (2/5 PASS). Pair-eval re-measurement is deferred to the session that evaluates Gap A (interprocedural taint) ROI.- Pivot Option D (compliance-led identity) remains in force. v2.7.2 is detection-engine hygiene, not a behavioural pivot. The
compliance_reportstage and four standard mappings shipped in v2.7.0 are unchanged.
[!TIP]
What's new in v2.7.1 (Phase 2C+.4 vendor corpus expansion — quantitative refinement of v2.7.0's scenario C)
- Pair-eval corpus extended 7 → 12 pairs — five new vendor/model entries: D-Link DIR-859 (CVE-2019-17621), D-Link DIR-878 (vendor advisory), ASUS RT-AC68U (CVE-2020-15498), Linksys WRT1900AC v2 (progression), and Linksys EA6700 (progression). Manifest registration alone clears Phase 2D' Entry Gate 5 (corpus ≥ 10).
- Phase 2D' Entry Gate scorecard: 1/5 → 2/5 PASS (Gate 4 Rerun + Gate 5 Corpus). Gate 1 recall improves 0.143 → 0.167 (+17% relative) but still FAIL; Gate 2 (tier variation) and Gate 3 (finding diversity 0.917) remain FAIL. The new TP/FP pair (DIR-859 vuln + patched both hit
aiedge.findings.web.exec_sink_overlap) corroborates the v2.7.0 diagnosis thatfindings.py's single-synthesis-finding selection is the structural Gate 1/3 limit.- Honest measurement-of-record protocol — an intermediate 1st-pass measurement under partial WRT1900AC extractions transiently showed Gate 2 PASS due to
aiedge.findings.analysis_incompletepopulating theunknowntier; the figure of record is the ok-state measurement after the 2400-second budget rerun (Gate 2 reverts to FAIL). Documented indocs/v2.7.1_release_plan.md.- Scorer reliability fix —
scripts/score_pair_corpus.pynow graceful-skips pairs with absent runs (vulnerable_status="missing"/patched_status="missing") instead of aborting onStopIteration, so corpus growth and partial-coverage measurements no longer crash the release gate.- Pivot Option D (compliance-led identity) remains in force — v2.7.1 is a quantitative refinement of v2.7.0's scenario C, not a re-pivot. The
compliance_reportstage and four standard mappings shipped in v2.7.0 are unchanged.
Why SCOUT?
Every finding has a hash-anchored evidence chain.
No finding without a file path, byte offset, SHA-256 hash, and rationale. Artifacts are immutable and traceable from firmware blob to final verdict.
4-tier confidence caps with Ghidra P-code verification -- honest scoring.
SYMBOL_COOCCURRENCE capped at 0.40, STATIC_CODE_VERIFIED at 0.55, STATIC_ONLY at 0.60, PCODE_VERIFIED at 0.75. Promotion toconfirmedrequires dynamic verification. We don't inflate scores.
SARIF + CycloneDX VEX + SLSA provenance -- standard formats.
GitHub Code Scanning, VS Code, CI/CD integration out of the box.
Built for analyst-in-the-loop firmware review.
SCOUT is strongest when used to start deep review on a single firmware image quickly, expose evidence paths, and preserve matched reasoning lineage across triage and reporting surfaces. Analyst hints loop back into next-run LLM adjudication via MCP, while final verdict ownership stays with the reviewer.
ER605-style exploitability dossiers.
Theexploitability_dossierstage converts findings into an analysis-only decision log: target context, input surface, reachability, controllability, primitive hypothesis, mitigation friction, chain candidates, and patch/variant questions. It ranks exploit leads without generating payloads or claiming verified exploitability.
In the gatedprofile=exploitlane, those ranked leads can seedexploit_autopoc, which generates lab-only proof plugins and verifies them throughexploit_runner,poc_validation, andexploit_policy.
How It Works
firmware.bin ──> 47-stage pipeline ──> SARIF findings ──> Web viewer
(auto Ghidra) CycloneDX SBOM+VEX TUI dashboard
(auto CVE match) Evidence chain GitHub/VS Code
(optional LLM) Exploitability dossier MCP for AI agents
SLSA attestation
# Full analysis
./scout analyze firmware.bin
# Static-only (no LLM, $0)
./scout analyze firmware.bin --no-llm
# Pre-extracted rootfs
./scout analyze firmware.img --rootfs /path/to/rootfs
# Web viewer
./scout serve aiedge-runs/<run_id> --port 8080
# TUI dashboard
./scout ti # interactive (latest run)
./scout tw # watch mode (auto-refresh)
# MCP server for AI agents
./scout mcp --project-id aiedge-runs/<run_id>
Comparison
| Feature | SCOUT | FirmAgent | EMBA | FACT | FirmAE |
|---|---|---|---|---|---|
| Scale (firmware tested) | 1,123 | 14 | -- | -- | 1,124 |
| SBOM (CycloneDX 1.6+VEX) | Yes | No | Yes | No | No |
| SARIF 2.1.0 Export | Yes | No | No | No | No |
| Hash-Anchored Evidence Chain | Yes | No | No | No | No |
| SLSA L2 Provenance | Yes | No | No | No | No |
| Known CVE Signature Matching | Yes (2,528 CVEs, 25 sigs) | No | No | No | No |
| Confidence Caps (honest scoring) | Yes | No | No | No | No |
| Ghidra Integration (auto-detect) | Yes | IDA Pro | Yes | No | No |
| AFL++ Fuzzing Pipeline | Yes | Yes | No | No | No |
| Cross-Binary IPC Chains | Yes (5 types) | No | No | No | No |
| Taint Propagation (LLM) | Yes | Yes (DeepSeek) | No | No | No |
| Adversarial FP Reduction | Yes | No | No | No | No |
| MCP Server (AI agent) | Yes | No | No | No | No |
| Web Report Viewer | Yes | No | Yes | Yes | No |
| Zero pip Dependencies | Yes | No | No | No | No |
Key Features
| Feature | Description | |
|---|---|---|
| :package: | SBOM & CVE | CycloneDX 1.6 + VEX + 25 known CVE signatures (8 vendors) + NVD scan + 2,528 local CVE DB + EPSS scoring (FIRST.org API, batched + cached) |
| :mag: | Binary Analysis | Ghidra P-code SSA dataflow taint + ELF hardening (NX/PIE/RELRO/Canary/FORTIFY) + .dynstr detection + 28 sink symbols + format string detection |
| :dart: | Attack Surface | Source→sink tracing, web server auto-detection, cross-binary IPC chains (5 types: unix socket, dbus, shm, pipe, exec) |
| :brain: | Taint Analysis | HTTP-aware inter-procedural taint, P-code SSA dataflow, call chain visualization, 4-strategy fallback (P-code → colocated → decompiled → interprocedural) |
| :robot: | LLM Engine | 4 backends (Codex CLI / Claude API / Claude Code CLI / Ollama) + centralized system prompts + structured JSON output + 5-stage parser (preamble/fence/raw/brace-counting/error-recovery) + temperature control |
| :crossed_swords: | LLM-Adjudicated Debate | Advocate/Critic LLM debate for LLM-adjudicated FPR reduction (99.3% on the Tier 2 carry-over baseline). Separate parse_failures vs llm_call_failures observability with quota_exhausted detection |
| :compass: | Explainability Surface (v2.6.1) | reasoning_trail persisted across findings, analyst Markdown, TUI, and web viewer so reviewers can inspect matched evidence lineage — not just a final score. Advocate / critic / decision / pattern-hit entries with 200-char excerpt redaction |
| :inbox_tray: | Analyst-in-the-loop Channel (v2.6.1) | 4 MCP tools for reasoning lookup, hint injection, verdict override, and category filtering. Hints loop back into next-run advocate prompt via AIEDGE_FEEDBACK_DIR (opt-in, fcntl.flock-safe) |
| :triangular_ruler: | Detection vs Priority Separation (v2.6.0) | confidence stays evidence-bound (≤0.55 static cap); priority_score / priority_inputs capture EPSS, reachability, backport, and CVSS as ranking signals. See docs/scoring_calibration.md |
| :speedboat: | Parallel DAG Execution (v2.6.0, PoC) | --experimental-parallel [N] opt-in level-wise stage parallelism (ThreadPoolExecutor + Kahn topo levels). 15 levels / max-width 7 on the 47-stage pipeline. Sequential path unchanged |
| :shield: | Security Assessment | X.509 cert scan, boot service audit, filesystem permission checks, credential mapping, hardcoded secret detection |
| :test_tube: | Fuzzing (optional) | AFL++ with CMPLOG, persistent mode, NVRAM faker, harness generation, crash triage |
| :bug: | Emulation | 4-tier (FirmAE / Pandawan+FirmSolo / QEMU user-mode / rootfs inspection) + GDB remote debug |
| :electric_plug: | MCP Server | 12 tools via Model Context Protocol for Claude Code/Desktop integration |
| :bar_chart: | Web Viewer | Glassmorphism dashboard with KPI bar, IPC map, risk heatmap, interactive evidence navigation |
| :link: | Evidence Chain | SHA-256 anchored artifacts + 4-tier confidence caps (0.40/0.55/0.60/0.75) + 5-tier exploit promotion ladder |
| :scroll: | Standard Output | SARIF 2.1.0 (GitHub Code Scanning) + CycloneDX 1.6 + VEX + SLSA Level 2 in-toto attestation |
| :gear: | CI/CD Integration | GitHub Action (.github/actions/scout-scan/) with composite Docker action + automatic SARIF upload to GitHub Security tab |
| :scales: | Regulatory Alignment | Output formats compatible with EU CRA Annex I (docs/compliance_mapping/cra_annex_i.md); SBOM output compatible with FDA Section 524B guidance; output formats compatible with ISO 21434 / UN R155 |
| :chart_with_upwards_trend: | Benchmarking | FirmAE dataset (1,123 firmware), analyst-readiness scoring, verifier-backed archive bundles, TP/FP analysis scripts |
| :key: | Vendor Decrypt | D-Link SHRS AES-128-CBC auto-decryption; Shannon entropy encryption detection (>7.9); binwalk v3 compatibility |
| :white_check_mark: | Zero Dependencies | Pure Python 3.10+ stdlib only — no pip dependencies, air-gap friendly deployment |
Analyst Copilot Surfaces
Explainability surface
reasoning_trailand evidence lineage are preserved across findings, analyst Markdown, TUI, web viewer, and SARIF properties.- This is where reviewers inspect why a finding was downgraded, upheld, or promoted.
Analyst-in-the-loop channel
- MCP tools and
AIEDGE_FEEDBACK_DIRare the supported override/hint path. - Human hints are allowed to influence the next run; final ownership still stays with the analyst.
Autonomous reasoning (future)
- SCOUT is not positioned as a fully autonomous exploit agent in v2.6.1.
- Multi-agent exploit chains, pair-grounded evaluation loops, and autonomous fuzz harness generation remain Phase 2D / reviewer-eval lane work.
Pipeline (47 Stages)
Firmware --> Unpack --> Profile --> Inventory --> Ghidra --> Semantic Classification
--> SBOM --> CVE Scan --> Reachability --> Endpoints --> Surfaces
--> Enhanced Source --> C-Source ID --> Taint Propagation
--> FP Verification --> Adversarial Triage
--> Graph --> Attack Surface --> Findings
--> LLM Triage --> LLM Synthesis --> Emulation --> [Fuzzing]
--> PoC Refinement --> Chain Construction --> Exploitability Dossier
--> Protocol Model --> Exploit State Machine --> Crash Replay --> Primitive Verifier
--> Exploit Chain --> PoC --> Verification
Ghidra is auto-detected and enabled by default. Stages in [brackets] require optional external tools (AFL++/Docker).
| Stage | Module | Purpose | LLM? | Cost |
|---|---|---|---|---|
tooling |
tooling.py |
External tool availability check (binwalk, Ghidra, Docker) | No | $0 |
extraction |
extraction.py |
Firmware unpacking (binwalk + vendor_decrypt + Shannon entropy detection) | No | $0 |
structure |
structure.py |
Filesystem structure analysis | No | $0 |
carving |
carving.py |
File carving from unstructured regions | No | $0 |
firmware_profile |
firmware_profile.py |
Architecture, kernel, init system fingerprinting | No | $0 |
inventory |
inventory.py |
Per-binary ELF hardening + symbol extraction | No | $0 |
ghidra_analysis |
ghidra_analysis.py |
Decompilation + P-code SSA dataflow analysis | No | $0 |
semantic_classification |
semantic_classifier.py |
3-pass function classifier (static → haiku → sonnet) | Yes | Low |
sbom |
sbom.py |
CycloneDX 1.6 SBOM generation with VEX | No | $0 |
cve_scan |
cve_scan.py |
NVD + 25 known signatures + EPSS enrichment | No | $0 |
reachability |
reachability.py |
BFS-based call-graph reachability | No | $0 |
endpoints |
endpoints.py |
Network endpoint discovery | No | $0 |
surfaces |
surfaces.py |
Attack surface enumeration | No | $0 |
enhanced_source |
enhanced_source.py |
Web server auto-detection + INPUT_APIS scan (21 APIs) | No | $0 |
csource_identification |
csource_identification.py |
HTTP input source identification via static sentinel + QEMU | No | $0 |
taint_propagation |
taint_propagation.py |
Inter-procedural taint with 28 sinks + format string detection | Yes | Medium |
fp_verification |
fp_verification.py |
3-pattern FP removal + LLM verification with parse/call failure separation | Yes | Low |
adversarial_triage |
adversarial_triage.py |
Advocate/Critic LLM debate (LLM-adjudicated FPR reduction, 99.3%) | Yes | Medium |
graph |
graph.py |
Communication graph (5 IPC edge types) | No | $0 |
attack_surface |
attack_surface.py |
Attack surface mapping with IPC chains | No | $0 |
attribution |
attribution.py |
Vendor/firmware attribution | No | $0 |
functional_spec |
functional_spec.py |
Functional specification extraction | No | $0 |
threat_model |
threat_model.py |
STRIDE-based threat modeling | No | $0 |
web_ui |
web_ui.py |
Web UI / CGI endpoint analysis | No | $0 |
findings |
findings.py |
Finding aggregation + SARIF export | No | $0 |
llm_triage |
llm_triage.py |
LLM finding triage (haiku/sonnet/opus auto-routing) | Yes | Variable |
llm_synthesis |
llm_synthesis.py |
LLM finding synthesis | Yes | Medium |
emulation |
emulation.py |
4-tier emulation (FirmAE / Pandawan / QEMU / rootfs) | No | $0 |
dynamic_validation |
dynamic_validation.py |
Dynamic behavior verification | No | $0 |
fuzzing |
fuzz_*.py |
AFL++ fuzzing with NVRAM faker | No | $0 |
poc_refinement |
poc_refinement.py |
Iterative PoC generation (5 attempts) | Yes | Medium |
chain_construction |
chain_constructor.py |
Same-binary + cross-binary IPC exploit chains | No | $0 |
exploitability_dossier |
exploitability_dossier.py |
ER605-style analysis-only exploitability decision log | No | $0 |
protocol_model |
protocol_model.py |
Run-local RAG protocol/input model + safe encoder skeleton | Optional | Low |
exploit_state_machine |
exploit_state_machine.py |
Candidate-level reachability/trigger/leak/control proof DAG | No | $0 |
crash_replay |
crash_replay.py |
Lab-gated QEMU/cyclic crash replay and GDB script collector | No | $0 |
primitive_verifier |
primitive_verifier.py |
Verifier-backed primitive classification from bundles/crashes/GDB traces | No | $0 |
exploit_gate |
stage_registry.py |
Exploit promotion gate | No | $0 |
exploit_chain |
exploit_chain.py |
Exploit chain validation | No | $0 |
exploit_autopoc |
exploit_autopoc.py |
Automated PoC orchestration | Yes | Medium |
poc_validation |
poc_validation.py |
PoC reproduction validation | No | $0 |
exploit_policy |
exploit_policy.py |
Final exploit promotion decision | No | $0 |
OTA-specific stages: ota, ota_payload, ota_fs, ota_roots, ota_boottriage, firmware_lineage (Android-style OTA payload analysis).
Benchmarks
Tier 1 (Static, frozen baseline)
Baseline: v2.6.1, 2026-04-17, fresh corpus refresh (docs/carry_over_benchmark_v2.6.md)
1,123firmware /8vendors / 98.8% success rate1,110success /4partial /9fatal3,531findings /146,943CVE matches1,089 / 1,110successful runs produced nonzero CVE output
Tier 2 (LLM-Adjudicated Adversarial Debate, GPT-5.3-Codex)
Baseline: v2.3.0, 2026-04-09, claude-code driver (carry-over; pair-eval lane still pending)
36firmware /9vendors2,430findings debated →2,412downgraded +18maintained- LLM-adjudicated FPR reduction: 99.3% | pair-grounded FN/FP: pending reviewer eval lane
v2.6.0 Post-merge Real-Firmware Validation
This section records post-release real-firmware validation runs, distinct from the carry-over corpus baselines above.
Validation target 1 — Netgear R7000 (codex driver, --experimental-parallel 4)
| Metric | v2.5.0 | v2.6.0 |
|---|---|---|
adversarial_triage parse_failures |
0/100 | 0/100 (100 debated, 97 downgraded, 3 maintained) |
fp_verification unverified |
0/100 | 0/100 (100 verified: 56 TP, 44 FP) |
reasoning_trail_count (top-level findings) |
N/A | 0/3 top-level / 100/100 at adversarial_triage + fp_verification artifacts ¹ |
findings with priority_score |
N/A | 3/3 (100% additive priority annotation) |
priority_bucket_counts |
N/A | {critical: 0, high: 0, medium: 3, low: 0} |
| category distribution | N/A | {vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0} |
cve_scan EPSS enriched |
23/23 | 0 (stage skipped — sbom landed partial and cve_scan/reachability skip on sbom dependency failure ²) |
--experimental-parallel 4 wall-clock |
N/A | ~170 minutes end-to-end across the registered pipeline (fp_verification dominant at 113 min; no sequential baseline for delta) |
¹ v2.6.0 → v2.6.1 follow-up (commit 7b36274): the top-level synthesis finding (web.exec_sink_overlap) now inherits matched downstream evidence lineage instead of relying only on the stage-level aggregate summary. Matching prefers run-relative binary path, falls back to binary SHA-256, and samples representative downstream trail entries deterministically so the synthesis finding reflects the alerts that actually informed it. This R7000 run reflects the v2.6.0 shipped behaviour.
² v2.6.0 → v2.6.1 follow-up (commit 8e0bb82): the R7000 extraction actually succeeded (1,664 files, 2,412 binaries scanned under squashfs-root), but the SBOM stage returned 0 components on this firmware due to a silent schema mismatch — _collect_so_files_from_inventory read inventory.file_list (a pre-v2.x key no longer emitted) and _detect_from_binary_analysis expected per-entry string_hits (replaced by matched_symbols in the current inventory schema). OpenWrt hid the bug because its opkg database alone contributes 100+ components. The fix makes both helpers walk inventory.roots directly and fall back to reading the binary file contents via a new _extract_ascii_runs helper. A clean re-run of just SbomStage on this R7000 run raises the component count from 0 to 4 (curl 7.36.0 via binary read, plus openssl 1.0.0 / libz 1 / libpthread 0 via .so* walking). Downstream cve_scan / reachability would then produce real CVE + EPSS numbers on a full pipeline re-run.
Validation target 2 — OpenWrt Archer C7 v5 (TP-Link, --no-llm)
| Metric | v2.6.0 |
|---|---|
| total findings | 3 |
reasoning_trail_count |
0 (no-llm: adversarial_triage and fp_verification are LLM-gated; trail is populated only when LLM stages run) |
findings with priority_score |
3 / 3 (100% — additive priority annotation succeeded for all findings) |
priority_bucket_counts |
{critical: 0, high: 0, medium: 3, low: 0} |
| category distribution | {vulnerability: 1, pipeline_artifact: 2, misconfiguration: 0, unclassified: 0} (PR #7a 3-category ontology, 0% unclassified rate) |
| notable caveats | OpenWrt is squashfs ext4 root; binwalk extracted cleanly; --no-llm path skipped reasoning_trail generation as expected. Run completed end-to-end through findings stage. |
See CHANGELOG.md for full version history and docs/scoring_calibration.md for the two-score contract.
Architecture
+--------------------------------------------------------------------+
| SCOUT (Evidence Engine) |
| |
| Firmware --> Unpack --> Profile --> Inventory --> SBOM --> CVE |
| | | | | |
| Ghidra Binary Audit 40+ sigs NVD+ |
| auto-detect NX/PIE/etc local DB|
| |
| --> Taint --> FP Filter --> Attack Surface --> Findings |
| (HTTP-aware) (3-pattern) (IPC chains) (SARIF 2.1.0) |
| |
| --> Emulation --> [Fuzzing] --> Exploit Chain --> PoC --> Verify |
| |
| 42 stages . SHA-256 manifests . 4-tier confidence caps (0.40/0.55/0.60/0.75) |
| Outputs: SARIF + CycloneDX VEX + SLSA L2 + Markdown reports |
+--------------------------------------------------------------------+
| Handoff (firmware_handoff.json) |
+--------------------------------------------------------------------+
| Terminator (Orchestrator) |
| LLM Tribunal --> Dynamic Validation --> Verified Chain |
+--------------------------------------------------------------------+
| Layer | Role | Deterministic? |
|---|---|---|
| SCOUT | Evidence production (42 stages) | Yes |
| Handoff | JSON contract between engine and orchestrator | Yes |
| Terminator | LLM tribunal, dynamic validation, exploit dev | No (auditable) |
Exploit Promotion Policy
| Level | Requirements | Placement |
|---|---|---|
dismissed |
Critic rebuttal strong or confidence < 0.5 | Appendix only |
candidate |
Confidence 0.5-0.8, evidence exists but chain incomplete | Report (flagged) |
high_confidence_static |
Confidence >= 0.8, strong static evidence, no dynamic | Report (highlighted) |
confirmed |
Confidence >= 0.8 AND >= 1 dynamic verification artifact | Report (top) |
verified_chain |
Confirmed AND PoC reproduced 3x in sandbox | Exploit report |
CLI Reference
| Command | Description |
|---|---|
./scout analyze <firmware> |
Full 47-stage analysis pipeline |
./scout analyze <firmware> --quiet |
Suppress real-time progress output (CI/scripted use) |
./scout analyze-8mb <firmware> |
Truncated 8MB canonical track |
./scout stages <run_dir> --stages X,Y |
Rerun specific stages |
./scout serve <run_dir> |
Launch web report viewer |
./scout mcp [--project-id <id>] |
Start MCP stdio server |
./scout tui <run_dir> |
Terminal UI dashboard |
./scout ti |
TUI interactive (latest run) |
./scout tw |
TUI watch mode (auto-refresh) |
./scout to |
TUI one-shot (latest run) |
./scout t |
TUI default (latest run) |
./scout corpus-validate |
Validate corpus manifest |
./scout quality-metrics |
Compute quality metrics |
./scout quality-gate |
Check quality thresholds |
./scout release-quality-gate |
Unified release gate |
Exit codes: 0 success, 10 partial, 20 fatal, 30 policy violation
# FirmAE dataset benchmark (1,123 usable firmware images in the current frozen baseline)
./scripts/benchmark_firmae.sh --parallel 8 --time-budget 1800 --cleanup
# Options
--dataset-dir DIR # Firmware directory (default: aiedge-inputs/firmae-benchmark)
--results-dir DIR # Output directory
--file-list PATH # Explicit newline-delimited firmware list
--parallel N # Concurrent jobs (default: 4)
--time-budget S # Seconds per firmware (default: 600)
--stages STAGES # Specific stages (default: full pipeline)
--max-images N # Limit images (0 = all)
--llm # Enable LLM-backed stages
--8mb # Use 8MB truncated track
--full # Include dynamic stages
--cleanup # Preserve a verifier-friendly run replica under results/archives/, then delete original run dirs
--dry-run # List files without running
# Analyst-readiness re-evaluation for an existing benchmark-results tree
python3 scripts/reevaluate_benchmark_results.py \
--results-dir benchmark-results/<run>
# Normalize legacy bundles and rerun a stage subset (useful for debugging archive fidelity issues)
python3 scripts/rerun_benchmark_stages.py \
--results-dir benchmark-results/<legacy-run> \
--out-dir benchmark-results/<rerun-out> \
--stages attribution,graph,attack_surface \
--no-llm
# Post-benchmark analysis
PYTHONPATH=src python3 scripts/cve_rematch.py \
--results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
--nvd-dir data/nvd-cache \
--csv-out cve_matches.csv
PYTHONPATH=src python3 scripts/analyze_findings.py \
--results-dir benchmark-results/firmae-YYYYMMDD_HHMM \
--output analysis_report.json
# FirmAE dataset setup
./scripts/unpack_firmae_dataset.sh [ZIP_FILE]
# Tier 1 frozen baseline docs
# - docs/tier1_rebenchmark_frozen_baseline.md
# - docs/tier1_rebenchmark_final_analysis.md
Current benchmark contract
- Archived benchmark bundles are now expected to be run replicas, not flattened JSON snapshots.
- Benchmark quality is reported in two layers:
- analysis rate = pipeline completed (
success + partial) - analyst-ready rate = archived bundle passes analyst/verifier checks and remains evidence-navigable
- analysis rate = pipeline completed (
benchmark-results/legacy/tier2-llm-v2is a legacy snapshot. It is useful for historical reference and re-evaluation, but it should not be used as the canonical analyst-readiness baseline.- The current contract has been validated on a fresh single-sample run (
benchmark-results/tier2-single-fidelity) where both analyst verifiers passed from the archived bundle.
Current LLM quality behavior
llm_triagemodel routing:<=10 haiku,11-50 sonnet,>50 or chain-backed opusllm_triageretries withsonnetif ahaikucall exits non-zerollm_triage,semantic_classification,adversarial_triage, andfp_verificationnow writestages/<stage>/llm_trace/*.json- Parse failures are handled fail-closed: repaired when possible, otherwise reported as degraded/partial instead of silently treated as clean success
Core
| Variable | Default | Description |
|---|---|---|
AIEDGE_LLM_DRIVER |
codex |
LLM provider: codex / claude / claude-code / gemini / ollama |
ANTHROPIC_API_KEY |
-- | API key for Claude driver (not needed for claude-code) |
AIEDGE_OLLAMA_URL |
http://localhost:11434 |
Ollama server URL |
AIEDGE_LLM_BUDGET_USD |
-- | LLM cost budget limit |
AIEDGE_PRIV_RUNNER |
-- | Privileged command prefix for dynamic stages |
AIEDGE_FEEDBACK_DIR |
aiedge-feedback |
Terminator feedback directory |
Ghidra
| Variable | Default | Description |
|---|---|---|
AIEDGE_GHIDRA_HOME |
auto-detect | Ghidra install path; probes /opt/ghidra_*, /usr/local/ghidra* |
AIEDGE_GHIDRA_MAX_BINARIES |
20 |
Max binaries to analyze |
AIEDGE_GHIDRA_TIMEOUT_S |
300 |
Per-binary analysis timeout |
SBOM & CVE
| Variable | Default | Description |
|---|---|---|
AIEDGE_NVD_API_KEY |
-- | NVD API key (optional, improves rate limits) |
AIEDGE_NVD_CACHE_DIR |
-- | Cross-run NVD response cache |
AIEDGE_SBOM_MAX_COMPONENTS |
500 |
Maximum SBOM components |
AIEDGE_CVE_SCAN_MAX_COMPONENTS |
50 |
Maximum components to CVE-scan |
AIEDGE_CVE_SCAN_TIMEOUT_S |
30 |
Per-request NVD API timeout |
Fuzzing & Emulation
| Variable | Default | Description |
|---|---|---|
AIEDGE_AFLPP_IMAGE |
aflplusplus/aflplusplus |
AFL++ Docker image |
AIEDGE_FUZZ_BUDGET_S |
3600 |
Fuzzing time budget (seconds) |
AIEDGE_FUZZ_MAX_TARGETS |
5 |
Max fuzzing target binaries |
AIEDGE_EMULATION_IMAGE |
scout-emulation:latest |
Emulation Docker image |
AIEDGE_FIRMAE_ROOT |
/opt/FirmAE |
FirmAE installation path |
AIEDGE_QEMU_GDB_PORT |
1234 |
QEMU GDB remote port |
Quality Gates
| Variable | Default | Description |
|---|---|---|
AIEDGE_QG_PRECISION_MIN |
0.9 |
Minimum precision threshold |
AIEDGE_QG_RECALL_MIN |
0.6 |
Minimum recall threshold |
AIEDGE_QG_FPR_MAX |
0.1 |
Maximum false positive rate |
aiedge-runs/<run_id>/
├── manifest.json
├── firmware_handoff.json
├── provenance.intoto.jsonl # SLSA L2 attestation
├── input/firmware.bin
├── stages/
│ ├── extraction/ # Unpacked filesystem
│ ├── inventory/
│ │ └── binary_analysis.json # Per-binary hardening + symbols
│ ├── enhanced_source/
│ │ └── sources.json # HTTP input sources + web server detection
│ ├── sbom/
│ │ ├── sbom.json # CycloneDX 1.6
│ │ └── vex.json # VEX exploitability
│ ├── cve_scan/
│ │ └── cve_matches.json # NVD + known signature matches
│ ├── taint_propagation/
│ │ └── taint_results.json # Taint paths + call chains
│ ├── ghidra_analysis/ # Decompiled functions (optional)
│ ├── chain_construction/
│ │ └── chains.json # Same-binary + cross-binary IPC chains
│ ├── findings/
│ │ ├── findings.json # All findings
│ │ ├── pattern_scan.json # Static pattern matches
│ │ ├── sarif.json # SARIF 2.1.0 export
│ │ └── stage.json # SHA-256 manifest
│ └── ... # 42 stage directories total
└── report/
├── viewer.html # Web dashboard
├── report.json
├── analyst_digest.json
└── executive_report.md
Verification Scripts
# Evidence chain integrity
python3 scripts/verify_analyst_digest.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_verified_chain.py --run-dir aiedge-runs/<run_id>
# Report schema compliance
python3 scripts/verify_aiedge_final_report.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_aiedge_analyst_report.py --run-dir aiedge-runs/<run_id>
# Security invariants
python3 scripts/verify_run_dir_evidence_only.py --run-dir aiedge-runs/<run_id>
python3 scripts/verify_network_isolation.py --run-dir aiedge-runs/<run_id>
# Quality gates
./scout release-quality-gate aiedge-runs/<run_id>
Documentation
| Document | Purpose |
|---|---|
| Blueprint | Pipeline architecture and design rationale |
| Status | Current implementation status |
| Artifact Schema | Profiling + inventory contracts |
| Adapter Contract | Terminator-SCOUT handoff protocol |
| Report Contract | Report structure and governance |
| Analyst Digest | Digest schema and verdicts |
| Verified Chain | Evidence requirements |
| Duplicate Gate | Cross-run dedup rules |
| Known CVE Ground Truth | CVE validation dataset |
| Upgrade Plan v2 | v2.0 upgrade plan |
| LLM Roadmap | LLM integration strategy |
Security & Ethics
Authorized environments only.
SCOUT is intended for contracted security audits, vulnerability research (responsible disclosure), and CTF/training in lab environments. Dynamic validation runs in network-isolated sandbox containers. No weaponized payloads are included.
Contributing
- Read Blueprint for architecture context
- Run
pytest -q-- all tests must pass - Lint
ruff check src/-- zero violations - Follow the Stage protocol (
src/aiedge/stage.py) - Zero pip dependencies -- stdlib only
License
Apache 2.0
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi