FlightBox — record, replay, and diff every LLM call your agent makes

Quick Start · Record · Replay · Diff · 中文

flightbox record and replay

Black-box flight recorder for AI agents — record every LLM call your agent makes, replay sessions deterministically, and export a redacted evidence report when something breaks.

FlightBox is local-first. Recordings live in SQLite. No hosted dashboard is required.

Why

An agent failed and nobody can reproduce it. The final answer is in a log, but the interesting evidence is scattered across LLM requests, tool calls, model responses, timing, tokens, and local notes.

FlightBox gives you a deterministic debugging trail:

record OpenAI / Anthropic / LiteLLM calls
replay the same responses later
diff two runs
export JSONL or pytest replay tests
generate a redacted Markdown / HTML report for PRs, CI notes, and teammates

Quick Start

pip install flightbox

Record

import flightbox
from openai import OpenAI

client = OpenAI()

with flightbox.record("debug-session") as rec:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(response.choices[0].message.content)

print(f"Recorded as run: {rec.run_id}")

Replay

import flightbox

with flightbox.replay("abc123def4"):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(response.choices[0].message.content)

Inspect

flightbox list
flightbox show <run-id>
flightbox stats <run-id>
flightbox timeline <run-id>
flightbox diff <run-a> <run-b>
flightbox diff <run-a> <run-b> --ignore-field request

Export

# JSONL eval rows
flightbox export <run-id> -f jsonl -o eval_dataset.jsonl

# Raw payloads are opt-in; the default JSONL export redacts common secrets.
flightbox export <run-id> -f jsonl --raw -o private_fixture.jsonl

# pytest replay skeleton
flightbox export <run-id> -f pytest -o test_replay.py

# redacted evidence report
flightbox report <run-id> -f md -o evidence.md
flightbox report <run-id> -f html -o evidence.html
flightbox report <run-id> \
  --note "reproduced after retry-path patch" \
  --verify "pytest tests/test_agent.py -q" \
  --env repo=agent-demo \
  -o evidence.md

# compact redacted call timeline
flightbox timeline <run-id> -o timeline.md

# audit raw recordings before sharing evidence
flightbox audit <run-id>
flightbox audit <run-id> -f json -o audit.json
flightbox audit <run-id> --policy .flightboxignore

The report redacts common API keys, bearer tokens, GitHub tokens, and authorization headers before writing the file. It also records lightweight evidence metadata: notes, verification commands, Python version, platform, and optional KEY=VALUE environment facts.
The timeline is a shorter PR-friendly view: one row per recorded call, with provider, model, latency, token totals, error state, and redacted request / response previews.
The audit command scans the raw recording for common secret patterns and reports only the event, top-level field, JSON path, pattern, and redacted preview. For noisy but safe fields, add a .flightboxignore policy:

# Ignore a whole top-level recording field.
field:token_usage

# Ignore one JSON path. `*` matches list entries.
path:request.messages.*.content

# Disable a pattern by name.
pattern:github-token

LiteLLM

pip install "flightbox[litellm]"

import flightbox
import litellm

with flightbox.record("router-debug") as rec:
    litellm.completion(
        model="openrouter/openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "ping"}],
    )

with flightbox.replay(rec.run_id):
    litellm.completion(
        model="openrouter/openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "ping"}],
    )

CLI Reference

flightbox list                    # List recorded runs
flightbox show <run-id>           # Show run details and events
flightbox stats <run-id>          # Summarize latency, tokens, and errors
flightbox timeline <run-id>       # Render a compact redacted call timeline
flightbox audit <run-id>          # Check raw payloads for common secret patterns
flightbox audit <run-id> --policy .flightboxignore
flightbox diff <run-a> <run-b>    # Compare two runs
flightbox diff <run-a> <run-b> --ignore-field request  # Hide expected volatile fields
flightbox export <run-id>         # Export as redacted JSONL or pytest
flightbox export <run-id> --raw   # Keep raw JSONL payloads for private fixtures
flightbox report <run-id>         # Export a redacted evidence report
flightbox report <run-id> --note "..." --verify "pytest -q" --env os=windows
flightbox delete <run-id>         # Delete a recording

Supported SDKs

OpenAI Python SDK (openai>=1.0) — sync and async
Anthropic Python SDK (anthropic>=0.20)
LiteLLM (litellm>=1.0) — completion and acompletion
SDKs and frameworks that call through those clients

Storage

Recordings are stored in .flightbox/recordings.db by default. You can pass a custom database path with --db in the CLI or by constructing RecordStore yourself.

Roadmap

Record / replay / diff / report are solid. The next steps are about covering more of what agents actually call and turning recordings into a real regression gate:

Wider SDK coverage — Google GenAI, Cohere, and raw HTTP LLM clients, so a recording doesn't depend on which SDK an agent happens to use.
A baseline assertion for CI — flightbox assert <run> --against baseline.jsonl that fails the build when an agent's call sequence drifts from a recorded baseline, so behavior change is caught in review, not in production.
Cost and latency trends — roll the per-call token/latency data already captured into a small cross-run summary, so a regression in spend is as visible as one in output.
A local transcript viewer — a single-file HTML view of a run's call chain, for when a Markdown timeline isn't enough to see where two runs diverged.

It stays local-first throughout — no recording ever has to leave your machine.

Related Projects

FlightBox is one of my LLM-reliability tools. A few others in the same vein:

CoreCoder — want to understand how a coding agent really works? Read the whole ~1k-line engine end to end, not a black box.
RepoWiki — dropped into an unfamiliar codebase? It gives you a guided wiki and a where-to-start reading path, a self-hostable DeepWiki alternative.
TokenTracker — no more surprise LLM bills: track every token and dollar with one line of code.
PromptDiff — did that prompt change help or hurt? Semantic diffing with an LLM judge, gated in CI.

License

MIT

Why