decibench

Name: decibench
Author: unforkopensource-org

pip install git+https://github.com/unforkopensource-org/decibench.git

The open testing standard for voice AI agents.
Deterministic + semantic + RAG-augmented evaluation.
Local-first. Zero telemetry. One CLI.

The Problem

You built a voice AI agent. It works in your demo. Then in production:

It hallucinates a refund policy that doesn't exist
It takes 4 seconds to respond and the caller hangs up
It crumbles when someone interrupts mid-sentence
It leaks a customer's SSN in the transcript log

You find out from your customers, not your test suite — because you don't have one.

Decibench fixes this. It's pytest for voice agents.

How It Works

┌─────────────────────────────────────────────────────────┐
│                    decibench run                        │
│                                                         │
│  ┌──────────┐    TTS     ┌──────────────┐   Evaluate    │
│  │ Scenario │ ────────▶  │  Your Agent  │ ───────────▶  │
│  │  (YAML)  │   Audio    │  (any target)│   10 metrics  │
│  └──────────┘  ◀──────── └──────────────┘               │
│                    STT                                   │
│                                                         │
│  Score: 87/100 │ Latency p95: 1.2s │ WER: 3.1%         │
│  ✓ compliance  │ ✓ hallucination   │ ✗ interruption     │
└─────────────────────────────────────────────────────────┘

You write scenarios in YAML (or auto-generate them from your docs)
Decibench synthesizes caller audio, calls your agent, transcribes the response
10 evaluators score every call across latency, accuracy, compliance, and more
Results go to a local SQLite DB → Rich CLI report, HTML, JUnit, or a full Vue dashboard

Quick Start

# Install from source
pip install git+https://github.com/unforkopensource-org/decibench.git

# Or clone and install locally
git clone https://github.com/unforkopensource-org/decibench.git
cd decibench
pip install -e .

# Verify the install
decibench doctor

# Run the built-in demo (zero config, zero API keys)
decibench run target=demo suite=quick

# Open the dashboard
decibench serve

Test your own agent

# WebSocket endpoint (generic — works with any agent)
decibench run target=ws://localhost:8080/ws suite=quick

# Native Vapi / Retell / ElevenLabs
decibench run target=vapi://agent_abc123 suite=standard

# Twilio Media Streams mock (no call credits needed)
decibench run target=twilio://localhost:5050/media suite=realestate

# Spawn a local process and pipe PCM through stdin/stdout
decibench run target='exec:"python my_agent.py"' suite=quick

Three Testing Modes

Mode	What it does	Cost	Speed
`deterministic`	Exact string matching, regex, keyword checks	Free	~ms
`semantic`	LLM-as-Judge scores accuracy, compliance, hallucination	~$0.01/call	~2s
`semantic+rag`	Upload your docs → auto-generate adversarial test suites	~$0.03/call	~5s

# Free deterministic checks only
decibench run target=demo suite=quick --mode deterministic

# Full semantic evaluation with GPT-4o / Claude / Gemini / Ollama
decibench run target=ws://... suite=standard --mode semantic

# Generate tests from your own knowledge base
decibench rag ingest ./docs/training-manual.pdf
decibench rag synthesize --suite-name my-tests --count 20
decibench run target=ws://... suite=my-tests --mode semantic+rag

10 Built-In Evaluators

Every call is scored across all applicable metrics automatically:

Evaluator	Metric	What it catches
Latency	`p50` `p90` `p95` `ttfb`	Slow responses that cause hangups
WER / CER	Word/character error rate	Garbled or inaccurate speech
Hallucination	LLM-graded factual accuracy	Agent invents information
Task Completion	Did the agent achieve the goal?	Broken conversation flows
Compliance	Mandatory disclosures, disclaimers	Regulatory violations
Interruption	Barge-in handling	Agent crashes on user interrupts
Silence	Dead air detection	Agent goes silent mid-call
MOS	Mean Opinion Score (DNSMOS)	Audio quality degradation
STOI	Short-Time Objective Intelligibility	Unintelligible speech
Composite Score	Weighted aggregate of all metrics	Single pass/fail number

Connectors

Decibench talks to your agent, not the other way around. No SDK to install in your agent code.

Connector	Target URI	Status
Demo	`demo://`	✅ Shipped
WebSocket	`ws://host:port/path`	✅ Shipped
HTTP	`http://host/endpoint`	✅ Shipped
Process	`exec:"command"`	✅ Shipped
ElevenLabs	`elevenlabs://agent_id`	✅ Shipped
Twilio Mock	`twilio://host/path`	✅ Shipped
Retell	`retell://agent_id`	🧪 Experimental
Vapi	`vapi://agent_id`	🧪 Experimental
LiveKit	—	📋 Planned
Bland	—	📋 Planned

LLM Judge Providers

Semantic evaluation works with any OpenAI-compatible API:

# decibench.toml
[judge]
provider = "openai"     # or "anthropic", "gemini", "ollama"
model    = "gpt-4o"     # or "claude-sonnet-4-20250514", "gemini-2.5-flash", "llama3"

# Self-hosted? Point at any OpenAI-compatible endpoint
[judge]
provider = "openai"
model    = "mistral-7b"
base_url = "http://localhost:11434/v1"  # Ollama, vLLM, LM Studio, etc.

MCP Server

Decibench ships a Model Context Protocol server so AI coding agents (Cursor, Windsurf, Claude Code) can run and analyze your voice tests directly:

pip install decibench[mcp]
decibench-mcp

Tools exposed: run_test, list_runs, analyze_failures, generate_scenario, manage_suites, and more.

CLI Reference

decibench run           Run a test suite against a target
decibench compare       Side-by-side comparison of two targets
decibench serve         Launch the Vue dashboard + REST API
decibench import        Import production call logs (Vapi, Retell, JSONL)
decibench evaluate-calls Score imported calls against evaluators
decibench replay        Re-evaluate a previous run with different settings
decibench rag ingest    Ingest documents into the RAG corpus
decibench rag synthesize Auto-generate test scenarios from your docs
decibench scenario      List / inspect / generate scenarios
decibench runs          List previous test runs
decibench scoring       View scoring weights and policies
decibench doctor        Verify installation and dependencies
decibench auth          Manage API key storage (keyring-backed)
decibench bridge        Launch the headless browser sidecar

Architecture

decibench/
├── cli/                 # Click CLI — thin wrappers, no business logic
├── connectors/          # Protocol adapters (WS, HTTP, Twilio, ElevenLabs, …)
├── evaluators/          # 10 metric evaluators (latency, WER, hallucination, …)
├── providers/           # Pluggable TTS, STT, and LLM Judge backends
├── reporters/           # Output: Rich terminal, HTML, JSON, JUnit, Markdown
├── rag/                 # Document ingestion, embedding, retrieval, synthesis
├── mcp/                 # Model Context Protocol server (stdio + SSE)
├── store/               # SQLite with migrations, privacy redaction engine
├── bridge/              # Protocol for headless browser sidecar (WebRTC targets)
├── scenarios/           # Built-in test suites (quick, standard, acoustic, adversarial, realestate)
└── api/                 # FastAPI REST server + embedded Vue dashboard

Privacy & Security

Decibench is built for teams that handle sensitive call data:

Zero telemetry — no data leaves your machine, ever
PII redaction engine — phone numbers, SSNs, emails, and credit cards are scrubbed from transcripts before they hit the local SQLite database
API keys in keyring — secrets are stored in your OS keychain, not in config files
Local-only storage — SQLite database stays on your machine unless you explicitly export

Configuration

# decibench.toml

[target]
uri = "ws://localhost:8080/ws"

[tts]
provider = "edge"        # Free Microsoft Edge TTS (default)

[stt]
provider = "faster_whisper"
model    = "base"

[judge]
provider = "openai"
model    = "gpt-4o"

[scoring]
latency_weight       = 0.25
accuracy_weight      = 0.25
compliance_weight    = 0.20
task_completion_weight = 0.20
audio_quality_weight = 0.10

Installation

# Install from GitHub (recommended)
pip install git+https://github.com/unforkopensource-org/decibench.git

# With semantic evaluation + MCP server
pip install "decibench[mcp] @ git+https://github.com/unforkopensource-org/decibench.git"

# With RAG-augmented testing
pip install "decibench[rag] @ git+https://github.com/unforkopensource-org/decibench.git"

# Everything
pip install "decibench[all] @ git+https://github.com/unforkopensource-org/decibench.git"

# Or clone for local development
git clone https://github.com/unforkopensource-org/decibench.git
cd decibench
pip install -e ".[dev]"

Requirements: Python 3.11+ · macOS / Linux / WSL

Note: PyPI publishing is coming soon. For now, install directly from GitHub.

Contributing

See CONTRIBUTING.md for guidelines. Run the test suite:

pip install -e ".[dev]"
python -m pytest --timeout=60
python -m ruff check src tests
python -m ruff format --check src tests

License

Apache 2.0 — see LICENSE.

_{Built by Unfork Open Source}