decibench
Health Uyari
- License — License: NOASSERTION
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 8 GitHub stars
Code Uyari
- fs module — File system access in bridge_sidecar/src/adapters/retell.ts
- network request — Outbound network request in bridge_sidecar/src/adapters/retell.ts
- fs module — File system access in bridge_sidecar/src/adapters/vapi.ts
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
The open testing standard for voice AI agents. Deterministic + semantic + RAG augmented evaluation. Local first. Zero telemetry.
decibench
pip install git+https://github.com/unforkopensource-org/decibench.git
The open testing standard for voice AI agents.
Deterministic + semantic + RAG-augmented evaluation.
Local-first. Zero telemetry. One CLI.
The Problem
You built a voice AI agent. It works in your demo. Then in production:
- It hallucinates a refund policy that doesn't exist
- It takes 4 seconds to respond and the caller hangs up
- It crumbles when someone interrupts mid-sentence
- It leaks a customer's SSN in the transcript log
You find out from your customers, not your test suite — because you don't have one.
Decibench fixes this. It's pytest for voice agents.
How It Works
┌─────────────────────────────────────────────────────────┐
│ decibench run │
│ │
│ ┌──────────┐ TTS ┌──────────────┐ Evaluate │
│ │ Scenario │ ────────▶ │ Your Agent │ ───────────▶ │
│ │ (YAML) │ Audio │ (any target)│ 10 metrics │
│ └──────────┘ ◀──────── └──────────────┘ │
│ STT │
│ │
│ Score: 87/100 │ Latency p95: 1.2s │ WER: 3.1% │
│ ✓ compliance │ ✓ hallucination │ ✗ interruption │
└─────────────────────────────────────────────────────────┘
- You write scenarios in YAML (or auto-generate them from your docs)
- Decibench synthesizes caller audio, calls your agent, transcribes the response
- 10 evaluators score every call across latency, accuracy, compliance, and more
- Results go to a local SQLite DB → Rich CLI report, HTML, JUnit, or a full Vue dashboard
Quick Start
# Install from source
pip install git+https://github.com/unforkopensource-org/decibench.git
# Or clone and install locally
git clone https://github.com/unforkopensource-org/decibench.git
cd decibench
pip install -e .
# Verify the install
decibench doctor
# Run the built-in demo (zero config, zero API keys)
decibench run target=demo suite=quick
# Open the dashboard
decibench serve
Test your own agent
# WebSocket endpoint (generic — works with any agent)
decibench run target=ws://localhost:8080/ws suite=quick
# Native Vapi / Retell / ElevenLabs
decibench run target=vapi://agent_abc123 suite=standard
# Twilio Media Streams mock (no call credits needed)
decibench run target=twilio://localhost:5050/media suite=realestate
# Spawn a local process and pipe PCM through stdin/stdout
decibench run target='exec:"python my_agent.py"' suite=quick
Three Testing Modes
| Mode | What it does | Cost | Speed |
|---|---|---|---|
deterministic |
Exact string matching, regex, keyword checks | Free | ~ms |
semantic |
LLM-as-Judge scores accuracy, compliance, hallucination | ~$0.01/call | ~2s |
semantic+rag |
Upload your docs → auto-generate adversarial test suites | ~$0.03/call | ~5s |
# Free deterministic checks only
decibench run target=demo suite=quick --mode deterministic
# Full semantic evaluation with GPT-4o / Claude / Gemini / Ollama
decibench run target=ws://... suite=standard --mode semantic
# Generate tests from your own knowledge base
decibench rag ingest ./docs/training-manual.pdf
decibench rag synthesize --suite-name my-tests --count 20
decibench run target=ws://... suite=my-tests --mode semantic+rag
10 Built-In Evaluators
Every call is scored across all applicable metrics automatically:
| Evaluator | Metric | What it catches |
|---|---|---|
| Latency | p50 p90 p95 ttfb |
Slow responses that cause hangups |
| WER / CER | Word/character error rate | Garbled or inaccurate speech |
| Hallucination | LLM-graded factual accuracy | Agent invents information |
| Task Completion | Did the agent achieve the goal? | Broken conversation flows |
| Compliance | Mandatory disclosures, disclaimers | Regulatory violations |
| Interruption | Barge-in handling | Agent crashes on user interrupts |
| Silence | Dead air detection | Agent goes silent mid-call |
| MOS | Mean Opinion Score (DNSMOS) | Audio quality degradation |
| STOI | Short-Time Objective Intelligibility | Unintelligible speech |
| Composite Score | Weighted aggregate of all metrics | Single pass/fail number |
Connectors
Decibench talks to your agent, not the other way around. No SDK to install in your agent code.
| Connector | Target URI | Status |
|---|---|---|
| Demo | demo:// |
✅ Shipped |
| WebSocket | ws://host:port/path |
✅ Shipped |
| HTTP | http://host/endpoint |
✅ Shipped |
| Process | exec:"command" |
✅ Shipped |
| ElevenLabs | elevenlabs://agent_id |
✅ Shipped |
| Twilio Mock | twilio://host/path |
✅ Shipped |
| Retell | retell://agent_id |
🧪 Experimental |
| Vapi | vapi://agent_id |
🧪 Experimental |
| LiveKit | — | 📋 Planned |
| Bland | — | 📋 Planned |
LLM Judge Providers
Semantic evaluation works with any OpenAI-compatible API:
# decibench.toml
[judge]
provider = "openai" # or "anthropic", "gemini", "ollama"
model = "gpt-4o" # or "claude-sonnet-4-20250514", "gemini-2.5-flash", "llama3"
# Self-hosted? Point at any OpenAI-compatible endpoint
[judge]
provider = "openai"
model = "mistral-7b"
base_url = "http://localhost:11434/v1" # Ollama, vLLM, LM Studio, etc.
MCP Server
Decibench ships a Model Context Protocol server so AI coding agents (Cursor, Windsurf, Claude Code) can run and analyze your voice tests directly:
pip install decibench[mcp]
decibench-mcp
Tools exposed: run_test, list_runs, analyze_failures, generate_scenario, manage_suites, and more.
CLI Reference
decibench run Run a test suite against a target
decibench compare Side-by-side comparison of two targets
decibench serve Launch the Vue dashboard + REST API
decibench import Import production call logs (Vapi, Retell, JSONL)
decibench evaluate-calls Score imported calls against evaluators
decibench replay Re-evaluate a previous run with different settings
decibench rag ingest Ingest documents into the RAG corpus
decibench rag synthesize Auto-generate test scenarios from your docs
decibench scenario List / inspect / generate scenarios
decibench runs List previous test runs
decibench scoring View scoring weights and policies
decibench doctor Verify installation and dependencies
decibench auth Manage API key storage (keyring-backed)
decibench bridge Launch the headless browser sidecar
Architecture
decibench/
├── cli/ # Click CLI — thin wrappers, no business logic
├── connectors/ # Protocol adapters (WS, HTTP, Twilio, ElevenLabs, …)
├── evaluators/ # 10 metric evaluators (latency, WER, hallucination, …)
├── providers/ # Pluggable TTS, STT, and LLM Judge backends
├── reporters/ # Output: Rich terminal, HTML, JSON, JUnit, Markdown
├── rag/ # Document ingestion, embedding, retrieval, synthesis
├── mcp/ # Model Context Protocol server (stdio + SSE)
├── store/ # SQLite with migrations, privacy redaction engine
├── bridge/ # Protocol for headless browser sidecar (WebRTC targets)
├── scenarios/ # Built-in test suites (quick, standard, acoustic, adversarial, realestate)
└── api/ # FastAPI REST server + embedded Vue dashboard
Privacy & Security
Decibench is built for teams that handle sensitive call data:
- Zero telemetry — no data leaves your machine, ever
- PII redaction engine — phone numbers, SSNs, emails, and credit cards are scrubbed from transcripts before they hit the local SQLite database
- API keys in keyring — secrets are stored in your OS keychain, not in config files
- Local-only storage — SQLite database stays on your machine unless you explicitly export
Configuration
# decibench.toml
[target]
uri = "ws://localhost:8080/ws"
[tts]
provider = "edge" # Free Microsoft Edge TTS (default)
[stt]
provider = "faster_whisper"
model = "base"
[judge]
provider = "openai"
model = "gpt-4o"
[scoring]
latency_weight = 0.25
accuracy_weight = 0.25
compliance_weight = 0.20
task_completion_weight = 0.20
audio_quality_weight = 0.10
Installation
# Install from GitHub (recommended)
pip install git+https://github.com/unforkopensource-org/decibench.git
# With semantic evaluation + MCP server
pip install "decibench[mcp] @ git+https://github.com/unforkopensource-org/decibench.git"
# With RAG-augmented testing
pip install "decibench[rag] @ git+https://github.com/unforkopensource-org/decibench.git"
# Everything
pip install "decibench[all] @ git+https://github.com/unforkopensource-org/decibench.git"
# Or clone for local development
git clone https://github.com/unforkopensource-org/decibench.git
cd decibench
pip install -e ".[dev]"
Requirements: Python 3.11+ · macOS / Linux / WSL
Note: PyPI publishing is coming soon. For now, install directly from GitHub.
Contributing
See CONTRIBUTING.md for guidelines. Run the test suite:
pip install -e ".[dev]"
python -m pytest --timeout=60
python -m ruff check src tests
python -m ruff format --check src tests
License
Apache 2.0 — see LICENSE.
Built by Unfork Open Source
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi