pitlane
Health Warn
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
Race a baseline vs a skill or MCP on real tasks. Hard checks show if it got better, faster, or cheaper. Numbers, not vibes.
pitlane 🏁
A feedback loop for people building AI skills and MCP servers.
You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse?
Pitlane gives you the answer. Define the tasks your skill should help with, set up a baseline (assistant without your skill) and a challenger (assistant with your skill), and race them. The results tell you with numbers, not vibes, whether your work is paying off.
The idea
In motorsport, the pit lane is where engineers tune the car between laps. Swap a part, adjust the setup, check the telemetry, see if the next lap is faster.
Building skills and MCP servers works the same way:
- Tune: change your skill, update your MCP server, tweak your prompts
- Race: run the assistant with and without your changes against real coding tasks
- Check the telemetry: did pass rates go up? Did quality improve? Did it get faster or cheaper?
- Repeat: go back to the pit, make another adjustment, race again
Pitlane is the telemetry system. You build the skill, pitlane tells you if it's working.
Demo
https://github.com/user-attachments/assets/c1aa78d8-9e26-43e0-9945-027fd0da6fe5
Key features
- YAML-based benchmark definitions (easy to version and diff)
- Deterministic assertions (file checks, command execution, custom scripts)
- Similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity)
- Metrics tracking (time, tokens, cost, file changes)
- JUnit XML output (
junit.xml) for native CI test reporting - Interactive HTML reports with side-by-side agent comparison
- Parallel execution and repeated runs with statistics
- Graceful interrupt handling (Ctrl+C generates partial reports)
- TDD workflow support (red-green-refactor)
Table of Contents
- Quick Start
- Supported Assistants
- Usage
- Writing Benchmarks
- TDD Workflow
- Editor Integration
- Security Considerations
- Contributing
- License
Quick start
Installation
You'll need uv, a fast Python package installer.
Install on macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
Install on Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Note: Windows should work but has not been extensively tested. If you encounter issues or can help validate Windows support, contributions are welcome!
Install pitlane:
uv tool install pitlane --from git+https://github.com/pitlane-ai/pitlane.git
Or run without installing:
uvx --from git+https://github.com/pitlane-ai/pitlane.git pitlane run examples/simple-codegen-eval.yaml
Prerequisites for running evaluations
Before running your first evaluation, ensure you have an AI coding assistant installed and authenticated on your machine. Choose from supported assistants. Install the assistant's CLI tool following their official documentation, most assistants will prompt you to log in on first use.
Running your first example
The example below uses OpenCode because it's free and requires no API key, but you can use any supported assistant by editing the YAML file.
Initialize a project with example benchmarks:
pitlane init --with-examples
Run the evaluation:
pitlane run examples/simple-codegen-eval.yaml
Results appear in runs/ with an HTML report showing pass rates and metrics.
Want to use a different assistant? Edit examples/simple-codegen-eval.yaml and uncomment your preferred assistant configuration. See Supported Assistants for options.
Need help designing benchmarks? Install the pitlane skill for AI-guided assistance:
npx skills add pitlane-ai/pitlane
Your AI assistant can then help you create effective eval benchmarks. See Writing Benchmarks for details.
Supported assistants
| Assistant | Type | Status |
|---|---|---|
| Claude Code | claude-code |
✅ Tested |
| Mistral Vibe | mistral-vibe |
✅ Tested |
| OpenCode | opencode |
✅ Tested |
| Bob | bob |
✅ Tested |
For the latest status and additional assistants in development, see PR #32.
Want to add support for another assistant? Contributions are welcome! See the Contributing Guide for instructions on implementing new assistants.
Usage
Basic Evaluation
Run all tasks against all configured assistants:
pitlane run examples/simple-codegen-eval.yaml
Filtering
Run specific tasks or assistants:
Run a single task:
pitlane run examples/simple-codegen-eval.yaml --task hello-world-python
Run specific assistants (comma-separated):
pitlane run examples/simple-codegen-eval.yaml --only-assistants claude-baseline
Skip assistants:
pitlane run examples/simple-codegen-eval.yaml --skip-assistants claude-baseline
Combine filters:
pitlane run examples/simple-codegen-eval.yaml --task hello-world-python --only-assistants claude-baseline
Parallel execution
Speed up multi-task benchmarks:
pitlane run examples/simple-codegen-eval.yaml --parallel 4
Repeated runs
Run tasks multiple times to measure consistency and get aggregated statistics:
pitlane run examples/simple-codegen-eval.yaml --repeat 5
This runs each task 5 times and reports avg/min/max/stddev for all metrics in the HTML report.
Debug output
Every run creates debug.log with detailed execution information. Stream output to terminal in real-time:
pitlane run examples/simple-codegen-eval.yaml --verbose
All assertions include detailed logging to help diagnose failures.
Interrupt handling
Press Ctrl+C to stop a run. You'll get a partial HTML report with results from completed tasks.
Open report in browser
By default, report.html opens in your browser after each run. To disable this:
pitlane run examples/simple-codegen-eval.yaml --no-open
The same flag works when regenerating a report:
pitlane report runs/2024-01-01_12-00-00 --no-open
Other commands
Initialize new benchmark project:
pitlane init
Initialize with example benchmarks:
pitlane init --with-examples
Generate JSON Schema for YAML validation:
pitlane schema generate
Install VS Code YAML validation (safe, with preview):
pitlane schema install
Regenerate HTML report from existing junit.xml:
pitlane report runs/2024-01-01_12-00-00
Writing benchmarks
Benchmarks are YAML files with two sections: assistants and tasks.
Need help designing effective benchmarks? Install the pitlane skill for AI-guided assistance:
npx skills add pitlane-ai/pitlane
Your AI assistant can help you design eval benchmarks that actually measure whether your skills or MCP servers improve performance.
Minimal example
assistants:
claude-baseline:
type: claude-code
args:
model: haiku
tasks:
- name: hello-world-python
prompt: "Create a Python script called hello.py that prints 'Hello, World!'"
workdir: ./fixtures/empty
timeout: 120
assertions:
- file_exists: "hello.py"
- command_succeeds: "python3 hello.py"
- file_contains: { path: "hello.py", pattern: "Hello, World!" }
Assistants
Each assistant defines how to run a model:
assistants:
# Baseline configuration
claude-baseline:
type: claude-code
args:
model: haiku
# With skills/MCP
claude-with-skill:
type: claude-code
args:
model: haiku
skills:
# Remote: GitHub reference (installed via npx skills add)
- source: org/repo
skill: my-skill-name
# Local: directory path (for development/testing)
- source: ./path/to/my-skill
Tasks
Each task specifies:
name: Unique identifierprompt: Instructions for the assistantworkdir: Fixture directory (copied for each run)timeout: Maximum secondsassertions: Checks to verify success
Assertions
Deterministic (preferred)
assertions:
# File exists
- file_exists: "main.py"
# Command succeeds (exit code 0)
- command_succeeds: "python main.py"
# Command fails (non-zero exit)
- command_fails: "python main.py --invalid"
# File contains pattern (regex)
- file_contains:
path: "main.py"
pattern: "def main\\(\\):"
# Custom script validation
- custom_script:
script: "./validate.sh"
interpreter: "bash"
timeout: 30
expected_exit_code: 0
Custom script assertions
When you need more complex validation logic than simple commands provide, use custom_script to run a dedicated test script. This is useful for multi-step validation, complex parsing, or reusable test logic.
Simple form (expects exit code 0):
- custom_script: "scripts/validate_output.sh"
- custom_script: "python scripts/validate.py"
- custom_script: "node scripts/check.js"
Advanced form with options:
- custom_script:
script: "python scripts/validate_output.py"
args: ["--strict", "--format=json"]
timeout: 30
expected_exit_code: 0
Options:
script— Shell command to execute (e.g.,python script.py,node script.js,./script.sh)args— List of arguments to pass to the script (optional)timeout— Maximum seconds to wait for completion (default: 60)expected_exit_code— Exit code that indicates success (default: 0)
The script field is executed as a shell command in the workdir, so you can use any interpreter:
- Python:
python validate.pyorpython3 validate.py - Node.js:
node check.js - Executable scripts:
./validate.sh(must have shebang and be executable) - Any command: Works like
command_succeedsbut with more control over timeout and exit codes
Your script receives the workdir as its working directory, so it can access generated files directly. The assertion passes if the script exits with the expected code.
Example validation script (scripts/validate_tf.sh):
#!/bin/bash
# Check if Terraform config is valid and contains required resources
terraform validate || exit 1
grep -q "aws_s3_bucket" main.tf || exit 2
exit 0
Use it in your eval:
- custom_script: "scripts/validate_tf.sh"
Similarity assertions
Similarity metrics
When exact matching isn't practical, use similarity metrics:
assertions:
# ROUGE: topic coverage (good for docs)
- rouge:
actual: "README.md"
expected: "./refs/golden.md"
metric: "rougeL"
min_score: 0.35
# BLEU: phrase matching (good for docs, not code)
- bleu:
actual: "README.md"
expected: "./refs/golden.md"
min_score: 0.2
# BERTScore: semantic similarity (good for docs/code)
- bertscore:
actual: "README.md"
expected: "./refs/golden.md"
min_score: 0.75
# Cosine similarity: overall meaning (good for code/configs)
- cosine_similarity:
actual: "variables.tf"
expected: "./refs/expected-vars.tf"
min_score: 0.7
Choosing metrics:
| Metric | Question | Speed | Best For |
|---|---|---|---|
rouge |
Same topics? | Fast | Documentation coverage |
bleu |
Same phrases? | Fast | Documentation phrasing |
bertscore |
Same meaning? | Slow | Semantic preservation |
cosine_similarity |
Same subject? | Slow | Code/config similarity |
Use deterministic assertions first. Add similarity metrics when you need fuzzy matching.
Weighted grading
Make some assertions count more:
assertions:
- file_exists: "main.tf"
- command_succeeds: "terraform validate"
weight: 3.0 # 3x more important
- rouge:
actual: "README.md"
expected: "./refs/golden.md"
metric: "rougeL"
min_score: 0.3
weight: 2.0
Results include both assertion_pass_rate (binary) and weighted_score (continuous).
The examples/ directory contains working benchmarks you can use as starting points:
simple-codegen-eval.yaml— Minimal example with deterministic assertionssimilarity-codegen-eval.yaml— Demonstrates all similarity metricsterraform-module-eval.yaml— Real-world Terraform evaluationweighted-grading-eval.yaml— Weighted assertions and continuous scoring
TDD workflow
Treat benchmarks like tests:
- Red: Add or tighten assertions
- Green: Update skills/MCP until assertions pass
- Refactor: Clean up without breaking tests
This lets you iterate on what "good" means without guessing.
Editor integration
VS Code / Cursor / Bob
Enable YAML validation:
pitlane schema install
This adds JSON Schema validation to .vscode/settings.json with preview and backup.
Manual setup:
{
"yaml.schemas": {
"./schemas/pitlane.schema.json": [
"eval.yaml",
"examples/*.yaml",
"**/*eval*.y*ml"
]
},
"yaml.validate": true
}
Other editors
Generate schema and docs:
pitlane schema generate
This outputs:
schemas/pitlane.schema.jsondocs/schema.md
Security considerations
Execution is not sandboxed. Pitlane runs assistants directly on your system using their native CLIs. While this provides full functionality and realistic testing conditions, it means assistants have the same file system and network access as any other process you run.
Recommended precautions:
- Run evaluations in a Docker container or virtual machine when testing untrusted code or prompts
- Review benchmark tasks and assertions before running them
- Use dedicated test environments rather than production systems
- Be cautious with benchmarks that involve sensitive data or credentials
The native CLI approach is intentional—it ensures pitlane tests assistants in real-world conditions. But like any development tool that executes code, reasonable precautions are advisable.
Contributing
See CONTRIBUTING.md for development setup, testing guidelines, and how to submit changes.
License
Apache 2.0
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found