pitlane 🏁

A feedback loop for people building AI skills and MCP servers.

You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse?

Pitlane gives you the answer. Define the tasks your skill should help with, set up a baseline (assistant without your skill) and a challenger (assistant with your skill), and race them. The results tell you with numbers, not vibes, whether your work is paying off.

The idea

In motorsport, the pit lane is where engineers tune the car between laps. Swap a part, adjust the setup, check the telemetry, see if the next lap is faster.

Building skills and MCP servers works the same way:

Tune: change your skill, update your MCP server, tweak your prompts
Race: run the assistant with and without your changes against real coding tasks
Check the telemetry: did pass rates go up? Did quality improve? Did it get faster or cheaper?
Repeat: go back to the pit, make another adjustment, race again

Pitlane is the telemetry system. You build the skill, pitlane tells you if it's working.

Demo

https://github.com/user-attachments/assets/c1aa78d8-9e26-43e0-9945-027fd0da6fe5

Key features

YAML-based benchmark definitions (easy to version and diff)
Deterministic assertions (file checks, command execution, custom scripts)
Similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity)
Metrics tracking (time, tokens, cost, file changes)
JUnit XML output (junit.xml) for native CI test reporting
Interactive HTML reports with side-by-side agent comparison
Parallel execution and repeated runs with statistics
Graceful interrupt handling (Ctrl+C generates partial reports)
TDD workflow support (red-green-refactor)

Quick Start
Supported Assistants
Usage
Writing Benchmarks
TDD Workflow
Editor Integration
Security Considerations
Contributing
License

Quick start

Installation

You'll need uv, a fast Python package installer.

Install on macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install on Windows:

powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

Note: Windows should work but has not been extensively tested. If you encounter issues or can help validate Windows support, contributions are welcome!

Install pitlane:

uv tool install pitlane --from git+https://github.com/pitlane-ai/pitlane.git

Or run without installing:

uvx --from git+https://github.com/pitlane-ai/pitlane.git pitlane run examples/simple-codegen-eval.yaml

Prerequisites for running evaluations

Before running your first evaluation, ensure you have an AI coding assistant installed and authenticated on your machine. Choose from supported assistants. Install the assistant's CLI tool following their official documentation, most assistants will prompt you to log in on first use.

Running your first example

The example below uses OpenCode because it's free and requires no API key, but you can use any supported assistant by editing the YAML file.

Initialize a project with example benchmarks:

pitlane init --with-examples

Run the evaluation:

pitlane run examples/simple-codegen-eval.yaml

Results appear in runs/ with an HTML report showing pass rates and metrics.

Want to use a different assistant? Edit examples/simple-codegen-eval.yaml and uncomment your preferred assistant configuration. See Supported Assistants for options.

Need help designing benchmarks? Install the pitlane skill for AI-guided assistance:

npx skills add pitlane-ai/pitlane

Your AI assistant can then help you create effective eval benchmarks. See Writing Benchmarks for details.

Supported assistants

Assistant	Type	Status
Claude Code	`claude-code`	✅ Tested
Mistral Vibe	`mistral-vibe`	✅ Tested
OpenCode	`opencode`	✅ Tested
Bob	`bob`	✅ Tested

For the latest status and additional assistants in development, see PR #32.

Want to add support for another assistant? Contributions are welcome! See the Contributing Guide for instructions on implementing new assistants.

Usage

Basic Evaluation

Run all tasks against all configured assistants:

pitlane run examples/simple-codegen-eval.yaml

Filtering

Run specific tasks or assistants:

Run a single task:

pitlane run examples/simple-codegen-eval.yaml --task hello-world-python

Run specific assistants (comma-separated):

pitlane run examples/simple-codegen-eval.yaml --only-assistants claude-baseline

Skip assistants:

pitlane run examples/simple-codegen-eval.yaml --skip-assistants claude-baseline

Combine filters:

pitlane run examples/simple-codegen-eval.yaml --task hello-world-python --only-assistants claude-baseline

Parallel execution

Speed up multi-task benchmarks:

pitlane run examples/simple-codegen-eval.yaml --parallel 4

Repeated runs

Run tasks multiple times to measure consistency and get aggregated statistics:

pitlane run examples/simple-codegen-eval.yaml --repeat 5

This runs each task 5 times and reports avg/min/max/stddev for all metrics in the HTML report.

Debug output

Every run creates debug.log with detailed execution information. Stream output to terminal in real-time:

pitlane run examples/simple-codegen-eval.yaml --verbose

All assertions include detailed logging to help diagnose failures.

Interrupt handling

Press Ctrl+C to stop a run. You'll get a partial HTML report with results from completed tasks.

Open report in browser

By default, report.html opens in your browser after each run. To disable this:

pitlane run examples/simple-codegen-eval.yaml --no-open

The same flag works when regenerating a report:

pitlane report runs/2024-01-01_12-00-00 --no-open

Other commands

Initialize new benchmark project:

pitlane init

Initialize with example benchmarks:

pitlane init --with-examples

Generate JSON Schema for YAML validation:

pitlane schema generate

Install VS Code YAML validation (safe, with preview):

pitlane schema install

Regenerate HTML report from existing junit.xml:

pitlane report runs/2024-01-01_12-00-00

Writing benchmarks

Benchmarks are YAML files with two sections: assistants and tasks.

Need help designing effective benchmarks? Install the pitlane skill for AI-guided assistance:

npx skills add pitlane-ai/pitlane

Your AI assistant can help you design eval benchmarks that actually measure whether your skills or MCP servers improve performance.

Minimal example

assistants:
  claude-baseline:
    type: claude-code
    args:
      model: haiku

tasks:
  - name: hello-world-python
    prompt: "Create a Python script called hello.py that prints 'Hello, World!'"
    workdir: ./fixtures/empty
    timeout: 120
    assertions:
      - file_exists: "hello.py"
      - command_succeeds: "python3 hello.py"
      - file_contains: { path: "hello.py", pattern: "Hello, World!" }

Assistants

Each assistant defines how to run a model:

assistants:
  # Baseline configuration
  claude-baseline:
    type: claude-code
    args:
      model: haiku

  # With skills/MCP
  claude-with-skill:
    type: claude-code
    args:
      model: haiku
    skills:
      # Remote: GitHub reference (installed via npx skills add)
      - source: org/repo
        skill: my-skill-name
      # Local: directory path (for development/testing)
      - source: ./path/to/my-skill

Tasks

Each task specifies:

name: Unique identifier
prompt: Instructions for the assistant
workdir: Fixture directory (copied for each run)
timeout: Maximum seconds
assertions: Checks to verify success

Assertions

Deterministic (preferred)

assertions:
  # File exists
  - file_exists: "main.py"

  # Command succeeds (exit code 0)
  - command_succeeds: "python main.py"

  # Command fails (non-zero exit)
  - command_fails: "python main.py --invalid"

  # File contains pattern (regex)
  - file_contains:
      path: "main.py"
      pattern: "def main\\(\\):"

  # Custom script validation
  - custom_script:
      script: "./validate.sh"
      interpreter: "bash"
      timeout: 30
      expected_exit_code: 0

Custom script assertions

When you need more complex validation logic than simple commands provide, use custom_script to run a dedicated test script. This is useful for multi-step validation, complex parsing, or reusable test logic.

Simple form (expects exit code 0):

- custom_script: "scripts/validate_output.sh"
- custom_script: "python scripts/validate.py"
- custom_script: "node scripts/check.js"

Advanced form with options:

- custom_script:
    script: "python scripts/validate_output.py"
    args: ["--strict", "--format=json"]
    timeout: 30
    expected_exit_code: 0

Options:

script — Shell command to execute (e.g., python script.py, node script.js, ./script.sh)
args — List of arguments to pass to the script (optional)
timeout — Maximum seconds to wait for completion (default: 60)
expected_exit_code — Exit code that indicates success (default: 0)

The script field is executed as a shell command in the workdir, so you can use any interpreter:

Python: python validate.py or python3 validate.py
Node.js: node check.js
Executable scripts: ./validate.sh (must have shebang and be executable)
Any command: Works like command_succeeds but with more control over timeout and exit codes

Your script receives the workdir as its working directory, so it can access generated files directly. The assertion passes if the script exits with the expected code.

Example validation script (scripts/validate_tf.sh):

#!/bin/bash
# Check if Terraform config is valid and contains required resources
terraform validate || exit 1
grep -q "aws_s3_bucket" main.tf || exit 2
exit 0

Use it in your eval:

- custom_script: "scripts/validate_tf.sh"

Similarity assertions

Similarity metrics

When exact matching isn't practical, use similarity metrics:

assertions:
  # ROUGE: topic coverage (good for docs)
  - rouge:
      actual: "README.md"
      expected: "./refs/golden.md"
      metric: "rougeL"
      min_score: 0.35

  # BLEU: phrase matching (good for docs, not code)
  - bleu:
      actual: "README.md"
      expected: "./refs/golden.md"
      min_score: 0.2

  # BERTScore: semantic similarity (good for docs/code)
  - bertscore:
      actual: "README.md"
      expected: "./refs/golden.md"
      min_score: 0.75

  # Cosine similarity: overall meaning (good for code/configs)
  - cosine_similarity:
      actual: "variables.tf"
      expected: "./refs/expected-vars.tf"
      min_score: 0.7

Choosing metrics:

Metric	Question	Speed	Best For
`rouge`	Same topics?	Fast	Documentation coverage
`bleu`	Same phrases?	Fast	Documentation phrasing
`bertscore`	Same meaning?	Slow	Semantic preservation
`cosine_similarity`	Same subject?	Slow	Code/config similarity

Use deterministic assertions first. Add similarity metrics when you need fuzzy matching.

Weighted grading

Make some assertions count more:

assertions:
  - file_exists: "main.tf"
  - command_succeeds: "terraform validate"
    weight: 3.0  # 3x more important
  - rouge:
      actual: "README.md"
      expected: "./refs/golden.md"
      metric: "rougeL"
      min_score: 0.3
    weight: 2.0

Results include both assertion_pass_rate (binary) and weighted_score (continuous).

The examples/ directory contains working benchmarks you can use as starting points:

simple-codegen-eval.yaml — Minimal example with deterministic assertions
similarity-codegen-eval.yaml — Demonstrates all similarity metrics
terraform-module-eval.yaml — Real-world Terraform evaluation
weighted-grading-eval.yaml — Weighted assertions and continuous scoring

TDD workflow

Treat benchmarks like tests:

Red: Add or tighten assertions
Green: Update skills/MCP until assertions pass
Refactor: Clean up without breaking tests

This lets you iterate on what "good" means without guessing.

Editor integration

VS Code / Cursor / Bob

Enable YAML validation:

pitlane schema install

This adds JSON Schema validation to .vscode/settings.json with preview and backup.

Manual setup:

{
  "yaml.schemas": {
    "./schemas/pitlane.schema.json": [
      "eval.yaml",
      "examples/*.yaml",
      "**/*eval*.y*ml"
    ]
  },
  "yaml.validate": true
}

Other editors

Generate schema and docs:

pitlane schema generate

This outputs:

schemas/pitlane.schema.json
docs/schema.md

Security considerations

Execution is not sandboxed. Pitlane runs assistants directly on your system using their native CLIs. While this provides full functionality and realistic testing conditions, it means assistants have the same file system and network access as any other process you run.

Recommended precautions:

Run evaluations in a Docker container or virtual machine when testing untrusted code or prompts
Review benchmark tasks and assertions before running them
Use dedicated test environments rather than production systems
Be cautious with benchmarks that involve sensitive data or credentials

The native CLI approach is intentional—it ensures pitlane tests assistants in real-world conditions. But like any development tool that executes code, reasonable precautions are advisable.

Contributing

See CONTRIBUTING.md for development setup, testing guidelines, and how to submit changes.

License

Apache 2.0

pitlane