NiceEval

Progressive, full-featured, excellent DX lightweight ai agent evals tool

NiceEval is a general-purpose agent eval tool inspired by eve. It has an excellent DX design — anyone can get started and configured in about 10 minutes. It's also very versatile: it can eval plugins, Hooks, and Skills written for Claude Code/Codex coding agents, and can directly eval your own AI Agent framework (no matter if it's based on AI SDK, LangGraph, Pi, or any other interface, it's easy to integrate).

After the eval completes, it generates readable reports and lets you view agent behavior details. Convenient for debugging and optimization.

Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist

NiceEval is an AI-native eval tool. In tools built around Dataset/golden-style Input vs. Expected Output, that shape doesn't fit real agent evaluation well. NiceEval is built for evaluating agents at a finer grain — multi-turn conversations, multi-agent setups, tool calls, skill loading, and more.

It also coexists with LangFuse and BrainTrust: use them for tracing, or upload eval results to both (in progress).

Architecture

NiceEval supports two integration modes, depending on whether the system under test needs an isolated sandbox filesystem.

Mode 1: Sandbox (Docker, E2B) — run coding agents like Codex and Claude Code that need a sandbox

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official)
        ▼
   ┌──────────────────────────────┐
   │        Docker Sandbox         │
   │   ┌────────────────────────┐  │
   │   │ Codex / Claude Code /  │  │
   │   │ apps needing isolation │  │
   │   └────────────────────────┘  │
   └──────────────────────────────┘

Mode 2: Direct — connect straight to your own AI Agent

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official, or your own implementation)
        ▼
   ┌──────────────────────────────┐
   │       your own Web Agent      │
   │   (HTTP / AI SDK·LangGraph·   │
   │    Pi and other frameworks —  │
   │         no Docker needed)     │
   └──────────────────────────────┘

NiceEval core owns discovery, scheduling, scoring, reporting, and artifacts.
Agent adapters are the open boundary: you decide how to call the system under test.
Coding agents that need filesystem isolation run inside the Docker Sandbox; your own Web Agent can connect directly, without Docker.

Example

Running an eval takes two files: the eval itself (what to check) and an experiment (which agent to run it against). The CLI won't run a bare eval id — the experiment in niceeval exp <experiment> <eval prefix> is what picks the system under test. Here's a real eval against a directly-connected web agent (full project in examples/zh/ai-sdk/), checking that the agent calls a tool for live weather questions and answers from the tool result instead of making it up:

// evals/eval-tool-call.eval.ts
import { defineEval } from "niceeval";

export default defineEval({
  description: "Verify the agent calls the weather tool and answers from its result",

  async test(t) {
    const turn = await t.send("What's the weather in Beijing today?");
    t.succeeded();

    await t.group("calls get_weather with the right city", () => {
      t.calledTool("get_weather", { input: { city: "Beijing" } });
      t.messageIncludes(/°C|sunny|cloudy|rain/);
    });

    const second = await t.send("What about Shanghai tomorrow?");
    second.messageIncludes("Shanghai");

    t.judge.autoevals
      .closedQA("Does the reply use the tool's weather data instead of making up a temperature?")
      .atLeast(0.7);
  },
});

// experiments/local.ts
import { defineExperiment } from "niceeval";
import { webAgent } from "./adapter"; // your agent adapter, pointed at the system under test

export default defineExperiment({
  agent: webAgent({ baseUrl: "http://127.0.0.1:5188" }),
});

npx niceeval exp local eval-tool-call  // run only eval-tool-call under the local experiment
npx niceeval view

For coding agents that need an isolated workspace (Codex, Claude Code plugins/skills), see examples/zh/coding-agent-skill/: evals there use t.sandbox.uploadDirectory() to seed the workspace, t.fileChanged() / t.file() to check what changed, and t.sandbox.runCommand() to run tests.

Quick Start

READ https://raw.githubusercontent.com/CorrectRoadH/niceeval/refs/heads/main/INIT.md and install niceeval for this repo.

Start from the scenario that matches what you need to evaluate:

Roadmap

Official Adapters

Agent Software
- Claude Code
- Codex
- Bub
- OpenClaw
- Hermess Agent
- Alma
- ...
Agent Frameworks
- AI SDK
- LangGraph
- Claude SDK
- Codex SDK
- vm0
- Cursor Agent SDK

Documentation

Quickstart

Acknowledgements

This project was inspired by — or had its code learned by AI from — the projects below:
eve
agent eval
ponytail

Thanks to the following communities