NiceEval

agent
Security Audit
Warn
Health Warn
  • No license — Repository has no license file
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 6 GitHub stars
Code Warn
  • network request — Outbound network request in examples/zh/ai-sdk/adapter/adapter.ts
Permissions Pass
  • Permissions — No dangerous permissions requested

No AI report is available for this listing yet.

SUMMARY

build eval for your agent in 10 mins

README.md

NiceEval

Progressive, full-featured, excellent DX lightweight ai agent evals tool

typescript
license
docs

中文 | Deutsch | Español | français | 日本語 | 한국어 | Português | Русский

NiceEval is a general-purpose agent eval tool inspired by eve. It has an excellent DX design — anyone can get started and configured in about 10 minutes. It's also very versatile: it can eval plugins, Hooks, and Skills written for Claude Code/Codex coding agents, and can directly eval your own AI Agent framework (no matter if it's based on AI SDK, LangGraph, Pi, or any other interface, it's easy to integrate).

After the eval completes, it generates readable reports and lets you view agent behavior details. Convenient for debugging and optimization.

Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist

NiceEval is an AI-native eval tool. In tools built around Dataset/golden-style Input vs. Expected Output, that shape doesn't fit real agent evaluation well. NiceEval is built for evaluating agents at a finer grain — multi-turn conversations, multi-agent setups, tool calls, skill loading, and more.

It also coexists with LangFuse and BrainTrust: use them for tracing, or upload eval results to both (in progress).

Architecture

NiceEval supports two integration modes, depending on whether the system under test needs an isolated sandbox filesystem.

Mode 1: Sandbox (Docker, E2B) — run coding agents like Codex and Claude Code that need a sandbox

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official)
        ▼
   ┌──────────────────────────────┐
   │        Docker Sandbox         │
   │   ┌────────────────────────┐  │
   │   │ Codex / Claude Code /  │  │
   │   │ apps needing isolation │  │
   │   └────────────────────────┘  │
   └──────────────────────────────┘

Mode 2: Direct — connect straight to your own AI Agent

   evals/*.eval.ts
        │
        ▼
   ┌─────────────────────┐
   │     NiceEval        │
   └─────────────────────┘
        │
        │ Agent adapter (official, or your own implementation)
        ▼
   ┌──────────────────────────────┐
   │       your own Web Agent      │
   │   (HTTP / AI SDK·LangGraph·   │
   │    Pi and other frameworks —  │
   │         no Docker needed)     │
   └──────────────────────────────┘
  • NiceEval core owns discovery, scheduling, scoring, reporting, and artifacts.
  • Agent adapters are the open boundary: you decide how to call the system under test.
  • Coding agents that need filesystem isolation run inside the Docker Sandbox; your own Web Agent can connect directly, without Docker.

Example

Running an eval takes two files: the eval itself (what to check) and an experiment (which agent to run it against). The CLI won't run a bare eval id — the experiment in niceeval exp <experiment> <eval prefix> is what picks the system under test. Here's a real eval against a directly-connected web agent (full project in examples/zh/ai-sdk/), checking that the agent calls a tool for live weather questions and answers from the tool result instead of making it up:

// evals/eval-tool-call.eval.ts
import { defineEval } from "niceeval";

export default defineEval({
  description: "Verify the agent calls the weather tool and answers from its result",

  async test(t) {
    const turn = await t.send("What's the weather in Beijing today?");
    t.succeeded();

    await t.group("calls get_weather with the right city", () => {
      t.calledTool("get_weather", { input: { city: "Beijing" } });
      t.messageIncludes(/°C|sunny|cloudy|rain/);
    });

    const second = await t.send("What about Shanghai tomorrow?");
    second.messageIncludes("Shanghai");

    t.judge.autoevals
      .closedQA("Does the reply use the tool's weather data instead of making up a temperature?")
      .atLeast(0.7);
  },
});
// experiments/local.ts
import { defineExperiment } from "niceeval";
import { webAgent } from "./adapter"; // your agent adapter, pointed at the system under test

export default defineExperiment({
  agent: webAgent({ baseUrl: "http://127.0.0.1:5188" }),
});
npx niceeval exp local eval-tool-call  // run only eval-tool-call under the local experiment
npx niceeval view

For coding agents that need an isolated workspace (Codex, Claude Code plugins/skills), see examples/zh/coding-agent-skill/: evals there use t.sandbox.uploadDirectory() to seed the workspace, t.fileChanged() / t.file() to check what changed, and t.sandbox.runCommand() to run tests.

Quick Start

READ https://raw.githubusercontent.com/CorrectRoadH/niceeval/refs/heads/main/INIT.md and install niceeval for this repo.

Start from the scenario that matches what you need to evaluate:

Roadmap

Official Adapters

  • Agent Software

    • Claude Code
    • Codex
    • Bub
    • OpenClaw
    • Hermess Agent
    • Alma
    • ...
  • Agent Frameworks

    • AI SDK
    • LangGraph
    • Claude SDK
    • Codex SDK
    • vm0
    • Cursor Agent SDK

Documentation

Acknowledgements

This project was inspired by — or had its code learned by AI from — the projects below:
eve
agent eval
ponytail

Thanks to the following communities

Reviews (0)

No results found