NiceEval
Health Uyari
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 6 GitHub stars
Code Uyari
- network request — Outbound network request in examples/zh/ai-sdk/adapter/adapter.ts
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
build eval for your agent in 10 mins
NiceEval
Progressive, full-featured, excellent DX lightweight ai agent evals tool
中文 | Deutsch | Español | français | 日本語 | 한국어 | Português | Русский
NiceEval is a general-purpose agent eval tool inspired by eve. It has an excellent DX design — anyone can get started and configured in about 10 minutes. It's also very versatile: it can eval plugins, Hooks, and Skills written for Claude Code/Codex coding agents, and can directly eval your own AI Agent framework (no matter if it's based on AI SDK, LangGraph, Pi, or any other interface, it's easy to integrate).
After the eval completes, it generates readable reports and lets you view agent behavior details. Convenient for debugging and optimization.
Why NiceEval when DeepEval, LangFuse, and BrainTrust already exist
NiceEval is an AI-native eval tool. In tools built around Dataset/golden-style Input vs. Expected Output, that shape doesn't fit real agent evaluation well. NiceEval is built for evaluating agents at a finer grain — multi-turn conversations, multi-agent setups, tool calls, skill loading, and more.
It also coexists with LangFuse and BrainTrust: use them for tracing, or upload eval results to both (in progress).
Architecture
NiceEval supports two integration modes, depending on whether the system under test needs an isolated sandbox filesystem.
Mode 1: Sandbox (Docker, E2B) — run coding agents like Codex and Claude Code that need a sandbox
evals/*.eval.ts
│
▼
┌─────────────────────┐
│ NiceEval │
└─────────────────────┘
│
│ Agent adapter (official)
▼
┌──────────────────────────────┐
│ Docker Sandbox │
│ ┌────────────────────────┐ │
│ │ Codex / Claude Code / │ │
│ │ apps needing isolation │ │
│ └────────────────────────┘ │
└──────────────────────────────┘
Mode 2: Direct — connect straight to your own AI Agent
evals/*.eval.ts
│
▼
┌─────────────────────┐
│ NiceEval │
└─────────────────────┘
│
│ Agent adapter (official, or your own implementation)
▼
┌──────────────────────────────┐
│ your own Web Agent │
│ (HTTP / AI SDK·LangGraph· │
│ Pi and other frameworks — │
│ no Docker needed) │
└──────────────────────────────┘
- NiceEval core owns discovery, scheduling, scoring, reporting, and artifacts.
- Agent adapters are the open boundary: you decide how to call the system under test.
- Coding agents that need filesystem isolation run inside the Docker Sandbox; your own Web Agent can connect directly, without Docker.
Example
Running an eval takes two files: the eval itself (what to check) and an experiment (which agent to run it against). The CLI won't run a bare eval id — the experiment in niceeval exp <experiment> <eval prefix> is what picks the system under test. Here's a real eval against a directly-connected web agent (full project in examples/zh/ai-sdk/), checking that the agent calls a tool for live weather questions and answers from the tool result instead of making it up:
// evals/eval-tool-call.eval.ts
import { defineEval } from "niceeval";
export default defineEval({
description: "Verify the agent calls the weather tool and answers from its result",
async test(t) {
const turn = await t.send("What's the weather in Beijing today?");
t.succeeded();
await t.group("calls get_weather with the right city", () => {
t.calledTool("get_weather", { input: { city: "Beijing" } });
t.messageIncludes(/°C|sunny|cloudy|rain/);
});
const second = await t.send("What about Shanghai tomorrow?");
second.messageIncludes("Shanghai");
t.judge.autoevals
.closedQA("Does the reply use the tool's weather data instead of making up a temperature?")
.atLeast(0.7);
},
});
// experiments/local.ts
import { defineExperiment } from "niceeval";
import { webAgent } from "./adapter"; // your agent adapter, pointed at the system under test
export default defineExperiment({
agent: webAgent({ baseUrl: "http://127.0.0.1:5188" }),
});
npx niceeval exp local eval-tool-call // run only eval-tool-call under the local experiment
npx niceeval view
For coding agents that need an isolated workspace (Codex, Claude Code plugins/skills), see examples/zh/coding-agent-skill/: evals there use t.sandbox.uploadDirectory() to seed the workspace, t.fileChanged() / t.file() to check what changed, and t.sandbox.runCommand() to run tests.
Quick Start
READ https://raw.githubusercontent.com/CorrectRoadH/niceeval/refs/heads/main/INIT.md and install niceeval for this repo.
Start from the scenario that matches what you need to evaluate:
Roadmap
Official Adapters
Agent Software
- Claude Code
- Codex
- Bub
- OpenClaw
- Hermess Agent
- Alma
- ...
Agent Frameworks
- AI SDK
- LangGraph
- Claude SDK
- Codex SDK
- vm0
- Cursor Agent SDK
Documentation
Acknowledgements
This project was inspired by — or had its code learned by AI from — the projects below:
eve
agent eval
ponytail
Thanks to the following communities
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi