quaere
Health Warn
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 5 GitHub stars
Code Fail
- eval() — Dynamic code execution via eval() in .github/workflows/eval-manual.yml
Permissions Pass
- Permissions — No dangerous permissions requested
No AI report is available for this listing yet.
Process-correction skills that make coding agents ask before they act.
Quaere
Stop coding agents from confidently doing the wrong thing.
Coding agents rarely fail by saying "I do not know." They fail by sounding finished too early: they skim code, accept plausible claims, patch a wide diff, and report success before the cause is proven.
Quaere is four core skills plus three opt-in extensions for Claude Code, Codex CLI, and other skill-aware coding agents. A skill is a markdown file the agent loads on demand based on task context — each core skill gates a different drift point: read the code semantically, ground external facts, prove claims, and execute changes in small verified steps. The extensions add security auditing, structured ideation, and naming on top.
Quaere is an independent project, not affiliated with or endorsed by Anthropic. The skills run through Claude Code's and Codex CLI's built-in skill systems.
The name is Latin quaere — to ask, to seek, to inquire. The point is not more process; it is to wedge one move — make a claim, defend it with evidence, then act — in at the spots where agents drift.
Without Quaere: plausible claim -> broad patch -> partial test -> confident summary
With Quaere: claim -> evidence -> disconfirming probe -> scoped patch -> verified diff
In the in-tree eval sweep measured at v0.3.1, the same scenarios scored 53% assertion pass rate without the skills and 91% with them. The eval is not a substitute for external benchmarks; it is a concrete regression harness for the failure modes Quaere is designed to catch.
Measured effect · Skills · Picking a skill · Installation · Docs · quaere.dev · 日本語
Measured effect
The headline comes from the in-tree eval sweep against the v0.3.1 skill set. Skill bodies have changed since the measurement: v0.5.0 reorganized every skill for the Codex read cap, and unreleased commits since then added the quaere-naming extension, distilled quaere-semantic to its measured active core, and gated the confident / locally novel certainty labels on an executed probe. The published sweep numbers predate those changes:
| mode | assertion pass rate | scenario-level |
|---|---|---|
| Baseline (no skill) | 53% (56 / 106) | 0 / 18 pass |
| With skill | 91% (96 / 106) | 10 / 18 pass |
| Δ | +37.7 pp | +10 scenarios |
Measured at v0.3.1 on the 18-scenario / 106-assertion suite; the suite has since grown to 22 scenarios / 125 assertions. See docs/evaluation.md for measurement notes.
The eval is a regression harness for Quaere's own failure modes, not a third-party benchmark. A separate Terminal-Bench sweep (terminal-bench-core==0.1.1, v0.3.2 install pipeline) reports two cuts:
- +1.25 pp on the full 80-task set (41/80 → 42/80, 51.25% → 52.50%). Within typical run-to-run variance — read it as "does not regress".
- +8.7 pp on the 69 tasks where the install pipeline wasn't broken (52.2% → 60.9%). The other 11 tasks failed before the skill could load.
The per-category cut, taken on the clean 69-task subset, sits underneath those averages — with small per-category samples (n = 3 to 10), treat the splits as suggestive rather than confirmed effects: data-processing +60 pp, SWE-bench style +25 pp, security +22 pp, build/compile +17 pp, ML/AI −10 pp, the remaining five categories ~0. Variance notes and the per-task breakdown live in docs/evaluation.md.
What that looks like on one task
The Terminal-Bench task csv-to-parquet — "Convert /app/data.csv into a Parquet file named /app/data.parquet" — recovered (baseline ✗ → with-skill ✓). Same model, same task, two runs.
Baseline (test_data_matches: failed)
The CSV is small and straightforward. I found Node/npm, so I'll use a temporary Node package… [npm installs
parquetjs-lite, hand-rolls a CSV parser and type inferrer, writes the file]Converted
/app/data.csvto/app/data.parquet. Verified read-back: 5 rows, columns: name, age, city, first row: John, 25, New York…
The agent read its own output back through its own writer, found it consistent with itself, and declared success. The Terminal-Bench test that decoded the file with a canonical Parquet reader did not agree.
With skill (test_data_matches: passed)
I'll convert the CSV with a structured CSV reader and Parquet writer, then verify the output file exists and is readable. [bootstraps a
python3-venv, installspyarrow, reads throughpyarrow.csv, writes throughpyarrow.parquet.write_table, reads back throughpq.read_table]Converted
/app/data.csvto/app/data.parquet. Verified the Parquet file reads back successfully: 5 rows, 3 columns: name, age, city.
Two behaviors changed. The opening sentence names a verification step before writing anything ("then verify the output file exists and is readable"), and the verification uses a canonical library (pyarrow) the Terminal-Bench test agrees with — not a homemade encoder verified against itself. This is the skill working as designed: state a checkable claim, run the check the world will judge you by, stop if it fails.
Skills
Quaere is four core skills plus opt-in extensions. quaere install
installs the core set; extensions are installed on request
(quaere install --extensions, or quaere install --skill <name>).
Core (installed by default)
| Skill | Use when | Main safeguard |
|---|---|---|
skills/core/quaere-semantic |
You need to understand unfamiliar code, module intent, invariants, or why code is shaped a certain way before changing it. | Forces What (mechanical) / What (domain intent) / Why / Invariants / Failure / Connections (← / →) per meaningful unit and marks unknown intent instead of inventing it. |
skills/core/quaere-grounding |
The task depends on external, version-sensitive facts: SDKs, APIs, libraries, CLIs, cloud services, security advisories, changelogs, release notes, or docs. | Anchors local versions, ranks source quality, checks version fit and conflicts, and turns confirmed external facts into implementation constraints. |
skills/core/quaere-evidence |
You are handling unclear bugs, risky PR review, CI failures, flaky tests, security-sensitive changes, database/concurrency changes, external APIs, or claims that need evidence before patching. | Requires findings, hypotheses/claims, defense, disconfirming probes, decisions, verification, and handoff before accepting a fix. |
skills/core/quaere-execution |
You are authorized to implement a multi-step coding change, apply a plan, finish review feedback, or turn a specification into working code. | Enforces read → plan → execute → review → fix → verify → commit, with commits only when explicitly authorized. |
Extensions (opt-in)
| Skill | Use when | Main safeguard |
|---|---|---|
skills/extensions/quaere-audit |
You are doing deep security auditing, bug bounty preparation, protocol conformance checking, exploitability analysis, or specification-grounded vulnerability discovery. | Derives explicit security properties, maps attack surfaces and code, attempts proofs, gates false positives, and reports confirmed/potential/rejected findings with evidence or PoCs. Install with quaere install --skill audit. |
skills/extensions/quaere-invention |
You need a non-obvious approach, alternative architecture, research direction, product or monetization idea, or agent-skill design before committing to a plan. | Forces the agent to name the default basin it is escaping, break assumptions through structured mutation passes, classify novelty honestly with fixed labels (no self-rated "truly novel"), and design a kill-probe before promoting an idea. Install with quaere install --skill invention. |
skills/extensions/quaere-naming |
You need to name a product, SaaS, brand, library, open source project, CLI, bot, or app, or escape generic AI-slop names. | Forces a metaphor-driven process — naming brief before any name, conceptual territories instead of thesaurus synonyms, anti-pattern filtering, and a mandatory tool-verified availability gate (never from memory) — so only vetted finalists with origin stories reach the user. Install with quaere install --skill naming. |
Picking a skill
Pipeline for complex work
For multi-step work, the skills chain in this order:
quaere-semantic → quaere-grounding → quaere-evidence → quaere-execution
- Use
quaere-semanticfirst when misunderstanding existing code would make the implementation risky. - Use
quaere-groundingwhen implementation depends on external facts that may have changed. - Use
quaere-evidencewhen a claim, bug cause, review comment, or proposed fix needs proof. - Use
quaere-executionwhen it is time to implement the confirmed plan and verify the final diff.
A small implementation can use only quaere-execution in lightweight mode; a pure code-reading task can stop after quaere-semantic; SDK, cloud API, or dependency work can start with quaere-grounding.
For deep security work — discovering or validating vulnerabilities from properties, attack surfaces, and exploitability gates — install the quaere-audit extension (quaere install --skill audit). It coordinates the four core skills as needed.
When the risk is settling on the obvious answer too early — choosing an approach, architecture, or research/product direction before widening the option space — install the quaere-invention extension (quaere install --skill invention). Chained, it sits between grounding and evidence (semantic → grounding → invention → evidence → execution); standalone ideation can run invention → evidence.
Standalone: match the main risk
Use the first matching row that describes the main risk in the task:
| Main risk | Start with | Then use |
|---|---|---|
| The existing code's intent or invariants are unclear. | quaere-semantic |
quaere-execution only after the important invariants are known. |
| The answer depends on current SDK, API, CLI, cloud, advisory, or docs behavior. | quaere-grounding |
quaere-execution with only confirmed constraints, or quaere-evidence if facts conflict. |
| A bug cause, CI failure, flaky test, or review claim might be wrong. | quaere-evidence |
quaere-execution after a claim or hypothesis is confirmed. |
| The plan is already approved and implementation is the main work. | quaere-execution |
quaere-evidence if the work turns risky or the cause becomes unclear. |
| The task is to discover or validate vulnerabilities from specs and attack surfaces. | quaere-audit (extension) |
It coordinates quaere-semantic, quaere-grounding, quaere-evidence, and quaere-execution as needed. |
| You are about to commit to the obvious approach and want to widen the option space first. | quaere-invention (extension) |
Hands surviving candidates with kill-probes to quaere-grounding, quaere-evidence, or quaere-execution. |
Tie-breaker
If two skills seem plausible, choose the one that answers the blocking question first:
- "What does this code mean?" →
quaere-semantic - "Is this external fact true for this version?" →
quaere-grounding - "Is this claim actually proven?" →
quaere-evidence - "Are we ready to change files?" →
quaere-execution - "What security properties can fail?" →
quaere-audit - "Are we trapped in the obvious solution space?" →
quaere-invention
Installation
Quaere ships as the quaere-cli npm package. The CLI's only job is to copy skill files into ~/.claude/skills/ and ~/.agents/skills/, so a zero-install run is fine — no global package needed.
npx quaere-cli install
This auto-detects which agents are present and deploys to all of them. Pass an explicit target to scope the deployment:
npx quaere-cli install claude # only Claude Code
npx quaere-cli install codex # only Codex CLI
npx quaere-cli install all # both
Bun
bunx quaere-cli install
Global install
If you would rather have the CLI permanently in PATH:
npm install -g quaere-cli
quaere install # the package also exposes the `quaere` alias
Verifying the install
npx quaere-cli list # show installed skills and the recorded version
npx quaere-cli doctor # validate frontmatter, names, and line budgets
npx quaere-cli update # check GitHub Releases for a newer version
Substitute quaere for npx quaere-cli once you have installed globally.
Releases ship with npm provenance attestations (Sigstore OIDC) binding the tarball back to the release workflow at the exact tag. npm audit signatures verifies the chain end to end.
See CHANGELOG.md for the per-version change history; the Unreleased section is the next-up shipping list. The CLI behavior contracts are documented in docs/cli-contracts.md.
Examples
See examples/ for realistic prompts and expected behavior patterns.
Quick examples:
- "Read this module and explain the intent before we change it" →
quaere-semantic - "Check the installed SDK version and current docs before suggesting code changes" →
quaere-grounding - "This CI failure looks flaky; figure out whether the review comment is real before patching" →
quaere-evidence - "Apply this plan, run tests, review the diff, and commit if it passes" →
quaere-execution - "Audit this protocol implementation against the spec and produce confirmed or potential vulnerabilities with evidence" →
quaere-audit
Safety
- Commits happen only when the user explicitly authorizes them.
.agent-state/is local investigation state by default and should not be committed unless the user asks for it as an artifact.- For security-sensitive paths, database schema, or concurrency changes, use
quaere-evidencebefore patching — not after.
Evaluation
The in-tree harness lives at evals/. It is fast, mostly deterministic, and a good fit for per-PR checks. The headline numbers in Measured effect come from running it through Codex CLI.
Run a single scenario:
python evals/run_skill_evals.py \
--runner 'codex=codex exec - < $prompt_file' \
--scenario sdk-version-grounding \
--mode both \
--output-dir "$(pwd)/eval-results/$(date -u +%Y%m%dT%H%M%SZ)"
See docs/evaluation.md for measurement notes and evals/README.md for the full assertion-type table, judge backends, locale alternates, and Terminal-Bench adapter.
Docs
docs/evaluation.md— measured effect, variance notes, current benchmark limits.docs/cli-contracts.md— install, force, doctor, and update behavior contracts.docs/roadmap.md— external benchmark roadmap.
Contributing
Run the skills validator before publishing changes:
python scripts/validate_skills.py
It checks frontmatter, directory/name consistency, README and README.ja coverage, reachability-anchor positions within the Codex read cap, the line-count budget, that reference links resolve, and accidental .agent-state/ inclusion. GitHub Actions runs the same validation on push and pull request.
For changes under cli/ (the npm package), run the local check pipeline before committing:
cd cli
pnpm install --frozen-lockfile
pnpm check # oxlint + tsc --noEmit + vitest
The same pipeline runs in CI before publish, so a failing check there will hold the release.
License
MIT. See LICENSE.
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found