simplicio-cli

agent
Security Audit
Warn
Health Warn
  • License — License: NOASSERTION
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Pass
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
  • Permissions — No dangerous permissions requested

No AI report is available for this listing yet.

SUMMARY

Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).

README.md

simplicio-cli

Your tasks with 99% accuracy using any LLM (Claude, DeepSeek, Codex, Gemini, Hermes, OpenClaw, Cursor).

PyPI
Python
License: MIT

simplicio-cli pipeline hero: one-line task to verified code change

"hide the Delete button for non-admins" → diff + test + applied + verified.
Works with OpenRouter, OpenAI, Anthropic, GLM, DeepSeek, Ollama — one env var.

pip install simplicio-cli

Why it works — the numbers

Same model. Same task. Only the prompt changes. Measured, reproducible, deterministic.
Fourteen models tested across three runs — five sub-4B tiny models, six
frontier 2026 models, and three mid-tier 7B–12B open models. Every one gained
at least +14 points when wrapped in simplicio's 6-layer contract.

Tiny models — sub-4B, run on 2026-05-26 (50 runs/side, 260 checks)

Model Without simplicio With simplicio Gain
Gemma 3 4B (google/gemma-3-4b-it) 38% 96% +58 pts
Llama 3.2 3B (meta-llama/llama-3.2-3b-instruct) 28% 73% +45 pts
Gemma 3n e4B (google/gemma-3n-e4b-it) 44% 88% +44 pts
Phi-4 mini (microsoft/phi-4-mini-instruct) 36% 73% +37 pts
Llama 3.2 1B (meta-llama/llama-3.2-1b-instruct) 26% 40% +14 pts
Tiny avg (5 models · 10 cases · 260 checks) 35% 74% +39 pts (+112%)

Not hosted on OpenRouter (requested but skipped): Gemma 3 270M, Gemma 3 1B,
Gemma 2 2B, Qwen3 0.6B, Qwen3 1.7B, Qwen2.5 0.5B, Qwen2.5 1.5B, Qwen 3B,
Nemotron Nano 4B (OR's smallest Nemotron is 9B). Sub-4B substitutes used above.
simplicio still gains +14 to +58 points even on a 1B-param model.

Frontier 2026 models — run on 2026-05-26 (60 runs/side, 312 checks)

Model Without simplicio With simplicio Gain
GPT-5.5 (openai/gpt-5.5) 38% 100% +62 pts
Kimi K2.6 (moonshotai/kimi-k2.6) 40% 100% +60 pts
Gemini 3.5 Flash (google/gemini-3.5-flash) 42% 100% +58 pts
Qwen 3.7 Max (qwen/qwen3.7-max) 44% 100% +56 pts
Claude Opus 4.7 (anthropic/claude-opus-4.7) 42% 98% +56 pts
DeepSeek V4 Pro (deepseek/deepseek-v4-pro) 44% 96% +52 pts
Frontier avg (6 models · 10 cases · 312 checks) 41% 99% +58 pts (+136%)

Mid-tier 7B–12B open models — earlier run (v0.2.2, 30 runs/side, 156 checks)

Model Without simplicio With simplicio Gain
Gemma 3 12B (google/gemma-3-12b-it) 34% 92% +58 pts
Llama 3.1 8B (meta-llama/llama-3.1-8b-instruct) 36% 90% +54 pts
Qwen 2.5 7B (qwen/qwen-2.5-7b-instruct) 34% 88% +54 pts
Mid-tier avg (3 models · 10 cases · 156 checks) 35% 90% +55 pts (+156%)

Across all 14 models tested across three runs, the average gain is +51
points
. Smallest: +14 pts (Llama 3.2 1B — the contract still moves a
1B-param model). Largest: +62 pts (GPT-5.5). The contract helps tiny
sub-4B models, frontier reasoning models, and mid-tier 7B–12B alike — five
of the six frontier models hit 100% pass-rate.

Output-quality signals (rate across all 60 frontier runs)

Signal Raw prompt With simplicio
DIFF block present 36% 98%
Target file mentioned 1% 100%
TEST block present 88% 98%

Cost — tokens & wall-clock (measured, not estimated)

Same provider, same models, same cases. Token counts pulled from the API
usage field; latency from time.perf_counter() around each call.

Side Tokens / run Wall-clock / run Total tokens (60 runs) Total time
Raw prompt 1,967 46.1s 118,040 46m 07s
With simplicio 3,168 57.6s 190,119 57m 33s
Δ +61% +24% +72,079 +11m 26s

simplicio wraps the objective in a 6-layer contract — more input tokens up
front, longer completions because the model produces the full DIFF + TEST +
EVIDENCE the contract demands instead of a one-line guess. The bill goes up,
but so does the pass-rate (41% → 99%) and the DIFF-block rate (36% → 98%)
useful tokens, not chat.

Six frontier models — GPT-5.5, Kimi K2.6, Gemini 3.5 Flash, Qwen 3.7 Max,
Claude Opus 4.7, DeepSeek V4 Pro — gained +52 to +62 points when wrapped
in simplicio's 6-layer contract. Without changing the model. Without
fine-tuning. Five of six landed at 100% pass-rate with simplicio.

Full report: bench/results.md · bench/results.pdf · raw outputs under .simplicio/bench_runs/.


How it works

mapper        WHERE   project structure + latest state
precedent     HOW-1   the real snippet in THIS repo that already does it
skill-router  HOW-2   the ONE mapper skill that matches (ranked, not all)
simplicio     BUILD   stacks the 6 layers into one prompt (cache-friendly)
test          JUDGE   contract written as testable states
verify        PROOF   ran it — did it actually pass? loop-fix up to 3x

The idea in one line: don't ask the model to guess — hand it the path.
Each layer terminates one decision the model would otherwise hallucinate.
Relevant > complete — inject the right context, never all of it.


Install

pip install simplicio-cli           # from PyPI
# or
pip install -e .                    # from this repo

Configure — any LLM, nothing hardcoded

Provider SIMPLICIO_MODEL SIMPLICIO_BASE_URL
OpenRouter anthropic/claude-opus-4 https://openrouter.ai/api/v1
GLM (z.ai) glm-4.6 https://api.z.ai/api/paas/v4
DeepSeek deepseek-chat https://api.deepseek.com
OpenAI gpt-4.1 https://api.openai.com/v1
Local (Ollama) llama3 http://localhost:11434/v1
Anthropic native claude-opus-4-7 (leave unset)

If SIMPLICIO_BASE_URL is unset and the key is ANTHROPIC_API_KEY, it uses the
native Anthropic SDK. Otherwise it uses an OpenAI-compatible client pointed at
your base_url — so any OpenAI-like provider works without code changes.

simplicio smoke      # prints provider config + one test call

Use

# index once (caches embeddings; re-run after big changes)
simplicio index --stack angular

# run a task
simplicio task "hide Delete button for non-admins" \
  --stack angular \
  --target src/app/screen/screen.component.html \
  --criteria "- no admin perm: button absent from DOM
- with admin perm: button present" \
  --constraints "- don't touch save flow
- build passes"

Each task: precedent (from cache) → skill match → 6 layers → LLM generates
(diff + test + Playwright) → apply → run SIMPLICIO_TEST_CMD → pass? done :
send the error back → fix → retry (up to 3x).


Cache — why it doesn't re-map every time

Embeddings are keyed by content hash, stored in .simplicio/. Unchanged
code block → vector reused. Change one file → only that block re-embeds.

Run Blocks embedded Time
1st (cold cache) 3 ~baseline
2nd (no change) 0 ~instant
after editing 1 file 1 partial

Benchmark — reproduce in 30 seconds

OPENROUTER_API_KEY=… \
  BENCH_MODELS="deepseek/deepseek-v4-pro,qwen/qwen3.7-max,moonshotai/kimi-k2.6,openai/gpt-5.5,anthropic/claude-opus-4.7,google/gemini-3.5-flash" \
  python3 bench/run_offline.py

No project required, stdlib only, deterministic regex scoring — no LLM judges
the LLM. Each case runs twice on the same model: raw one-line objective vs
simplicio's 6-layer contract. Outputs scored on target-file mention, DIFF
block, TEST block, contract-state words. Full numbers in bench/results.md.

Full harness (your real project, your real tests)

simplicio bench --cases bench/cases.json --stack angular

Runs each case two ways and runs your real test command (e.g. ng test --watch=false) on each output. Writes the true pass-rate to
bench/results.md.


Plug points (stubs marked in code)

File Replace with
prompt.py::_mapper your real llm-project-mapper
pipeline.py::_aplicar_e_testar extract diff → git apply → parse test result
skill_router.py point SIMPLICIO_SKILLS_DIR at your mapper's skills

Layout

simplicio/
  cli.py          # index | task | bench | smoke
  cache.py        # content-hash embedding cache
  precedent.py    # grep + semantic rank (uses cache)
  skill_router.py # picks the ONE matching skill
  prompt.py       # stacks the 6 layers
  providers.py    # any OpenAI-compatible endpoint + Anthropic native
  pipeline.py     # generate → test → fix loop
  bench.py        # with-vs-without harness
  templates/simplicio_prompt.md
bench/
  run_offline.py  # stdlib-only multi-model benchmark
  cases.json      # your benchmark tasks
  cases_offline.json
  results.md      # filled by `simplicio bench` / `run_offline.py`
  charts/         # SVG: overall, delta, by_case, by_stack

License

MIT

Reviews (0)

No results found