llmtrim
Health Gecti
- License Γ’β¬β License: AGPL-3.0
- Description Γ’β¬β Repository has a description
- Active repo Γ’β¬β Last push 0 days ago
- Community trust Γ’β¬β 16 GitHub stars
Code Basarisiz
- rm -rf Γ’β¬β Recursive force deletion command in .github/workflows/release.yml
Permissions Gecti
- Permissions Γ’β¬β No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
πΈ Cut ~66% off your LLM bill. Drop-in proxy compresses input, output, and cache: any provider, answers unchanged, no extra model calls. Single Rust binary.
llmtrim
Cut ~66% off your LLM bill: input, output, and cache, with zero extra model calls.
Numbers β’ Before / after β’ Get started β’ Works with β’ Compared to β’ Benchmark β’ Security β’ Issues
A drop-in HTTPS proxy that compresses every LLM request and reply. Works with any provider, with no model in the loop. Quality holds, A/B-checked live on every benchmark case.
A request bleeds tokens in three places. llmtrim fixes all three:
- Input: system prompt, tool schemas, history. Resent every turn.
- Output: the model's reply. The expensive half.
- Cache: the invariant prefix. Re-billed in full when busted.
Every cut passes the token gate: a check that re-counts the result with the provider's real tokenizer and reverts any stage that doesn't save.
The guarantee: no net token win β auto-revert. Upstream rejects the request β the original is replayed verbatim. Worst case is zero savings - never a bigger bill, never a broken call.
πΈ The numbers
Measured live, not estimated. Every one of the 112 A/B cases is sent twice (original and compressed), then answered, scored, and billed at real rates.
- Quality holds. Answers scored 78.9% original vs 82.2% compressed. The +3.3pp delta sits within the per-corpus confidence intervals (Β±5β15pp at these sample sizes), so read it as no degradation, not a bonus - per-corpus CIs in bench/README.md.
- The token cuts travel; the price tag varies. The cuts are model-independent: β31% input, β74% output. The cost saving depends on the model's output:input price ratio - β66% on the benchmark model (
qwen/qwen3-next-80b-a3b-instruct, β12:1 ratio), projected β57β59% at GPT-4o / Claude Sonnet rates, less on reasoning models whose hidden thinking tokens can't be cut from the prompt side. - Your prompt cache survives. On live Claude Code traffic, llmtrim cuts β68% of compressible input without ever touching the cached prefix. Your ~90% prompt-cache discount stays intact;
llmtrim statusshows yours.
| original | compressed | saved | |
|---|---|---|---|
| input tokens | 71,031 | 49,062 | β31% |
| output tokens | 25,843 | 6,628 | β74% |
| total tokens | 96,874 | 55,690 | β43% |
| round-trip cost | $0.0365 | $0.0126 | β66% |
| answer quality | 78.9% | 82.2% | Ξ within CI (no measured degradation) |
Methodology + per-corpus frontier β
π What compression looks like
Each stage fires only where it pays, and only if the token gate nets a win:
Stages run in savings order: tool-output > retrieve > cache > output > json-sample > serialization > skeleton > dedup > micro-text. Nothing under a cache_control marker is ever rewritten.
| Stage | Lever | What it does | When it runs |
|---|---|---|---|
| T tool-output | toolout | lossless template fold first (consecutive runs and interleaved parallel-build lines), then window logs Β· diffs Β· grep Β· repetitive dumps to the signal: errors, changes, matches | auto Β· tool results |
| A cache discipline | cache | mark + stabilize the invariant prefix (sort tools/schema Β· OpenAI prompt_cache_key) so it stays cached across calls |
auto Β· tools |
| B lexical retrieval | retrieve | BM25+ ranking with RM3 feedback (TextRank when query-less) Β· TextTiling cuts prose at topic shifts Β· budgeted submodular selection keeps the relevant non-redundant chunks; question protected | auto Β· long context |
| C skeletonization | skeleton | tree-sitter keeps the bodies of the query-relevant functions, drops the rest to signatures - 14 languages | auto Β· code |
| D serialize + hygiene | serialization | minify JSON, encode record arrays to TOON (a compact table encoding for JSON arrays) or CSV, Unicode-normalize | always Β· lossless |
| Dβ json sample | json_crush | down-sample huge record arrays: keep first/last + outliers (errors, rare values) + a query-biased diverse sample | auto Β· big JSON |
| E dedup | dedup | collapse duplicate + near-duplicate lines (prose only; data untouched) | always Β· exact |
| F output control | output | terse instruction Β· Chain-of-Draft Β· token budget Β· native JSON schema | auto |
| G tool layer | tool | static tool selection + description trimming (schemas resent each call) | auto Β· tools |
| H multimodal | multimodal | downscale images to the provider's resolution cap | auto Β· images |
Default auto switches each stage on only where it pays. safe runs the lossless stages only. Full config β
β‘ Get started (60 seconds)
Is this safe? Everything runs locally - nothing is ever sent to us. llmtrim sees your LLM traffic only; every other connection passes through untouched.
setupchanges three things (a certificate in~/.llmtrim/, a proxy block in your shell profile, a background service) andllmtrim uninstallremoves all three. Anything that can't be compressed safely is sent through unmodified. Full threat model: SECURITY.md.
# 1 - Install (any OS, prebuilt binary, no Rust needed)
npm install -g @llmtrim/cli && llmtrim setup
# 2 - Open a new shell. Your tools now route through llmtrim.
# 3 - Watch the bill shrink
llmtrim status --watch
No Node? Same result with the installers: curl -fsSL https://raw.githubusercontent.com/fkiene/llmtrim/main/install.sh | sh (Linux/macOS) or irm https://raw.githubusercontent.com/fkiene/llmtrim/main/install.ps1 | iex (Windows PowerShell).
Prefer your own package manager? brew install fkiene/tap/llmtrim, cargo binstall llmtrim, scoop install llmtrim (after scoop bucket add llmtrim https://github.com/fkiene/scoop-bucket), or docker run ghcr.io/fkiene/llmtrim - same binary everywhere. Prebuilt for x64 and ARM64; WSL uses the Linux line. Full options in INSTALL.md.
How it works
llmtrim is a local HTTPS proxy: it decrypts, compresses, and re-encrypts your LLM traffic on your own machine - the same technique as mitmproxy, scoped to LLM APIs. setup creates a private certificate that lets llmtrim read this one kind of traffic: it is technically restricted to LLM domains and cannot read your bank, email, or anything else. It also wires HTTPS_PROXY/NODE_EXTRA_CA_CERTS into your shell profile and starts the daemon at login. No IDE settings are touched.
Don't take the README's word for the "LLM domains only" part - check it yourself:
llmtrim ca # prints the certificate path, then:
openssl x509 -in ~/.llmtrim/ca.pem -noout -text | grep -A3 "Name Constraints"
# the domains listed there are the only ones it can ever intercept
without llmtrim: with llmtrim:
tool ββrequestβββΆ LLM API tool ββrequestβββΆ llmtrim ββcompressedβββΆ LLM API
β² β β² β (gate Β· stream) β
βββββ response βββββ βββββ response βββββ΄ββ pass-through ββββββββ
full bill β66% bill, answer unchanged
There's no API key to manage - it forwards your tool's own auth. The CA is name-constrained to LLM domains, and only a metadata-only counts ledger touches disk (Security β).
llmtrim status # health + savings: β running Β· β port β env β ca Β· $ saved Β· by-model
llmtrim doctor # something off? end-to-end diagnosis; each failing check names its fix
llmtrim uninstall # exact inverse of setup: daemon, profile block, CA, binary - all reversed
If the daemon ever stops, your tools fail fast with a connection error rather than silently bypassing compression. llmtrim doctor names the problem; llmtrim start fixes it.
llmtrim start # start the interceptor in the background (setup does this)
llmtrim serve # or foreground (Ctrl-C to stop)
llmtrim stop # stop the daemon
llmtrim update # update to the latest release + restart the daemon (channel-aware)
llmtrim autostart # run at login (--off to disable)
llmtrim ca # print the CA path + how to trust it system-wide (for GUI apps)
llmtrim status --daily # time-series report (--weekly/--monthly); --json/--csv to export
status doubles as a health check. It verifies the whole chain (daemon β port β env β CA β traffic) and exits 0 healthy / 1 stopped / 2 degraded. status -q prints just healthy|degraded|stopped for scripts; the JSON export carries the same under daemon.health.
Default auto routes each request to its shape's preset, with breakers that keep it safe on live traffic:
cacheskips a client managing its owncache_control(no 400s)retrieveprotects directive blockstool_selectnever drops an already-invoked tool
On agent traffic, tool-description trimming is the big lever - clients resend long tool schemas on every call.
π Works with
Any tool that honors HTTPS_PROXY and an env-provided CA - which is every CLI agent and most Node apps:
| Tool | Works | Notes |
|---|---|---|
| Claude Code | β | β68% of compressible input on live traffic, cache discount intact |
| Codex CLI | β | |
| Gemini CLI | β | |
| Cursor / VS Code extensions | β | Node-based: picks up NODE_EXTRA_CA_CERTS |
| Aider, OpenCode, any HTTPS_PROXY-aware CLI | β | |
| Your own app / SDK | β | or call the library / one-shot CLI directly |
| GitHub Copilot | β | certificate pinning - can't be intercepted |
Providers come from the llm_providers registry (OpenAI, Anthropic, Google, DeepSeek, Mistral, xAI, Moonshot, Zhipu, Qwen, MiniMax, Cerebras, OpenRouter, β¦) and update with the crate. Every non-LLM connection is blind-tunneled untouched.
π οΈ One-shot & library
Use the same compression without the proxy - from the CLI or as a Rust library:
echo '{"model":"gpt-4o","messages":[...]}' | llmtrim compress --provider openai > out.json
echo '{"model":"gpt-4o","messages":[...]}' | llmtrim send --provider openai # compress + call + print
use llmtrim::{compress, compress_with_config};
use llmtrim::config::DenseConfig;
use llmtrim::ir::ProviderKind;
let result = compress(request_json, Some(ProviderKind::OpenAi))?; // env/file config, auto-detect with None
println!("{} -> {} input tokens", result.input_tokens_before, result.input_tokens_after);
let result = compress_with_config(request_json, Some(ProviderKind::OpenAi), &DenseConfig::default())?;
π€ Compared to
Three neighbors solve parts of the same problem - good company to be in. RTK pioneered CLI-output filtering, caveman the terse-output skill, and Headroom is the closest peer on the input side. Each compresses one layer; llmtrim does the whole round-trip.
| llmtrim | Headroom | RTK | caveman | |
|---|---|---|---|---|
| Whole round-trip (input Β· output Β· cache) | β | input only | CLI only | output only |
| Can't increase your bill (auto-revert gate) | β | β | β | β |
| Live A/B: savings and answer quality | β | offline evals | β | tokens only |
| Install: one static binary | β | Python + GB models | β | β |
| Overhead it adds / request | <10 ms | 52 ms median* | <10 ms | n/a |
| Prompt overhead it injects | 19 tokens | n/a | n/a | 949 tokens (always-on skill) |
| Deterministic: same request β same result | β | β | β | β |
* Headroom's own production telemetry (161 ms mean, 4.2 s P99) - sources in the feature comparison below.
They stack. llmtrim removes another 35% from Claude Code's resent tool schemas on top of RTK. On agentic tool output it saves 93β98%, with the bill measured both ways.
vs CavemanMeasured on caveman's own 10 benchmark prompts, same model (gpt-oss-20b), real API token counts (bench/results-caveman):
| llmtrim | caveman | |
|---|---|---|
| Output reduction | β69% | β80% (deeper) |
| Instruction cost | 19 tokens (prompts/output_terse.txt) |
949 tokens (always-on skill, o200k) |
| Net tokens saved / request | ~728 | ~698 |
| Quality on 9 prompts | 1 truncation (2048 cap) | 1 hard fail (empty completion) + 1 thinned answer |
| Beyond output | input + cache + tool schemas | output only |
The caveman persona cuts deeper, but its skill costs 50Γ the instruction tokens and carried the only hard failure. Net per request the two land within a few percent of each other - llmtrim gets there without the persona risk, quality-gated, and also compresses the input and cache sides caveman doesn't touch. Both arms reproducible: bench/scripts/caveman_ab.py.
The trade is pure-Rust simplicity + cache-correctness vs ML reach:
| llmtrim | Headroom | |
|---|---|---|
| Runtime | single 47 MB static binary, 0 deps | Python + numpy / onnxruntime / transformers / magika / fastembed (100s MB β GB) |
| Latency it adds | <10 ms per request, measured here: 0.5 ms at 5 KB, 7 ms at 49k tokens. ~110 ms one-time startup. The smaller prompt often makes the call faster overall | 52 ms median / 161 ms mean, P99 4.2 s - self-reported production telemetry* |
| Models | none (deterministic) | ONNX detection (magika) + learned text compressor (Kompress) + embeddings |
| Tool output | log / diff / grep + repetitive fallback, adaptiveβaggressive auto-split | SmartCrusher / log / diff / search (ML-assisted) |
| Cache discipline | never rewrites the cache_control prefix + tool/schema sort + OpenAI prompt_cache_key |
live-zone byte-range surgery + cache stabilization |
| Output side | terse / Chain-of-Draft / token-budget shaping | input-side only |
Where Headroom leads (honest): ML content detection, semantic relevance, a learned text compressor, cross-agent memory, an MCP server, more providers (Bedrock / Vertex). Savings are in the same league (llmtrim 93β98%; Headroom ~92%).
* llmtrim's latencies are measured here (cargo bench --bench latency). Headroom's numbers are self-reported on its benchmarks page.
π¬ Benchmark
The benchmark measures two things per request, both live:
- tokens saved: real tokenizer, at compress time
- quality retained: A/B delta between the answer on the original vs the compressed request
A preset only counts if quality holds at its saving, so the (saved, retained) frontier is the benchmark, not the saving alone. It also shows where compression pays (output-heavy generation, chat, reasoning) and where it can't (cache workloads, short extractive RAG). Full per-corpus frontier + CIs in bench/README.md.
Scoring uses ground truth where possible: numeric-exact (math), pass@1 running the unit tests (code), token-F1 (QA), tool-call match (agents), LLM judge (open-ended).
python3 bench/scripts/download.py 40 # pull + hash real corpora (gsm8k, humaneval, dolly, hotpotqa, glaive, ultrachat, cnn)
bash bench/scripts/run_all.sh # live A/B (needs OPENROUTER_API_KEY; builds --features live)
python3 bench/scripts/chart.py # regenerate the chart + table
π Configuration
Zero config needed. Default auto routes every request by its shape: tools β agent, code β code, long-context + question β rag, else β aggressive. To force a preset, set one line - preset = "<name>" in the config TOML ($LLMTRIM_CONFIG or $XDG_CONFIG_HOME/llmtrim/config.toml) or LLMTRIM_PRESET=<name>.
| preset | for |
|---|---|
auto (default) |
routes each request to the proven preset for its shape - right for almost everyone |
safe |
lossless only - byte-faithful round-trip (lossy stages off) |
Known workload? Name a preset: reasoning (math / step-by-step) or cache (a fixed prefix reused across calls). Naming one yourself rarely beats auto. Power users can hand-tune raw flags (preset wins over flags).
| field | default | meaning |
|---|---|---|
hygiene |
true |
Stage D minify (+ base64 strip if enabled) |
normalize_unicode |
false |
NFKC fold + strip invisible/format waste (lossy; in aggressive) |
serialize |
true |
Stage D TOON encoding |
serialize_nested |
true |
also encode arrays nested in content JSON |
serialize_min_rows |
2 |
min array rows before encoding |
serialize_csv |
false |
encode flat arrays as both TOON and CSV, keep the smaller |
serialize_flatten |
false β on in agent/aggressive |
flatten nested-uniform records to dotted columns (meta.region) |
serialize_buckets |
false β on in agent/aggressive |
partition heterogeneous record arrays into per-shape TOON tables |
json_crush / json_crush_max_rows |
false / 50 β on in agent/aggressive |
sample record arrays longer than the cap (keep first/last + outliers + a query-biased sample); lossy |
strip_base64 |
false β on in auto |
elide base64/data-URI blobs (β₯200 chars) to a [elided] marker; lossy but measured +0.0pp (bench/data/base64.jsonl) |
numeric_sig_figs |
(none) | round floats to N significant figures (lossy) |
output_control |
false |
Stage F terse instruction + cap |
output_level |
"terse" |
terse (clean) or draft (Chain-of-Draft) |
output_max_tokens |
(none) | impose a hard cap when the request has none |
output_token_budget |
(none) | inject a soft "answer within N tokens" budget |
output_compact_code |
false |
instruct minified-code output (model-gated) |
retrieve |
false |
Stage B lexical retrieval (lossy) |
retrieve_keep_ratio |
0.5 |
fraction of the segment's tokens kept (the selection budget) |
retrieve_reorder |
false |
head+tail U-shape (lost-in-the-middle; lossless) |
retrieve_mmr |
false |
MMR diversity-aware selection |
retrieve_sentence |
false |
training-free DSLR sentence pruning (answer + boundary protected) |
cache / cache_max_breakpoints |
false / 4 |
Stage A cache_control breakpoints (lossless) |
dedup |
true |
collapse exact-duplicate lines (lossless) |
dedup_near |
false |
also collapse near-duplicate lines (SimHash) |
ngram / ngram_max_entries |
false / 32 |
reversible n-gram abbreviation (lossless) |
tool_select / tool_trim_desc |
false |
Stage G keep relevant tools / trim descriptions |
toolout |
false β on in agent/aggressive |
Stage T tool-output compression (log / diff / grep + repetitive fallback); positional elision |
toolout_mode |
"auto" |
Stage T split: adaptive Β· aggressive Β· auto (per-segment by noise density) |
toolout_max_lines / toolout_min_lines |
40 / 20 |
keep-budget ceiling / skip segments shorter than this |
toolout_template |
true |
lossless template fold before windowing: consecutive runs (Drain) + interleaved lines (LSH grouping) |
skeletonize / minify_code |
false |
Stage C drop bodies / strip indentation (lossless) |
skeleton_keep_full_top_k |
5 |
bodies kept for the top-k functions overlapping the conversation (Hierarchical Context Pruning) |
skeleton_drop_unmatched / skeleton_drop_min_body_lines |
false / 8 |
also drop zero-overlap functions β₯ N lines entirely (on in aggressive) |
multimodal / image_detail |
false |
Stage H downscale to the provider's cap |
tool_minify_schema |
false β on in agent/aggressive |
minify tool JSON-Schemas in place (drop title/$schema/examples, dedup boilerplate descriptions): stays valid JSON Schema |
quality_gate |
true |
after the token gate, revert a lossy cut whose query-relevant coverage drops below the calibrated threshold ("saved tokens by deleting the answer") |
memo |
true |
proxy-only memo: a conversation prefix seen last turn reuses its compressed bytes verbatim, so the provider's prefix cache stays warm on agent loops (in-memory only) |
Env: LLMTRIM_PRESET (preset by name), LLMTRIM_CONFIG (config-file path), LLMTRIM_DB_PATH (ledger location).
π Security
llmtrim sits between your tool and the provider - its trust model is the product. Full threat model in SECURITY.md:
- Local CA, name-constrained. Generated on your machine (
~/.llmtrim/ca.pem, key0600), X.509-constrained to LLM API domains. Even a stolen key can't mint a cert for any other host. Trusted per-tool viaNODE_EXTRA_CA_CERTS; every non-LLM connection blind-tunnels untouched. - No keys, no prompts on disk. Forwards your tool's own auth. Prompt/response text stays in memory - never logged, never persisted.
- Binds
127.0.0.1only. No client auth; never expose it on a public interface. - Metadata-only ledger (
~/.local/share/llmtrim/tracking.db) - provider, model, token counts, never content. Cap 100k events;retention_days = Nto age-prune;uninstall --purgewipes it.
Report vulnerabilities privately via a security advisory, not a public issue.
β οΈ Known limits
These are the current limits, surfaced by the same A/B that proves the savings:
- Anthropic / Gemini counts are approximate. No public exact tokenizer, so an o200k BPE proxy is used and flagged (
is_exact() == false, surfaced instatus). OpenAI is exact (tiktoken). - Output savings aren't measured live. The proxy compresses input; an output saving needs the A/B counterfactual, which only offline
benchhas.status"saved" is input-side. - Default is quality-gated, not lossless. Lossy stages run where the eval shows quality holds; the token gate ensures fewer tokens, not quality. Want a byte-faithful round-trip? Use
safe.
π Acknowledgments
Every lever is a deterministic implementation of published research - the ideas are theirs, the engineering and the token gate are ours.
Papers + crates behind each stageRetrieval & context (Stage B)
- BM25: Robertson & Zaragoza, The Probabilistic Relevance Framework: BM25 and Beyond (2009) Β·
bm25 - BM25+: Lv & Zhai, Lower-Bounding Term Frequency Normalization (CIKM 2011) - Ξ΄ floor so an occurrence always beats absence
- RM3: Lavrenko & Croft, Relevance-Based Language Models (SIGIR 2001) - pseudo-relevance feedback for sparse queries
- TextTiling: Hearst, TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages (CL 1997) - prose chunk boundaries at lexical-cohesion valleys
- TextRank: Mihalcea & Tarau, TextRank: Bringing Order into Texts (EMNLP 2004)
- MMR: Carbonell & Goldstein, The Use of MMR, Diversity-Based Reranking⦠(SIGIR 1998)
- Submodular selection: Lin & Bilmes, A Class of Submodular Functions for Document Summarization (ACL 2011) + cost-ratio knapsack greedy, arXiv:2008.05391 - token-budgeted chunk/row selection (CELF lazy greedy)
- Diverse sampling: Chen et al., Fast Greedy MAP Inference for DPP (NeurIPS 2018) - the json-sample diversity fill
- Lost in the Middle: Liu et al. (2023), arXiv:2307.03172 - head+tail reordering
- DSLR: Hwang et al. (2024), arXiv:2407.03627 - sentence-level pruning
Code (Stages C, F)
- RepoCoder: Zhang et al. (2023), arXiv:2303.12570 - AST skeletons beat raw source for non-focus code
- Hierarchical Context Pruning: Zhang et al. (2024), arXiv:2406.18294 - keep full bodies only for the completion-relevant functions (our ranking is lexical, not embeddings)
- The Hidden Cost of Readability: Pan et al. (2025), arXiv:2508.13666 - code minification
- Reducing Token Usage β¦ via Minification: Hrubec & Cito (2026), arXiv:2606.01326 - per-transformation token accounting
Tool output (Stage T)
- Drain: He et al., Drain: An Online Log Parsing Approach with Fixed Depth Tree (ICWS 2017) - the consecutive template fold
- Brain: Yu et al., Brain: Log Parsing with Bidirectional Parallel Tree (IEEE TSC 2023) - positional-voting template extraction
- LogLSHD: Huang et al. (2025), arXiv:2504.02172 - MinHash-LSH grouping of interleaved same-template lines (ours is deterministic: first-N voting, alphanumeric tokens kept)
Dedup & abbreviation (Stages E, E+)
- SimHash: Charikar, Similarity Estimation Techniques from Rounding Algorithms (STOC 2002) Β·
gaoya - CompactPrompt: Choi et al. (2025), arXiv:2510.18043 - n-gram abbreviation
- Maximal repeats: Becher et al., Efficient Repeat Finding via Suffix Arrays (arXiv:1304.0528) + Re-Pair, Larsson & Moffat (DCC 1999) - the dictionary miner: all maximal repeated phrases, selected by real token gain
Output control (Stage F)
- Chain-of-Draft: Xu et al. (2025), arXiv:2502.18600 - terse reasoning steps
- TALE: Han et al. (2024), arXiv:2412.18547 - soft "answer within N tokens" budget
Serialization (Stage D)
- TOON (Token-Oriented Object Notation) - Johann Schopplich Β·
toon-format
Built on the Rust ecosystem: tiktoken-rs, toon-format, bm25, gaoya, tree-sitter, pest, image, unicode-normalization, whatlang, hudsucker, rusqlite.
π Try it on one real session
Install, open a new shell, and leave llmtrim status --watch running while you work. If the dollars column doesn't move, llmtrim uninstall reverses everything. Found a request it mangled? Set LLMTRIM_CAPTURE_DIR and open an issue with the before/after capture - a repro is a fix. And if it saved you money, a β helps others find it.
π License
AGPL-3.0-only: use, modify, and self-host freely. Running llmtrim locally to compress your own traffic triggers no obligations - the AGPL only applies if you offer a modified llmtrim as a network service to others, in which case you must release your source under the same license. Contributions via DCO sign-off.
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi