openclaw-claude-code-mlx-server
Health Warn
- No license — Repository has no license file
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Low visibility — Only 6 GitHub stars
Code Fail
- rm -rf — Recursive force deletion command in .claude/settings.json
- eval() — Dynamic code execution via eval() in scripts/inspect_mlx_cache.py
Permissions Pass
- Permissions — No dangerous permissions requested
This tool is a local inference server for Apple Silicon that runs MLX-framework language models behind an OpenAI-compatible API. It uses a dual-pipeline architecture to maximize KV cache hit rates, significantly reducing latency for repeated prompts.
Security Assessment
The overall risk is Medium. The tool operates locally and does not request dangerous system permissions or contain hardcoded secrets. However, the rule-based scan flagged two notable code issues. First, it uses dynamic code execution via `eval()` in a cache inspection script, which can be a security vulnerability if exploited. Second, a recursive force deletion command (`rm -rf`) was found inside a configuration file (`.claude/settings.json`). While the tool makes local network requests to bridge client requests to the model, developers should carefully review the flagged files before executing the setup.
Quality Assessment
The project is active and recently maintained, with the last push occurring today. Despite its highly detailed and well-documented README, community trust and visibility are currently very low, evidenced by only 6 GitHub stars. Additionally, the repository completely lacks a license file, meaning the legal terms of use, modification, and distribution are undefined. This poses a potential compliance issue for enterprise or open-source adoption.
Verdict
Use with caution—the code contains risky patterns like `eval()` and `rm -rf`, and the lack of a license makes it legally ambiguous for broader projects.
Lightweight local bridge for running an mlx-lm model behind an OpenAI-compatible API and exposing it through a LiteLLM proxy.
MLX LLM Server
A local, KV-cache-aware LLM inference server for Apple Silicon built on the MLX framework. Exposes an OpenAI-compatible API through a LiteLLM proxy. The core value proposition is maximum KV cache hit rate to minimize prefill latency — on a 27B model, cold 10k-token prefill costs ~42 s while a warm hit is effectively free.
Overview
- Dual-pipeline cache architecture: canonical pipeline drives the cache key, original pipeline drives the model. They never recombine after
_canonicalize_messages(), so cache stability never compromises model input. - Two cache layers: a global block-hash cache (token-level) and a message-aware stable-prefix cache (structure-level). Secondary lookup recovers hits after mid-history tool-result injections.
- Apple Silicon native: MLX +
mlx-lmfor text models,mlx-vlmfor vision-language models (auto-detected fromconfig.json). - OpenAI-compatible streaming: real token-by-token SSE with proper
reasoning_contentandcontentdeltas so clients (PI, OpenClaw, Claude Code, LiteLLM) render thinking blocks and answers as they arrive. - Multi-agent safe: canonical-form write guards on the session turn store prevent sub-agent branches from clobbering the orchestrator's turn record; per-session lineage tracking supports branch-return cache reuse.
- Request watchdog: wall-clock deadline per request releases
model_lockon stalls, evicts partial KV state, and terminates the response withfinish_reason="length". - Correctness observability: counters (
vlm_retreat,hybrid_trim_miss,exact_key_rejected_by_model_lcp) surface cache-correctness regressions as they happen.
Quick Start
# One-command bootstrap (creates .venv, installs deps, execs server)
python3.12 install_and_run.py
# Manual setup
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && python start-llm.py
# Smoke test
curl -s http://127.0.0.1:4000/v1/models
install_and_run.py has a preflight that refuses to run if the venv has been contaminated by a non-3.12 Python (e.g. a stray python3.14 -m pip against the existing venv). Python 3.12 is strictly required.
Request path: Client -> LiteLLM Proxy (port 4000) -> MLX Server (port 8080) -> MLX Model.
Core Mandates
- The model must never be blinded. All semantic content (tool results,
<think>blocks, file contents, images) reaches the model 100% intact. A cache hit that hides content is worse than a miss. - Separate cache key from model input. Canonical tokens drive cache lookup; original tokens drive the model. The two are computed at
_canonicalize_messages()and never recombined. - Structural normalisation, not string substitution. Volatile fields are normalised at message-struct level before
apply_chat_template. Post-render scrubs are atomic, line-scoped only — nore.DOTALLspanning section boundaries. - RoPE correctness. KV state reuse is position-aware. Physical KV offsets (
_kv_cache_offset(cache)) slice the originalmodel_tokensfor the remaining prefill; canonical token counts are never used for model-side slicing. - No gut-feel patches. Every change to caching or normalisation cites a specific log-confirmed failure mode. See
.context/LESSONS.mdbefore touching cache logic.
Architecture
The dual pipeline
Incoming Request
|
v
[1] _heal_messages()
| Restores stripped <think> blocks from HEALING_STORE (SHA-256 keyed)
v
[2] _canonicalize_messages(messages)
|-- Returns: (original, canonical) — deep copies; original is never modified
|-- Canonical mutations: message_id -> __STABLE_MSG_ID__,
| Inbound Context block -> __STABLE_INBOUND_CONTEXT_SECTION__,
| sub-agent "Stats: runtime ..." -> __STABLE_RUNTIME__
| |
v v
[3] original_messages [4] canonical_messages
(Model Input Pipeline) (Cache Key Pipeline)
| |
v v
apply_chat_template() apply_chat_template()
| |
v v
model_tokens cache_prompt_raw
| |
| [5] _scrub_cache_key()
| Atomic, line-scoped scrubs:
| - "Current time is ..." timestamps
| - cch=<hex>; telemetry headers
| - -anthropic-billing-header: ...
| - <system-reminder>...</system-reminder>
| |
| [5a] Image-identity markers (G1)
| For VLM requests: inject a 2-token
| SHA-256-derived marker after each
| <|image_pad|> run so identical
| message structure with different
| images yields different cache keys
| |
v v
Prefill & Generate <--- KV Match --- prompt_tokens (canonical cache key)
(from model_tokens) Lookup via LRUPromptCache +
SessionIndex + SESSION_TURN_STORE
Canonicalisation layers
| Layer | Function | Scope | What it normalises |
|---|---|---|---|
| Message-struct | _canonicalize_messages() |
Before rendering | message_id, Inbound Context block, sub-agent Stats: runtime |
| Inbound Context | _canonicalize_inbound_context_block() |
Inside _canonicalize_messages |
Produces identical canonical output whether OpenClaw includes or omits the block |
| Post-render | _scrub_cache_key() |
After apply_chat_template |
Timestamps, cch= headers, billing headers, <system-reminder> tags |
| Image identity | _inject_image_markers() |
Post-tokenisation, canonical only | 2-token SHA-256 marker per image — forces cache-key divergence on image swap |
Cache system
Two independent layers work together to maximise hits.
Layer 1 — Global block-hash cache (LRUPromptCache)
Token-based, no assumptions about session identity or message structure.
- Chain hashes: prompt tokens are split into 16-token blocks; each block's hash is SHA-256-chained with the prior block's hash.
- Longest-candidate selection: among entries sharing the same early block hashes, always pick the deepest matching entry. Without this, a short entry shadows a long one and collapses hit rate.
- Model-space LCP verification: candidate discovery is canonical-token-based, but final acceptance requires
entry.model_tokensto be a prefix of the request'smodel_tokens. This guards against canonical-prefix collisions across different VLM image expansions. - Hybrid-cache awareness: Qwen3.6 and other hybrid VLMs expose
ArraysCache + KVCachelayers thatcan_trim_prompt_cachereports as non-trimmable. Candidates that would require trim are rejected rather than silently reused at the wrong KV depth. - Cost-aware eviction: score =
age / (sqrt(length) * log(frequency)). Protects long conversations and frequently-used prefixes; evicts short, old branches. - Prefix culling: when a longer entry is inserted, shorter strict prefixes are removed (longest one kept as a safety net). Requires model-token-space prefix match, not just canonical.
- Metal memory management:
mx.metal.clear_cache()runs on eviction to return Metal buffers to the OS pool and prevent monotonic GPU memory growth.
Layer 2 — Message-aware stable-prefix (SESSION_TURN_STORE)
Secondary layer that recovers hits after mid-history mutations (tool-result injections, early whitespace drift, message_id churn).
- Canonical-form storage and diff: the turn record stores
_canonicalize_messages()output, and_message_diff()compares canonical-form leading prefixes. This lets OpenClaw drift (per-turnmessage_id, Stats runtime, Inbound Context variants) collapse to equality because it IS equal in cache-key space. - Exact token boundaries:
_compute_msg_token_boundaries()converts the stable message count to an exact token count via cumulativeapply_chat_template(messages[:i+1])rendering. Falls back to even distribution only for VLMs without a usable text template. - Secondary lookup: if the global layer returns fewer cached tokens than the stable-prefix layer estimates, a secondary
fetch_nearest_cachescoped to the stable prefix forces a match up to the change point. - Strict-append write guard: only records when the new canonical list strictly extends the stored one. Prevents sub-agent or parallel-branch requests from clobbering the orchestrator's turn record.
Positional correctness (RoPE)
On every cache hit, _kv_cache_offset(cache) reads the physical KV depth from the first attention layer that exposes .offset (hybrid VLM caches mix ArraysCache + KVCache — layer 0 may lack it). That offset slices the original model_tokens for the remaining prefill. RoPE stays coherent even when canonical and original pipelines differ by thousands of tokens (e.g., a 200-token JSON Inbound Context block collapses to a single __STABLE_INBOUND_CONTEXT_SECTION__ sentinel in the canonical pipeline).
Session management
SessionIndex: per-session cache-key history with parent/branch lineage. Anchor prefixes at 2048-token strides support branch-return reuse.SessionContext: extracted from request body (session_id,conversation_id,thread_id,chat_id, or nestedmetadata/extra_body). Falls back to a SHA-1 hash of the prompt's first 128 tokens when no explicit ID is provided.
Reasoning and streaming
Streaming split: reasoning_content vs content
The server uses a character-level state machine (_ReasoningStreamSplitter) to classify every generated chunk into one of three streams — reasoning (to delta.reasoning_content), content (to delta.content), or tool-call buffer (held for end-of-stream extraction). The classifier:
- Handles
<think>...</think>blocks emitted by Qwen3 and the orphan-close variant emitted by GLM ("reasoning\nanswer" with no opening tag). - Holds a small lookback so tags split across two generated chunks are detected correctly — no premature emission of partial sentinels.
- Matches
_strip_thinking_from_contentbyte-for-byte on the concatenatedcontentstream for every non-pathological case (verified by import-time self-test plus a 150-trial random-chunking soak). This means the stream a client receives is identical to the non-streamingmessage.content, keeping the healing-hash consistent acrossstream=true/stream=falseretries. - Switches to a buffered tool-call mode on the first
<tool_call>sentinel; end-of-stream extraction then emitsdelta.tool_callsplus any trailing non-tool text.
PI, LiteLLM's OpenAI-compatible wrapper, and other clients that understand reasoning_content render thinking live. Clients that don't understand the field see identical content and ignore the rest — no back-compat break.
Non-streaming responses
The non-streaming path still derives message.content from the legacy _strip_thinking_from_content + _extract_openai_tool_calls pipeline (byte-identical to prior versions). When thinking was requested, message.reasoning_content is populated additively via the same splitter.
Healing store
HEALING_STORE is a SHA-256-keyed OrderedDict (cap 2000). When a client echoes a stripped assistant message back in the next turn, _heal_messages looks up the hash of (content, canonicalised tool_calls) and swaps the stripped message for the full original (with <think> blocks intact). Tool-call arguments are canonicalised to parsed-dict form before hashing so clients that deserialise arguments differently from the server still hit the store.
Tool calls
The server extracts structured OpenAI tool_calls from model-generated XML:
- Legacy:
<tool_call>name <arg_key>k</arg_key><arg_value>v</arg_value></tool_call> - Qwen:
<tool_call><function=name><parameter=key>value</parameter></function></tool_call>
Both patterns are recognised independent of model_family; parser order is family-prioritised.
Reasoning control
Any of these request fields map to enable_thinking (bool) internally:
enable_thinking: true/falsethinking: off|on|low|medium|high|xhigh|...(OpenClaw / Anthropic style)reasoning_effort: off|low|medium|high|xhigh(OpenAI / LiteLLM style)reasoning: {enabled|effort|level|type: ...}(Responses-style)- Nested
metadata.*/extra_body.*variants of the above
Default is controlled by DEFAULT_THINKING (default true).
VLM (vision-language model) support
VLMs are auto-detected from config.json via a whitelist of mlx-vlm model_type values (Qwen3.6, Qwen2.5-VL, GLM-4V, Llava, Gemma3, etc.). When a VLM is loaded:
- Messages with
image_urlorinput_imagecontent parts are processed viamlx_vlm.utils.prepare_inputs. Data URLs are base64-decoded to PIL; bare URLs pass through as strings. - The canonical cache key is derived using the text-only tokeniser (CPU, no GPU lock needed) so we don't re-acquire
model_lockfor cache lookup. - Image identity (G1): each image payload yields a 2-token SHA-256-derived marker injected after its
<|image_pad|>run in the canonical token stream. Two requests with the same message structure but different images now produce different cache keys. Without this,<|image_pad|>is a single token id that encodes identically for any image, and the block-hash index happily serves image A's KV for a request carrying image B. - Partial-image safety net: if a cache cut lands mid-image-pad-run (diagnosed via
_vlm_cache_covers_partial_image), the server evicts the offending entry, retreats to fresh prefill, and bumps thevlm_retreatcounter. With G1 active, this should never fire on clean traffic — any bump is a signal. - Metal sync (
torch.mps.synchronize()+mx.eval()) runs before generation to prevent Metal encoder collisions between PyTorch/torchvision and MLX.
Safety and correctness
Generation watchdog (G2)
Every request arms a threading.Timer(GENERATION_WATCHDOG_SECONDS) after acquiring model_lock. If it fires, a cooperative abort flag is set; the generator yield loop checks it between tokens and exits cleanly. On abort: partial KV state is not inserted into the cache, the session turn record is not advanced, the healing store is not updated, and the response terminates with finish_reason="length". The timer is cancelled unconditionally in the finally block so model_lock always releases. Default 600 s — set GENERATION_WATCHDOG_SECONDS=0 to disable.
Canonical-form write guard (G3)
_update_session_turn_store canonicalises the incoming message list and compares its leading prefix against the stored canonical form. Only strict appends in canonical space advance the record; any semantic divergence (new system prompt, mid-history edit, sub-agent branch) refuses the write to preserve the best-known linear record.
Cache-correctness counters (P4)
Three counters, atomically bumped and surfaced per-request in logs and at startup:
| Counter | Meaning | Expected value |
|---|---|---|
vlm_retreat |
Mid-image cache cut forced a fresh prefill | 0 on clean traffic (G1 effectiveness) |
hybrid_trim_miss |
Hybrid VLM cache rejected because can_trim_prompt_cache is False |
0 or low; justifies G4 (pre-generation checkpoint) |
exact_key_rejected_by_model_lcp |
Exact canonical key hit rejected because model_tokens diverged |
0 — any bump is a canonicalisation-over-normalisation bug |
Thread safety
model_lock— serialises GPU access (prefill + decode).prompt_cache_lock— protectsPROMPT_CACHE,SESSION_TURN_STORE,SESSION_INDEX.HEALING_STORE_LOCK— protects the healing OrderedDict.console_lock— serialises terminal output._CACHE_METRICS_LOCK— protects the correctness counter dict.
HTTP server fairness
The server runs single-threaded (HTTPServer) because MLX streams are per-thread and mlx-vlm's generate_step issues bare mx.async_eval outside a with mx.stream(...) context. Every response sets Connection: close and self.close_connection = True so the accept loop doesn't park on a keep-alive socket from LiteLLM's connection pool. When upstream mlx-vlm moves to mx.new_thread_local_stream, this can revert to ThreadingHTTPServer.
Other
_assert_cache_key_safety(opt-in viaCACHE_NORM_SAFETY_CHECK=true): asserts the cache key is not dramatically shorter than the original prompt (>10% delta = over-match). Falls back to original prompt as cache key on violation.- Metal crash fix: PyTorch MPS disabled at import time (
torch.backends.mps.is_built = lambda: False) to prevent Metal command buffer collisions with MLX. - Preserve-thinking probe: at startup,
_probe_preserve_thinking_support()tests whether the active tokenizer/processor accepts thepreserve_thinkingkwarg, caching the result soapply_chat_templatecalls stay consistent across cache-key and model-input paths.
Environment variables
Core
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
lmstudio-community/GLM-4.7-Flash-MLX-4bit |
HuggingFace repo or local path |
MODEL_FAMILY |
auto-detected from path | qwen3, glm4, or generic |
PROXY_MODEL_ID |
openai/{MODEL_PATH} |
Model name exposed through LiteLLM |
MLX_HOST |
127.0.0.1 |
MLX server bind address |
MLX_PORT |
8080 |
MLX server port |
PROXY_PORT |
4000 |
LiteLLM proxy port |
PROXY_STARTUP_WAIT_SECONDS |
2.0 |
Seconds to wait for LiteLLM startup |
KV cache
| Variable | Default | Description |
|---|---|---|
MAX_KV_SIZE |
196608 |
Maximum KV cache size in tokens |
KV_GROUP_SIZE |
64 |
KV cache group size (uniform scheme) |
KV_BITS |
OFF |
KV quantisation bits (4, 8, or OFF) |
KV_QUANT_SCHEME |
uniform |
uniform or turboquant |
QUANTIZED_KV_START |
5000 |
Token position where KV quantisation kicks in |
Prompt cache
| Variable | Default | Description |
|---|---|---|
PROMPT_CACHE_MAX_ENTRIES_GLOBAL |
24 |
Max entries in the global block-hash cache |
PROMPT_CACHE_MAX_ENTRIES_PER_SESSION |
2 |
Max cache keys tracked per session |
PROMPT_CACHE_TTL_SECONDS |
1800 |
Time-to-live for cache entries (0 = infinite) |
PROMPT_CACHE_SESSION_MAX_IDLE_SECONDS |
1800 |
Session idle timeout for turn store (0 = infinite) |
PROMPT_CACHE_BLOCK_SIZE |
16 |
Token block size for hash chains |
CACHE_USE_BLOCK_INDEX |
true |
Enable block-hash index for O(blocks) lookup |
CACHE_CANONICALIZE_TOOL_CONTEXT |
true |
Enable dual-pipeline structured normalisation |
CACHE_VLM_IMAGE_IDENTITY |
true |
Inject per-image markers into canonical cache key (G1) |
CACHE_SESSION_PARTITIONING |
true |
Session partitioning flag (read but not applied — lookup is always global) |
CACHE_NORM_SAFETY_CHECK |
false |
Enable the 10%-delta cache-key safety assertion |
NORMALIZE_WRITE_TOOL_CONTENT_FOR_PROMPT |
false |
Replace write tool's content arg with a digest placeholder at render time |
Generation
| Variable | Default | Description |
|---|---|---|
DEFAULT_TEMPERATURE |
0.1 |
Sampling temperature |
DEFAULT_TOP_P |
0.9 |
Top-p (nucleus) sampling |
DEFAULT_TOP_K |
40 |
Top-k sampling |
DEFAULT_MIN_P |
0.05 |
Minimum probability threshold |
DEFAULT_REPETITION_PENALTY |
1.08 |
Repetition penalty |
DEFAULT_REPETITION_CONTEXT_SIZE |
256 |
Context window for repetition penalty |
DEFAULT_MAX_TOKENS |
2048 |
Default max output tokens |
DEFAULT_THINKING |
true |
Enable reasoning output by default |
PRESERVE_THINKING |
true |
Forward preserve_thinking=True to apply_chat_template when the tokeniser supports it |
GENERATION_WATCHDOG_SECONDS |
600 |
Wall-clock deadline per request (0 = disabled) |
Logging and debug
| Variable | Default | Description |
|---|---|---|
ENABLE_REQUEST_LOGGING |
true |
Write per-request session logs to logs/ |
LOG_ROOT |
logs |
Log output directory |
VLM_CACHE_DEBUG |
false |
Log first 64 prompt token IDs and cache-diagnostic lines for VLM debugging |
Project structure
| File | Purpose |
|---|---|
start-llm.py |
Entire server (~4900 LOC): API handler, dual pipeline, prompt cache, session layers, streaming splitter, watchdog |
install_and_run.py |
Bootstrap: verifies Python 3.12, detects venv contamination, installs deps, execs server |
.env / .env.example |
Runtime config |
requirements.txt |
Dependencies: mlx, mlx-lm, mlx-vlm, litellm[proxy], python-dotenv |
scripts/pi_like_probe.py |
Primary cache-hit / divergence validation harness — PI-shaped agentic probe |
scripts/probe_session.py |
4-scenario cache validation (normal, drift, semantic, insert) |
scripts/diff_turns.py |
Diff consecutive turns from cache session logs |
scripts/inspect_mlx_cache.py |
Inspect MLX KV cache state |
scripts/test_openclaw_integration.py |
Automated OpenClaw integration test harness |
.context/GOALS.md |
Strategic direction, active G-items (G1–G9) |
.context/TASKS.md |
Active portions and open items |
.context/LESSONS.md |
Hard-won knowledge — read before touching cache logic |
.claude/rules/server-core.md |
Auto-loaded rules when editing start-llm.py |
CLAUDE.md |
Project conventions and change protocol |
Supported models
Any MLX-compatible text LM and any mlx-vlm 0.4.x-supported VLM. Tested and actively tuned for:
- Qwen3 / Qwen3.6 (text and hybrid VLM) — primary model_family
- GLM-4 / GLM-4.7-Flash — primary model_family with orphan-close
</think>handling - Generic LMs via mlx-lm's
load()path
Hybrid VLMs (Qwen3.6's Qwen3_5MoeForConditionalGeneration, mixing ArraysCache and KVCache layers) are handled with explicit offset scanning — layer 0 may lack .offset, so _kv_cache_offset walks layers to find the first attention cache.
Verification
No formal unit-test suite; verification is empirical via real agentic traffic.
python scripts/pi_like_probe.py— PI-shaped agentic probe, primary harness.python scripts/probe_session.py— 4-scenario cache validation.scripts/diff_turns.py— diff session prompts turn-by-turn to pinpoint cache-key divergence.logs/<date>/cache-session-*.log— per-session cache trace; first place to look when a fix may have regressed hit rate.
Healthy numbers: text T3 ≥ 97% hit, image T2 ≥ 94% hit, zero bumps on vlm_retreat and exact_key_rejected_by_model_lcp, hybrid_trim_miss small and only on cross-session cold transitions.
Troubleshooting
- Cache misses: inspect
logs/forstable_prefix_msg_count. If 0, an early message is drifting — checkcache_key_delta_charsto verify canonicalisation fired. Ifexact_key_rejected_by_model_lcpis bumping, the canonical pipeline is over-normalising. - Server hangs: watchdog should catch after
GENERATION_WATCHDOG_SECONDS. Ifmodel_lockis stuck, lower the timeout or investigate the Metal state viaActivity Monitor. - OOM: lower
PROMPT_CACHE_MAX_ENTRIES_GLOBALor enableKV_BITS=4. At 20k tokens each cache entry is ~2–3 GB of GPU memory. - Reasoning not rendering in client: verify client reads
delta.reasoning_content(OpenAI Qwen/DeepSeek convention). Clients that only readdelta.contentsee the answer without the thinking block — this is expected. - VLM Metal crash: the server disables PyTorch MPS at startup. If crashes persist, install
mlx-vlmwithout the torch extra:pip uninstall torch torchvision; pip install mlx-vlm. - Stale LiteLLM: auto-detected and killed on the proxy port at startup.
- Venv contamination (e.g.
python3.14binaries inside.venv/):install_and_run.pyrefuses to proceed and points at the artefacts. Delete the venv and re-run, or scrub the bin entries and restorepip -> pip3.12symlinks.
Requirements
- Python 3.12 strict — enforced by
install_and_run.pypreflight. - macOS with Apple Silicon (M1/M2/M3/M4).
- Dependencies:
mlx,mlx-lm,mlx-vlm(optional — only loaded for VLM models),litellm[proxy],python-dotenv.
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found