1bit-systems

mcp
Security Audit
Warn
Health Warn
  • License — License: MIT
  • Description — Repository has a description
  • Active repo — Last push 0 days ago
  • Low visibility — Only 5 GitHub stars
Code Pass
  • Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
  • Permissions — No dangerous permissions requested
Purpose
This tool is a local, ternary-weight LLM inference stack built in C++ and Rust, specifically optimized for AMD Strix Halo hardware. It provides an OpenAI-compatible API server for local AI execution without relying on Python, Docker, or external cloud services.

Security Assessment
The automated code scan checked 12 files and found no dangerous patterns, hardcoded secrets, or dangerous permission requests. The project explicitly states it includes "no telemetry, ever." It operates locally as an HTTP server (exposed on port 8180), meaning it does make local network requests to function, but it does not phone home to external cloud servers. It does not appear to broadly access unrelated sensitive data or execute arbitrary hidden shell commands beyond its standard inference operations. Overall risk: Low.

Quality Assessment
The project is highly active, with its most recent code push happening today. It uses the permissive MIT license and features standard open-source best practices like a detailed README, automated CI testing, and an Arch Linux AUR package. However, it currently has very low community visibility with only 5 GitHub stars. This means it has not been widely peer-reviewed or battle-tested by a large audience, so undiscovered bugs or edge-case vulnerabilities are possible.

Verdict
Use with caution — the code is clean, licensed, and actively maintained, but its extremely low community adoption means it lacks the extensive peer review typically expected for production environments.
SUMMARY

Local, ternary-weight LLM inference on AMD Strix Halo. Rust above the kernels, HIP below, zero Python at runtime. https://discord.gg/EhQgmNePg

README.md

1bit.systems

ternary inference for the rest of us

Latest Release
CI
GitHub downloads
GitHub issues
PRs welcome
License: MIT
Star History
AUR


1bit.systems

Install · Docs · Site · Discord


"Whoa."

You bought a Strix Halo because the spec sheet read like science fiction
— 128 GB of unified LPDDR5x, Radeon 8060S, an XDNA2 NPU welded onto the
die. Then you booted Linux and discovered the cloud-AI ecosystem still
thinks "local" means a 4090 and a 1500W PSU. We built this for the
other crowd. The mini-PC-on-the-desk crowd. The closet-server crowd.
The "I want a chat endpoint that doesn't phone home" crowd.
1bit.systems is a full ternary inference stack tuned for one machine —
gfx1151 plus its NPU — C++23 from kernels to desktop. No Python at
runtime. No Docker on the serving path. No telemetry, ever.

"There is no spoon."

comes in three flavors

  • lemond — the canonical local AI server. C++ HTTP front door.
    OpenAI / Ollama / Anthropic API surfaces on port :8180. Dispatches
    per-recipe to wrapped backends including the in-process rocm-cpp
    Engine. Forked from lemonade-sdk/lemonade and patched in-house —
    every wedge stays here.
  • 1bit-services — the apps tower above lemond, in cpp/.
    Operator CLI, Qt6 + FTXUI helms, landing page, voice loop, MCP
    bridge, power profile control, retrieval pipeline, watchdog, NPU
    dispatch. All C++23. All bare metal.
  • halo-arcade — a vanilla-JS canvas-game cabinet that ships in
    the same release. Because every good rig deserves a coin slot.

out of the box

Three models auto-load on boot via lemond-bootstrap.service.
Reachable on :8180/api/v1/* the moment the unit is green.

auto-loaded role backend
Bonsai-1.7B-gguf:Q1_0 chat llama.cpp Vulkan
nomic-embed-text-v1.5 embeddings llama.cpp Vulkan
halo-1bit-2b chat (ternary) rocm-cpp ternary

Eight .h1b ternary models ship with the release. All of them are
reachable via the unified /api/v1/* surface — pull the rest with
1bit pull <name> when you want them resident. max_loaded_models=12
so you can mix.

The companion UIs come up alongside lemond:

  • GAIA UIhttp://localhost:8000/gaia/. Full FastAPI shim
    rewritten in C++. 11/11 ctest, 36+ endpoints. Chat, file picker,
    agent panel.
  • Lemonade UIhttp://localhost:8000/app/. The classic lemonade
    console — model list, recipe inspector, live /metrics.

built by

A two-person crew on a Strix Halo box, plus the kindness of strangers
who write good open-source kernels. We use AMD hardware. We are not
affiliated with AMD. Anything that looks like a partnership is just us
reading their docs at 2am.

getting started

cmake --preset release-strix
cmake --build --preset release-strix
ctest --preset release-strix

curl -fsSL https://1bit.systems/install.sh | bash
1bit install core
1bit pull bonsai-1.7b
1bit run bonsai-1.7b

Point any OpenAI-compatible client at http://localhost:8180/v1. Open
WebUI, Claude Code, Continue — they all just work.

1bit tunnel start mints a Headscale preauthkey and prints a QR. Scan
from the official Tailscale app, point its login URL at
https://headscale.strixhalo.local, and your phone is on the closet
box's mesh. No app store, no middleman.

"I know kung fu."

adding a service

Third parties extend the package manager without forking. Drop a
packages.local.toml next to packages.toml:

[my-sidecar]
binary = "/usr/local/bin/my-sidecar"
unit   = "my-sidecar.service"
port   = 9100

Then 1bit registry add ./packages.local.toml && 1bit install my-sidecar. The overlay survives 1bit update.

apps + integrations

Native first, then everything else.

native (we ship it) description
lemond C++ HTTP server. Forked from lemonade-sdk. OpenAI / Ollama / Anthropic surfaces.
1bit-helm Qt6 desktop client. Plasma SNI tray icon. Start / stop / status.
1bit-landing live /metrics probe + landing page on :8190.
1bit-voice sentence-boundary streaming voice loop (LLM SSE → TTS chunks).
1bit-echo browser WebSocket gateway over 1bit-voice.
1bit-mcp stdio JSON-RPC MCP bridge for Claude Code and friends.
1bit-power 1bit power — RyzenAdj wrapper, profile control.
halo-arcade vanilla JS canvas games. The good kind.
third-party (it just works) how
Open WebUI point at http://localhost:8180/v1.
Claude Code Anthropic-compat surface.
Continue OpenAI-compat.
stable-diffusion.cpp image gen on :8081, native HIP for SDXL.
whisper.cpp STT on :8082.
kokoro TTS on :8083.

supported platforms

platform state
CachyOS first-class. We dev here.
Arch Linux AUR 1bit-systems-bin.
Fedora AppImage path. ROCm 7.x on host.
Debian / Ubuntu AppImage. .deb someday.
NixOS flake.nix in tree. Untested by us.
Windows use lemond upstream until we port.

CLI

# chat with a ternary model
1bit run bonsai-1.7b

# list everything we know how to pull
1bit list

# get models
1bit pull halo-1bit-2b

# launch a connected app from the catalog
1bit launch claude

# stack health
1bit status
1bit doctor
1bit logs lemond
# multi-modality, dispatched by lemond's recipe registry
1bit run kokoro-v1            # TTS
1bit run whisper-large-v3     # STT
1bit run sdxl-turbo           # image gen

hardware

The shipping target is a single SKU: AMD Strix Halo, Ryzen AI MAX+
Pro 395, Radeon 8060S iGPU (gfx1151), XDNA2 NPU, 128 GB LPDDR5x.

That is the closet machine. Everything in this repo is tuned around
its bandwidth, its kernels, its NPU control packets, its thermal
envelope.

The fat-binary build covers eight Wave32-WMMA AMD arches in one ship
gfx1151 plus the rest of RDNA3 / RDNA3.5 / RDNA4. RX 9070 XT
(gfx1201) on a Ryzen host is the sibling target; same kernels,
more bandwidth.

NPU path: we author AIE2P kernels in C++ via Xilinx/llvm-aie
(Peano), dispatch through libxrt, and use IRON / MLIR-AIE at compile
time. AMD's VitisAI EP is the primary lane when it lands on Linux
STX-H; until then, the custom-kernel lane carries the load.

"Where we're going, we don't need racks."

honest numbers

C++ wins the lane. Doesn't matter which C++ — rocm-cpp, llama.cpp
Vulkan, ggml-hip — the family beats every other family on this box.
What changes inside the family is who wins which model.

Strix Halo · gfx1151 · 256-tok decode · median-of-3 · max_loaded_models=13 · 2026-04-25 (post Run 5)
model backend tok/s
smollm2-135m llama.cpp Vulkan 530
gemma-3-270m-it llama.cpp Vulkan 443
Bonsai-1.7B-gguf:Q1_0 llama.cpp Vulkan 330
Llama-3.2-1B-Instruct-GGUF llama.cpp Vulkan 199
deepseek-r1-distill-qwen-1.5b llama.cpp Vulkan 168
halo-1bit-2b-sherry-cpp rocm-cpp ternary 73
Qwen3-4B-GGUF llama.cpp Vulkan 73
halo-1bit-2b-sherry-v3 rocm-cpp ternary 73
halo-1bit-2b-sherry-v4 rocm-cpp ternary 73
Phi-4-mini-instruct-GGUF llama.cpp Vulkan 71
halo-1bit-2b rocm-cpp ternary 65
halo-1bit-2b-tq1 rocm-cpp ternary 60
bonsai-1.7b-tq2-h1b rocm-cpp ternary 59
halo-bitnet-2b-tq2 rocm-cpp ternary 56

Caveat. Numbers above are essay-style 256-tok output,
median-of-3 — the regime you actually feel when you talk to the
thing. Cache-friendly prompts run higher.

Reading the table. Vulkan llama.cpp dominates raw throughput on
small or quantized GGUFs — that's where it is supposed to win and we
are not pretending otherwise. rocm-cpp ternary ties Vulkan in the
same GB band: sherry-cpp 73 tok/s vs Qwen3-4B 73 tok/s, but ours holds
2 B params at 1.65 GB vs Qwen3's 4 B at 2.38 GB. Same throughput,
smaller footprint.

Run 5 (Sherry retrain) headline: 104.7 tok/s sustained on
halo-1bit-2b-sherry-cpp cache-friendly prompts, final loss 4.8870,
PPL 9.18-ish on wikitext-103 1024 tok.

Memory side: ternary GEMV pulls 92% of LPDDR5x peak, the split-KV
Flash-Decoding attention beats naive 6.78× at L=2048. NPU i8
matmul @ 512×512 = 0.93 ms
bit-exact. NPU ternary kernel is the next
ship-gate — toolchain proven, ternary bitnet_gemm not written yet.

Reproducible from benchmarks/ against checked-in recipes. Anything
not in this README lives in the Benchmarks
wiki
with raw JSON and methodology.

status, honestly

lane state
LLM · TTS · STT · image shipping on :8180 / :8095 / :8190 / :1234
Sherry retrain (Run 5) landed; halo-1bit-2b-sherry-cpp shipping at 73 tok/s essay / 104.7 tok/s cache-friendly
NPU toolchain (IRON + MLIR-AIE + Peano + libxrt, npu5) axpy 160/160 green on Arch
NPU serve path (BitNet-1.58 end-to-end) kernel authoring in flight
Reddit / public launch ship-gated until the NPU lane goes live
Wan 2.2 video lane upstream-blocked on sd.cpp 5D ggml

If you came here from a Reddit post — there isn't one yet. We are not
announcing until the NPU demo trips the gate.

the rules of the house

  • Rule A. No Python at runtime. Scripts on a dev box are fine. A
    systemd unit serving HTTP is not.
  • Rule B. C++23 default. HIP kernels stay C++20 in rocm-cpp/.
  • Rule C. hipBLAS is banned in the runtime path. Native Tensile
    kernels only.
  • Rule E. NPU stack is ORT C++ with the VitisAI EP as the primary
    lane; Peano + libxrt + aie-rt is the custom-kernel lane.
  • Rule F. ISO C++ Core Guidelines — I.27 pImpl, F.55 exhaustive
    std::visit, std::expected on every fallible path,
    [[nodiscard]] on factories.

The full long-form lives in CLAUDE.md and
CONTRIBUTING.md. They are short on purpose.

connect a client

The server speaks OpenAI-compat. Anything that takes a base_url
works.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8180/v1",
    api_key="not-used-but-required",
)

resp = client.chat.completions.create(
    model="bonsai-1.7b",
    messages=[{"role": "user", "content": "hello, ternary world"}],
)
print(resp.choices[0].message.content)

Pick your language on the Clients wiki.
C++, Go, Node, Ruby, PHP, Java, C#. They all dial the same port.

standing on shoulders

We forked, patched, and bundled work from a lot of people. They didn't
ask for our patches and we don't push them upstream — our improvements
stay in our forks, theirs flow into ours. Asymmetric, friendly, no
relationship overhead.

read more

license + footer

Most of the source is MIT (see LICENSE). Sherry-specific
source (3:4 N:M sparse ternary GEMV, 1.25 bpw packer, L1-ratio rescale,
phantom-sign balance) is PolyForm Noncommercial 1.0.0 (see
LICENSE-SHERRY.md and
SHERRY-FILES.txt). Commercial use of Sherry
requires a paid license — contact
[email protected].

The relicense is forward-only (effective 2026-04-26). Pre-cut history
remains MIT for that snapshot.

Model weights follow upstream licenses (Microsoft MIT for BitNet
b1.58-2B-4T, etc.).

We don't transfer anything off your box without you asking. When you
1bit pull, we go to Hugging Face. That's it. No analytics, no crash
reporters, no "anonymous usage statistics."


1bit.systems · @bong-water-water-bong

no LLMs were harmed making this. one almost was.

Reviews (0)

No results found