luna-agent
Health Pass
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 13 GitHub stars
Code Pass
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Pass
- Permissions — No dangerous permissions requested
This project is a custom, lightweight AI agent built in Python. It provides persistent memory via SQLite, integrates with MCP tool servers, and can interact through a Discord bot or a local command-line interface, running entirely on local hardware.
Security Assessment
Overall Risk: Medium. The agent is designed with a security-conscious architecture (such as routing all LLM traffic through a single client to allow for an AI firewall). However, its native tools explicitly include bash and file execution, which means it can run shell commands and access the local filesystem. While the automated code scan (12 files) found no hardcoded secrets or dangerous patterns, the fundamental capability to execute code natively requires caution. Users should carefully restrict the agent's permissions in any deployment environment.
Quality Assessment
The project appears healthy and actively maintained, with repository activity as recent as today. It uses the permissive MIT license, which is excellent for open-source adoption. Community trust is currently low but growing, represented by 13 GitHub stars. The codebase is intentionally minimal at roughly 2,300 lines of Python and contains zero external frameworks, which makes it significantly easier for a developer to manually read and audit compared to larger alternatives.
Verdict
Use with caution: the lightweight, framework-free design is highly auditable, but you should sandbox the environment to mitigate the inherent risks of its native bash and file execution capabilities.
Custom minimal AI agent with persistent memory, MCP tools, and Discord
Luna Agent
A custom minimal AI agent with persistent memory, MCP tool integration, Discord/CLI interface, and structured observability.
Runs entirely on local hardware — no cloud API costs.
~2300 lines of Python. No frameworks.
Why Custom
We evaluated existing agent frameworks and rejected them all:
- OpenClaw: 400K lines of code, 42K exposed instances on Shodan. Too large to audit, too large to trust.
- ZeroClaw: 9 days old at time of evaluation. Too immature.
- NanoClaw: Too thin — would need to rebuild most of it anyway.
The core needs (memory, tools, chat, logging) are individually well-solved problems. No 400K-line framework needed.
Architecture
Discord (discord.py) CLI REPL (no token)
| |
v v
+---------------------------------+
| Luna Agent Core |
| |
| agent.py | agent loop: msg → memory → prompt → LLM → tools → respond
| ├── llm.py | single LLM client, configurable endpoint
| ├── memory.py | SQLite + FTS5 + sqlite-vec hybrid search
| ├── tools.py | native tools: bash, files, web, delegate, code_task
| ├── tool_output.py | smart output pipeline for large results
| ├── mcp_manager.py | MCP client for community tool servers
| └── observe.py | structured JSON logging
| |
+---------------------------------+
|
v
llama-server Qwen3.5-35B-A3B on 2x RTX 3090
All LLM traffic flows through a single LLMClient with a configurable endpoint URL. Today it points at localhost:8001 (llama-server). To insert an AI firewall later, change the URL to localhost:9000 — zero code changes required.
Thinking model support: Luna handles reasoning models (Qwen3.5, etc.) automatically — extracting reasoning_content, falling back to cleaned reasoning when content is empty, and stripping leaked markup (<thinking>, <tool_call>, etc.) from output.
Hardware
- Intel i7-13700K, 64GB DDR4
- 2x NVIDIA RTX 3090 (24GB each, 48GB total)
- Qwen3.5-35B-A3B Q8_0 via llama-server with layer split across both GPUs
- 131K context window, Q8_0 KV cache
Quick Start
cd ~/luna-agent
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run without Discord (interactive CLI REPL)
python -m luna
# Run with Discord
DISCORD_TOKEN=your-token-here python -m luna
Project Structure
luna-agent/
├── config.toml # All configuration
├── mcp_servers.json # MCP server registry
├── pyproject.toml # Dependencies
├── luna/
│ ├── __main__.py # Entry point (python -m luna)
│ ├── agent.py # Core agent loop
│ ├── llm.py # LLM client (OpenAI-compatible)
│ ├── memory.py # Memory (SQLite + FTS5 + sqlite-vec)
│ ├── tools.py # Native tools (bash, files, web)
│ ├── tool_output.py # Large output persistence + filtering
│ ├── mcp_manager.py # MCP tool client
│ ├── discord_bot.py # Discord interface
│ ├── observe.py # Structured JSON logging
│ └── config.py # Config loader
├── tests/
│ ├── test_agent.py # Agent loop tests
│ ├── test_llm.py # LLM client tests
│ ├── test_memory.py # Memory system tests
│ ├── test_tools.py # Native tool tests
│ └── test_tool_output.py # Output pipeline tests
├── luna-agent.service # systemd unit for the agent
├── worker-agent.service # systemd unit for llama-server (Qwen3.5-35B-A3B)
└── data/ # Created at runtime
├── memory.db # SQLite database
├── logs/ # JSON log files
│ └── luna-YYYY-MM-DD.jsonl
└── tool_outputs/ # Persisted large tool outputs
Configuration
All settings live in config.toml. Environment variables override for secrets:
| Env Var | Overrides | Required |
|---|---|---|
DISCORD_TOKEN |
Discord bot token | Yes (for Discord) |
LLM_ENDPOINT |
[llm] endpoint |
No |
LLM_MODEL |
[llm] model |
No |
MEMORY_DB_PATH |
[memory] db_path |
No |
LOG_DIR |
[observe] log_dir |
No |
See config.toml for all available settings and their defaults.
Components
Agent (agent.py)
The orchestrator. Receives a message and session ID, then:
- Saves the user message to memory
- Searches for relevant memories (hybrid FTS + vector)
- Retrieves the session summary (if any)
- Builds a system prompt with memories, summary, and current time
- Loads the last 20 messages for context
- Calls the LLM with all available tools (native + MCP)
- Enters a tool call loop (max 25 rounds):
- Executes each tool call (native or MCP)
- Feeds results back to the LLM
- Repeats until the LLM responds without tool calls
- Saves the assistant response
- Triggers conversation summarization if enough messages have accumulated
LLM Client (llm.py)
Thin async wrapper around the OpenAI-compatible API. Single chat() method that handles tool calls, thinking model output, and per-call temperature overrides. This is the only code that talks to the LLM — the AI firewall insertion point.
Returns structured LLMResponse objects with content, reasoning, tool calls, and token usage.
Memory (memory.py)
SQLite-based persistent memory with three search strategies combined via Reciprocal Rank Fusion:
- FTS5 keyword search — fast exact/stemmed term matching (Porter stemmer + Unicode61 tokenizer)
- sqlite-vec cosine similarity — semantic search via nomic-embed-text-v1.5 embeddings
- Recency + importance weighting — recent and important memories rank higher
Scoring formula:
final_score = rrf_score + (recency_weight × 2^(-age_days / 7)) + (importance / 10 × 0.1)
Database tables:
| Table | Purpose |
|---|---|
messages |
Every message persisted per session |
memories |
Extracted facts with embeddings and importance scores |
summaries |
LLM-generated compression of old message blocks |
memories_fts |
FTS5 virtual table for keyword search |
memories_vec |
sqlite-vec virtual table for vector search |
Conversation compression: Every N messages (default 50), the LLM summarizes the conversation and extracts facts with importance scores (1-10). Facts above the threshold (default 3.0) are stored as memories. This enables effectively infinite conversations — the agent always has a summary of what came before plus searchable memory of key facts.
All retrieval parameters (top_k, RRF k, recency weight, importance threshold, etc.) are in config.toml for experimentation.
Native Tools (tools.py)
Built-in tools that don't require external MCP servers:
| Tool | Description |
|---|---|
bash |
Execute shell commands with safety guardrails |
read_file |
Read files with optional offset/limit for large files |
write_file |
Write or append to files, creates parent directories |
list_directory |
List files/directories, optional recursion with depth limits |
web_fetch |
Fetch a URL and convert HTML to markdown via html2text |
web_search |
Search the web via DuckDuckGo, returns structured results |
delegate |
Hand off a self-contained subtask to a sub-agent with its own tool loop |
code_task |
Delegate a coding task to a sub-agent with a write-run-fix loop |
summarize_paper |
Fetch and summarize an arXiv paper |
list_available_tools |
Discover MCP tools available from connected servers |
use_tool |
Call a specific MCP tool by name |
Bash safety: Commands are checked against blocked patterns before execution:
rm -rf /,mkfs,dd if=,shutdown,reboot, fork bombs, writes to/dev/sda- Timeout enforcement: default 30s, max 120s
- Output capped at 50KB
Tool Output Pipeline (tool_output.py)
Handles large tool outputs so they don't overwhelm the LLM context:
- Small outputs (< 10KB) — passed through directly
- Large outputs — processed through a pipeline:
- Persist — full output saved to
data/tool_outputs/with a deterministic filename (content hash + source label) - Python filter — keyword matching against the user's query context, with structural detection (headers, code blocks). Includes 1 line of surrounding context per match.
- LLM extraction — if the Python filter finds fewer than 5 keyword matches, the LLM extracts relevant parts from the raw output
- File reference — a footer with the persisted file path is appended so the agent can inspect the full output later
- Persist — full output saved to
MCP Manager (mcp_manager.py)
Connects to community MCP servers via stdio transport. On startup it spawns configured servers, discovers their tools, and converts schemas to OpenAI function-calling format. Tool calls from the LLM are routed to the correct server automatically.
Tool namespacing: Tools are prefixed with the server name (browser__navigate, filesystem__read_file) to avoid collisions between servers.
Configure servers in mcp_servers.json:
{
"servers": {
"browser": {
"command": "npx",
"args": ["-y", "@playwright/mcp"],
"transport": "stdio"
}
}
}
Adding a new tool is editing JSON — no code changes.
Discord Bot (discord_bot.py)
Responds to DMs, @mentions, and replies in threads it created. Shows a typing indicator while the agent is processing.
Session isolation: Session IDs are derived from message context to keep memory separate:
| Context | Session ID |
|---|---|
| Thread | thread-{thread_id} |
| DM | dm-{user_id} |
| Channel | ch-{channel_id}-{user_id} |
Long responses are split at newlines (preferred), spaces, or hard-split at 2000 characters to stay within Discord's limit.
Observability (observe.py)
Every LLM call, tool execution, memory operation, and Discord message is logged as structured JSON.
Dual output:
- File —
data/logs/luna-YYYY-MM-DD.jsonl, one file per day, machine-parseable - Console — human-readable format for development
What's logged:
| Component | Events |
|---|---|
| LLM | llm_call, llm_response (tokens, latency, tools used) |
| Memory | memory_search (hits, method breakdown), memory_stored, summary_stored |
| Tools | tool_executing, native_tool_call, tool_call (server, tool, duration, errors) |
| Discord | discord_ready, discord_message (session, author, channel) |
| MCP | server_connected, tools_refreshed, mcp_shutdown |
| Agent | agent_process (latency), agent_response (memory hits, tool rounds) |
| Output | output_persisted, llm_extraction_triggered |
Inspection:
# Watch logs in real-time
tail -f data/logs/luna-*.jsonl
# Search with jq
jq 'select(.event == "llm_response")' data/logs/luna-*.jsonl
jq 'select(.latency_ms > 5000)' data/logs/luna-*.jsonl
Config (config.py)
Dataclass-based configuration loaded from config.toml with environment variable overrides. Relative paths are resolved against the project root. All fields have sensible defaults — the agent starts with zero configuration if a config.toml is present.
Deployment
Copy the systemd service files and enable them:
sudo cp luna-agent.service /etc/systemd/system/
sudo cp worker-agent.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now worker-agent # Start LLM server (Qwen3.5-35B-A3B) first
sudo systemctl enable --now luna-agent # Then the agent (depends on worker-agent)
Monitor:
journalctl -u luna-agent -f
journalctl -u worker-agent -f
CLI mode (no Discord token): The agent starts an interactive REPL where tool calls print inline as they execute, then the final response prints below. Useful for testing without Discord.
Dependencies
8 runtime packages, no heavy frameworks:
| Package | Purpose |
|---|---|
discord.py |
Discord API client |
openai |
OpenAI-compatible HTTP client |
mcp[cli] |
Model Context Protocol SDK |
sentence-transformers |
Embedding model runtime |
einops |
Tensor operations for embeddings |
sqlite-vec |
Vector search in SQLite |
html2text |
HTML to markdown conversion |
duckduckgo-search |
Web search |
Dev: pytest, pytest-asyncio
Python: >= 3.11
What This Doesn't Include (By Design)
- No AI firewall (future — just don't block the insertion point)
- No web dashboard (future phase of observability)
- No multi-user auth (single user)
- No cloud LLM fallback (local only)
- No containers for the agent (systemd is simpler)
Contributing
See CONTRIBUTING.md for development setup, testing, and pull request guidelines.
License
MIT — Fabio Nonato, 2026
Reviews (0)
Sign in to leave a review.
Leave a reviewNo results found