Model Maestro

Config-driven Unified LLM Gateway

Route, load-balance and manage Ollama, OpenAI and other LLM providers through a single authenticated API. Model Maestro gives you user-based access control, model mapping, token usage tracking, health-checked node pooling and a modern Next.js admin dashboard — all wired to PostgreSQL + Redis.

Quick Start · Features · Architecture · API · Admin Panel

Quick Start
Features
Architecture
Tech Stack
Configuration
Admin Panel
API Reference
Model Mapping & Routing
Troubleshooting
Development
License

Quick Start

Requires Docker & Docker Compose.

# 1. Clone
git clone <repository-url> && cd model-maestro

# 2. Configure
cp .env.example .env

# 3. Launch full stack (PostgreSQL + Redis + FastAPI + Next.js)
docker compose -f docker-compose.dev.yml up --build -d

# 4. Seed the database
docker exec maestro python -m app.seeder

# 5. Open the admin panel at http://localhost:3000

Service	URL	Notes
API	`http://localhost:8000`	FastAPI gateway
Admin Dashboard	`http://localhost:3000`	Next.js admin panel
API Docs	`http://localhost:8000/api/docs`	Basic-auth protected

For a more detailed setup guide, see docs/SETUP.md.

Features

JWT Authentication — Bearer-token auth on every LLM request.
Admin Dashboard — Next.js 16 panel for visual management of users, nodes, models, groups and audit logs.
Model Mapping — Translate display names (gpt-oss:120b) to real names (gpt-oss:120b-cloud) via PostgreSQL with JSON-file caching.
Node-Scoped Model Mappings — Bind a mapping to a specific node so the same display name can resolve to different real names on different backends.
Node-Scoped Routing via Model Prefix — Force a request to a specific node by prefixing the model name: node:trmix:kimi-k2.6:latest routes directly to the node with code trmix.
Multi-Node Load Balancing — Round-robin, weighted and priority-based strategies across Ollama and vLLM nodes.
vLLM Support — Native vLLM (OpenAI-compatible) node type with automatic health checks, model discovery and Authorization: Bearer header forwarding.
Model Groups — Group models into logical units with fallback chains. Requests dynamically resolve to the best member based on capability tags (vision, tools) and strategy.
Node Health Management — Automatic health checks, model discovery and availability tracking for both Ollama and vLLM nodes.
Per-Node Warmup Toggle — Enable or disable model warmup per node via admin UI.
Drag-and-Drop Node Priority — Reorder node cards in the admin panel to update fallback priority visually.
User-Level Access Control — Per-user model allowlists and rate limits (requests / tokens per day).
Token Usage Tracking — Background-batched activity logs with prompt / completion / total token breakdowns, plus request source identification (Cursor, Claude, OpenClaw, Grafana, etc.).
Tool Set Filtering — Restrict which tools a model is allowed to invoke via configurable tool sets.
Context Length Config — Per-model context length stored in mappings (used by Cursor/Antigravity for usage bars).
Streaming — SSE-based streaming on /api/chat, /api/generate and /v1/chat/completions.
OpenAI Compatible — Drop-in /v1/chat/completions, /v1/completions, /v1/embeddings and /v1/models endpoints.
Full Ollama API — /api/generate, /api/chat, /api/embeddings, /api/tags, /api/show, /api/copy, /api/delete, /api/pull, /api/push, /api/create.
Grafana Assistant API — Full Grafana LLM Assistant compatibility endpoints (/grafana/assistant/*) for Grafana-native AI features.
DeepSeek Tool Call Parsing — Auto-detects and converts DeepSeek's raw XML tool call output (<tool_calls><invoke>, <CallMcpTool>, <tool_call name="...">) to OpenAI tool_calls format in streaming and non-streaming responses. Kimi/Moonshot <|tool_calls_section_begin|> format also supported.
Streaming-Aware Background Tasks — Health checks, model discovery and warmup defer when streams are active, preventing interruptions.
Node-Aware Model Warmup — Warmup requests target only models that exist on each node, eliminating 404 errors from stale model names.
Background Tasks — Redis-backed async queue for activity logging, node health checks, model discovery, model warmup and load cleanup.
Audit Logs — Every admin action is timestamped and queryable.
PostgreSQL + Alembic — Schema migrations run automatically on container startup.
Redis Cache — Hot-path caching for mappings, config and user usage data.

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Cursor     │     │  Antigravity │     │   Claude     │
│   IDE        │     │   IDE        │     │   Code       │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
                     ┌──────┴──────┐
                     │  Load       │
                     │  Balancer   │
                     └──────┬──────┘
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
┌──────┴──────┐    ┌────────┴────────┐   ┌──────┴──────┐
│  Ollama     │    │    Ollama       │   │   OpenAI    │
│  Node 1     │    │    Node 2       │   │   / Other   │
└─────────────┘    └─────────────────┘   └─────────────┘

Request Flow

Client Request
      │
      ▼
┌─────────────────┐
│  JWT Middleware │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Model Group?    │──No──▶┌──────────────┐
│ (resolve member)│       │ Model Mapper │
└────────┬────────┘       │ (display→real)│
         │Yes             └──────┬───────┘
         │                        │
         ▼                        ▼
┌─────────────────┐       ┌──────────────┐
│ Load Balancer   │──────▶│ Node Pool    │
│ (pick healthy)  │       │ (health check│
└────────┬────────┘       │  + retry)    │
         │                └──────┬───────┘
         │                       │
         ▼                       ▼
┌─────────────────┐       ┌──────────────┐
│ Ollama Proxy    │◀──────│ Ollama /     │
│ (reverse map)   │       │ Provider API │
└────────┬────────┘       └──────────────┘
         │
         ▼
    Client Response

For the full architecture documentation, see docs/ARCHITECTURE.md.

Tech Stack

Layer	Technology
API Gateway	Python 3.11, FastAPI, Uvicorn
Async HTTP	httpx (HTTP/2)
Auth	JWT (PyJWT)
Database	PostgreSQL 15 + asyncpg + SQLAlchemy async
Migrations	Alembic
Cache	Redis 7
Frontend	Next.js 16, React 19, Tailwind CSS v4, shadcn/ui
Background Tasks	Redis-backed async queue
Deployment	Docker, Docker Compose

Configuration

Copy .env.example to .env and set:

# Ollama
OLLAMA_BASE_URL=http://host.docker.internal:11434
JWT_SECRET_KEY=change-this-to-a-strong-secret
LOG_LEVEL=INFO

# PostgreSQL
DATABASE_URL=postgresql+asyncpg://maestro_user:maestro_password@postgres:5432/maestro

# Redis
REDIS_URL=redis://redis:6379/0

# Admin Token (for /admin/* endpoints)
ADMIN_TOKEN=change-this-for-production

# Admin Panel Login
ADMIN_USERNAME=admin
ADMIN_PASSWORD=admin

# Swagger / ReDoc Basic Auth
DOCS_USERNAME=admin
DOCS_PASSWORD=admin

Admin Panel

The Next.js dashboard (http://localhost:3000) provides a visual interface for everything.

Page	What you can do
Dashboard	Node health, model counts, user statistics
Users	Create users, manage tokens, assign models, set limits
Nodes	Add/edit Ollama and vLLM nodes, set codes, view health, trigger discovery, drag-and-drop priority
Models per Node	Browse discovered models per node
Models > Mappings	Display↔Real name mappings, node-scoped overrides, context length, capabilities
Models > Groups	Create groups, add members, set strategy, reorder fallbacks
Models > Config	Per-model tool restrictions and settings
Tool Sets	Create tool groups and assign to models
Request Logs	Filterable request history with source identification (Cursor, Claude, OpenClaw, Grafana, etc.)
Settings	System-wide configuration
Audit Logs	Filterable history of all admin actions

Default login: username admin, password from ADMIN_PASSWORD in .env.

API Reference

For the complete API reference with all request/response examples, see docs/API.md.

Authentication

Every LLM request requires:

Authorization: Bearer <jwt-token>

Admin endpoints require:

Authorization: Bearer <admin-token>

LLM Endpoints

Method	Endpoint	Description
`POST`	`/api/chat`	Chat completions (Ollama format)
`POST`	`/api/generate`	Text generation
`POST`	`/api/embeddings`	Generate embeddings
`GET`	`/api/tags`	List available models
`POST`	`/api/show`	Show model info
`POST`	`/api/copy`	Copy model
`DELETE`	`/api/delete`	Delete model
`POST`	`/api/pull`	Pull model
`POST`	`/api/push`	Push model
`POST`	`/api/create`	Create model from Modelfile
`POST`	`/v1/completions`	OpenAI-compatible completions
`POST`	`/v1/embeddings`	OpenAI-compatible embeddings

Example — Chat

curl -X POST http://localhost:8000/api/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:120b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Example — Streaming Chat

curl -X POST http://localhost:8000/api/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:120b",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

Admin Endpoints

Users

# Create user
curl -X POST http://localhost:8000/admin/users \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"username": "john"}'

# List users
curl http://localhost:8000/admin/users \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Refresh token
curl -X PUT http://localhost:8000/admin/users/john/token \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Model Assignment

# Assign specific models
curl -X POST http://localhost:8000/admin/users/john/models \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"models": ["gpt-oss:120b", "deepseek-v3.1:671b"]}'

# Grant access to all models
curl -X POST http://localhost:8000/admin/users/john/models/all \
  -H "Authorization: Bearer $ADMIN_TOKEN"

User Limits

# Set limits (null = unlimited)
curl -X POST http://localhost:8000/admin/users/john/limits \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"request_limit": 1000, "token_limit": 1000000}'

Model Mappings

# Create mapping with context length
curl -X POST http://localhost:8000/admin/model-mappings \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "display_name": "gpt-oss:120b",
    "real_name": "gpt-oss:120b-cloud",
    "context_length": 128000,
    "capabilities": ["completion", "tools"]
  }'

# List
curl http://localhost:8000/admin/model-mappings \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Delete
curl -X DELETE http://localhost:8000/admin/model-mappings/gpt-oss:120b \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Nodes

# Add node (with optional code for prefix routing)
curl -X POST http://localhost:8000/admin/nodes \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "main",
    "base_url": "http://localhost:11434",
    "priority": 100,
    "code": "trmix",
    "node_type": "ollama"
  }'

# Toggle activation
curl -X PATCH http://localhost:8000/admin/nodes/1/toggle \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Reorder node priorities (drag-and-drop)
curl -X PATCH http://localhost:8000/admin/nodes/batch/priority \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"priorities": [{"id": 1, "priority": 200}, {"id": 2, "priority": 100}]}'

Model Groups

# Create group
curl -X POST http://localhost:8000/admin/model-groups \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "coding", "strategy": "round_robin", "description": "Code models"}'

# Add member
curl -X POST http://localhost:8000/admin/model-groups/coding/members \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model_display_name": "qwen3-coder:480b", "priority": 1}'

Grafana Assistant

# List chats
curl http://localhost:8000/grafana/assistant/chats \
  -H "Authorization: Bearer $TOKEN"

# Create chat
curl -X POST http://localhost:8000/grafana/assistant/chats \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello"}'

# Stream chat
curl -X POST http://localhost:8000/grafana/assistant/chat/stream \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello"}'

# Get LLM config
curl http://localhost:8000/grafana/assistant/config \
  -H "Authorization: Bearer $TOKEN"

# Update LLM config
curl -X POST http://localhost:8000/grafana/assistant/config \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-oss:120b", "temperature": 0.7}'

# Check infrastructure discovery status
curl http://localhost:8000/grafana/assistant/discovery \
  -H "Authorization: Bearer $TOKEN"

OpenAI Compatible

Method	Endpoint	Description
`POST`	`/v1/chat/completions`	Chat completions (OpenAI format)
`POST`	`/v1/completions`	Text completions (OpenAI format)
`POST`	`/v1/embeddings`	Embeddings (OpenAI format)
`GET`	`/v1/models`	Model list (OpenAI format)

Example — OpenAI Compatible

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:120b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Model Mapping & Routing

Display Name → Real Name

Client sends:       gpt-oss:120b
Proxy looks up:     gpt-oss:120b → gpt-oss:120b-cloud
Ollama receives:    gpt-oss:120b-cloud

Real Name → Display Name

Ollama returns:     gpt-oss:120b-cloud
Proxy translates:   gpt-oss:120b-cloud → gpt-oss:120b
Client sees:        gpt-oss:120b

Node Prefix Routing

Force a request to a specific node by prefixing the model name with its code:

Client sends:       node:trmix:kimi-k2.6:latest
Gateway parses:     code = "trmix", model = "kimi-k2.6:latest"
Node lookup:        trmix → node #3
Model mapping:      kimi-k2.6:latest → kimi-k2.6:latest-cloud
Node #3 receives:   kimi-k2.6:latest-cloud

Syntax: node:{code}:{model_name}
The code is the unique short identifier set on each node in the admin panel.
If the code does not exist, the gateway returns 404 Node with code 'x' not found.
When a prefix is present, the load balancer is skipped and the request goes directly to the matched node.
Prefix routing works on every endpoint that accepts a model parameter: /api/chat, /api/generate, /v1/chat/completions, /v1/embeddings, etc.

Model Groups

If the requested model is a group, the gateway resolves it dynamically:

Detect if the request needs vision (image content in messages).
Filter members by capability tags (vision, tools).
Pick a member using the group's strategy:
- round_robin — cycle through members
- weighted — weighted random selection
- priority — always pick lowest priority number
If the selected model fails, retry with the next member in priority order.

Node-Scoped Mappings

A model mapping can be bound to a specific node so the same display name resolves to a different real name on different backends. This is useful when nodes host different variants of the same model (e.g. a CPU-quantized version on one node and a full-GPU version on another).

Troubleshooting

Restart the full stack

docker compose -f docker-compose.dev.yml down
docker compose -f docker-compose.dev.yml up --build -d

Run migrations manually

docker exec maestro alembic upgrade head

Re-run seeds

docker exec maestro python -m app.seeder --reset
docker exec maestro python -m app.seeder

Clear cache

docker exec maestro python scripts/clear_cache.py

Check PostgreSQL health

docker exec maestro-postgres pg_isready -U maestro_user -d maestro

Check Redis

docker exec maestro-redis redis-cli ping

View logs

# All services
docker compose -f docker-compose.dev.yml logs -f

# API only
docker compose -f docker-compose.dev.yml logs -f maestro

# Frontend only
docker compose -f docker-compose.dev.yml logs -f frontend

Development

Project Structure

model-maestro/
├── app/
│   ├── main.py              # FastAPI app, routers, docs auth
│   ├── proxy.py             # Proxy logic, model routing, failover, tool call parsing
│   ├── config.py            # Settings, ModelMappingManager, ModelGroupManager
│   ├── auth.py              # JWT authentication
│   ├── models.py            # Pydantic request/response models
│   ├── models_db.py         # SQLAlchemy ORM models
│   ├── database.py          # Async DB engine & session maker
│   ├── redis.py             # Redis client & queue
│   ├── load_balancer.py     # Node selection algorithms
│   ├── node_manager.py      # Health checks, discovery, node CRUD
│   ├── user_manager.py      # User CRUD
│   ├── background_tasks.py  # Activity log processor, health checks, model warmup
│   ├── openclaw.py          # OpenClaw integration
│   ├── admin*.py            # Admin API routers
│   ├── repositories/        # Data access layer
│   ├── services/            # Business logic layer
│   └── seeds/               # DB seed migrations
├── frontend/
│   ├── src/app/             # Next.js App Router pages
│   ├── src/components/      # React components (sidebar, shell, etc.)
│   └── public/              # Static assets (logo, favicon)
├── docs/                    # Documentation (architecture, API, setup)
├── alembic/                 # Alembic migrations
├── tests/                   # pytest suite
├── docker-compose.dev.yml   # Dev stack (PG + Redis + API + Frontend)
├── docker-compose.yml       # Production stack (API + Frontend only)
└── Dockerfile               # FastAPI container

Running Tests

python -m pytest tests/ -v

Lint & Format

# Backend
python -m black app/
python -m ruff check app/

# Frontend
cd frontend && npm run lint

Documentation

docs/ARCHITECTURE.md — System architecture, request flow, database schema
docs/API.md — Complete API reference with all endpoints, requests and responses
docs/SETUP.md — Detailed setup guide, environment variables, production deployment
QUICKSTART.md — Get running in under 5 minutes

License

MIT

model-maestro

Table of Contents