Docker AI Stack

Deploy a complete, self-hosted AI stack on your own server with a single command.

Zero-config: all services auto-configure on first start
Secure: Ollama, LiteLLM, and MCP Gateway generate API keys automatically
Private: audio, embeddings, and LLM inference all run locally — no data sent to third parties
Optional auth: Whisper, Kokoro, and Embeddings work without API keys by default (set keys via env files for public deployments)
Lightweight stacks for lower memory requirements (as low as ~2.5 GB)
GPU acceleration via NVIDIA CUDA

Note: When using LiteLLM with external providers (e.g., OpenAI, Anthropic), your data will be sent to those providers.

Services included:

Service	Role	Default port
Ollama (LLM)	Runs local LLM models (llama3, qwen, mistral, etc.)	`11434`
LiteLLM	AI gateway — routes requests to Ollama, OpenAI, Anthropic, and 100+ providers	`4000`
Embeddings	Converts text to vectors for semantic search and RAG	`8000`
Whisper (STT)	Transcribes spoken audio to text	`9000`
Kokoro (TTS)	Converts text to natural-sounding speech	`8880`
MCP Gateway	Provides MCP tools (filesystem, fetch, GitHub, search, databases) to AI clients	`3000`

Also available:

AI/Audio: WhisperLive (real-time STT)
VPN: WireGuard, OpenVPN, IPsec VPN, Headscale

Architecture

graph LR
    A["🎤 Audio input"] -->|transcribe| W["Whisper<br/>(speech-to-text)"]
    D["📄 Documents"] -->|embed| E["Embeddings<br/>(text → vectors)"]
    E -->|store| VDB["Vector DB<br/>(Qdrant, Chroma)"]
    W -->|query| E
    VDB -->|context| L["LiteLLM<br/>(AI gateway)"]
    W -->|text| L
    L -->|routes to| O["Ollama<br/>(local LLM)"]
    L -->|response| T["Kokoro TTS<br/>(text-to-speech)"]
    T --> B["🔊 Audio output"]
    C["🤖 AI client<br/>(Cline, Claude, etc.)"] -->|MCP tools| M["MCP Gateway<br/>(MCP endpoint)"]
    C -->|chat| L
    L -->|MCP protocol| M

Quick start

Requirements:

A Linux server (local or cloud) with Docker installed
At least 8 GB of RAM (with small models). For larger LLM models (8B+), 32 GB or more is recommended.
You can comment out services you don't need to reduce memory usage.

Start the full stack:

# Clone the repository to get the compose files
git clone https://github.com/hwdsl2/docker-ai-stack
cd docker-ai-stack
docker compose up -d

Pull a model (required before making LLM requests):

docker exec ollama ollama_manage --pull llama3.2:3b

Check the logs to confirm all services are ready:

docker compose logs

Get the API keys:

# Ollama API key
docker exec ollama ollama_manage --showkey

# LiteLLM API key
docker exec litellm litellm_manage --showkey

# MCP Gateway API key
docker exec mcp mcp_manage --showkey

Stop the stack:

docker compose down

GPU acceleration (NVIDIA CUDA)

For NVIDIA GPU acceleration, use the CUDA compose file:

docker compose -f docker-compose.cuda.yml up -d

Requirements: NVIDIA GPU, NVIDIA driver 535+, and the NVIDIA Container Toolkit installed on the host. CUDA images are linux/amd64 only.

Lightweight stacks

Don't need the full stack? Use a pre-configured subset from the stacks/ folder:

Stack	Services	Memory	Use case
voice-pipeline	Whisper + Ollama + LiteLLM + Kokoro	~5 GB	Speech-to-text → LLM → text-to-speech
rag-pipeline	Ollama + LiteLLM + Embeddings	~3 GB	Semantic search + LLM Q&A
ai-tools	Ollama + LiteLLM + MCP Gateway	~3 GB	AI coding assistant with tool access
chat-only	Ollama + LiteLLM	~2.5 GB	Minimal local ChatGPT replacement

git clone https://github.com/hwdsl2/docker-ai-stack
cd docker-ai-stack/stacks/voice-pipeline  # or rag-pipeline, ai-tools, chat-only
docker compose up -d

Running without Docker Compose

If you prefer using docker run commands directly, first create a shared network so services can communicate:

docker network create ai-stack

Then start each service on the shared network:

# Ollama (LLM)
docker run -d --name ollama --restart always \
    --network ai-stack \
    -v ollama-data:/var/lib/ollama \
    hwdsl2/ollama-server

# LiteLLM (AI gateway)
docker run -d --name litellm --restart always \
    --network ai-stack \
    -p 4000:4000 \
    -e LITELLM_OLLAMA_BASE_URL=http://ollama:11434 \
    -v litellm-data:/etc/litellm \
    hwdsl2/litellm-server

# Embeddings
docker run -d --name embeddings --restart always \
    --network ai-stack \
    -p 8000:8000 \
    -v embeddings-data:/var/lib/embeddings \
    hwdsl2/embeddings-server

# Whisper (STT)
docker run -d --name whisper --restart always \
    --network ai-stack \
    -p 9000:9000 \
    -v whisper-data:/var/lib/whisper \
    hwdsl2/whisper-server

# Kokoro (TTS)
docker run -d --name kokoro --restart always \
    --network ai-stack \
    -p 8880:8880 \
    -v kokoro-data:/var/lib/kokoro \
    hwdsl2/kokoro-server

# MCP Gateway
docker run -d --name mcp --restart always \
    --network ai-stack \
    -p 3000:3000 \
    -v mcp-data:/var/lib/mcp \
    hwdsl2/mcp-gateway

Note: The shared network allows services to reach each other by container name (e.g., LiteLLM connects to Ollama via http://ollama:11434). You can start only the services you need — they don't all have to run together.

Pull a model (required before making LLM requests):

docker exec ollama ollama_manage --pull llama3.2:3b

Connect MCP Gateway to LiteLLM

LiteLLM and MCP Gateway are automatically wired when using the compose files in this repository. The LITELLM_MCP_URL=http://mcp:3000/mcp environment variable is pre-configured in the compose files, so LiteLLM injects the mcp_servers: block into its config on every start.

To complete the wiring, set the MCP API key after first start:

# 1. Get the MCP Gateway API key
docker exec mcp mcp_manage --showkey

# 2. Add it to litellm.env (or pass as environment variable) and restart:
#    LITELLM_MCP_API_KEY=mcp-xxxx...
docker compose restart litellm

Alternatively, pre-set a known key in mcp.env before starting (MCP_API_KEY=my-key) and use the same value for LITELLM_MCP_API_KEY in litellm.env — then no restart is needed.

Once connected, AI clients that call LiteLLM can use MCP tools (filesystem, fetch, GitHub, etc.) directly through the LiteLLM proxy.

Voice pipeline example

Transcribe a spoken question, get a local LLM response via Ollama, and convert it to speech:

Tip: Need a sample audio file? Download this English speech sample (WAV, MIT License) from the Azure Samples repository:

curl -L -o sample_speech.wav \
    "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/sampledata/audiofiles/katiesteve.wav"

LITELLM_KEY=$(docker exec litellm litellm_manage --showkey | grep '^sk-' | head -1)

# Step 1: Transcribe audio to text (Whisper)
TEXT=$(curl -s http://localhost:9000/v1/audio/transcriptions \
    -F file=@sample_speech.wav -F model=whisper-1 | jq -r .text)

# Step 2: Send text to Ollama via LiteLLM and get a response
RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"ollama/llama3.2:3b\",\"messages\":[{\"role\":\"user\",\"content\":\"$TEXT\"}]}" \
    | jq -r '.choices[0].message.content')

# Step 3: Convert the response to speech (Kokoro TTS)
curl -s http://localhost:8880/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"tts-1\",\"input\":\"$RESPONSE\",\"voice\":\"af_heart\"}" \
    --output response.mp3

RAG pipeline example

Embed documents for semantic search, retrieve context, then answer questions with a local Ollama model:

LITELLM_KEY=$(docker exec litellm litellm_manage --showkey | grep '^sk-' | head -1)

# Step 1: Embed a document chunk and store the vector in your vector DB
curl -s http://localhost:8000/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{"input": "Docker simplifies deployment by packaging apps in containers.", "model": "text-embedding-ada-002"}' \
    | jq '.data[0].embedding'
# → Store the returned vector alongside the source text in Qdrant, Chroma, pgvector, etc.

# Step 2: At query time, embed the question, retrieve the top matching chunks from
#          the vector DB, then send the question and retrieved context to Ollama via LiteLLM.
curl -s http://localhost:4000/v1/chat/completions \
    -H "Authorization: Bearer $LITELLM_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "ollama/llama3.2:3b",
      "messages": [
        {"role": "system", "content": "Answer using only the provided context."},
        {"role": "user", "content": "What does Docker do?\n\nContext: Docker simplifies deployment by packaging apps in containers."}
      ]
    }' \
    | jq -r '.choices[0].message.content'

MCP tools example

Use MCP Gateway to give your AI assistant access to files, web, and GitHub:

MCP_KEY=$(docker exec mcp mcp_manage --showkey | grep '^mcp-' | head -1)

# Use MCP endpoint with an AI client (e.g., Cline in VS Code)
# Set the MCP server URL: http://localhost:3000/mcp
# Set Authorization header: Bearer <api_key>

# Or test the MCP endpoint directly with an initialize request
curl -s http://localhost:3000/mcp \
    -X POST \
    -H "Authorization: Bearer $MCP_KEY" \
    -H "Content-Type: application/json" \
    -H "Accept: application/json, text/event-stream" \
    -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}'

Customization

Each service can be configured with an optional env file. Copy the example env file from the respective repository, edit it, and uncomment the volume mount in docker-compose.yml:

Service	Env file	Repository
Ollama	`ollama.env`	docker-ollama
LiteLLM	`litellm.env`	docker-litellm
Embeddings	`embed.env`	docker-embeddings
Whisper	`whisper.env`	docker-whisper
Kokoro	`kokoro.env`	docker-kokoro
MCP Gateway	`mcp.env`	docker-mcp-gateway

For detailed configuration options, API reference, and model management, see the documentation in each service's repository.

Update images

To update all services to the latest versions:

docker compose pull
docker compose up -d

Your data is preserved in the Docker volumes.

License

This project is an independent Docker configuration and is not affiliated with, endorsed by, or sponsored by Ollama, Berri AI (LiteLLM), Hugging Face, hexgrad (Kokoro), OpenAI, SYSTRAN, or MCPHub.