Multimodal-voice-assistant
Health Gecti
- License — License: MIT
- Description — Repository has a description
- Active repo — Last push 0 days ago
- Community trust — 10 GitHub stars
Code Gecti
- Code scan — Scanned 12 files during light audit, no dangerous patterns found
Permissions Gecti
- Permissions — No dangerous permissions requested
Bu listing icin henuz AI raporu yok.
This project is a multi-modal AI voice assistant that uses LM Studio, OpenAI API or Claude Code, audio processing with WhisperModel, speech recognition, clipboard extraction, and image processing to respond to user prompts.
Multi-Modal AI Voice Assistant
A multi-modal AI voice assistant supporting DeepSeek (default), OpenAI, Anthropic Claude, and local LM Studio LLMs with configurable text-to-speech (OpenAI streaming or Kokoro). Combines voice transcription, tool calling, clipboard extraction, screenshot analysis, and web search to respond with rich context.
Features
- Multi-provider LLM support: DeepSeek (default, fast & cheap), OpenAI (GPT-5), local LM Studio, Anthropic Claude
- Tool calling: Screenshot capture, webcam capture, clipboard extraction, DuckDuckGo search
- Flexible TTS: OpenAI streaming voices or offline Kokoro synthesis
- Model Context Protocol (MCP): Pluggable context providers for external integrations
- Wake word activation: Say "nova" followed by your prompt
- Graceful fallbacks: Models and TTS providers fall back automatically on failure
.envsupport: All credentials/config can live in a single gitignored.envfile
Installation
# Clone the repository
git clone https://github.com/tristan-mcinnis/Multimodal-voice-assistant
cd Multimodal-voice-assistant
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Or install as a package
pip install -e .
Quick Start
Default mode (DeepSeek + Kokoro)
# 1. Copy the example .env and add your DeepSeek key
cp .env.example .env
# Edit .env, set DEEPSEEK_API_KEY=sk-...
# 2. (Optional) Download Kokoro TTS models for offline speech (~335MB)
mkdir -p models
curl -L -o models/kokoro-v1.0.onnx https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/kokoro-v1.0.onnx
curl -L -o models/voices-v1.0.bin https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/voices-v1.0.bin
# 3. Run the assistant
python run.py
Get a DeepSeek API key at https://platform.deepseek.com/. The default model isdeepseek-v4-flash (1M context, low-latency). Override withDEEPSEEK_PREFERRED_CHAT_MODEL=deepseek-v4-pro for the higher-capability model.
Cloud mode (OpenAI)
export OPENAI_API_KEY="sk-..."
export LLM_PROVIDER=openai
python run.py
Local mode (LM Studio)
# Start LM Studio and load a model first.
export LLM_PROVIDER=local
export LOCAL_LLM_BASE_URL=http://localhost:1234/v1
python run.py
The wake word is "nova". Say it followed by your request.
Configuration
The assistant reads a .env file in the project root (loaded viapython-dotenv before any submodule imports). Anything you can export you
can also drop in .env. See .env.example for the full
template — .env itself is gitignored.
LLM Providers
# DeepSeek (default — set in .env)
export LLM_PROVIDER=deepseek
export DEEPSEEK_API_KEY="sk-..."
export DEEPSEEK_PREFERRED_CHAT_MODEL=deepseek-v4-flash # or deepseek-v4-pro
# OpenAI
export LLM_PROVIDER=openai
export OPENAI_API_KEY="sk-..."
# Local LM Studio
export LLM_PROVIDER=local
export LOCAL_LLM_BASE_URL=http://localhost:1234/v1
export LOCAL_LLM_MODEL=your-model-name
# Claude/Anthropic
export LLM_PROVIDER=anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
Text-to-Speech
# Kokoro TTS (default, local, offline)
# Requires downloading model files - see "Kokoro TTS Setup" below
export ASSISTANT_TTS_PROVIDER=kokoro
export KOKORO_VOICE=af_sarah
export KOKORO_STREAMING=true # Low-latency ONNX mode
# OpenAI TTS (requires API key)
export ASSISTANT_TTS_PROVIDER=openai
Environment Variables
| Variable | Description |
|---|---|
LLM_PROVIDER |
deepseek (default), openai, local, or anthropic |
DEEPSEEK_API_KEY |
DeepSeek API key (default provider) |
DEEPSEEK_PREFERRED_CHAT_MODEL |
Default deepseek-v4-flash |
DEEPSEEK_BASE_URL |
Override (default https://api.deepseek.com) |
OPENAI_API_KEY |
OpenAI API key |
ANTHROPIC_API_KEY |
Anthropic API key (for Claude) |
LOCAL_LLM_BASE_URL |
LM Studio endpoint (default: http://localhost:1234/v1) |
LOCAL_LLM_MODEL |
Model name in LM Studio |
ASSISTANT_TTS_PROVIDER |
openai or kokoro |
ASSISTANT_DISABLE_TOOLS |
Set to true to disable tool calling |
ASSISTANT_SIMPLE_TOOLS |
Set to true for clipboard + search only (no vision) |
MCP_CONTEXT_FILE |
Path to MCP context file |
Architecture
assistant/
├── core.py # VoiceAssistant orchestrator
├── config/settings.py # Env-var parsing
├── providers/
│ ├── llm/
│ │ ├── openai_compatible.py # Shared adapter for OpenAI-shaped APIs
│ │ ├── deepseek_provider.py # DeepSeek (default)
│ │ ├── openai_provider.py # OpenAI
│ │ ├── local_provider.py # LM Studio
│ │ └── anthropic_provider.py # Claude
│ └── tts/ # OpenAI, Kokoro
├── tools/
│ ├── loop.py # Unified streaming tool-call loop
│ ├── registry.py # Tool registry
│ └── … # clipboard, search, vision tools
├── context/ # Conversation and MCP context
├── speech/ # Whisper recognition
├── media/ # Screenshot, webcam capture
└── utils/ # Logging, message helpers
DeepSeek, OpenAI, and LM Studio all speak the OpenAI Chat Completions wire
format and share OpenAICompatibleProvider — see
ADR 0001. The
streaming + non-streaming tool-call loop lives behind one seam — see
ADR 0002. The domain glossary is in
CONTEXT.md.
Testing
pip install -e '.[dev]'
pytest
Tests cover the tool registry, the unified ToolLoop (with a scripted fake
provider — no network/audio needed), and the LLM provider factory.
Extending
Adding Tools
Register tools in VoiceAssistant._register_builtin_tools() or create new files in assistant/tools/:
self.tool_registry.register(
name="my_tool",
description="What this tool does",
parameters={"type": "object", "properties": {...}},
handler=lambda **kwargs: "result",
)
Adding LLM Providers
- Create
assistant/providers/llm/my_provider.py - Implement the
LLMProviderinterface frombase.py - Update the factory in
assistant/providers/llm/__init__.py
Kokoro TTS Setup
For offline text-to-speech, download the model files (~335MB total) to the models/ directory:
mkdir -p models
curl -L -o models/kokoro-v1.0.onnx https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/kokoro-v1.0.onnx
curl -L -o models/voices-v1.0.bin https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/voices-v1.0.bin
That's it! Kokoro is the default TTS provider and will automatically find these files.
Optional: Custom model location
If you prefer to store the models elsewhere:
export KOKORO_ONNX_MODEL_PATH=/path/to/kokoro-v1.0.onnx
export KOKORO_VOICES_BIN_PATH=/path/to/voices-v1.0.bin
Streaming mode (lower latency)
For reduced latency, enable streaming mode:
export KOKORO_STREAMING=true
Dependencies
Core: openai, faster-whisper, SpeechRecognition, pyaudio, rich, Pillow, pygame, duckduckgo-search, scikit-learn
Optional: kokoro-tts, kokoro-onnx, anthropic
Credits
- Kokoro TTS by Nazmus Sakib Dridoy
License
MIT License - see LICENSE for details.
Yorumlar (0)
Yorum birakmak icin giris yap.
Yorum birakSonuc bulunamadi