Aelios Spark

Make any web app voice-controllable in 37 languages in 5 minutes.

Aelios Spark is an open-source voice control layer for web apps. Drop in a
widget, define a few tools, and your users can operate your software
by talking to it — creating records, navigating screens, running queries, all hands-free.

flowchart LR
    user(["🎙️ User"])
    subgraph host["Your web app (browser)"]
        widget["Aelios Spark widget<br/>+ your tool defs"]
    end
    subgraph server["Your machine / VPS"]
        agent["Aelios Spark agent server<br/>(Python / Pipecat)"]
        cfg[("aelios-spark.config.yaml<br/>prompt · persona · KB")]
        agent -.reads.-> cfg
    end
    user <-->|voice| widget
    widget <==>|"WebRTC<br/>(audio + tool RPC)"| agent
    agent -->|"LLM, STT, TTS"| providers[("OpenAI · Deepgram<br/>Cartesia · Daily")]

No backend. No SaaS sign-up. You run the agent server yourself, define
tools in your app code, and the voice loop runs locally. Bring your
own API keys for OpenAI, Daily, Deepgram, Cartesia.

Desktop only today. The widget refuses to mount on viewports
narrower than 768px and tears down any live session if the window
shrinks below that threshold. Mobile support is on the
roadmap.

Looking for production scale? The managed version is
Aelios AI — autoscaling, multi-tenant
agents, hosted control plane, continuous-learning loops, and a
separate video demo agent that learns your software and
streams hands-free product demos 24/7. The OSS Aelios Spark widget is the
same code path; the managed platform adds the surfaces around it.

Quick start — get it running in 5 minutes
How it works — a session, end to end
Two registration patterns — how the
host page wires everything up
Two modes — action and guide — the
agent operates your app, OR it narrates it
Languages — the 37 the widget ships with
What you need — bring-your-own-key providers
Repo layout
Deep documentation — the rest of the system,
one doc per concern
Contributing
License

Quick start

You need three things running:

Agent server (Python, this repo)
Widget bundle (TypeScript, this repo — build once)
Your web app (where the widget gets embedded)

# Clone
git clone https://github.com/Aelios-AI/aelios-spark
cd aelios-spark

# 1. Agent server
cd packages/agent-server
cp .env.example .env       # paste in OPENAI_API_KEY, DAILY_API_KEY, etc.
uv sync
uv run python server.py    # serves :3002

# 2. Widget bundle (in another terminal)
cd packages/widget
npm install
npm run build              # produces dist/aelios-spark-widget.js

# 3. Try the example app
cd ../../examples/tracker
npm install && npm run copy-widget
npm run dev                # → http://localhost:5180

Open the example, click the launcher, and talk to your tasks app. Try
"create a task to ship the release notes by Friday" or
"list tasks assigned to Alice".

Full step-by-step with troubleshooting in
docs/quickstart.md.

How it works

sequenceDiagram
    autonumber
    participant User
    participant Page as Host page
    participant Widget as Aelios Spark widget
    participant Server as Agent server
    participant LLM as LLM + STT + TTS

    Page->>Widget: AeliosSpark.configure({...}), AeliosSpark.defineTool(...)
    User->>Widget: Click launcher, pick language/mode
    Widget->>Server: POST /start (tools + lang + mode)
    Server->>Server: Load aelios-spark.config.yaml<br/>(prompt + persona + KB)
    Server-->>Widget: Daily room URL + token
    Note over Widget,Server: WebRTC voice loop established

    loop Conversation turn
        User->>Widget: speaks
        Widget->>Server: audio (WebRTC)
        Server->>LLM: STT → reason → TTS
        LLM-->>Server: tool calls + reply
        Server->>Widget: tool_call_batch (RTVI)
        Widget->>Page: invoke registered tool fn
        Page-->>Widget: result
        Widget-->>Server: tool_result
        Server-->>Widget: spoken reply (audio)
        Widget-->>User: speaks
    end

A session has three layers:

Widget runs in your visitor's browser. It captures audio,
renders the chrome, holds the tool registry, and talks to the
agent server over WebRTC + RTVI.
Agent server runs on your machine (or VPS). It hosts a Pipecat
pipeline — STT → LLM → TTS → audio out — plus the
InAppAgentProcessor state machine that schedules tool calls,
manages demonstrations, requests screenshots, runs idle timers,
and applies schema-gated structured output.
Your web app (the host page) registers tools and calls
AeliosSpark.configure(...) to point at the agent server and tweak the
pill's position + theme colors.

Tool calls flow over the RTVI data channel; audio flows over WebRTC.
Everything is one-session-per-process — no shared state.

For the full architecture (priority queue, five wake modes,
demonstrations, screenshot service, tool dispatcher, watchdogs, the
RTVI custom-message protocol), read
docs/architecture.md.

Two registration patterns

The host page interacts with the widget through two patterns. They
serve different concerns and can be called in any order.

Pattern 1 — `AeliosSpark.configure({...})`: agent URL + widget look

Tells the widget where the agent server is and how it should look.
The full surface is small — see docs/configuration.md:

AeliosSpark.configure({
    agentUrl: "http://localhost:3002/start",
    branding: {
        position: "bottom-right",      // or "bottom-left"
        themeColors: {                 // optional palette override
            primary: "#F4F5F7",
            bg: "#0A0A0A",
            text: "#F4F5F7",
            muted: "#A0A0A0",
            onPrimary: "#0A0A0A",
        },
    },
});

Pattern 2 — `AeliosSpark.defineTool({...})`: callable functions

Each tool the agent can invoke during voice turns. Tools accumulate
in an in-memory registry; at session start, the registry is forwarded
to the agent server as the session's tool set.

AeliosSpark.defineTool({
    name: "create_contact",
    description: "Add a new contact. Use when the user says 'add' or names a new person.",
    parameters: {
        type: "object",
        properties: {
            name: { type: "string" },
            email: { type: "string" },
        },
        required: ["name"],
    },
    execute: async ({ name, email }) => myApi.createContact({ name, email }),
    requiresConfirmation: false,    // set true for destructive ops
});

The `AeliosSparkReady` queue — order-independent setup

Both patterns work through a callback queue so they're safe to call
before the widget bundle has finished loading:

<script src="/aelios-spark-widget.js" data-agent-url="http://localhost:3002/start"></script>
<script>
  window.AeliosSparkReady = window.AeliosSparkReady || [];
  window.AeliosSparkReady.push((AeliosSpark) => {
    AeliosSpark.configure({ ... });
    AeliosSpark.defineTool({ ... });
    AeliosSpark.defineTool({ ... });
  });
</script>

Then on the server side — tell the agent who it is, what your
software is, and what it should know about it — in
packages/agent-server/aelios-spark.config.yaml. Both the agent's persona
and the host software's knowledge base live here, because both
get baked into the system prompt the LLM sees every turn:

agent:                          # who the agent is
  name: "Acme Assistant"
  personality: "Friendly and precise."

software:                       # the app the widget is embedded in
  name: "Acme CRM"
  tldr: "A simple CRM for small teams."
  docs_file: "./knowledge.md"   # KB the agent draws on for every reply

additional_instructions: |      # any extra business rules / style notes
  You operate Acme CRM on behalf of the user via voice. Be concise.

Restart the agent server and refresh your app — voice control is live.

Full tool-writing guide in docs/tools.md.
Full widget config schema in
docs/configuration.md.

Two modes — action and guide

Aelios Spark sessions run in one of two modes. The visitor picks at session
start; the choice is frozen for the session.

	`action` (default)	`guide`
Calls your tools	yes	no
Sees the screen	only when the agent decides	every turn
Points to UI	no	yes (ghost cursor)
Best for	operating your app	narrating your app

Action mode is the agent operating your software on the
visitor's behalf — voice-driven CRUD, dictation-with-effects,
hands-free workflows. The agent only sees the screen when it
explicitly requests a screenshot.

Guide mode is read-only narration with on-screen pointing —
onboarding, accessibility, sales demos. The agent gets a screenshot
every turn and can drop a ghost cursor (an arrow + fixed "Agent"
tag) onto any element on the page; what to do there is conveyed by
the spoken reply itself. It cannot call tools; the schema literally
drops the tool_invocations field.

Both modes run through the same InAppAgentProcessor, but each has
its own Jinja system-prompt template
(IN_APP_AGENT_TURN_TEMPLATE for action, IN_APP_AGENT_GUIDE_TURN_TEMPLATE
for guide) — guide mode has no tools, no demonstrations, no batches,
so a shared template would bury the relevant instructions under
sections the LLM has to skip every turn. Schema gating layers on top:
guide mode's schema literally drops the tool_invocations field.
Full breakdown — when to use each, the schema differences, the
two-trigger rule, the confirmation flow — in
docs/modes.md.

Languages

The widget ships a hardcoded 37-language picker that visitors
choose from at session start. The chosen language code is sent in
the /start body; the agent server runs Deepgram Nova-3 STT for
all 37 (configured per-session via the language enum) and
Cartesia handles TTS.

🇸🇦 Arabic · 🇧🇬 Bulgarian · 🇨🇳 Chinese · 🇭🇷 Croatian ·
🇨🇿 Czech · 🇩🇰 Danish · 🇳🇱 Dutch · 🇺🇸 English · 🇫🇮 Finnish ·
🇫🇷 French · 🇩🇪 German · 🇬🇷 Greek · 🇮🇳 Gujarati · 🇮🇱 Hebrew ·
🇮🇳 Hindi · 🇭🇺 Hungarian · 🇮🇩 Indonesian · 🇮🇹 Italian ·
🇯🇵 Japanese · 🇮🇳 Kannada · 🇰🇷 Korean · 🇲🇾 Malay ·
🇮🇳 Marathi · 🇳🇴 Norwegian · 🇵🇱 Polish · 🇵🇹 Portuguese ·
🇷🇴 Romanian · 🇷🇺 Russian · 🇸🇰 Slovak · 🇪🇸 Spanish ·
🇸🇪 Swedish · 🇵🇭 Tagalog · 🇮🇳 Tamil · 🇮🇳 Telugu · 🇹🇭 Thai ·
🇹🇷 Turkish · 🇻🇳 Vietnamese

All 37 ship with native Cartesia voices out of the box. All bundled
voices are female — if you set agent.name in aelios-spark.config.yaml,
pick a feminine name so the persona name and the spoken voice match.
Operators who want a different voice (different gender, different
accent, custom clone) should override per-agent via voice_languages
or edit CARTESIA_TTS_VOICES in
adapters/languages.py.

The picker list is fixed in
Widget.tsx and not host-
configurable.

What you need

Bring-your-own-key. None of these are baked in:

Provider	What for	Required
OpenAI	Main LLM	yes
Daily	WebRTC transport	yes (free tier covers dev)
Deepgram	Speech-to-text — Nova-3 covers all 37 languages	yes
Cartesia	Agent's voice (text-to-speech)	yes
Google AI Studio	Gemini — conversation-history summarisation	yes

See packages/agent-server/.env.example.

Want a different LLM? The agent server talks to LLMs through
LangChain, so switching providers is a
LangChain swap — Anthropic, Google, Mistral, Cohere, local models via
Ollama / vLLM, anything LangChain supports. Two call sites:
brain/processor.py for
the main agent loop (currently ChatOpenAI) and
brain/conversation_history.py
for the cheap summarizer (currently ChatGoogleGenerativeAI).

Want a different STT/TTS/Transport provider? All voice services and the transport service are drop-in
Pipecat adapters — swap them in bot.py and you can run on Whisper,
ElevenLabs, Riva, AssemblyAI, SmallWebRTC, etc. See the
Pipecat services docs.

Repo layout

aelios-spark/
├── packages/
│   ├── widget/         the embeddable JS — runs in your users' browsers
│   └── agent-server/   the Python voice agent — you run this
├── examples/
│   └── tracker/        full sample app showing how to wire everything up
├── docs/               deep documentation (read these — see below)
├── CONTRIBUTING.md     dev setup, test architecture, PR process
└── LICENSE             Apache 2.0

Deep documentation

One doc per concern. The README is the orientation; these are the
manual.

Doc	What it covers
`docs/quickstart.md`	Step-by-step setup with troubleshooting
`docs/architecture.md`	The agent server end-to-end: Pipecat pipeline, processor state machine, priority queue, five wake modes, tool dispatcher, demonstrations, screenshot service, conversation history, watchdogs, RTVI custom-message protocol
`docs/modes.md`	Action vs guide mode — the schema differences, the two-trigger rule, confirmation flow, screenshot behaviour, when to use each
`docs/widget.md`	Widget bundle anatomy, connection state machine, session timing rules (90-min cap, 6-min connecting timeout, etc.), idle protocol, error states, mock mode, theming
`docs/tools.md`	Writing tool definitions — when to call, return values, parallel batches, confirmation flow, common patterns
`docs/configuration.md`	Every config knob — widget-side (`AeliosSpark.configure(...)`) and server-side (`aelios-spark.config.yaml`), env vars, provider swaps
`packages/agent-server/tests/README.md`	Three-layer test architecture (unit / processor / real-LLM-judge), when to add tests at which layer

Read in roughly that order if you want to understand the whole
system.

Built on Pipecat

The agent server is built on top of
Pipecat, the open-source
framework for voice + multimodal conversational AI. All STT/TTS/
transport wrappers live in packages/agent-server/adapters/ —
swap in any of Pipecat's services
and Aelios Spark keeps working.

Contributing

PRs welcome — see CONTRIBUTING.md for dev setup,
the three-layer test contract, the contributions matrix, and code
style.

Aelios Spark is a real OSS project backed by a real production agent loop, so
changes that touch the agent state machine get reviewed carefully.
The "Reviewed carefully" rows in CONTRIBUTING flag exactly which
areas those are.

Managed offering

For production, Aelios AI wraps the OSS agent
code path with the surfaces a serious deployment actually needs:

Autoscaling, multi-tenant agents, hosted control plane — no
infra to operate.
Observability — per-session traces, transcripts, tool
call/result audit, latency breakdowns.
Continuous-learning loops — session analytics feed back into
the agent's persona / KB / tool descriptions so the agent gets
better at your specific software over time.
Video demo agent — a separate agent product that learns your
software's UI from your docs + recorded screen flows, then drives
on-screen demo videos hands-free. Runs 24/7 so prospects can
watch a live product walk-through any time without sales-team
scheduling. Same conversational core as the widget; different
delivery surface.

Graduate when you outgrow self-hosting.

License

Apache 2.0.

aelios-spark

Aelios Spark

Table of contents

Quick start

How it works

Two registration patterns

Pattern 1 — `AeliosSpark.configure({...})`: agent URL + widget look

Pattern 2 — `AeliosSpark.defineTool({...})`: callable functions

The `AeliosSparkReady` queue — order-independent setup

Two modes — action and guide

Languages

What you need

Repo layout

Deep documentation

Built on Pipecat

Contributing

Managed offering

License

Reviews (0)

Aelios Spark

Table of contents

Quick start

How it works

Two registration patterns

Pattern 1 — AeliosSpark.configure({...}): agent URL + widget look

Pattern 2 — AeliosSpark.defineTool({...}): callable functions

The AeliosSparkReady queue — order-independent setup

Two modes — action and guide

Languages

What you need

Repo layout

Deep documentation

Built on Pipecat

Contributing

Managed offering

License

Reviews (0)

Pattern 1 — `AeliosSpark.configure({...})`: agent URL + widget look

Pattern 2 — `AeliosSpark.defineTool({...})`: callable functions

The `AeliosSparkReady` queue — order-independent setup