Voice Assistant
Real-time voice assistant for OpenClaw.
- Rating
- 4.9 (367 reviews)
- Downloads
- 49,615 downloads
- Version
- 1.0.0
Overview
Real-time voice assistant for OpenClaw.
Complete Documentation
View Source →
Voice Assistant
Real-time voice interface for your OpenClaw agent. Talk to your agent and hear it respond — with configurable STT and TTS providers, full streaming at every stage, and sub-2 second time-to-first-audio.
Architecture
Browser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text
→ OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text
→ TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker
The voice interface connects to your running OpenClaw gateway's OpenAI-compatible endpoint. It's the same agent with all its context, tools, and memory — just with a voice.
Quick Start
cd {baseDir}
cp .env.example .env
# Fill in your API keys and gateway URL
uv run scripts/server.py
# Open http://localhost:7860 and click the mic
Supported Providers
STT (Speech-to-Text)
| Provider | Model | Latency | Notes |
|---|---|---|---|
| Deepgram | nova-2 (streaming) | ~200-300ms | WebSocket streaming, best accuracy/speed |
| ElevenLabs | Scribe v1 | ~300-500ms | REST-based, good multilingual |
TTS (Text-to-Speech)
| Provider | Model | Latency | Notes |
|---|---|---|---|
| Deepgram | aura-2 | ~200ms | WebSocket streaming, low cost |
| ElevenLabs | Turbo v2.5 | ~300ms | Best voice quality, streaming |
Configuration
All configuration is via environment variables in .env:
# === Required ===
OPENCLAW_GATEWAY_URL=http://localhost:4141/v1 # Your OpenClaw gateway
OPENCLAW_MODEL=claude-sonnet-4-5-20250929 # Model your gateway routes to
# === STT Provider (pick one) ===
VOICE_STT_PROVIDER=deepgram # "deepgram" or "elevenlabs"
DEEPGRAM_API_KEY=your-key-here # Required if STT=deepgram
ELEVENLABS_API_KEY=your-key-here # Required if STT=elevenlabs
# === TTS Provider (pick one) ===
VOICE_TTS_PROVIDER=elevenlabs # "deepgram" or "elevenlabs"
# Uses the same API keys as above
# === Optional Tuning ===
VOICE_TTS_VOICE=rachel # ElevenLabs voice name/ID
VOICE_TTS_VOICE_DG=aura-2-theia-en # Deepgram Aura voice
VOICE_VAD_SILENCE_MS=400 # Silence before end-of-turn (ms)
VOICE_SAMPLE_RATE=16000 # Audio sample rate
VOICE_SERVER_PORT=7860 # Server port
VOICE_SYSTEM_PROMPT="" # Optional system prompt override
Provider Combinations
| Setup | Best For |
|---|---|
| Deepgram STT + ElevenLabs TTS | Best quality voice output |
| Deepgram STT + Deepgram TTS | Lowest latency, single vendor |
| ElevenLabs STT + ElevenLabs TTS | Best multilingual support |
How It Works
- Browser captures mic audio via Web Audio API and streams raw PCM over a WebSocket
- Server receives audio and pipes it to the configured STT provider's streaming endpoint
- STT returns partial transcripts in real-time; on end-of-utterance the full text is sent to the OpenClaw gateway
- OpenClaw gateway streams the LLM response token-by-token via SSE (Server-Sent Events)
- Tokens are accumulated into sentence-sized chunks and streamed to the TTS provider
- TTS returns audio chunks that are immediately forwarded to the browser over the same WebSocket
- Browser plays audio using the Web Audio API with a jitter buffer for smooth playback
Interruption Handling (Barge-In)
When the user starts speaking while the agent is still talking:
- Current TTS audio is immediately cancelled
- The agent stops its current response
- New STT session begins capturing the user's interruption
Usage Examples
User: "Hey, set up my voice assistant"
→ OpenClaw runs: cd {baseDir} && cp .env.example .env
→ Opens .env for the user to fill in API keys
→ Runs: uv run scripts/server.py
User: "Start a voice chat"
→ Opens http://localhost:7860 in the browser
User: "Switch TTS to Deepgram"
→ Updates VOICE_TTS_PROVIDER=deepgram in .env
→ Restarts the server
Troubleshooting
- No audio output? Check that your TTS API key is valid and the provider is set correctly
- High latency? Use Deepgram for both STT and TTS; ensure your gateway is on the same network
- Cuts off speech? Increase
VOICE_VAD_SILENCE_MSto 600-800ms - Echo/feedback? Use headphones, or enable the built-in echo cancellation in the browser UI
Latency Budget
| Stage | Target | Actual (typical) |
|---|---|---|
| Audio capture + VAD | <200ms | ~100-150ms |
| STT transcription | <400ms | ~200-400ms |
| OpenClaw LLM first token | <1500ms | ~500-1500ms |
| TTS first audio chunk | <400ms | ~200-400ms |
| Total first audio | <2.5s | ~1.0-2.5s |
Installation
openclaw install voice-assistant
💻Code Examples
→ TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker
The voice interface connects to your running OpenClaw gateway's OpenAI-compatible endpoint. It's the same agent with all its context, tools, and memory — just with a voice.
## Quick Start# Open http://localhost:7860 and click the mic
## Supported Providers
### STT (Speech-to-Text)
| Provider | Model | Latency | Notes |
|-----------|------------------|----------|------------------------------|
| Deepgram | nova-2 (streaming) | ~200-300ms | WebSocket streaming, best accuracy/speed |
| ElevenLabs | Scribe v1 | ~300-500ms | REST-based, good multilingual |
### TTS (Text-to-Speech)
| Provider | Model | Latency | Notes |
|------------|--------------|----------|--------------------------------|
| Deepgram | aura-2 | ~200ms | WebSocket streaming, low cost |
| ElevenLabs | Turbo v2.5 | ~300ms | Best voice quality, streaming |
## Configuration
All configuration is via environment variables in `.env`:VOICE_SYSTEM_PROMPT="" # Optional system prompt override
## Provider Combinations
| Setup | Best For |
|------------------------------------|---------------------------------|
| Deepgram STT + ElevenLabs TTS | Best quality voice output |
| Deepgram STT + Deepgram TTS | Lowest latency, single vendor |
| ElevenLabs STT + ElevenLabs TTS | Best multilingual support |
## How It Works
1. **Browser captures mic audio** via Web Audio API and streams raw PCM over a WebSocket
2. **Server receives audio** and pipes it to the configured STT provider's streaming endpoint
3. **STT returns partial transcripts** in real-time; on end-of-utterance the full text is sent to the OpenClaw gateway
4. **OpenClaw gateway streams** the LLM response token-by-token via SSE (Server-Sent Events)
5. **Tokens are accumulated** into sentence-sized chunks and streamed to the TTS provider
6. **TTS returns audio chunks** that are immediately forwarded to the browser over the same WebSocket
7. **Browser plays audio** using the Web Audio API with a jitter buffer for smooth playback
## Interruption Handling (Barge-In)
When the user starts speaking while the agent is still talking:
- Current TTS audio is immediately cancelled
- The agent stops its current response
- New STT session begins capturing the user's interruption
## Usage ExamplesBrowser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text
→ OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text
→ TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speakercd {baseDir}
cp .env.example .env
# Fill in your API keys and gateway URL
uv run scripts/server.py
# Open http://localhost:7860 and click the mic# === Required ===
OPENCLAW_GATEWAY_URL=http://localhost:4141/v1 # Your OpenClaw gateway
OPENCLAW_MODEL=claude-sonnet-4-5-20250929 # Model your gateway routes to
# === STT Provider (pick one) ===
VOICE_STT_PROVIDER=deepgram # "deepgram" or "elevenlabs"
DEEPGRAM_API_KEY=your-key-here # Required if STT=deepgram
ELEVENLABS_API_KEY=your-key-here # Required if STT=elevenlabs
# === TTS Provider (pick one) ===
VOICE_TTS_PROVIDER=elevenlabs # "deepgram" or "elevenlabs"
# Uses the same API keys as above
# === Optional Tuning ===
VOICE_TTS_VOICE=rachel # ElevenLabs voice name/ID
VOICE_TTS_VOICE_DG=aura-2-theia-en # Deepgram Aura voice
VOICE_VAD_SILENCE_MS=400 # Silence before end-of-turn (ms)
VOICE_SAMPLE_RATE=16000 # Audio sample rate
VOICE_SERVER_PORT=7860 # Server port
VOICE_SYSTEM_PROMPT="" # Optional system prompt overrideUser: "Hey, set up my voice assistant"
→ OpenClaw runs: cd {baseDir} && cp .env.example .env
→ Opens .env for the user to fill in API keys
→ Runs: uv run scripts/server.py
User: "Start a voice chat"
→ Opens http://localhost:7860 in the browser
User: "Switch TTS to Deepgram"
→ Updates VOICE_TTS_PROVIDER=deepgram in .env
→ Restarts the server⚙️Configuration Options
string
| Option | Default | Description |
|---|---|---|
OPENCLAW_GATEWAY_URL | http://localhost:4141/v1 | - |
OPENCLAW_MODEL | claude-sonnet-4-5-20250929 | - |
VOICE_STT_PROVIDER | deepgram | - |
DEEPGRAM_API_KEY | your-key-here | - |
ELEVENLABS_API_KEY | your-key-here | - |
VOICE_TTS_PROVIDER | elevenlabs | - |
VOICE_TTS_VOICE | rachel | - |
VOICE_TTS_VOICE_DG | aura-2-theia-en | - |
number
| Option | Default | Description |
|---|---|---|
VOICE_VAD_SILENCE_MS | 400 | - |
VOICE_SAMPLE_RATE | 16000 | - |
VOICE_SERVER_PORT | 7860 | - |
Tags
Quick Info
Ready to Install?
Get started with this skill in seconds
Related Skills
4claw
4claw — a moderated imageboard for AI agents.
Aap Passport
Agent Attestation Protocol - The Reverse Turing Test.
Acestep Lyrics Transcription
Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.
Adaptive Suite
A continuously adaptive skill suite that empowers Clawdbot.