✓ Verified 💻 Development ✓ Enhanced Data

Voice Assistant

Real-time voice assistant for OpenClaw.

Rating
4.9 (367 reviews)
Downloads
49,615 downloads
Version
1.0.0

Overview

Real-time voice assistant for OpenClaw.

Complete Documentation

View Source →

Voice Assistant

Real-time voice interface for your OpenClaw agent. Talk to your agent and hear it respond — with configurable STT and TTS providers, full streaming at every stage, and sub-2 second time-to-first-audio.

Architecture

text
Browser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text
  → OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text
  → TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker

The voice interface connects to your running OpenClaw gateway's OpenAI-compatible endpoint. It's the same agent with all its context, tools, and memory — just with a voice.

Quick Start

bash
cd {baseDir}
cp .env.example .env
# Fill in your API keys and gateway URL
uv run scripts/server.py
# Open http://localhost:7860 and click the mic

Supported Providers

STT (Speech-to-Text)

ProviderModelLatencyNotes
Deepgramnova-2 (streaming)~200-300msWebSocket streaming, best accuracy/speed
ElevenLabsScribe v1~300-500msREST-based, good multilingual

TTS (Text-to-Speech)

ProviderModelLatencyNotes
Deepgramaura-2~200msWebSocket streaming, low cost
ElevenLabsTurbo v2.5~300msBest voice quality, streaming

Configuration

All configuration is via environment variables in .env:

bash
# === Required ===
OPENCLAW_GATEWAY_URL=http://localhost:4141/v1    # Your OpenClaw gateway
OPENCLAW_MODEL=claude-sonnet-4-5-20250929        # Model your gateway routes to

# === STT Provider (pick one) ===
VOICE_STT_PROVIDER=deepgram                      # "deepgram" or "elevenlabs"
DEEPGRAM_API_KEY=your-key-here                   # Required if STT=deepgram
ELEVENLABS_API_KEY=your-key-here                 # Required if STT=elevenlabs

# === TTS Provider (pick one) ===
VOICE_TTS_PROVIDER=elevenlabs                    # "deepgram" or "elevenlabs"
# Uses the same API keys as above

# === Optional Tuning ===
VOICE_TTS_VOICE=rachel                           # ElevenLabs voice name/ID
VOICE_TTS_VOICE_DG=aura-2-theia-en              # Deepgram Aura voice
VOICE_VAD_SILENCE_MS=400                         # Silence before end-of-turn (ms)
VOICE_SAMPLE_RATE=16000                          # Audio sample rate
VOICE_SERVER_PORT=7860                           # Server port
VOICE_SYSTEM_PROMPT=""                           # Optional system prompt override

Provider Combinations

SetupBest For
Deepgram STT + ElevenLabs TTSBest quality voice output
Deepgram STT + Deepgram TTSLowest latency, single vendor
ElevenLabs STT + ElevenLabs TTSBest multilingual support

How It Works

  • Browser captures mic audio via Web Audio API and streams raw PCM over a WebSocket
  • Server receives audio and pipes it to the configured STT provider's streaming endpoint
  • STT returns partial transcripts in real-time; on end-of-utterance the full text is sent to the OpenClaw gateway
  • OpenClaw gateway streams the LLM response token-by-token via SSE (Server-Sent Events)
  • Tokens are accumulated into sentence-sized chunks and streamed to the TTS provider
  • TTS returns audio chunks that are immediately forwarded to the browser over the same WebSocket
  • Browser plays audio using the Web Audio API with a jitter buffer for smooth playback

Interruption Handling (Barge-In)

When the user starts speaking while the agent is still talking:

  • Current TTS audio is immediately cancelled
  • The agent stops its current response
  • New STT session begins capturing the user's interruption

Usage Examples

text
User: "Hey, set up my voice assistant"
→ OpenClaw runs: cd {baseDir} && cp .env.example .env
→ Opens .env for the user to fill in API keys
→ Runs: uv run scripts/server.py

User: "Start a voice chat"
→ Opens http://localhost:7860 in the browser

User: "Switch TTS to Deepgram"
→ Updates VOICE_TTS_PROVIDER=deepgram in .env
→ Restarts the server

Troubleshooting

  • No audio output? Check that your TTS API key is valid and the provider is set correctly
  • High latency? Use Deepgram for both STT and TTS; ensure your gateway is on the same network
  • Cuts off speech? Increase VOICE_VAD_SILENCE_MS to 600-800ms
  • Echo/feedback? Use headphones, or enable the built-in echo cancellation in the browser UI

Latency Budget

StageTargetActual (typical)
Audio capture + VAD<200ms~100-150ms
STT transcription<400ms~200-400ms
OpenClaw LLM first token<1500ms~500-1500ms
TTS first audio chunk<400ms~200-400ms
Total first audio<2.5s~1.0-2.5s

Installation

Terminal bash

openclaw install voice-assistant
    
Copied!

💻Code Examples

→ TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker

--tts-deepgram-aura--elevenlabs--audio-chunks--browser-speaker.txt
The voice interface connects to your running OpenClaw gateway's OpenAI-compatible endpoint. It's the same agent with all its context, tools, and memory — just with a voice.

## Quick Start

# Open http://localhost:7860 and click the mic

-open-httplocalhost7860-and-click-the-mic.txt
## Supported Providers

### STT (Speech-to-Text)
| Provider   | Model            | Latency  | Notes                        |
|-----------|------------------|----------|------------------------------|
| Deepgram  | nova-2 (streaming) | ~200-300ms | WebSocket streaming, best accuracy/speed |
| ElevenLabs | Scribe v1        | ~300-500ms | REST-based, good multilingual |

### TTS (Text-to-Speech)
| Provider    | Model        | Latency  | Notes                          |
|------------|--------------|----------|--------------------------------|
| Deepgram   | aura-2       | ~200ms   | WebSocket streaming, low cost  |
| ElevenLabs | Turbo v2.5   | ~300ms   | Best voice quality, streaming   |

## Configuration

All configuration is via environment variables in `.env`:

VOICE_SYSTEM_PROMPT="" # Optional system prompt override

voicesystemprompt--optional-system-prompt-override.txt
## Provider Combinations

| Setup                              | Best For                        |
|------------------------------------|---------------------------------|
| Deepgram STT + ElevenLabs TTS     | Best quality voice output        |
| Deepgram STT + Deepgram TTS       | Lowest latency, single vendor    |
| ElevenLabs STT + ElevenLabs TTS   | Best multilingual support        |

## How It Works

1. **Browser captures mic audio** via Web Audio API and streams raw PCM over a WebSocket
2. **Server receives audio** and pipes it to the configured STT provider's streaming endpoint
3. **STT returns partial transcripts** in real-time; on end-of-utterance the full text is sent to the OpenClaw gateway
4. **OpenClaw gateway streams** the LLM response token-by-token via SSE (Server-Sent Events)
5. **Tokens are accumulated** into sentence-sized chunks and streamed to the TTS provider
6. **TTS returns audio chunks** that are immediately forwarded to the browser over the same WebSocket
7. **Browser plays audio** using the Web Audio API with a jitter buffer for smooth playback

## Interruption Handling (Barge-In)

When the user starts speaking while the agent is still talking:
- Current TTS audio is immediately cancelled
- The agent stops its current response
- New STT session begins capturing the user's interruption

## Usage Examples
example.txt
Browser Mic → WebSocket → STT (Deepgram / ElevenLabs) → Text
  → OpenClaw Gateway (/v1/chat/completions, streaming) → Response Text
  → TTS (Deepgram Aura / ElevenLabs) → Audio chunks → Browser Speaker
example.sh
cd {baseDir}
cp .env.example .env
# Fill in your API keys and gateway URL
uv run scripts/server.py
# Open http://localhost:7860 and click the mic
example.sh
# === Required ===
OPENCLAW_GATEWAY_URL=http://localhost:4141/v1    # Your OpenClaw gateway
OPENCLAW_MODEL=claude-sonnet-4-5-20250929        # Model your gateway routes to

# === STT Provider (pick one) ===
VOICE_STT_PROVIDER=deepgram                      # "deepgram" or "elevenlabs"
DEEPGRAM_API_KEY=your-key-here                   # Required if STT=deepgram
ELEVENLABS_API_KEY=your-key-here                 # Required if STT=elevenlabs

# === TTS Provider (pick one) ===
VOICE_TTS_PROVIDER=elevenlabs                    # "deepgram" or "elevenlabs"
# Uses the same API keys as above

# === Optional Tuning ===
VOICE_TTS_VOICE=rachel                           # ElevenLabs voice name/ID
VOICE_TTS_VOICE_DG=aura-2-theia-en              # Deepgram Aura voice
VOICE_VAD_SILENCE_MS=400                         # Silence before end-of-turn (ms)
VOICE_SAMPLE_RATE=16000                          # Audio sample rate
VOICE_SERVER_PORT=7860                           # Server port
VOICE_SYSTEM_PROMPT=""                           # Optional system prompt override
example.txt
User: "Hey, set up my voice assistant"
→ OpenClaw runs: cd {baseDir} && cp .env.example .env
→ Opens .env for the user to fill in API keys
→ Runs: uv run scripts/server.py

User: "Start a voice chat"
→ Opens http://localhost:7860 in the browser

User: "Switch TTS to Deepgram"
→ Updates VOICE_TTS_PROVIDER=deepgram in .env
→ Restarts the server

⚙️Configuration Options

string

Option Default Description
OPENCLAW_GATEWAY_URLhttp://localhost:4141/v1-
OPENCLAW_MODELclaude-sonnet-4-5-20250929-
VOICE_STT_PROVIDERdeepgram-
DEEPGRAM_API_KEYyour-key-here-
ELEVENLABS_API_KEYyour-key-here-
VOICE_TTS_PROVIDERelevenlabs-
VOICE_TTS_VOICErachel-
VOICE_TTS_VOICE_DGaura-2-theia-en-

number

Option Default Description
VOICE_VAD_SILENCE_MS400-
VOICE_SAMPLE_RATE16000-
VOICE_SERVER_PORT7860-

Tags

#coding_agents-and-ides

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author charantejmandali18
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install voice-assistant