Building Conversational AI with WebRTC and LLMs: Real-Time Voice Agents
A technical guide to building real-time voice AI agents using WebRTC for audio transport, speech-to-text, LLM reasoning, and text-to-speech in a low-latency pipeline.
Voice Is the Next Interface for AI Agents
Text-based AI interactions dominate today, but voice is the natural human communication medium. Building voice AI agents that feel conversational — with low latency, natural turn-taking, and contextual understanding — requires integrating multiple real-time systems: audio transport (WebRTC), speech recognition (STT), language model reasoning (LLM), and speech synthesis (TTS).
The technical challenge is latency. A human-to-human conversation has roughly 200-300ms of silence between turns. To feel natural, a voice AI agent must perceive speech, understand it, reason about a response, generate speech, and deliver audio within a similar window.
Architecture Overview
User's Browser
|
| WebRTC (audio stream)
|
Media Server (audio processing)
|
+-> VAD (Voice Activity Detection) -> STT (Speech-to-Text)
| |
| LLM Reasoning
| |
+<- Audio Stream <-- TTS (Text-to-Speech) <-+
WebRTC: The Audio Transport Layer
WebRTC provides peer-to-peer real-time communication with built-in handling for NAT traversal, codec negotiation, and network adaptation. For voice AI, it solves critical problems:
- Low latency: Sub-100ms audio delivery over UDP with adaptive bitrate
- Echo cancellation: Built-in AEC prevents the agent from hearing its own voice through the user's speakers
- Noise suppression: Reduces background noise before audio reaches the STT model
- Browser support: No plugins required; works in all modern browsers
Server-Side WebRTC with Mediasoup or LiveKit
For production deployments, a media server sits between the user and the AI pipeline:
// LiveKit server-side participant (simplified)
import { RoomServiceClient, Room } from 'livekit-server-sdk';
const roomService = new RoomServiceClient(LIVEKIT_URL, API_KEY, API_SECRET);
// Create a room for the voice session
await roomService.createRoom({ name: 'voice-session-123' });
// AI agent joins as a participant
const agentToken = generateToken({ identity: 'ai-agent', roomName: 'voice-session-123' });
const room = await Room.connect(LIVEKIT_URL, agentToken);
// Receive audio from user
room.on('trackSubscribed', async (track) => {
const audioStream = track.getMediaStream();
await processAudioStream(audioStream);
});
Voice Activity Detection (VAD)
VAD determines when the user starts and stops speaking. This is critical for turn-taking:
- Silero VAD: Open-source model with high accuracy and low latency (< 10ms). The most popular choice for voice agent pipelines.
- WebRTC's built-in VAD: Lower accuracy but zero additional compute cost.
Handling Interruptions
Natural conversation includes interruptions. When the user starts speaking while the agent is talking:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
- Detect user speech onset via VAD
- Immediately stop TTS playback
- Discard any un-played generated audio
- Process the user's new utterance
- Generate a fresh response that acknowledges the interruption if appropriate
Speech-to-Text Pipeline
Streaming STT
For low latency, STT must process audio incrementally rather than waiting for the complete utterance:
- Deepgram: Streaming API with 200-300ms latency, strong accuracy, and speaker diarization
- OpenAI Whisper (self-hosted): whisper.cpp or faster-whisper for on-premise deployments
- AssemblyAI: Real-time transcription with under 300ms latency
Optimizing STT Latency
- Stream audio in small chunks (20-100ms frames) rather than waiting for silence
- Use endpointing models that detect end-of-utterance faster than fixed silence timeouts
- Pre-warm STT connections to eliminate cold-start latency on the first utterance
LLM Reasoning Layer
The LLM processes the transcribed text and generates a response. For voice, two optimizations are critical:
Streaming Token Generation
Start TTS on the first generated tokens without waiting for the complete response. This "time to first byte" optimization can reduce perceived latency by 1-3 seconds:
async def stream_llm_to_tts(transcript: str):
buffer = ""
async for chunk in llm.stream(messages=[{"role": "user", "content": transcript}]):
buffer += chunk.text
# Send to TTS at sentence boundaries for natural speech
if buffer.endswith((".", "!", "?", ":")):
audio = await tts.synthesize(buffer)
await send_audio_to_user(audio)
buffer = ""
Voice-Optimized Prompting
LLM responses for voice agents should be:
- Concise: 1-3 sentences per turn, not paragraphs
- Conversational: Use contractions, simple vocabulary, and natural phrasing
- Action-oriented: Confirm actions clearly ("I've updated your appointment to Thursday at 3 PM")
- Turn-taking aware: End with a question or clear stopping point
Text-to-Speech
Low-Latency TTS Options
| Provider | Latency | Quality | Streaming |
|---|---|---|---|
| ElevenLabs | 200-400ms | Very high | Yes |
| OpenAI TTS | 300-500ms | High | Yes |
| Cartesia | 100-200ms | High | Yes |
| XTTS v2 (open source) | 300-600ms | Good | Yes |
Voice Cloning and Consistency
Production voice agents need consistent voice characteristics across sessions. Most TTS providers support voice cloning from a short audio sample (10-30 seconds), allowing organizations to create branded agent voices.
End-to-End Latency Budget
For a natural-feeling conversation, the total pipeline latency should be under 1 second:
| Component | Target Latency |
|---|---|
| WebRTC transport | 50-100ms |
| VAD + endpointing | 200-300ms |
| STT transcription | 200-300ms |
| LLM time-to-first-token | 200-400ms |
| TTS time-to-first-audio | 150-300ms |
| Total | 800-1400ms |
Achieving the lower end of this range requires careful optimization at every stage, geographic co-location of services, and streaming throughout the pipeline rather than sequential processing.
Production Considerations
- Fallback handling: When any pipeline component fails, the agent should gracefully communicate the issue rather than going silent
- Session persistence: Maintain conversation state across WebRTC reconnections (mobile users switching between WiFi and cellular)
- Recording and transcription: Log complete conversations for quality review, with appropriate privacy disclosures
- Scalability: WebRTC media servers need horizontal scaling for concurrent sessions, typically 50-200 sessions per server
Sources: LiveKit Documentation | Deepgram Streaming API | Silero VAD
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.