What you’ll learn: Detailed breakdown of latencies across all Vodex system components including ASR, AI models, and TTS providers with performance characteristics and optimization notes.
ASR (Automatic Speech Recognition) Latencies
Speech recognition is the first step in processing user input. Different ASR providers offer varying latency characteristics based on their processing approach.Streaming ASRs
Real-time speech recognition that processes audio as it’s being spoken:- Alpha Echo V2
- Nova Echo
Ultra-Low Latency Processing
Characteristics:
Metric | Value |
---|---|
Base Latency | 250ms |
Processing Type | Streaming |
Best For | Real-time conversations, interactive scenarios |
- Fastest streaming ASR option
- Optimized for conversational AI
- Minimal delay in speech recognition
Streaming ASR Timing: For streaming ASRs, we implement adaptive timing - either 300ms OR 1 second depending on user activity to prevent cutting off users mid-sentence.
Non-Streaming ASRs
Batch processing ASRs that analyze complete audio segments:Google ASR
Google ASR
Enterprise-Grade Recognition
Best for: High-accuracy requirements, complex audio processing
Metric | Value |
---|---|
Processing Latency | 2 seconds |
VAD Time | 300ms (fixed) |
Total Latency | ~2.3 seconds |
OpenAI Whisper
OpenAI Whisper
Advanced AI-Powered ASR
Best for: Multi-language support, challenging audio conditions
Metric | Value |
---|---|
Processing Latency | 700ms |
VAD Time | 300ms (fixed) |
Total Latency | ~1 second |
ElevenLabs Whisper
ElevenLabs Whisper
Optimized Whisper Implementation
Best for: Fast Whisper processing, quality audio recognition
Metric | Value |
---|---|
Processing Latency | 400ms |
VAD Time | 300ms (fixed) |
Total Latency | ~700ms |
Azure Speech Services
Azure Speech Services
Microsoft Cloud ASR
Best for: Enterprise integrations, Microsoft ecosystem
Metric | Value |
---|---|
Processing Latency | 800ms |
VAD Time | 300ms (fixed) |
Total Latency | ~1.1 seconds |
Genesis Echo
Genesis Echo
Specialized Voice Recognition
Best for: Specialized voice processing, custom implementations
Metric | Value |
---|---|
Processing Latency | 400ms |
VAD Time | 300ms (fixed) |
Total Latency | ~700ms |
Non-Streaming ASR Note: All non-streaming ASRs include a fixed Voice Activity Detection (VAD) time of 300ms to ensure complete speech capture before processing.
AI Model Latencies
Time to first token generation - the critical metric for conversational responsiveness.Important: Listed latencies represent time to first token. Complete response generation typically adds +300ms from the first token delivery.
Latency vs Capability Trade-off: Lower latency models are typically smaller LLMs that cannot handle longer prompts, complex rules, or extensive context. Choose based on your specific use case requirements.
Vodex Optimized Models
Specially tuned models optimized for voice conversations:- Spark Series
- Performance Comparison
Vodex Custom-Based Architecture
Performance Characteristics:
Model | First Token Latency | Status | Best For |
---|---|---|---|
Spark | 1.5s | ✅ Stable | Complex reasoning, detailed conversations |
Spark Flash | 400ms | ✅ Stable | Balanced performance, standard interactions |
Spark Flash Lite | 200ms | ✅ Stable | Quick responses, simple tasks |
- Optimized for voice conversations
- Excellent reasoning capabilities
- Consistent performance across scenarios
OpenAI Models
Industry-leading AI models with varying performance characteristics:GPT-4 Series
GPT-4 Series
Advanced Reasoning Models
Best for: Complex conversations, detailed analysis, multimodal interactions
Model | First Token Latency | Status | Characteristics |
---|---|---|---|
GPT-4o | 800ms | ✅ Stable | Multimodal capabilities, advanced reasoning |
GPT-4o Mini | 500ms | ✅ Stable | Efficient multimodal processing |
GPT-4.1 Mini | 500ms | ✅ Stable | Enhanced efficiency, cost-effective |
GPT-5 Series (Next Generation)
GPT-5 Series (Next Generation)
Cutting-Edge AI Models
Model | First Token Latency | Status | Characteristics |
---|---|---|---|
GPT-5 Nano | 350ms | ⚠️ Latency issues | Ultra-efficient processing, limited context |
GPT-5 Mini | 200ms | ⚠️ Latency issues | Fast responses, simplified reasoning |
GPT-5 Chat | 500ms | ✅ Stable | Optimized for conversations |
GPT-5 Status: GPT-5 Nano and Mini are currently experiencing latency issues. Monitor performance closely if using these models.
Specialized Models
Specialized Models
Purpose-Built AI Solutions
Best for: Specialized applications requiring high accuracy
Model | First Token Latency | Status | Characteristics |
---|---|---|---|
OpenAI GSS | 1.2s | ✅ Stable | Specialized processing, high accuracy |
Open Source & Alternative Models
High-performance alternatives to proprietary models:- Meta Llama
- DeepSeek
Open Source Excellence
Advantages:
Model | First Token Latency | Status | Characteristics |
---|---|---|---|
Llama 3.3 70B | 450ms | ✅ Stable | Large parameter model, excellent reasoning |
Llama 4 Maverick | 450ms | ✅ Stable | Next-generation open source |
- Open source flexibility
- Privacy-focused processing
- Customizable implementations
- Cost-effective scaling
TTS (Text-to-Speech) Latencies
Time to first audio chunk - critical for maintaining conversation flow.TTS Streaming: All TTS providers support streaming, allowing audio playback to begin as soon as the first chunk is available, reducing perceived latency.
ElevenLabs (Default Provider)
Premium AI voice synthesis with multiple performance tiers:- Turbo Series
- Flash Series
- Multilingual Series
High-Quality Voice Synthesis
Characteristics:
Model | First Chunk Latency | Quality | Best For |
---|---|---|---|
Turbo 2 | 250ms | High | Professional conversations, customer service |
Turbo 2.5 | 250ms | Enhanced | Premium voice quality, sales calls |
- Premium voice quality
- Natural emotional expression
- Multiple voice personalities
- Excellent for professional use
ElevenLabs Selection: Use Flash series for real-time interactions where speed is critical, and Turbo series for professional scenarios where voice quality is paramount.
Alternative TTS Providers
Additional voice synthesis options for specialized needs:RimeLabs
RimeLabs
Advanced Voice Technology
Best for:
Metric | Value |
---|---|
First Chunk Latency | 350ms |
Quality | High |
Availability | Contact support |
- Specialized voice requirements
- Custom voice training
- Enterprise implementations
Google UR Realistic
Google UR Realistic
WaveNet Technology
Best for:
Metric | Value |
---|---|
First Chunk Latency | 400ms |
Quality | Ultra-realistic |
Availability | Contact support |
- Ultra-realistic voice requirements
- Multi-language campaigns
- Google Cloud integrations
Azure Cognitive Services
Azure Cognitive Services
Microsoft Neural Voices
Best for:
Metric | Value |
---|---|
First Chunk Latency | 800ms |
Quality | Enterprise-grade |
Availability | Contact support |
- Enterprise applications
- Microsoft ecosystem integration
- Multi-language support
Latency Optimization Strategies
Configuration Recommendations
- Real-Time Conversations
- Balanced Performance
- High-Quality Processing
Ultra-Low Latency SetupRecommended Configuration:
- ASR: Alpha Echo V2 (250ms)
- Model: Spark Flash Lite (200ms) or GPT-5 Mini (200ms)*
- TTS: ElevenLabs Flash 2.5 (95ms)
Limitations: Ultra-low latency models cannot handle complex prompts, extensive context, or sophisticated rules. Best for simple, straightforward interactions only.
*GPT-5 Mini currently experiencing latency issues
Performance Monitoring
1
Baseline Measurement
Establish Performance Baselines
- Monitor end-to-end conversation latency
- Track component-specific performance
- Document peak and average response times
2
Optimization Testing
A/B Test Configurations
- Compare different ASR providers
- Test various AI model combinations
- Evaluate TTS provider performance
3
Continuous Monitoring
Ongoing Performance Tracking
- Set up latency alerts
- Monitor for performance degradation
- Track user experience metrics
Technical Considerations
Latency Factors
Network Latency
External Factors
- Geographic distance to servers
- Network congestion and routing
- Internet service provider performance
- CDN and edge server optimization
Processing Load
System Performance
- Server capacity and utilization
- Concurrent request handling
- Model loading and initialization
- Resource allocation efficiency
Audio Quality
Input Characteristics
- Audio sample rate and quality
- Background noise levels
- Speaker clarity and volume
- Connection stability
Best Practices
ASR Optimization
ASR Optimization
Speech Recognition Best Practices
- Choose streaming ASR for real-time scenarios
- Consider non-streaming for accuracy-critical applications
- Account for VAD timing in latency calculations
- Test with representative audio samples
Model Selection
Model Selection
AI Model Optimization
- Match model complexity to use case requirements
- Monitor for model-specific latency issues
- Consider fallback options for unstable models
- Balance reasoning capability with response speed
TTS Configuration
TTS Configuration
Voice Synthesis Optimization
- Use streaming TTS for reduced perceived latency
- Select voice quality appropriate for use case
- Consider multiple TTS providers for redundancy
- Test voice quality across different scenarios
Next Steps
Ready to optimize your latency? Use this comprehensive guide to select the optimal configuration for your specific use case and performance requirements.
- Assess your requirements - Determine if you need real-time, balanced, or high-quality processing
- Test configurations - Experiment with different component combinations
- Monitor performance - Track latency metrics in production
- Optimize iteratively - Continuously refine based on real-world performance
Need help with configuration? Contact support@vodex.ai for personalized latency optimization assistance.
Related Configuration Settings
Configure these latency-related settings in your Vodex dashboard:ASR Configuration
Speech Recognition SettingsConfigure ASR providers and settings in Call Settings and Advanced Settings.
- Choose between streaming and non-streaming ASR
- Configure language codes and detection
- Set up custom ASR parameters
AI Model Selection
Model ConfigurationSelect and configure AI models in Call Settings.
- Choose from Vodex optimized models
- Configure OpenAI, Llama, and other providers
- Balance latency vs capability requirements
Voice/TTS Settings
Text-to-Speech ConfigurationConfigure voice providers and settings in Call Settings.
- Select ElevenLabs Turbo or Flash series
- Configure alternative TTS providers
- Optimize voice quality vs speed