Speech Stacks from LLM Vendors: Latency, Diarization Limits & Per-Minute Pricing Compared (2026 Guide)
By Sam Qikaka
Category: Models & Releases
Enterprise leaders building voice agents need clear comparisons of speech stacks from top LLM vendors like OpenAI and Google. This guide breaks down STT/TTS latency, streaming capabilities, diarization limits, and official per-minute costs as of May 2026.
Understanding Speech Stacks in LLM Ecosystems Speech stacks refer to the integrated pipelines of speech-to-text (STT), text-to-speech (TTS), and streaming models offered by major LLM vendors. These components are essential for building low-latency voice agents, especially in enterprise settings like multi-agent platforms such as LUMOS, where real-time conversation handling is critical. In LLM ecosystems, speech stacks enable end-to-end voice AI: audio input is transcribed via STT, processed by an LLM for reasoning or orchestration, and converted back to speech via TTS. Key challenges include achieving sub-500ms end-to-end latency, handling multi-speaker diarization, and managing costs at scale. For B2B operations, selecting the right stack involves balancing accuracy, speed, and integration with tools like RAG pipelines. This guide focuses on vendors with native LLM-speech integrations,
drawing from official documentation as of May 4, 2026. Key Vendors and Their Speech Model Lineups Leading LLM vendors provide specialized speech models alongside their core LLMs. Here's a lineup of exact model IDs from official sources: OpenAI : STT via , , and ; TTS with , , and . The Realtime API unifies streaming STT/LLM/TTS for voice agents. Google Cloud : Speech-to-Text (v2 with models like or ), Text-to-Speech (WaveNet/Chirp-HD voices), and multimodal AudioPaLM integrations via Vertex AI. IBM : Granite Speech 3.3 (open-source focused STT/TTS) via watsonx, with streaming support. Microsoft Azure : Speech Services (Custom Neural Voice, Whisper-based STT) integrated with Azure OpenAI. Amazon : Transcribe (medical/real-time streaming) and Polly TTS, accessible via Bedrock. These lineups evolve; always check vendor pricing pages (e.g., openai.com/pricing, cloud.google.com/speech-to-text
/pricing) for the latest model IDs. For LUMOS integration, OpenAI's Realtime API and Google's streaming endpoints plug directly into orchestration layers, reducing custom VAD/endpointer needs. Speech-to-Text (STT) Latency and Accuracy Breakdown STT latency is measured as time-to-first-token (TTFT) for streaming models, critical for real-time agents. Official benchmarks (as of May 4, 2026): OpenAI : Batch mode 300-500ms TTFT; accuracy 96%+ on common benchmarks like LibriSpeech. Streaming via Realtime API targets <250ms. Google Speech-to-Text v2 ( model): Streaming latency <200ms, with 95%+ word error rate reduction via enhanced models. IBM Granite Speech 3.3: Open-source streaming STT with 250ms latency in benchmarks, strong on noisy audio. Tradeoffs: Whisper excels in multilingual accuracy but lags in ultra-low latency without Realtime API. Google offers better noise robustness for enter
prise calls. Test via vendor playgrounds for your audio profiles. Text-to-Speech (TTS) Models: Prosody, Speed & Naturalness TTS quality hinges on prosody (natural intonation), synthesis speed, and voice variety: OpenAI and : 2.5x faster than predecessors, with high naturalness scores (MOS 4.8). adds emotional expressiveness. Google Text-to-Speech (Chirp-HD): Neural2 voices with prosody control, synthesis <150ms for short utterances. IBM Granite TTS: Open-weights models emphasize custom voice cloning, suitable for branded agents. In LUMOS pipelines, low-latency TTS like OpenAI's prevents 'robotic pauses' in multi-turn dialogues. Benchmarks show Google's WaveNet edging in long-form naturalness. Streaming Capabilities and End-to-End Latency Streaming speech models enable incremental processing, vital for <300ms voice loops: OpenAI Realtime API : Combines STT/LLM/TTS in one WebSocket; end-to
-end latency 200-400ms reported in production (openai.com/docs/guides/realtime). Google Vertex AI : Streaming STT/TTS with AudioPaLM; supports barge-in and semantic turn detection, TTFT <250ms. Others : Azure RealTime supports WebRTC; AWS Transcribe Medical streams with <300ms. Integration tip for LUMOS: Use WebRTC transport to minimize telephony latency ( 50ms), pairing with vendor streaming to hit sub-500ms E2E. Sequential batch STT/TTS adds 1-2s delays—avoid for agents. Diarization Limits Across Vendor Models Speaker diarization separates voices in multi-speaker audio, with limits on max speakers and accuracy: OpenAI : No native diarization; use post-processing or Realtime API extensions (up to 2-4 speakers reliably). Google Speech-to-Text: Supports up to 30 speakers in enhanced models, with 85%+ diarization accuracy. IBM Granite Speech: Handles 10+ speakers in open-source setups, ide
al for meetings. Azure Speech: Diarization for up to 10 speakers standard. Enterprise tradeoff: For call centers (2 speakers), all suffice; meetings need Google's limits. LUMOS users can layer open-source diarization (e.g., pyannote) on Whisper for cost savings. Per-Minute Pricing Comparison with Of