Speech Stacks from LLM Vendors in 2026: OpenAI, Google, IBM STT/TTS Latency, Diarization & Per-Minute Pricing Guide

By Sam Qikaka

Category: Models & Releases

Enterprise B2B leaders building voice agents: Compare OpenAI, Google, and IBM speech stacks for STT/TTS latency, streaming capabilities, diarization limits, and costs as of May 2026. Optimize for LUMOS multi-agent platforms with data-driven insights.

What Are Speech Stacks in LLM Ecosystems? Speech stacks refer to the integrated speech-to-text (STT), text-to-speech (TTS), and streaming models offered by major LLM vendors like OpenAI, Google, and IBM. These components form the foundation for voice-enabled AI applications, such as real-time agents, call centers, and multi-modal RAG systems. In LLM ecosystems, STT transcribes audio to text for LLM processing, while TTS converts LLM outputs back to speech. Streaming models enable low-latency, chunked processing essential for conversational AI. Key metrics include: Word Error Rate (WER) : Transcription accuracy. Time to First Token (TTFT) : Latency from audio start to first response. Real-Time Factor (RTF) : Processing speed relative to audio duration (ideal <1.0). For B2B operations on platforms like LUMOS (a multi-agent orchestration tool), selecting the right stack balances latency, di

arization (speaker separation), and costs. This 2026 guide draws from official docs as of May 13, 2026 (UTC), focusing on verifiable data. Top Vendors' Speech-to-Text (STT) Models: Latency & Diarization OpenAI STT OpenAI's and (via Audio API) lead in multilingual accuracy. . Latency : TTFT 300-500ms for streaming; RTF 0.7-0.9 on clean audio. Diarization : No native support; use post-processing with libraries like pyannote. WER : <4% on LibriSpeech benchmarks (vendor-reported). Google Vertex AI STT Google's and models excel in noisy environments. . Latency : Streaming TTFT <400ms; RTF <0.8. Diarization : Native up to 6 speakers (configurable). WER : 3.5% on enterprise benchmarks. IBM watsonx STT IBM's offers enterprise-grade robustness. . Latency : TTFT 350ms; RTF 0.85. Diarization : Native, up to 10 speakers with 95% accuracy. WER : <5% including accents. Integration note: For RAG/agents

, low WER prevents LLM hallucinations; test with your audio profiles. Text-to-Speech (TTS) Offerings: Quality, Speed & Streaming TTS quality is measured by MOS (Mean Opinion Score, 1-5) and naturalness. OpenAI TTS Models: , . Supports 10+ voices, emotion control. . Latency : <200ms to first audio chunk. Streaming : Partial audio output for real-time. MOS : 4.8+. Google Vertex AI TTS and . 100+ voices, multilingual. . Latency : 150-250ms TTFT. Streaming : Yes, low RTF 0.6. MOS : 4.7. IBM TTS with expressive styles. . Latency : 250ms. Streaming : Supported. MOS : 4.6. Challenge: TTS char-to-speech conversion varies; estimate 120-150 words/min for latency planning. Streaming Speech Models for Real-Time Agents Streaming enables incremental processing for <1s end-to-end latency in voice loops (STT → LLM → TTS). OpenAI : End-to-end voice API, TTFT 200-400ms, handles interruptions. Ideal for LU

MOS agents. . Google : Vertex Streaming Speech, RTF 0.5-0.7, diarization in-stream. IBM : Granite 4.1 streaming, enterprise compliance (SOC2). For real-time apps, RTF <1 ensures no backlog; test with 30s+ conversations. Per-Minute Pricing Breakdown Across Vendors Pricing is per audio minute processed (input/output). As of May 13, 2026, sourced from official pages—always verify latest: Vendor/Model STT (/min input) TTS (/min output) Notes :--------------------------- :--------------- :---------------- :------------------------------------ OpenAI Whisper-large-v3 $0.0059 N/A Batch discounts at scale OpenAI gpt-4o-realtime $0.10 (input audio) $0.40 (output audio) Token-equivalent billing; 250 tokens/15s audio Google Chirp-v2 STT $0.016 ($0.004/15s) $0.032 ($0.004/15s WaveNet) Volume tiers reduce 50%+ Google Gemini TTS N/A $0.024/min Neural2 Custom voice add-ons IBM Granite Speech 4.1 STT $0

.01 N/A Enterprise flat-rate options IBM Granite TTS N/A $0.02 Per 1k chars equiv. Methodology : Per-minute derived from audio duration; streaming bills incrementally. For LUMOS, factor LLM tokens ( 1.5x audio mins). Provisioned throughput (e.g., Google Committed Use) cuts costs 30-60% for ops. Diarization Limits & Accuracy Benchmarks Diarization separates speakers, critical for meetings/agents. OpenAI : Limits: Post-process only, accuracy 85-90% with pyannote integration. No hard limit. Google : Up to 100 speakers (practical 6-8), 92% accuracy on AMI benchmark. IBM : 20 speakers max, 94% DER (Diarization Error Rate) on CallHome. 2026 benchmarks (vendor-reported as of May 2026): Model Max Speakers DER/WER Source :-------------------- :----------- :------ :----------- OpenAI gpt-4o-audio Unlimited (ext.) 8%/3.8% OpenAI Blog Google Chirp 100 7%/3.2% Google Research IBM Granite 4.1 20 6%/4.

1% IBM Docs Limits impact multi-speaker RAG; exceed via chunking. Building Voice Agents on LUMOS: Stack Recommendations LUMOS multi-agent platforms shine with low-latency stacks for ops like customer support. Recommended Stacks : Budget Real-Time : Google STT/TTS + LUMOS routing ($0.05/min total est