xAI Speech-to-Text API in general availability: 25 languages, batch and streaming
Why it matters
xAI has announced the general availability of its Speech-to-Text API supporting transcription in 25 languages through batch and streaming modes. The announcement comes one month after the Text-to-Speech API reached general availability in March 2026. With this, xAI completes its audio stack alongside the Grok language models and enters direct competition with OpenAI Whisper, Google Cloud Speech, and Azure Speech.
xAI has announced that its Speech-to-Text (STT) API has moved from beta to general availability. The announcement is included in service release notes on docs.x.ai in April 2026. Although the announcement contains no pricing details or technical architecture specifications, it signals a clear move — Elon’s AI company is rounding out its audio offering and entering direct competition with established ASR (Automatic Speech Recognition) platforms.
What STT brings
The release notes list two operational modes and language coverage:
- 25 languages supported for speech-to-text transcription
- Batch mode for processing complete audio files
- Streaming mode for live transcription from a continuous audio stream
Batch mode is typical for scenarios where processing can be deferred — transcribing podcasts, video files, call center recordings. Streaming mode is needed for real-time applications — live captioning, voice assistants, interactive dialog systems.
Context: completing the audio stack
About a month ago, in March 2026, xAI announced the general availability of its Text-to-Speech (TTS) API, which produces naturally sounding speech from text using Grok. Together with today’s STT announcement, xAI now has a complete audio pipeline:
- Audio input → STT → text
- Text → Grok (reasoning and response) → text
- Text → TTS → audio output
For developers, this means they can build voice assistants, consistently multilingual transcription services, and real-time dialog systems without having to combine three different providers. All components work through the same API key and the same billing tokens.
Market position
The ASR market is already crowded: OpenAI Whisper dominates the open-source segment, Google Cloud Speech-to-Text is the enterprise standard, Microsoft Azure Speech covers complex multilingual use cases, and specialized players like Deepgram and AssemblyAI hold low-latency niches.
xAI enters this space with a strategy of deep integration with Grok rather than standalone ASR superiority. The goal is not for xAI STT to top every benchmark, but to be the easiest path to a complete multimodal application for developers already using xAI for text.
The 25-language figure places xAI in the same range as OpenAI Whisper (which supports ~100 languages), but is significantly smaller than Google’s Speech-to-Text, which covers over 125 languages. Nevertheless, for English, major European languages, and several major Asian languages, the coverage is sufficient for the largest global application base.
xAI documentation directs developers to the Speech to Text docs for further details on pricing, quotas, and specific language data. The announcement is part of the ongoing expansion of the xAI platform during 2026, following earlier releases of Grok models 3, 4, and 4.20.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic and NEC build Japan's largest AI engineering workforce — Claude for 30,000 NEC employees
AWS: multimodal biological foundation models accelerate drug discovery by 50 percent and diagnostics by 90 percent
CNCF: infrastructure engineer migrated 60+ Kubernetes resources in 30 minutes with the help of an AI agent