Which modes are supported?

Batch mode for processing entire audio files at once, and streaming mode for real-time transcription from live audio sources.

Which market segment is xAI targeting?

Developers who already use the Grok API for text and now need a full audio pipeline — transcription, reasoning with Grok, and speech synthesis — without leaving the xAI ecosystem.

How does STT relate to the TTS released in March?

Together they form a complete audio stack — STT converts speech to text that Grok processes, and TTS returns the response as natural speech. This moves xAI out of the pure language domain into a multimodal assistant service.

xAI Speech-to-Text API in general availability: 25 languages, batch and streaming

xAI has announced that its Speech-to-Text (STT) API has moved from beta to general availability. The announcement is included in service release notes on docs.x.ai in April 2026. Although the announcement contains no pricing details or technical architecture specifications, it signals a clear move — Elon’s AI company is rounding out its audio offering and entering direct competition with established ASR (Automatic Speech Recognition) platforms.

What STT brings

The release notes list two operational modes and language coverage:

25 languages supported for speech-to-text transcription
Batch mode for processing complete audio files
Streaming mode for live transcription from a continuous audio stream

Batch mode is typical for scenarios where processing can be deferred — transcribing podcasts, video files, call center recordings. Streaming mode is needed for real-time applications — live captioning, voice assistants, interactive dialog systems.

Context: completing the audio stack

About a month ago, in March 2026, xAI announced the general availability of its Text-to-Speech (TTS) API, which produces naturally sounding speech from text using Grok. Together with today’s STT announcement, xAI now has a complete audio pipeline:

Audio input → STT → text
Text → Grok (reasoning and response) → text
Text → TTS → audio output

For developers, this means they can build voice assistants, consistently multilingual transcription services, and real-time dialog systems without having to combine three different providers. All components work through the same API key and the same billing tokens.

Market position

The ASR market is already crowded: OpenAI Whisper dominates the open-source segment, Google Cloud Speech-to-Text is the enterprise standard, Microsoft Azure Speech covers complex multilingual use cases, and specialized players like Deepgram and AssemblyAI hold low-latency niches.

xAI enters this space with a strategy of deep integration with Grok rather than standalone ASR superiority. The goal is not for xAI STT to top every benchmark, but to be the easiest path to a complete multimodal application for developers already using xAI for text.

The 25-language figure places xAI in the same range as OpenAI Whisper (which supports ~100 languages), but is significantly smaller than Google’s Speech-to-Text, which covers over 125 languages. Nevertheless, for English, major European languages, and several major Asian languages, the coverage is sufficient for the largest global application base.

xAI documentation directs developers to the Speech to Text docs for further details on pricing, quotas, and specific language data. The announcement is part of the ongoing expansion of the xAI platform during 2026, following earlier releases of Grok models 3, 4, and 4.20.

xAI Speech-to-Text API in general availability: 25 languages, batch and streaming

What STT brings

Context: completing the audio stack

Market position

Sources

Related news