What is Speech-to-Text (STT) and how does it differ from Text-to-Speech?

STT (Speech-to-Text) converts speech into text and is the foundation for voice assistants, transcription, and dictation. TTS (Text-to-Speech) works in reverse — it converts text into synthesized speech.

What is the difference between batch and streaming mode?

Batch mode processes an entire audio file at once and returns the transcript — suitable for recordings, podcasts, and meeting recordings. Streaming mode returns the transcript in real time as the user speaks — required for live assistants and dictation.

How does xAI STT fit into the Grok ecosystem?

The Grok Voice Agent API has been in GA since December 2025. By bringing STT to GA, xAI now has a complete voice stack: STT for input, Grok for reasoning, and TTS/Voice Agent for output — all from a single provider.

xAI Speech-to-Text API moves from beta to general availability

In April 2026, xAI announced in its release notes that its Speech-to-Text (STT) API is leaving beta and reaching general availability (GA). The service transcribes audio to text, supports 25 languages, and offers both operating modes — batch and streaming.

What exactly does the xAI STT API offer?

The key message from the documentation: “Transcribe audio to text in 25 languages with batch and streaming modes.” Batch mode is intended for processing entire audio files — meeting recordings, podcast episodes, interviews — where the entire file is sent to the API and the result is returned when transcription is complete.

Streaming mode, on the other hand, processes audio in real time. As the user speaks, partial transcripts are returned with low latency, which is essential for voice assistants, live subtitling, or dictation in applications.

Support for 25 languages puts xAI in competitive territory with OpenAI Whisper and Google Cloud Speech-to-Text, although the exact list of languages is not specified in the published release notes.

What does GA status mean for developers?

The move from beta to GA carries several practical consequences. First, the API is available without a waitlist — any xAI user with an API key can immediately start sending requests. Second, GA typically means more stable SLA guarantees and a lower probability of breaking changes in the API contract.

Third, GA signals that xAI is ready to support production workloads, which matters for developers building commercial voice products. Specific pricing per minute of audio processing is not detailed in the published release notes, so developers need to check the current pricing in the xAI console.

How does it fit with Grok and the Voice Agent?

The Grok Voice Agent API has been in GA since December 2025, meaning xAI has now closed the loop of a complete voice stack — STT for input (speech recognition), Grok LLM for reasoning, and the Voice Agent for output (speech synthesis and conversation management).

This integration means developers building voice products can use a single provider instead of mixing STT from one team (e.g., Whisper), an LLM from another, and TTS from a third. The advantage is unified latency, a unified SDK, and unified billing.

For xAI, this is strategically important because competing offerings like OpenAI’s Realtime API already offer an integrated voice stack. The STT GA closes the gap and makes xAI a serious option for production voice assistant development.

xAI Speech-to-Text API exits beta: general availability for 25 languages

xAI Speech-to-Text API moves from beta to general availability

What exactly does the xAI STT API offer?

What does GA status mean for developers?

How does it fit with Grok and the Voice Agent?

Sources

Related news