🤖 24 AI
🟡 🏥 In Practice Friday, April 17, 2026 · 3 min read

xAI Speech-to-Text API in general availability: 25 languages, batch and streaming

Why it matters

xAI has announced the general availability of its Speech-to-Text API supporting transcription in 25 languages through batch and streaming modes. The announcement comes one month after the Text-to-Speech API reached general availability in March 2026. With this, xAI completes its audio stack alongside the Grok language models and enters direct competition with OpenAI Whisper, Google Cloud Speech, and Azure Speech.

xAI has announced that its Speech-to-Text (STT) API has moved from beta to general availability. The announcement is included in service release notes on docs.x.ai in April 2026. Although the announcement contains no pricing details or technical architecture specifications, it signals a clear move — Elon’s AI company is rounding out its audio offering and entering direct competition with established ASR (Automatic Speech Recognition) platforms.

What STT brings

The release notes list two operational modes and language coverage:

  • 25 languages supported for speech-to-text transcription
  • Batch mode for processing complete audio files
  • Streaming mode for live transcription from a continuous audio stream

Batch mode is typical for scenarios where processing can be deferred — transcribing podcasts, video files, call center recordings. Streaming mode is needed for real-time applications — live captioning, voice assistants, interactive dialog systems.

Context: completing the audio stack

About a month ago, in March 2026, xAI announced the general availability of its Text-to-Speech (TTS) API, which produces naturally sounding speech from text using Grok. Together with today’s STT announcement, xAI now has a complete audio pipeline:

  1. Audio input → STT → text
  2. Text → Grok (reasoning and response) → text
  3. Text → TTS → audio output

For developers, this means they can build voice assistants, consistently multilingual transcription services, and real-time dialog systems without having to combine three different providers. All components work through the same API key and the same billing tokens.

Market position

The ASR market is already crowded: OpenAI Whisper dominates the open-source segment, Google Cloud Speech-to-Text is the enterprise standard, Microsoft Azure Speech covers complex multilingual use cases, and specialized players like Deepgram and AssemblyAI hold low-latency niches.

xAI enters this space with a strategy of deep integration with Grok rather than standalone ASR superiority. The goal is not for xAI STT to top every benchmark, but to be the easiest path to a complete multimodal application for developers already using xAI for text.

The 25-language figure places xAI in the same range as OpenAI Whisper (which supports ~100 languages), but is significantly smaller than Google’s Speech-to-Text, which covers over 125 languages. Nevertheless, for English, major European languages, and several major Asian languages, the coverage is sufficient for the largest global application base.

xAI documentation directs developers to the Speech to Text docs for further details on pricing, quotas, and specific language data. The announcement is part of the ongoing expansion of the xAI platform during 2026, following earlier releases of Grok models 3, 4, and 4.20.

🤖

This article was generated using artificial intelligence from primary sources.