OpenAI: three new realtime voice models in the API

OpenAI introduced on May 7, 2026, three new realtime voice models in the API: GPT-Realtime-2 with GPT-5-class reasoning and a 128,000-token context, GPT-Realtime-Translate that translates from 70+ input languages into 13 output languages, and GPT-Realtime-Whisper for live speech transcription.

OpenAI on May 7, 2026, introduced a new generation of realtime voice models in the API, opening a class of voice applications that simultaneously reason, translate, and transcribe while the user speaks. These are three separate models that together cover the voice stack for enterprise voice agents.

What does GPT-Realtime-2 bring?

GPT-Realtime-2 is the first OpenAI voice model with GPT-5-class reasoning and is capable of handling demanding requests and conducting natural conversations. The context window has been expanded from 32,000 to 128,000 tokens, enabling longer sessions and more complex instructions within a single call. The model offers adjustable reasoning levels from minimal to extra-high, allowing development teams to balance latency and cognitive depth. On the Big Bench Audio benchmark for audio intelligence, GPT-Realtime-2 (high) achieves a 15.2% better result than the previous GPT-Realtime-1.5, while the xhigh variant outperforms 1.5 by 13.8% on the Audio MultiChallenge instruction-following test.

How do the Translate and Whisper models work?

GPT-Realtime-Translate translates from 70+ input languages into 13 output languages, matching the speaker’s pace in real time, positioning it for scenarios such as multinational meetings and cross-border customer support. GPT-Realtime-Whisper is a streaming speech-to-text that transcribes speech live as the user speaks, intended for applications that need immediate text output with minimal latency. Both models are separate from the main GPT-Realtime-2, giving development teams the freedom to combine or separate functionalities.

What is the pricing model?

GPT-Realtime-2 costs $32 per million input audio tokens, with $0.40 for cached input tokens, and $64 per million output audio tokens. The cached price represents an 80× reduction for repeated contexts and makes longer sessions economically viable. GPT-Realtime-Translate is priced per minute at $0.034/min, while GPT-Realtime-Whisper is set at $0.017/min. This pushes OpenAI directly into the enterprise voice agent market, where the Realtime API was previously limited by a shorter context and lower reasoning level.

Frequently Asked Questions

What is new in GPT-Realtime-2?

It is the first voice model with GPT-5-class reasoning, a context expanded from 32,000 to 128,000 tokens, and adjustable reasoning levels from minimal to extra-high.

How many languages does GPT-Realtime-Translate support?

It translates from 70+ input languages into 13 output languages, in real time and matching the speaker's pace.

How much do the new models cost?

GPT-Realtime-2: $32 per 1M input audio tokens ($0.40 for cached) and $64 per 1M output. Translate $0.034/min, Whisper $0.017/min.

OpenAI: three new realtime voice models in the API with reasoning and translation

What does GPT-Realtime-2 bring?

How do the Translate and Whisper models work?

What is the pricing model?

Frequently Asked Questions

Sources

Related news