Amazon Nova 2 Sonic: Under 500ms Voice Agent Foundation

Amazon Nova 2 Sonic is a new generation speech-to-speech foundation model announced on May 14, 2026, through Amazon Bedrock. It eliminates the need for separate speech-to-text and text-to-speech services — end-to-end latency below 500ms, audio latency below 30ms via the Stream edge network, native turn detection, barge-in support, and function calling during conversation. The Stream Vision Agents framework abstracts bidirectional audio stream management.

Amazon Web Services launched Amazon Nova 2 Sonic on May 14, 2026 — a second-generation speech-to-speech foundation model available through Amazon Bedrock. The new model eliminates the pipeline complexity of classic voice agent stacks and pushes latency benchmarks below thresholds that enable natural human conversation.

What does Nova 2 Sonic change in voice agent architecture?

Traditional voice agent stacks use three separate services: speech-to-text (STT), LLM reasoning, and text-to-speech (TTS). Each adds latency and failure points. Nova 2 Sonic is a speech-to-speech foundation model — it understands input speech and generates output audio directly, eliminating STT/TTS layers. The result is end-to-end latency “typically under 500 milliseconds.”

What specific latencies does Amazon cite?

Three key metrics position Nova 2 Sonic for production:

End-to-end latency: typically under 500 milliseconds
Audio latency: under 30 milliseconds via the Stream edge network
Join times: sub-500ms when establishing a connection

These thresholds enable “natural conversational flow without perceptible delays” — the conversational partner does not notice cross-talk pauses that degrade communication quality.

What capabilities does the model offer?

Nova 2 Sonic combines five capabilities in a single model:

Speech-to-speech conversion with understanding and reasoning
Voice activity detection to identify speech boundaries and interruptions
Barge-in support allows the user to naturally interrupt the agent
Function calling during conversation for API integration and backend actions
Contextual awareness maintains a full conversation history

What does the Stream Vision Agents framework add?

The Stream Vision Agents framework abstracts the complexity of managing bidirectional audio streams. It uses an event-driven bidirectional streaming API instead of traditional request-response patterns, enabling development teams to build production-grade voice applications with minimal code. The framework handles connection management, jitter buffering, packet loss recovery, and adaptive bitrate compression.

This approach positions Amazon in the real-time voice agent arena where OpenAI Realtime API, ElevenLabs Conversational, and Google Gemini Live have dominated. The entry cost is integration with the Bedrock ecosystem — a trade-off for customers already on AWS.

Frequently Asked Questions

How does Nova 2 Sonic differ from Nova Sonic 1?

Nova 2 Sonic is a new generation foundation model with end-to-end latency below 500ms (vs. longer with Nova Sonic 1), native turn detection without external VAD libraries, barge-in support, and function calling during conversation — Nova Sonic 1 required the Stream Vision Agents framework for equivalent functionality.

What specific latencies does Amazon cite?

End-to-end latency typically under 500ms, audio latency under 30ms via the Stream edge network, sub-500ms join times when establishing a connection — all within thresholds that enable natural conversation without perceptible delays.

Amazon Nova 2 Sonic: Speech-to-Speech Foundation Model with End-to-End Latency Below 500ms and 30ms Audio Latency

What does Nova 2 Sonic change in voice agent architecture?

What specific latencies does Amazon cite?

What capabilities does the model offer?

What does the Stream Vision Agents framework add?

Frequently Asked Questions

Sources

Related news