🟡 🤖 Models Published: · 2 min read ·

Amazon Nova 2 Sonic: Speech-to-Speech Foundation Model with End-to-End Latency Below 500ms and 30ms Audio Latency

Editorial illustration: voice agent with sound waves and edge network graphic.

Amazon Nova 2 Sonic is a new generation speech-to-speech foundation model announced on May 14, 2026, through Amazon Bedrock. It eliminates the need for separate speech-to-text and text-to-speech services — end-to-end latency below 500ms, audio latency below 30ms via the Stream edge network, native turn detection, barge-in support, and function calling during conversation. The Stream Vision Agents framework abstracts bidirectional audio stream management.

🤖

This article was generated using artificial intelligence from primary sources.

Amazon Web Services launched Amazon Nova 2 Sonic on May 14, 2026 — a second-generation speech-to-speech foundation model available through Amazon Bedrock. The new model eliminates the pipeline complexity of classic voice agent stacks and pushes latency benchmarks below thresholds that enable natural human conversation.

What does Nova 2 Sonic change in voice agent architecture?

Traditional voice agent stacks use three separate services: speech-to-text (STT), LLM reasoning, and text-to-speech (TTS). Each adds latency and failure points. Nova 2 Sonic is a speech-to-speech foundation model — it understands input speech and generates output audio directly, eliminating STT/TTS layers. The result is end-to-end latency “typically under 500 milliseconds.”

What specific latencies does Amazon cite?

Three key metrics position Nova 2 Sonic for production:

  • End-to-end latency: typically under 500 milliseconds
  • Audio latency: under 30 milliseconds via the Stream edge network
  • Join times: sub-500ms when establishing a connection

These thresholds enable “natural conversational flow without perceptible delays” — the conversational partner does not notice cross-talk pauses that degrade communication quality.

What capabilities does the model offer?

Nova 2 Sonic combines five capabilities in a single model:

  • Speech-to-speech conversion with understanding and reasoning
  • Voice activity detection to identify speech boundaries and interruptions
  • Barge-in support allows the user to naturally interrupt the agent
  • Function calling during conversation for API integration and backend actions
  • Contextual awareness maintains a full conversation history

What does the Stream Vision Agents framework add?

The Stream Vision Agents framework abstracts the complexity of managing bidirectional audio streams. It uses an event-driven bidirectional streaming API instead of traditional request-response patterns, enabling development teams to build production-grade voice applications with minimal code. The framework handles connection management, jitter buffering, packet loss recovery, and adaptive bitrate compression.

This approach positions Amazon in the real-time voice agent arena where OpenAI Realtime API, ElevenLabs Conversational, and Google Gemini Live have dominated. The entry cost is integration with the Bedrock ecosystem — a trade-off for customers already on AWS.

Frequently Asked Questions

How does Nova 2 Sonic differ from Nova Sonic 1?
Nova 2 Sonic is a new generation foundation model with end-to-end latency below 500ms (vs. longer with Nova Sonic 1), native turn detection without external VAD libraries, barge-in support, and function calling during conversation — Nova Sonic 1 required the Stream Vision Agents framework for equivalent functionality.
What specific latencies does Amazon cite?
End-to-end latency typically under 500ms, audio latency under 30ms via the Stream edge network, sub-500ms join times when establishing a connection — all within thresholds that enable natural conversation without perceptible delays.