What is the key difference compared to transcription-based search?

Transcription loses everything that is not speech — visual elements, sound effects, and music become invisible to search. Nova Multimodal Embeddings treats text, images, video and audio as equal parts of the semantic space.

How large are the performance gains compared to baseline?

Recall@5 jumps from 51 to 90 percent (39 pp), Recall@10 from 64 to 95 percent, MRR from 48 to 90 percent, NDCG@10 from 54 to 88 percent. Between 30 and 40 percentage points of improvement across all metrics.

What does the infrastructure cost?

AWS cites S3 Vectors as the primary storage — up to 90 percent cheaper than specialised vector databases. OpenSearch Service covers kNN plus metadata, Fargate for processing, Transcribe for audio, and Rekognition for celebrity ID.

AWS Nova Multimodal Embeddings for video search: hybrid approach delivers 90 percent recall instead of 51 percent

AWS, alongside the recent article on Nova distillation, published on April 17, 2026 the second key part of its video search story — Amazon Nova Multimodal Embeddings. Authors from the same team (Amit Kalawat, Bimal Gajjar, James Wu) document in detail an architecture that fundamentally changes how AWS approaches semantic search of video content.

What is different

A classic video search pipeline has a clear limitation: everything gets converted to text. Audio is transcribed, images are described, metadata is read — and then a text embedding model performs the search. The problem: in that process 90 percent of the original content is lost — sound effects, music, visual composition, colours, movement.

Nova Multimodal Embeddings changes that approach. The system processes text, documents, images, video and audio simultaneously into a shared 1024-dimensional vector space. There is no prior conversion to text — each modality retains its own semantics.

Two-phase pipeline

Ingestion phase treats video as a structured signal:

Scene detection via FFmpeg — video is divided at natural transitions (typically 5–15 seconds)
Three parallel processing branches:
- 1024-dim embeddings for visual + audio signals
- Transcription with aligned sentence-level embeddings
- Celebrity ID + caption generation for additional metadata

Search phase uses intent-aware routing:

Intent analysis (Claude Haiku) assigns weights (0.0 to 1.0) to each modality — visual, audio, transcription, metadata
Query embeddings are generated through three specific indexes
Final score = w₁ × norm_bm25 + w₂ × norm_visual + w₃ × norm_audio + w₄ × norm_transcription

Hybrid approach: semantic + lexical

The key innovation is combining semantic and lexical search:

Semantic search (embeddings) — excellent for conceptual similarity (“dramatic scene”, “nostalgic tone”)
Lexical search (BM25) — excellent for exact entities (names, product codes, locations)

Without the BM25 layer, searching for specific people or product names would be unreliable. Embeddings work very well on abstraction, but cannot distinguish between similar yet different names.

Performance: massive jump

AWS tested the system on 10 long videos with 20 queries and compared the hybrid approach against the baseline combined-mode embedding solution:

Metric	Hybrid approach	Baseline
Recall@5	90%	51%
Recall@10	95%	64%
MRR	90%	48%
NDCG@10	88%	54%

Between 30 and 40 percentage points of improvement across all metrics. This is not an incremental jump — this is a redefinition of what can be achieved with video search.

Infrastructure side

AWS designed a pipeline that is affordable at enterprise scale:

S3 Vectors as primary storage for the three index spaces — up to 90 percent cheaper than specialised vector databases
OpenSearch Service for kNN search and metadata indexing
AWS Fargate for processing workloads
Amazon Transcribe for audio-to-text
Amazon Rekognition for celebrity ID
Nova 2 Lite for generating descriptions and genres

The architecture supports scaling to massive content libraries through efficient vector storage and selective query routing — if the intent router determines that audio is not relevant to a query (weight below 0.05), the audio index is not searched at all.

Use cases cited by AWS

Sports producers searching for highlight moments in archives
Film studios searching scenes featuring specific actors
News organisations searching footage by mood, location or event

In all cases, the previous transcription-based approach missed visual and audio information that is often critical to finding the right scene.

Broader context

Together with the Nova Model Distillation article (see sister post), AWS published in a single day a complete video search pipeline: embedding architecture plus distilled routing. Both articles come from the same author team and constitute a complete enterprise solution for organisations managing large video libraries.

For AWS, this is a strategic move — Amazon has long struggled to position itself as an AI infrastructure leader relative to Google and Azure. The Nova model family plus multimodal embeddings plus distillation plus S3 Vectors forms a concrete, measurable stack with documented savings.