🤖 24 AI
🟡 🤖 Models Saturday, April 18, 2026 · 4 min read

AWS Nova Multimodal Embeddings for video search: hybrid approach delivers 90 percent recall instead of 51 percent

Why it matters

AWS Nova Multimodal Embeddings is a new architecture that simultaneously processes the visual, audio and text content of a video into a shared 1024-dimensional vector space without converting to text. Combining semantic embedding with BM25 lexical search yields 90 percent Recall@5, compared to 51 percent for baseline combined-mode embeddings — a jump of 30 to 40 percentage points across all metrics.

AWS, alongside the recent article on Nova distillation, published on April 17, 2026 the second key part of its video search story — Amazon Nova Multimodal Embeddings. Authors from the same team (Amit Kalawat, Bimal Gajjar, James Wu) document in detail an architecture that fundamentally changes how AWS approaches semantic search of video content.

What is different

A classic video search pipeline has a clear limitation: everything gets converted to text. Audio is transcribed, images are described, metadata is read — and then a text embedding model performs the search. The problem: in that process 90 percent of the original content is lost — sound effects, music, visual composition, colours, movement.

Nova Multimodal Embeddings changes that approach. The system processes text, documents, images, video and audio simultaneously into a shared 1024-dimensional vector space. There is no prior conversion to text — each modality retains its own semantics.

Two-phase pipeline

Ingestion phase treats video as a structured signal:

  1. Scene detection via FFmpeg — video is divided at natural transitions (typically 5–15 seconds)
  2. Three parallel processing branches:
    • 1024-dim embeddings for visual + audio signals
    • Transcription with aligned sentence-level embeddings
    • Celebrity ID + caption generation for additional metadata

Search phase uses intent-aware routing:

  1. Intent analysis (Claude Haiku) assigns weights (0.0 to 1.0) to each modality — visual, audio, transcription, metadata
  2. Query embeddings are generated through three specific indexes
  3. Final score = w₁ × norm_bm25 + w₂ × norm_visual + w₃ × norm_audio + w₄ × norm_transcription

Hybrid approach: semantic + lexical

The key innovation is combining semantic and lexical search:

  • Semantic search (embeddings) — excellent for conceptual similarity (“dramatic scene”, “nostalgic tone”)
  • Lexical search (BM25) — excellent for exact entities (names, product codes, locations)

Without the BM25 layer, searching for specific people or product names would be unreliable. Embeddings work very well on abstraction, but cannot distinguish between similar yet different names.

Performance: massive jump

AWS tested the system on 10 long videos with 20 queries and compared the hybrid approach against the baseline combined-mode embedding solution:

MetricHybrid approachBaseline
Recall@590%51%
Recall@1095%64%
MRR90%48%
NDCG@1088%54%

Between 30 and 40 percentage points of improvement across all metrics. This is not an incremental jump — this is a redefinition of what can be achieved with video search.

Infrastructure side

AWS designed a pipeline that is affordable at enterprise scale:

  • S3 Vectors as primary storage for the three index spaces — up to 90 percent cheaper than specialised vector databases
  • OpenSearch Service for kNN search and metadata indexing
  • AWS Fargate for processing workloads
  • Amazon Transcribe for audio-to-text
  • Amazon Rekognition for celebrity ID
  • Nova 2 Lite for generating descriptions and genres

The architecture supports scaling to massive content libraries through efficient vector storage and selective query routing — if the intent router determines that audio is not relevant to a query (weight below 0.05), the audio index is not searched at all.

Use cases cited by AWS

  • Sports producers searching for highlight moments in archives
  • Film studios searching scenes featuring specific actors
  • News organisations searching footage by mood, location or event

In all cases, the previous transcription-based approach missed visual and audio information that is often critical to finding the right scene.

Broader context

Together with the Nova Model Distillation article (see sister post), AWS published in a single day a complete video search pipeline: embedding architecture plus distilled routing. Both articles come from the same author team and constitute a complete enterprise solution for organisations managing large video libraries.

For AWS, this is a strategic move — Amazon has long struggled to position itself as an AI infrastructure leader relative to Google and Azure. The Nova model family plus multimodal embeddings plus distillation plus S3 Vectors forms a concrete, measurable stack with documented savings.

🤖

This article was generated using artificial intelligence from primary sources.