πŸ€– 24 AI
🟒 πŸ€– Models Friday, April 10, 2026 Β· 2 min read

Sentence Transformers v5.4 adds support for multimodal embedding and reranker models

Why it matters

HuggingFace's Sentence Transformers library has received version 5.4, which introduces multimodal embedding and reranker models. Users can now map text, images, audio and video into a shared embedding space and perform cross-modal similarity β€” a unification of search across different content types.

On April 9, HuggingFace released Sentence Transformers v5.4, a version that brings full support for multimodal models β€” embedding and reranker models that work with text, images, audio and video through the same API β€” to one of the most popular NLP libraries.

What’s new

The main advance is the ability to map different modalities into a shared embedding space, enabling cross-modal similarity β€” comparing, for example, text and images as if they were the same type of data. Users can search images using text queries, or find video segments relevant to an audio clip, all through a single API call.

Among the supported models are Qwen3-VL Embedding (2B and 8B versions, supporting text/image/video), NVIDIA llama-nemotron-embed-vl (1.7B), BAAI BGE-VL (from 100M to 8B parameters), and new multimodal rerankers such as jina-reranker-m0 and Qwen3-VL-Reranker-2B.

How it is used

Installation is optional according to the required modality: pip install sentence-transformers[image] for images, [audio] for audio, [video] for video. An example of cross-modal search is very simple β€” encode images and text queries through model.encode(), then call model.similarity(). Backward compatibility is preserved: existing text-only code works unchanged.

As for hardware: 2B variants require ~8 GB of VRAM, 8B variants ~20 GB. CPU inference is possible but extremely slow β€” GPU is recommended.

Why it matters

Sentence Transformers is the backbone of countless RAG (Retrieval Augmented Generation) systems and semantic searches in production. Bringing multimodal support into the same library means that developers do not have to change their architecture when they want to add image or video search β€” they simply swap the model. This is likely the quietest but most practical update that will turn the vast majority of RAG systems into multimodal ones over the coming months.

πŸ€– This article was generated using artificial intelligence from primary sources.