🟢 🤖 Models Published: · 2 min read ·

Sentence Transformers v5.4 adds support for multimodal embedding and reranker models

HuggingFace's Sentence Transformers library has received version 5.4, which introduces multimodal embedding and reranker models. Users can now map text, images, audio and video into a shared embedding space and perform cross-modal similarity — a unification of search across different content types.

🤖

This article was generated using artificial intelligence from primary sources.

On April 9, HuggingFace released Sentence Transformers v5.4, a version that brings full support for multimodal modelsembedding and reranker models that work with text, images, audio and video through the same API — to one of the most popular NLP libraries.

What’s new

The main advance is the ability to map different modalities into a shared embedding space, enabling cross-modal similarity — comparing, for example, text and images as if they were the same type of data. Users can search images using text queries, or find video segments relevant to an audio clip, all through a single API call.

Among the supported models are Qwen3-VL Embedding (2B and 8B versions, supporting text/image/video), NVIDIA llama-nemotron-embed-vl (1.7B), BAAI BGE-VL (from 100M to 8B parameters), and new multimodal rerankers such as jina-reranker-m0 and Qwen3-VL-Reranker-2B.

How it is used

Installation is optional according to the required modality: pip install sentence-transformers[image] for images, [audio] for audio, [video] for video. An example of cross-modal search is very simple — encode images and text queries through model.encode(), then call model.similarity(). Backward compatibility is preserved: existing text-only code works unchanged.

As for hardware: 2B variants require ~8 GB of VRAM, 8B variants ~20 GB. CPU inference is possible but extremely slow — GPU is recommended.

Why it matters

Sentence Transformers is the backbone of countless RAG (Retrieval Augmented Generation) systems and semantic searches in production. Bringing multimodal support into the same library means that developers do not have to change their architecture when they want to add image or video search — they simply swap the model. This is likely the quietest but most practical update that will turn the vast majority of RAG systems into multimodal ones over the coming months.