Sentence Transformers v5.4 adds support for multimodal embedding and reranker models
Why it matters
HuggingFace's Sentence Transformers library has received version 5.4, which introduces multimodal embedding and reranker models. Users can now map text, images, audio and video into a shared embedding space and perform cross-modal similarity β a unification of search across different content types.
On April 9, HuggingFace released Sentence Transformers v5.4, a version that brings full support for multimodal models β embedding and reranker models that work with text, images, audio and video through the same API β to one of the most popular NLP libraries.
Whatβs new
The main advance is the ability to map different modalities into a shared embedding space, enabling cross-modal similarity β comparing, for example, text and images as if they were the same type of data. Users can search images using text queries, or find video segments relevant to an audio clip, all through a single API call.
Among the supported models are Qwen3-VL Embedding (2B and 8B versions, supporting text/image/video), NVIDIA llama-nemotron-embed-vl (1.7B), BAAI BGE-VL (from 100M to 8B parameters), and new multimodal rerankers such as jina-reranker-m0 and Qwen3-VL-Reranker-2B.
How it is used
Installation is optional according to the required modality: pip install sentence-transformers[image] for images, [audio] for audio, [video] for video. An example of cross-modal search is very simple β encode images and text queries through model.encode(), then call model.similarity(). Backward compatibility is preserved: existing text-only code works unchanged.
As for hardware: 2B variants require ~8 GB of VRAM, 8B variants ~20 GB. CPU inference is possible but extremely slow β GPU is recommended.
Why it matters
Sentence Transformers is the backbone of countless RAG (Retrieval Augmented Generation) systems and semantic searches in production. Bringing multimodal support into the same library means that developers do not have to change their architecture when they want to add image or video search β they simply swap the model. This is likely the quietest but most practical update that will turn the vast majority of RAG systems into multimodal ones over the coming months.
Related news
ArXiv: Process Reward Agents β real-time feedback improves AI reasoning in medicine without retraining
ArXiv PRA: 4B model achieves 80.8% on medical benchmark β new SOTA for small scale
ArXiv SPPO: Sequence-level PPO solves the credit assignment problem in long reasoning chains