Google announces GA of gemini-embedding-2: first multimodal embedding model with 5 modalities in one space
Why it matters
Google announced the general availability of the gemini-embedding-2 model, which supports text, images, video, audio, and PDF inputs mapped into a unified embedding space. The model was in preview since March 10, 2026, and is now available to everyone via the Gemini API.
Google announced the general availability of gemini-embedding-2, the first multimodal embedding model that maps five modalities into a unified vector space: text, images, video, audio, and PDF documents. The model was in preview since March 10, 2026, and is now available to everyone through the Gemini API.
The announcement marked a significant milestone in embedding models, as previous work mostly covers text or text-image pairs. Very few models have consistently covered audio and video, and PDF as a first-class modality is almost uncharted territory.
What is an embedding and why does it matter?
An embedding is a numerical representation of an input in vector form — a series of numbers describing the meaning of the content. Embeddings are used for semantic search, RAG (retrieval-augmented generation) systems, classification, duplicate detection, and recommendations.
The key idea is that similar inputs are close to each other in the vector space. Until now this was mostly text-to-text or image-to-image. A multimodal embedding in a unified space means that a text query “cat jumping” can find a photo of a cat, a video clip of a cat, and an audio recording of meowing — all without special conversion.
Which inputs are supported?
The model supports five types of inputs:
- Text — the classic embedding source, typically used for search and RAG
- Images — photographs, screenshots, graphics
- Video — short clips or longer recordings
- Audio — voice, music, sound events
- PDF — full documents with a mix of text, images, and tables
The fact that PDF is a first-class modality means users don’t have to manually extract text and images from documents. The model does this internally and produces a single vector describing the entire document.
What are the practical applications?
The most obvious application is advanced semantic search over heterogeneous content. An organization with a mix of documents, images, and meeting recordings can index everything into the same vector index and search any media with any query.
For developers and companies building RAG applications, multimodal embedding simplifies the architecture. Instead of a pipeline that extracts text from a PDF, passes images through a separate model and audio through a third one, everything can go through a single API call. This reduces complexity and likely costs.
It should be noted that a GA release does not automatically mean the model is perfect for every use case — accuracy depends on specific data and domains. The recommendation is to test the model on your own dataset before migrating an entire production pipeline.
Sources
Related news
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools
Apple introduces MANZANO — a unified multimodal model that balances image understanding and generation
Microsoft AutoAdapt: automatic LLM adaptation to specialized domains in 30 minutes and $4