What is gemini-embedding-2?

It is Google's multimodal embedding model that converts text, images, video, audio, and PDF into vectors within a unified space.

What does 'unified embedding space' mean?

It means that different types of inputs (text, image, audio) can be directly compared because they all end up as vectors of the same kind.

When did the model become available?

The model was in preview since March 10, 2026, and GA (general availability) was announced on April 22, 2026 through the Gemini API.

Gemini Embedding 2 GA: 5 modalities in one space

Google announced the general availability of gemini-embedding-2, the first multimodal embedding model that maps five modalities into a unified vector space: text, images, video, audio, and PDF documents. The model was in preview since March 10, 2026, and is now available to everyone through the Gemini API.

The announcement marked a significant milestone in embedding models, as previous work mostly covers text or text-image pairs. Very few models have consistently covered audio and video, and PDF as a first-class modality is almost uncharted territory.

What is an embedding and why does it matter?

An embedding is a numerical representation of an input in vector form — a series of numbers describing the meaning of the content. Embeddings are used for semantic search, RAG (retrieval-augmented generation) systems, classification, duplicate detection, and recommendations.

The key idea is that similar inputs are close to each other in the vector space. Until now this was mostly text-to-text or image-to-image. A multimodal embedding in a unified space means that a text query “cat jumping” can find a photo of a cat, a video clip of a cat, and an audio recording of meowing — all without special conversion.

Which inputs are supported?

The model supports five types of inputs:

Text — the classic embedding source, typically used for search and RAG
Images — photographs, screenshots, graphics
Video — short clips or longer recordings
Audio — voice, music, sound events
PDF — full documents with a mix of text, images, and tables

The fact that PDF is a first-class modality means users don’t have to manually extract text and images from documents. The model does this internally and produces a single vector describing the entire document.

What are the practical applications?

The most obvious application is advanced semantic search over heterogeneous content. An organization with a mix of documents, images, and meeting recordings can index everything into the same vector index and search any media with any query.

For developers and companies building RAG applications, multimodal embedding simplifies the architecture. Instead of a pipeline that extracts text from a PDF, passes images through a separate model and audio through a third one, everything can go through a single API call. This reduces complexity and likely costs.

It should be noted that a GA release does not automatically mean the model is perfect for every use case — accuracy depends on specific data and domains. The recommendation is to test the model on your own dataset before migrating an entire production pipeline.

Google announces GA of gemini-embedding-2: first multimodal embedding model with 5 modalities in one space

What is an embedding and why does it matter?

Which inputs are supported?

What are the practical applications?

Sources

Related news