Gemini File Search: multimodal search + webhooks

Google expanded File Search in the Gemini API to multimodal search, enabling native embedding and retrieval of images alongside text documents through the gemini-embedding-2 model. Two new grounding fields and event-driven webhook support for the Batch API were also added.

What did Google announce?

Google expanded the File Search feature in the Gemini API to multimodal search. Previously limited to text, it now enables native embedding and retrieval of images alongside traditional text documents using the gemini-embedding-2 model.

Embedding is a vector representation of content that enables semantic comparison, while grounding means linking a response to a specific source.

What do the new grounding fields add?

The update introduces two new metadata fields in File Search results:

media_id — an identifier for visual citations, allowing a reference in the response to be linked to the exact image.
page_numbers — page number for documents, making it easier to track source position within PDFs or multi-page files.

For development teams, this means RAG applications can now cite an image from technical documentation as naturally as a text passage.

What does webhook support bring?

Alongside File Search, Google introduced event-driven webhook support in the Gemini API on May 4. It replaces traditional polling workflows for Batch API operations and other long-running processes.

Instead of the client asking “is it done?” every few seconds, Gemini sends an HTTP call to the configured URL when the status changes — reducing client-side overhead and notification latency.

Why does this matter?

Multimodal File Search removes the need for separate pipelines for images and text — one vector space covers both. This is valuable for enterprise scenarios such as searching product catalogs, medical documentation, or technical manuals with diagrams.

Webhook support, meanwhile, modernizes integration for batch processes and makes the Gemini API more compatible with event-driven architectures.

Frequently Asked Questions

Which model powers multimodal File Search?

The gemini-embedding-2 model, which natively embeds images and text into a shared vector space.

What are the new grounding fields?

media_id for visual citations and page_numbers for tracking position within a document.

What do webhooks add?

They replace polling workflows for the Batch API and other long-running processes, reducing client-side overhead.

Google: Gemini API File Search expanded to multimodal image and text search

What did Google announce?

What do the new grounding fields add?

What does webhook support bring?

Why does this matter?

Frequently Asked Questions

Sources

Related news