Multimodal model

A multimodal model is an AI system that, within a single architecture, processes and/or generates more than one type of data — a modality — such as text, images, audio, and video. Unlike a model restricted to a single modality, a multimodal model can, for example, describe the contents of a photo, answer a question about a chart, or create an image from a text description.

Technically, each modality is converted into a shared representation space (see embedding) so that the same network can process them jointly. Modern “natively multimodal” models are trained on mixed data from the outset, most often on a transformer architecture, while image and video generation frequently relies on diffusion models.

Across 2025–2026, multimodality has become standard for leading foundation models — Gemini, GPT-4o, Claude, and others accept text, images, documents, audio, and video. This is a key step toward assistants that “see” and “hear,” and it underpins agentic systems that act on real, varied inputs.

Sources

See also