Vision-language model

A vision-language model (VLM) is an AI system that jointly processes images and text within a single model. Unlike a large language model limited to text, a VLM can describe a photograph, answer questions about a chart or diagram, and read and interpret text embedded in an image.

Technically, an image is passed through a vision encoder that turns it into a sequence of vector representations (see embedding). These are inserted into the same token stream as the text, so a shared transformer backbone processes both jointly. The model thereby learns the relationships between what it “sees” and what it expresses in words; the output is usually text — a description, an answer, or an analysis.

Across 2025-2026, vision-language capability has become standard in leading foundation models: GPT-4o, Claude, and Gemini natively accept images, documents, and screenshots. This is a key step toward assistants that can “see,” and it also underpins agentic systems that act on visual inputs such as user interfaces and spreadsheets.

Sources

See also