AWS: semantic search of aerial imagery with Amazon Nova Multimodal Embeddings (Vexcel)
Vexcel and AWS demonstrated semantic search of aerial photography using Amazon Nova Multimodal Embeddings. After testing around 100 configurations, LLM-generated captions improved the F1 score for swimming pools by 11% and for roads by 13%, which evolved into the commercial product Vexcel Intelligence available in 45+ countries.
This article was generated using artificial intelligence from primary sources.
How does Amazon Nova find swimming pools in aerial photos?
Amazon Nova Multimodal Embeddings — a model that converts text and images into a shared vector space — achieved F1=0.621 for pool detection and F1=0.555 for roads in aerial photography. Vexcel, a leading provider of aerial geospatial data, tested around 100 different model configurations and parameters before selecting Amazon Nova as the foundation of the system.
Multimodal embeddings are numerical vectors that simultaneously encode visual and textual content in a unified space, meaning a user can search millions of aerial images with a simple text query — without manually labeling every image.
LLM captions as the key factor
The single largest gain in the entire project came from automatically generated large language model captions: +11% F1 for pools and +13% for roads compared to working without text descriptions. This finding confirms that combining text with visual content outperforms purely visual approaches to searching satellite and aerial imagery.
Every location in Vexcel’s database is covered by 7 perspectives: an orthophoto from directly above, four oblique captures at different angles, a digital surface model (DSM), and a digital terrain model (DTM).
Commercial outcome and technical stack
The research evolved directly into the commercial product Vexcel Intelligence, which is available in preview in more than 45 countries. The backend infrastructure relies on Amazon Bedrock for models, OpenSearch Serverless for vector search, and Amazon S3 for imagery storage.
Unlike traditional approaches that require manually labeled datasets for each object category, semantic search based on multimodal embeddings enables queries such as ‘industrial zone alongside a river’ without any prior annotation.
Frequently Asked Questions
- What are multimodal embeddings and why are they useful for image search?
- Multimodal embeddings are numerical vectors that encode both textual and visual content into a single shared space, enabling image search via text queries without manually labeling every photograph.
- How much did adding LLM-generated captions improve aerial photo search?
- LLM-generated captions delivered +11% F1 for pool detection and +13% for roads — the largest single gain across the entire test of around 100 configurations.