How can vision-language models retrieve, compose, and reason over images with text? We develop and apply vision-language methods for retrieval, captioning, and multimodal understanding in geospatial and natural-world settings. Recent work includes query-adaptive retrieval improvement for vision-language models (QuARI), learning from concepts in text for composed image retrieval (ConText-CIR), and vision-language pseudo-labels for single-positive multi-label learning. Our VLM research connects language to earth observation and biodiversity tasks, complementing dedicated representation learning and generative modeling efforts elsewhere in the lab.
Spotlight publications tagged for this research area.