Multimodal Vision Research Laboratory

MVRL

Research Area: Vision-Language Modeling

How can vision-language models retrieve, compose, and reason over images with text? We develop and apply vision-language methods for retrieval, captioning, and multimodal understanding in geospatial and natural-world settings. Recent work includes query-adaptive retrieval improvement for vision-language models (QuARI), learning from concepts in text for composed image retrieval (ConText-CIR), and vision-language pseudo-labels for single-positive multi-label learning. Our VLM research connects language to earth observation and biodiversity tasks, complementing dedicated representation learning and generative modeling efforts elsewhere in the lab.

All Publications

  1. Xiong Z, Ye X, Yaman B, Cheng S, Lu Y, Luo J, Jacobs N, Ren L. 2026. UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving.
  2. Xing E, Stylianou A, Pless R, Jacobs N. 2025. QuARI: Query Adaptive Retrieval Improvement. In: Neural Information Processing Systems (NeurIPS).
  3. Thumbnail for LD-SDM: Language-Driven Hierarchical Species Distribution Modeling
    Sastry S, Xing X, Dhakal A, Khanal S, Ahmad A, Jacobs N. 2025. LD-SDM: Language-Driven Hierarchical Species Distribution Modeling. In: Computer Vision for Ecology (IEEE/CVF International Conference on Computer Vision (ICCV) Workshops).
  4. Thumbnail for ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
    Xing E, Kolouju P, Pless R, Stylianou A, Jacobs N. 2025. ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  5. Thumbnail for GOMAA-Geo: GOal Modality Agnostic Active Geo-localization
    Sarkar A, Sastry S, Pirinen A, Zhang C, Jacobs N, Vorobeychik Y. 2024. GOMAA-Geo: GOal Modality Agnostic Active Geo-localization. In: Neural Information Processing Systems (NeurIPS).
  6. Levering A, Marcos D, Jacobs N, Tuia D. 2024. Prompt-guided and multimodal landscape scenicness assessments with vision-language models. PLOS ONE.
  7. Xing X, Xiong Z, Stylianou A, Sastry S, Gong L, Jacobs N. 2024. Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning. In: Workshop on Representation Learning with Very Limited Images.
  8. Dhakal A, Ahmad A, Khanal S, Sastry S, Kerner HR, Jacobs N. 2024. Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images. In: IEEE/ISPRS Workshop: Large Scale Computer Vision for Remote Sensing (EARTHVISION).