Multimodal Vision Research Laboratory

MVRL

Research Area: Vision-Language Modeling

This work explores the use and development of vision-language models for various vision tasks. We are interested in developing novel algorithms for this field, and in applying these algorithms to solve real-world problems. Recent research includes query adaptive retrieval improvement (QuARI) for vision-language models, learning from concepts in text for composed image retrieval (ConText-CIR), and vision-language pseudo-labels for single-positive multi-label learning. Our work applies vision-language models to diverse applications including natural world imagery understanding, geospatial analysis, and multimodal learning across image, text, and audio modalities.

Publications

  1. Xiong Z, Ye X, Yaman B, Cheng S, Lu Y, Luo J, Jacobs N, Ren L. 2026. UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving. International Journal of Applied Earth Observation and Geoinformation.
  2. Xing E, Stylianou A, Pless R, Jacobs N. 2025. QuARI: Query Adaptive Retrieval Improvement. In: Neural Information Processing Systems (NeurIPS).
  3. Thumbnail for LD-SDM: Language-Driven Hierarchical Species Distribution Modeling
    Sastry S, Xing X, Dhakal A, Khanal S, Ahmad A, Jacobs N. 2025. LD-SDM: Language-Driven Hierarchical Species Distribution Modeling. In: Computer Vision for Ecology (IEEE International Conference on Computer Vision (ICCV) Workshops).
  4. Thumbnail for ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
    Xing E, Kolouju P, Pless R, Stylianou A, Jacobs N. 2025. ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  5. Thumbnail for GOMAA-Geo: GOal Modality Agnostic Active Geo-localization
    Sarkar A, Sastry S, Pirinen A, Zhang C, Jacobs N, Vorobeychik Y. 2024. GOMAA-Geo: GOal Modality Agnostic Active Geo-localization. In: Neural Information Processing Systems (NeurIPS).
  6. Dhakal A, Ahmad A, Khanal S, Sastry S, Kerner HR, Jacobs N. 2024. Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images. In: IEEE/ISPRS Workshop: Large Scale Computer Vision for Remote Sensing (EARTHVISION).
  7. Xing X, Xiong Z, Stylianou A, Sastry S, Gong L, Jacobs N. 2024. Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning. In: Workshop on Representation Learning with Very Limited Images.
  8. Levering A, Marcos D, Jacobs N, Tuia D. 2024. Prompt-guided and multimodal landscape scenicness assessments with vision-language models. PLOS ONE.