Multimodal Vision Research Laboratory

MVRL

Research Area: Audio

Our audio research explores multimodal learning that combines audio with visual and geospatial data. Recent work includes probabilistic embeddings for multi-scale zero-shot soundscape mapping (PSM), learning tri-modal embeddings for zero-shot soundscape mapping, and a unified framework for zero-shot soundscape mapping from satellite images (Sat2Sound). We also develop probabilistic masked multimodal embedding models that integrate audio, image, and text for ecological applications (ProM3E), and explore multimodal approaches to mapping soundscapes from satellite imagery and social media data.

Publications

  1. Sastry S, Khanal S, Dhakal A, Lin J, Cher D, Jarosz P, Jacobs N. 2025. ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology .
  2. Thumbnail for PSM: Learning Probabilistic Embeddings for Multi-scale Zero-shot Soundscape Mapping
    Khanal S, Xing E, Sastry S, Dhakal A, Xiong Z, Ahmad A, Jacobs N. 2024. PSM: Learning Probabilistic Embeddings for Multi-scale Zero-shot Soundscape Mapping. In: ACM Multimedia. DOI: 10.1145/3664647.3681620.
  3. Khanal S, Sastry S, Dhakal A, Jacobs N. 2023. Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping. In: British Machine Vision Conference (BMVC).
  4. Thumbnail for A Multimodal Approach to Mapping Soundscapes
    Salem T, Zhai M, Workman S, Jacobs N. 2018. A Multimodal Approach to Mapping Soundscapes. In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS). DOI: 10.1109/IGARSS.2018.8517977.
  5. Song W, Salem T, Jacobs N, Johnson M. 2017. Detecting the Presence of Bird Vocalizations in Audio Segments Using a Convolutional Neural Network Architecture. In: International Symposium on Acoustic Communication by Animals.