Our audio research explores multimodal learning that combines audio with visual and geospatial data. Recent work includes probabilistic embeddings for multi-scale zero-shot soundscape mapping (PSM), learning tri-modal embeddings for zero-shot soundscape mapping, and a unified framework for zero-shot soundscape mapping from satellite images (Sat2Sound). We also develop probabilistic masked multimodal embedding models that integrate audio, image, and text for ecological applications (ProM3E), and explore multimodal approaches to mapping soundscapes from satellite imagery and social media data.