How do we learn representations that transfer across sensors, modalities, and domains? We develop self-supervised and multimodal embedding methods that align vision, language, and audio for robust learning at scale. Recent work includes Frobenius norm minimization for self-supervised learning (FroSSL), global and local entailment learning for natural-world imagery (RCME), probabilistic masked multimodal embedding models for ecology (ProM3E), and unified embedding spaces linking ground and satellite views (TaxaBind). These representations underpin geospatial search, biodiversity monitoring, and generative earth-data synthesis across the lab.
Spotlight publications tagged for this research area.