Using multiple input modalities can improve data-efficiency for ML with satellite imagery (Papers Track)

Arjun Rao (The University of Colorado Boulder); Esther Rolf (The University of Colorado Boulder)

Paper PDF Poster File Cite
Earth Observation & Monitoring Computer Vision & Remote Sensing

Abstract

A large corpus of diverse geospatial data layers are available around the world ranging from remotely-sensed raster data like satellite imagery digital elevation maps, predicted land cover maps, and human-annotated data such as OpenStreetMaps, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of geospatial machine learning (GeoML) models, however, are designed for optical modalities such as multi-spectral satellite imagery. We show improved GeoML model performance for classification and segmentation tasks when these geospatial inputs are fused as additional contextual clues with optical input imagery -- either as an additional input band, or passed as an auxiliary token to a Vision Transformer within a supervised learning setting. Benefits are largest in settings where labeled data are limited, suggesting that multi-modal inputs may be especially valuable for data-efficiency of GeoML models.