Rethinking Machine Learning for Climate Science: A Dataset Perspective
Aditya Grover (UCLA)
Abstract
The growing availability of data sources is a predominant factor enabling the widespread success of machine learning (ML) systems across a wide range of applications. Typically, training data in such systems constitutes a source of ground-truth, such as measurements about a physical object (e.g., natural images) or a human artifact (e.g., natural language). In this position paper, we take a critical look at the validity of this assumption for datasets for climate science. We argue that many such climate datasets are uniquely biased due to the pervasive use of external simulation models (e.g., general circulation models) and proxy variables (e.g., satellite measurements) for imputing and extrapolating in-situ observational data. We discuss opportunities for mitigating the bias in the training and deployment of ML systems using such datasets. Finally, we share views on improving the reliability and accountability of ML systems for climate science applications.