Data Gaps (Beta) - More Info

About Methodology Taxonomy Acknowledgements List of Data Gaps

About

Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.

In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.

Our list of data gaps is available at this link. This page provides more details on the methodology, taxonomy of data gaps, and stakeholders consulted.

This project is currently in its beta phase, with ongoing improvements to content and usability. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.

Contribute a new data gap by filling out this form.
Provide updates to an existing data gap by clicking the "Give feedback" button within the Details view for that data gap on the Data Gaps main page.
Provide general feedback (e.g., on content, usability, or actionability) by filling out this form.

Methodology

Climate Change AI's list of critical data gaps was compiled via a combination of desk research and stakeholder interviews. Please check back soon for more details on our methodology. A list of interviewees is provided below.

Taxonomy of Data Gaps

Data gaps are classified within six categories: Wish, Obtainability, Usability, Reliability, Sufficiency, and Miscellaneous/Other.

‣Type W: Wish - Dataset does not exist.

‣Type O: Obtainability - Dataset is not easily obtainable.

O1: Findability - Dataset is not easy to find for humans and/or computers.

Dataset is not Findable according to FAIR Principles (“metadata and data should be easy to find for both humans and computers”).

O2: Accessibility - Dataset is difficult to access.

Dataset is not Accessible according to FAIR Principles (“once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation”).
Dataset is not freely available to the public.
Obtaining access to the dataset requires lengthy bureaucratic approval or may otherwise be difficult/infeasible in practice.

‣Type U: Usability - Data is not readily usable.

U1: Structure - Dataset is not machine-readable, well-formatted, and/or interoperable.

Dataset is not in a machine-readable format.
Dataset is not in a uniform, consistent, and standardized format. (See also FAIR Principles R1.3.)
Dataset is not Interoperable according to FAIR Principles (“data [are amenable to being] integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing”).

U2: Aggregation - Data is scattered and requires consolidation.

Data is scattered and requires consolidation into a centralized dataset.

U3: Usage Rights - Data usage rights are unclear or restrictive.

Data usage rights are unclear (see also FAIR Principles R1.1).
Dataset is released under a restrictive usage license.

U4: Documentation - Documentation on data usage requires improvement.

Documentation or other types of metadata required to help users understand how to use the data are incomplete, lacking in detail, or unclear (see also FAIR Principles R1 and Datasheets for Datasets).

U5: Pre-processing - Data needs to be processed or cleaned prior to analysis.

Data contains an excessive amount of missing values, noise, and duplicates that need to be cleaned.
Data needs to be annotated.

U6: Large Volume - Usability is impeded by large data volume.

Dataset requires significant computational resources to process, presenting challenges for users who lack sufficient computing power.
Dataset is too large to easily be downloaded and transferred to computing infrastructure that is not already co-located with the dataset.
Data provider is unable to store or host the dataset effectively due to insufficient storage space.
Dataset is not partitioned or searchable, requiring bulk download with significant computational resources.

‣Type R: Reliability - Data needs to be improved, validated, and/or verified.

R1: Quality - Data quality needs to be improved, validated, and/or verified.

Data may contain significant errors or inaccuracies.
Data needs to be “ground truthed” or otherwise validated/verified.

R2: Provenance - Data integrity needs to be validated/verified due to provenance.

Data provenance is not properly documented (see also FAIR Principles R1.2 and Datasheets for Datasets).
Integrity of data needs to be validated/verified by a trustworthy source due to provenance-related issues (e.g., data is self-reported or comes from an unverified source).

‣Type S: Sufficiency - Data is insufficient and needs to be collected or simulated.

S1: Insufficient Volume - Data volume is insufficient for intended tasks.

Amount of data is insufficient for intended machine learning tasks.

S2: Coverage - Data coverage is limited (e.g., geographically, temporally, or demographically).

Data is only available for certain regions, time periods, demographic groups, etc., thereby limiting its usefulness.

S3: Granularity - Data is lacking in granularity/resolution.

More granular data is needed, e.g., with respect to spatial or temporal resolution.

S4: Timeliness - Data is not released promptly or is otherwise out of date.

Data is not released promptly or is otherwise out of date.

S5: Proxy - Data needs to be inferred or simulated.

Ground truth data is difficult or impossible to collect, and instead needs to be inferred or simulated.

S6: Missing Components - Dataset is missing important variables or types of information.

Dataset is missing additional variables or types of information that are important for downstream analysis (e.g., because this information has not yet been collected).

‣Type M: Miscellaneous/Other - Challenges or gaps that do not fit into the other categories, including challenges that arise from the use of multiple datasets.

Acknowledgments

We would like to thank the following individuals for their input to this stocktake of critical data gaps.

Aditya Jain (Mila - Quebec AI Institute)
Adrian Kelly (EPRI Europe)
Alexandre Lacoste (ServiceNow Research)
Alexandre Parisot (Linux Foundation Energy)
Alexis Groshenry (Kayrros)
Amanda Sessim Parisenti (Climate Change AI)
Amit Narayan (Climate AI Ventures)
Ankita Shukla (University of Nevada Reno)
Armin Aligholian (Recurve)
Arthur Ouaknine (McGill University and Mila)
Beichen Zhang (Lawrence Berkeley National Laboratory)
Ben Weinstein (University of Florida, Department of Wildlife Ecology and Conservation)
Bharathan Balaji (Amazon)
Bright Aboh
Carlos Silva (Pachama)
Carly Batist (WildMon)
Chinmay Adhvaryu
Dan Morris (Google Research)
Dan Stowell (Tilburg University/Naturalis Biodiversity Centre)
Daniel Gebbran (Equilibrium Energy)
Dara Farrell
Dave Thau (WWF)
David Dao (Gainforest.Earth)
David Rolnick (McGill University and Mila)
Pierre de Sainte Agathe (Environmental Data Scientist)
Diane Cook (Washington State University)
Edward Anderson (The World Bank)
Erin Moreland (NOAA Fisheries)
Felix Strieth Kalthoff (Bergische Universität Wuppertal)
Filippo Varini (Imperial College London)
Genevieve Flaspohler (Rhiza Research)
Grace Colverd (University of Cambridge)
Griffin Mooers (Massachusetts Institute of Technology)
Hanyu Zhang (Georgia Tech)
Ioana Colfescu (University of St Andrews)
Issa Tingzon (The World Bank/GFDRR)
Jan Drgona (Pacific Northwest National Lab)
Jeremy Renshaw (EPRI)
Jhi-Young Joo (Lawrence Livermore National Laboratory)
Jingwen Yang (RETEC International New Energy Pte. ltd.)
Jonathan Weyn (Microsoft)
Joshua Cortez (Thinking Machines Data Science)
Juan Sebastián Cañas (University College London)
Julia Gottfriedsen (OroraTech GmbH, LMU Munich)
Julia Kaltenborn (McGill University and Mila)
Kai Jeggle (ETH Zurich)
Kaiping Chen (University of Wisconsin-Madison)
Kakani Katija (MBARI)
Kasia Tokarska de los Santos (CarbonPool)
Katherine Lamb (Catalyst Cooperative)
Kevin Barnard (MBARI)
Konstantin Klemmer (Microsoft Research)
Kris Sankaran (University of Wisconsin-Madison)
Kristie Kaminski Kuster
Leland Werden (ETH Zurich)
Levente Klein (IBM Research )
Lucy Yu (Centre for Net Zero)
Lynn Kaack (Hertie School)
Marcella Scoczynski (Federal University of Technology - Paraná, Brazil)
Marcus Voss (Birds on Mars, TU Berlin)
Maria João Sousa (Climate Change AI, Cornell Tech)
Marissa Ramirez de Chanlatte (Lawrence Berkeley National Laboratory)
Martin Horsky (Vattenfall)
Mattia Baldini (Danish Energy Agency)
Max Callaghan (Mercator Research Institute on Global Commons and Climate Change)
Meareg Hailemariam (University of California, Berkeley)
Mélisande Teng (Mila, Université de Montréal)
Mercedeh Tariverdi (The World Bank)
Merl Chandana (LIRNEasia and Lanka Electricity Company)
Michael Bunsen (Mila - Quebec AI Institute)
Michelle Audirac (Harvard T.H. Chan School of Public Health)
Mike Harfoot (Vizzuality)
Millie Chapman (National Center for Ecological Analysis and Synthesis)
Ming Cong (Envision Energy)
Mohit Anand (Helmholtz Center for Environmental Research — UFZ, Department of Compound Environmental Risks, Leipzig, Germany)
Negar Sadrzadeh (University of British Columbia)
Nikola Milojevic-Dupont (MCC Berlin)
OceanLabs Seychelle
Oliver Watt-Meyer (Allen Institute for AI)
Olivier Francon (Cognizant AI Labs)
Omar Younis (ETH Zurich)
Oriana Chegwidden (CarbonPlan)
Panos Moutis (The City College of New York)
Peter van Lunteren (Addax Data Science)
Qian Xiao (ADAPT Centre, Ireland Trinity College Dublin, School of Computer Science and Statistics, Maynooth University, Maynooth International Engineering College (MIEC))
Ramya Srinivasan (Fujitsu Research of America)
Rowan Converse (Center for the Advancement of Spatial Informatics Research and Education, University of New Mexico)
Santiago Martinez Balvanera (University College London)
Sara Beery (MIT)
Sascha von Meier (Independent Consultant)
Sergei Nozdrenkov (wildflow.ai)
Sha Feng (Pacific Northwest National Laboratory)
Shiheng Duan (Lawrence Livermore National Laboratory)
Shiva Madadkhani (Technical University of Munich)
Shuaiqi Wu (University of California, Davis)
Shuting Zhai (Nori.com)
Simon Barnasch (Vattenfall)
Simon van Lierde (Leiden University)
Simone Fobi (Microsoft)
Somya Sharma Chatterjee (University of Minnesota)
Soumyendu Sarkar (Hewlett Packard Enterprise)
Spyros Chatzivasileiadis (DTU - Technical University of Denmark)
Stephane Hallegatte (The World Bank)
Stephen Haben (Energy Systems Catapult)
Sungduk Yu (Intel Labs)
Susana Rodríguez Buriticá (Instituto de Investigación de Recursos Biológicos, Alexander von Humboldt)
Talia Speaker (WILDLABS)
Theo Wolf (Oxford University)
Thijs van der Plas (Alan Turing Institute)
Tianyu Zhang (Mila)
Tim Engleman (ETH Zurich)
Tom Denton (Google DeepMind)
Tse-Chun Chen (Pacific Northwest National Laboratory)
Xiaodong Chen (Pacific Northwest National Laboratory)
Xiaoli Zhou (University of Dalhousie)
Xiaoming Bill Shi (Hong Kong University of Science and Technology)
Ye Liu (Pacific Northwest National Laboratory)
Ying-Jung Chen (Georgia Institute of Technology)
Yogeesh Beeharry (University of Mauritius)
Yue Dong (UCLA)
Yue Li (UCLA)
Yuyan Chen (McGill University and Mila)
Zane Selvans (Catalyst Cooperative)
Zikri Bayraktar (Schlumberger Doll Research)
Zoltan Nagy (The University of Texas at Austin)