Data Gaps (Beta)

Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.

In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.

This project is currently in its beta phase, with ongoing improvements to content and usability. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.

Use Case Gap Types Sectors
Analysis of grid reliability events
Details (click to expand)

Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions. ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.

DatasetData Gap Summary
EPRI10: Transmission control center alarm and operational data set

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
Assessing forest restoration outcomes
Details (click to expand)

Efforts are being made to restore ecosystems like forests and mangroves. ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.

DatasetData Gap Summary
Bioacoustic recordings

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Camera trap images

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Drone images for biodiversity

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Assessment of climate impacts on public health
Details (click to expand)

Climate change has major implications for public health. ML can help analyze the relationships between climate variables and health outcomes to assess how changes in climate conditions affect public health.

DatasetData Gap Summary
Health data

The biggest issue for health data is its limited and restricted access.

Give feedback
Historical climate observations

Processing climate data and Integrating climate data with health data is a big challenge.

Give feedback
Automatic individual re-identification for wildlife
Details (click to expand)

Identification of individuals in wildlife (e.g., individual animals) refers to the process of recognizing and confirming the identity of an animal during subsequent encounters. It is crucial for identifying and monitoring endangered species to better understand their needs and threats, and to aid in conservation efforts. Computer vision related ML techniques are widely used for automatic individual identification.

DatasetData Gap Summary
Camera trap images

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Drone images for biodiversity

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
eDNA

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Bias-correction of climate projections
Details (click to expand)

Climate projection provides essential information about future climate conditions, guiding efforts in mitigation and adaptation, such as disaster risk assessments and power grid optimization. ML enhances the accuracy of these projections by bias-correcting forecasts generated by physics-based climate models (e.g., CMIP6). ML achieves this by learning the relationship between historical climate simulations (e.g., CMIP6 data) and observed ground truth data (such as ERA5 or weather station observations).

DatasetData Gap Summary
CMIP6

The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Weather station data in general

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Bias-correction of weather forecasts
Details (click to expand)

ML can be used to improve the fidelity of high-impact weather forecasts by post-processing outputs from physics-based numerical forecast models and by learning to correct the systematic biases associated with physics-based numerical forecasting models.

DatasetData Gap Summary
ENS

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
HRES

The biggest challenge with using HRES data is that only a portion of it is available to the public for free.

Give feedback
Weather station data in general

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Data-driven generation of climate simulations
Details (click to expand)

Generating climate simulations by running physics-based climate models is time consuming. ML can be used to more quickly generate climate simulations corresponding to different greenhouse gas emissions scenarios. Specifically, ML can be used to learn a surrogate model that approximates computationally-intensive climate simulations generated via Earth system models.

DatasetData Gap Summary
ClimateBench v1.0

The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.

Give feedback
CMIP6

The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.

Give feedback
Detection of climate-induced ecosystem changes
Details (click to expand)

Climate change is inducing significant changes in ecosystems. ML can be used to assess the impact of climate change on biodiversity and identify critical areas for conservation.

DatasetData Gap Summary
Bioacoustic recordings

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Camera trap images

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Ground survey of land use and land management

Data access is restricted due to institutional barriers and other restrictions.

Give feedback
Historical climate observations

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
Development of hybrid-climate models
Details (click to expand)

Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.

DatasetData Gap Summary
ClimSim

An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.

Give feedback
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains)

Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Large-eddy simulations

Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.

Give feedback
Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher resolution and fidelity is needed. 

Give feedback
Digital reconstruction of the environment
Details (click to expand)

Modeling digital representations of environmental conditions and habitats using remote sensing data, such as satellite images, is crucial for understanding how environmental factors impact animal behavior and conservation efforts. This approach provides valuable insights into habitat conditions and changes, which are essential for effective wildlife conservation and management. ML can enhance this process by efficiently processing large volumes of data from various sources, leading to more detailed and accurate environmental reconstructions.

DatasetData Gap Summary
Drone images for biodiversity

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
eDNA

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Satellite Images

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Disaster risk assessment
Details (click to expand)

As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. ML can be used within these efforts to  analyze satellite imagery and geographic data, in order to pinpoint vulnerable areas and produce comprehensive risk maps.

DatasetData Gap Summary
Building footprint

More information, such as age of the building, should be included in the dataset.

Give feedback
Exposure data

Accessibility and reliability is a big issue.

Give feedback
Financial loss datasets related to the impacts of disasters

The financial loss data is usually proprietary and not open to the public.

Give feedback
Hazard data

Resolution of current hazard data is not sufficient for effective physical risk assessment

Give feedback
Open Street Map

Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.

Give feedback
Socioeconomic data

The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.

Give feedback
Surface elevation data

Very high-resolution reference data, for example, DEM currently is not freely open to the public.

Give feedback
Distribution-side hosting capacity estimation
Details (click to expand)

Historically the power grid has been designed for unidirectional flow from carbon-based generating sources to consumers. However, in the effort to lower greenhouse gas emissions, transition to and integration of renewable generation has become increasingly important in all aspects (e.g. transmission and distribution) of the grid from large scale generation farms to consumer-level rooftop solar and community wind turbine installations. The transition necessitates a restructuring of the grid from a unidirectional to a bidirectional energy network thereby stressing pre-existing systems–especially at the low-voltage distribution level. Due to its intermittent behavior, renewable integration at the low-voltage consumer level depends on the hosting capacity of the nearest substation feeder circuit. The hosting capacity determines the amount of generation from distributed energy resources (DERs) that a circuit can safely accommodate without setting off safety equipment. This can occur when generation exceeds consumption leading to overvoltage conditions or high current demand due to sudden peaks in demand leading to voltage sags. Faults may also lead to voltage sags.Operationally, distribution level substation feeders must surmount these conditions to ensure power quality. Traditional methods of assessing the hosting capacity of low-voltage distribution networks involve power flow analysis simulations which can be computationally expensive and difficult to perform in real-time operating conditions for large distribution circuits. For example, to analyze a particular feeder circuit, scenarios must be built by varying loads, DER generation, environmental conditions, power equipment availability, and human activity. Violations must then be identified with respect to voltage limits, thermal loads, and protection equipment to estimate hosting capacity. Machine learning models can serve as a surrogate to traditional models by capturing the spatio-temporal patterns of multiple streams of data for each node in the distribution network enabling real-time estimation capabilities. Additionally, reinforcement learning can enable accelerated scenario building and online control strategy evaluation. One such strategy, for example, may utilize inverter technology to modulate generation to match the larger power system’s needs and protect it from faults and overloads.

DatasetData Gap Summary
Distribution system simulators

While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.

Give feedback
Early detection of fire
Details (click to expand)

Climate change is expected to increase both the frequency and intensity of wildfires, as well as lengthen the fire season due to rising temperatures and shifting precipitation patterns. ML can play a crucial role in wildfire detection and monitoring by synthesizing data from various sources in order to provide more timely and precise information. For instance, ML algorithms can analyze satellite imagery from different regions to detect early signs of fires and track their progression. Additionally, ML can enhance automatic fire detection systems, improving their accuracy and responsiveness.

DatasetData Gap Summary
Drone images

Thermal images captured by drones have high value but the cost of high-resolution sensors is high.

Give feedback
Energy data fusion for policy and market analysis in energy systems
Details (click to expand)

Data collected from public utilities, energy companies, and government agencies by energy regulatory committees can provide detailed information with respect to generation, fuel consumption, emissions, and financial reports that better inform domestic policies to enforce and promote reduction of gas emissions through carbon pricing and renewable incentives, grid modernization and resilience planning for severe weather events, and equitable energy transitions. By providing continuously updated, well curated, analysis-ready energy system data, climate advocates will have better quantitative tools to influence political and administrative process thereby encouraging energy transition.

DatasetData Gap Summary
The Public Utility Data Liberation (PUDL)

Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.

Give feedback
Energy-efficient new building design
Details (click to expand)

The built environment contributes significantly to global carbon dioxide emissions both through the embodied carbon associated with building materials and through operational emissions associated with thermal comfort, ventilation, and lighting. Detailed analysis is often applied too late into the building design process, thereby leaving out significant energy-saving potential. The integration of building performance simulation (BPS) in the initial phase can be critical to sustainable and energy efficient design thereby influencing subsequent construction as well as overall building lifecycle. However, traditional BPS relies on complex physics models with respect to fluid dynamics, thermodynamics, sunlight, and acoustics, increasing computational complexity and processing time associated with the evaluation of a candidate design. Machine learning models can significantly enhance evaluation by emulating BPS based on synthetic and real-world data enabling rapid prototyping and optimization of building topology along multiple comfort, consumption, and environmental objectives. Machine learning can also be introduced at the prototyping phase in response to evaluation, with generative and genetic algorithms based refinement of layouts.

DatasetData Gap Summary
Benchmark datasets of building environmental conditions and occupancy

Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.

Give feedback
Computational fluid dynamics simulation

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
Residential daylight performance metric (DPM) data

Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light. While DPM evaluation is an important step in the planning of commercial buildings, residential buildings do not have similar focus, which is unusual given that most new building construction occurs within the residential sector. Data gaps are provided in the context of residential DPMs which lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
Estimation of forest carbon stock
Details (click to expand)

Forests are one of the Earth’s major carbon sinks, absorbing carbon dioxide (CO₂) from the atmosphere through photosynthesis and storing it in biomass (trees and vegetation) and soil. Accurate estimates of carbon stock help quantify the amount of CO₂ forests are sequestering, which is essential for climate change mitigation efforts. ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery. This approach can significantly improve upon traditional, labor-intensive forest inventory surveys, making carbon stock assessments more efficient and scalable.

DatasetData Gap Summary
GEDI lidar

There is uncertainty in the data.

Give feedback
Ground-survey based forest inventory data

The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.

Give feedback
Estimation of methane emissions from rice paddies
Details (click to expand)

Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practice is crucial for addressing climate change. ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.

DatasetData Gap Summary
Direct measurement of methane emission of rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Extreme heat prediction
Details (click to expand)

Extreme heat is becoming more common in a changing climate, but predicting and accurately modeling extreme heat is difficult. ML can help by improving extreme heat prediction.

DatasetData Gap Summary
NEX-GDDP-CMIP6

The major challenge is handling the size of data

Give feedback
Fault detection in low voltage distribution grids
Details (click to expand)

The low voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources (DERs) and dynamic loads (such as electric vehicles), low voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them. Traditional fault detection and localization utilize impedance-based or traveling-wave methods. Both methods assess deviations between two points with respect to line-specific thresholds and work well in cases where faults tend to have low fault resistance values and networks are limited in the number of branches. As low voltage distribution network topologies grow increasingly complex, line parameters can vary, making it increasingly difficult for traditional methods to accurately diagnose and isolate faults. . Machine learning methods can overcome these limitations as they can be trained on large amounts of data, extract relevant features, and recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.

DatasetData Gap Summary
Micro-synchrophasors (µPMU data)

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
Identification and mapping of climate policy
Details (click to expand)

Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions. ML can be employed to identify climate-related policies and categorize them according to different focus areas.

DatasetData Gap Summary
Academic literature databases

Data is not available in machine-readable formats and is limited to English-language literature from major journals.

Give feedback
Climate-related laws and regulations

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Give feedback
Improving battery management systems
Details (click to expand)

With the shift from carbon based generation to renewable, energy storage becomes crucial to counter the intermittent nature of renewable energy availability. Battery efficiency and lifetime have a direct impact on the effectiveness of transportation electrification. Machine learning can be a valuable tool in accelerating operational efficiency by estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL). Techniques such as reinforcement learning can optimize and enhance charge/discharge strategies for battery management systems (BMS). ML can also process large real-world datasets that may contain battery health parameters, charge/discharge measurements, and load demand. If the load is a vehicle, the type of vehicle, and driving behavior may also be available.

DatasetData Gap Summary
Improving power grid optimization
Details (click to expand)

Traditionally optimal power flow (OPF) seeks to solve the objective of minimizing the cost of power generation to meet a given load (economic dispatch) such that line limits due to thermal, voltage, or stability along with generation limits are met while maintaining power balance at each bus in the transmission system. Traditional techniques formulate OPF as a non-linear, constrained, non-convex optimization problem which can be solved for AC and DC systems separately. Traditional OPF solvers use a linear program to determine generation needed to minimize cost and satisfy load demand while adhering to physical constraints of the system. However, as the grid integrates more renewable generation sources there are trends towards the development of hybrid AC/DC power grids to address the limitations of traditional AC transmission systems and the desire to access remote renewables. Such hybrid systems present new challenges to traditional OPF by enabling bidirectional power flow, requiring the adaptation of OPF objective function and constraints to account for new losses, increased costs and congestion. ML can be used to approximate OPF problems, in order to allow them to be solved at greater speed, scale, and fidelity.

DatasetData Gap Summary
Grid2Op and PandaPower

Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.

Give feedback
Optimal power flow simulators

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Power Grid Lib: Optimal power flow benchmark library

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Marine wildlife detection and species classification
Details (click to expand)

Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These processes involve identifying and categorizing different marine species. ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.

DatasetData Gap Summary
Copernicus Marine Data Store

Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.

Give feedback
FathomNet

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Ocean biodiversity data

Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.

Give feedback
Sofar spotter archive

Data access is restricted.

Give feedback
Modeling effects of soil processes on soil organic carbon
Details (click to expand)

Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies. ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses.

DatasetData Gap Summary
Emission dataset compiled from FAO statistics

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.

Give feedback
Simulated variables from process-based models

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made in within simulation models.

Give feedback
Soil Survey Geographic Database (SSURGO)

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Non-intrusive electricity load monitoring
Details (click to expand)

Non-intrusive load monitoring (NILM) is a strategy to disaggregate the total electricity consumption profile of a building into individual appliance load profiles. This strategy can provide insight to individual consumer behavior for the purposes of real-time electricity pricing, can help target customers who may be due for an appliance upgrade, and can enable building energy management systems (EMS) to enact demand response strategies such as load shifting for sheddable or curtailable loads. These strategies can foster energy efficiency, reduce peaks in electricity demand, and help increase the utilization of low-carbon power by enabling better supply/demand matching, thereby fostering grid decarbonization and maintaining grid stability.

DatasetData Gap Summary
Pecan Street

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Sub-metered appliance-level data

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
Offshore wind power forecasting: Long-term (3 hours-1 year)
Details (click to expand)

Long-term wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.

DatasetData Gap Summary
Floating INfrastructure for Ocean observations FINO3

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
Offshore wind meteorological data and LiDAR wind mapping

Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
Offshore wind power forecasting: Short-term (10 min)
Details (click to expand)

Short-term wind forecasting can enable estimation of active power generated by wind farms in the absence of curtailment.

DatasetData Gap Summary
Orsted: Offshore wind SCADA operation data

Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
Post-disaster damage assessment
Details (click to expand)

Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies. ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.

DatasetData Gap Summary
Financial loss datasets related to the impacts of disasters

Data is proprietary and not open to the public.

Give feedback
Satellite Images

The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.

Give feedback
xBD

Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.

Give feedback
Short-term electricity load forecasting
Details (click to expand)

Short-term load forecasting (STLF) is critical for utilities to balance demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions. Furthermore, for grids that may have portions privatized, utilities rely on forecasts to procure (e.g. source and purchase) energy to meet demands. In peak conditions, where loads have been underestimated, utilities have limited options. One option is to utilize reserve capacity, or additional electric supply to ensure reliable power to customers. This usually entails recruitment of expensive peaker plants dependent on fossil-fuels in city centers to meet immediate demands over short distances. Another option is for the utility to initiate an outage to clip peaks. In the worst case, grid assets can be overloaded resulting in system failure and unplanned blackouts. Due to the reliance of historical electricity load data, weather forecasts, time with respect to the day, week, or month, and continuous streams of advanced metering infrastructure (AMI) data, machine learning models are well suited to handle large amounts of data and capture non-linearities which traditional linear models may struggle with.

DatasetData Gap Summary
Advanced metering infrastructure data

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.

Give feedback
Building data genome project

The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.

Give feedback
Faraday: Synthetic smart meter data

Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.

Give feedback
Smart inverter management for distributed energy resources
Details (click to expand)

Distributed energy resources (DERs) such as solar photovoltaics and energy storage systems are a part of low-inertia power systems that do not rely on traditional rotating components. These DERs rely on distributed inverters to convert power from DC to AC which typically are configured to unity power factor. An alternative to unity power factor, inverters can be “smart” by dynamically managing effects of intermittancy prior to feeding power back to feeder circuits at the distribution substation level. Smart inverters can perform Volt-VAR (Voltage-VAR) and Volt-Watt (Voltage-Watt) operations, which involve adjusting the output voltage and frequency of the inverter to maintain grid stability. In other words, the DER inverter is controlled to dynamically adjust reactive power injection back into the grid. This is crucial for preventing voltage sags and swells that can occur due to the integration of DERs into the grid.

DatasetData Gap Summary
Simulation tools for distribution connected inverter systems

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Smart inverter (UL1741-SB compliant) devices database

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Solar installation site assessment
Details (click to expand)

Statistical analysis on solar PV system components for pricing, logistics, planning, and site capacity studies is an important part of the process for siting solar PV systems. Spatiotemporal generation forecasting using pre-existing site data can be used to inform future site recommendations, policy, and decision making with respect to new developments.

DatasetData Gap Summary
LBNL: Solar panel PV system dataset

The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.

Give feedback
US large-scale Solar Photovoltaic Database (USPVDB)

The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.

Give feedback
Solar power forecasting: Long-term (>24 hours)
Details (click to expand)

Longer-term solar forecasts are beneficial for energy market pricing, investment decisions, and integration with other renewable energy sources such as hydroelectric plants to allow for larger scale coordination and grid operational studies. Additionally, inclusion of energy storage systems to harvest solar energy on longer time scales can be better aligned with longer term demand forecasting and predicted solar peaks.

DatasetData Gap Summary
NREL solar power data for integration studies

While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.

Give feedback
Solar power forecasting: Medium-term (6-24 hours)
Details (click to expand)

Medium-term solar forecasts can be beneficial for simulation case studies in demand response, microgrid behavior, electricity markets, and solar site planning.

DatasetData Gap Summary
Satellite remote sensing data

Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.

Give feedback
Solar power forecasting: Short-term (30 min-6 hours)
Details (click to expand)

Hourly site-specific solar forecasting can assist with solar energy estimates based on measured irradiance, photovoltaic inverter output energy, and turbine level output. Forecasting at this level can prove beneficial for joint distributed energy resource and energy storage microgrid scheduling studies, and system reliability studies.

DatasetData Gap Summary
NOAA's SOLRAD network

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.

Give feedback
NREL Solar Radiation Database (NSRDB)

While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.

Give feedback
NREL Solar Radiation Research Laboratory (SRRL): Baseline Measurement System (BMS)

While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.

Give feedback
PV Anlage-Reinhart system

PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.

Give feedback
SOLETE

While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.

Give feedback
Solar power forecasting: Very-short-term (0-30min)
Details (click to expand)

Very-short-term solar power forecasting is critical for time series irradiance forecasting and solar ramp event identification. Solar irradiance ramp events can be defined as sudden changes in solar irradiance within a short time interval. These events are often caused by transient clouds that can lead to abrupt fluctuations in the incoming solar energy. Cloud analysis using cloud segmentation and classification as a proxy to determining solar irradiance attentuation can assist in determining solar generation for photovoltaics and concentrated solar power towers. Solar generation predictions are important for real time electricity market and pricing studies, real-time dispatch of other generating sources, and energy storage control studies.

DatasetData Gap Summary
DOE Atmospheric Radiation Measurement (ARM) research facility data products

ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.

Give feedback
NIST campus photovoltaic (PV) arrays and weather station data sets

Data coverage is limited to Gaithersburg, MD NIST campus and is no longer being maintained after July 2017.

Give feedback
Solcast

Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.

Give feedback
SRRL TSI-880 sky imager gallery

Data coverage and granularity is limited by the location of the cameras and constrained to 10-minute increments. Resolution is also limited to 352x288 24bit jpeg images (see device specifications).

Give feedback
SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database)

There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.

Give feedback
Terrestrial wildlife detection and species classification
Details (click to expand)

Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems. Similarly to marine wildlife studies, ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.

DatasetData Gap Summary
Bioacoustic recordings

The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.

Give feedback
Camera trap images

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Community science data

The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.

Give feedback
Drone images for biodiversity

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
eDNA

 One gap in data is the incomplete barcoding reference databases.

Give feedback
GBIF

While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.

Give feedback
Satellite Images

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Variability analysis of wind power generation
Details (click to expand)

The shift from high-inertia generation sources such as thermal plants to low inertia distributed inverter-coupled generation from distributed energy resources introduces new stability and reliability issues. It is imperative to maintain the frequency of the system at a nominal level to prevent damage, instability, and blackouts. Wind generation from turbines can contribute some frequency response and inertia that may benefit the grid by providing a combination of synthetic inertial and primary frequency response to the grid system.

DatasetData Gap Summary
Simulation tools for active power control by wind

To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.

Give feedback
Weather forecasting: Near-term (< 24 hours)
Details (click to expand)

Near-term weather forecasting (< 24 hours ahead) of temperature, precipitation, etc. at km-level spatial and minute-level temporal resolution, in an accurate and computationally-efficient manner, has implications for many climate change mitigation and adaptation applications. ML can help provide more accurate near-term weather forecasts.

DatasetData Gap Summary
Automatic surface observation (ASOS)

Data volume is large and only data specific to the US is available.

Give feedback
High-resolution weather forecast (HRRR)

Data volume is large, and only data covering the US is available.

Give feedback
Radar data (MRMS)

Obtaining and integrating radar data from various sources is challenging.

Give feedback
Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)

Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.

DatasetData Gap Summary
ENS

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
HRES

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
WeatherBench 2

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
Weather forecasting: Subseasonal horizon
Details (click to expand)

High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
subX

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
CPC Precipitation

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Give feedback
S2S forecast data

More data is needed to take advantage of the large ML models.

Give feedback
Wildfire prediction: Short-term (3-7 days)
Details (click to expand)

Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.

DatasetData Gap Summary
Active fire data

A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
ESA land cover map

Yearly land cover classification gridded map at 300-m resolution from 1992 to present produced by European Space Agency (ESA) Climate Change Initiative (CCI) https://catalogue.ceda.ac.uk/uuid/b382ebe6679d44b8b0e68ea4ef4b701c.

Higher resolution land cover maps (at 10-m resolution) are also available for years 2020 and 2021 (https://esa-worldcover.org/en).

For fire prediction, this provides fine-grained information of available fuel.

Give feedback
Socioeconomic data

Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.

Give feedback
Dataset Gap Types Modalities Sectors
Academic literature databases
Details (click to expand)

Academic literature databases, such as Openalex, Web of Science, Scopus.

Use CaseData Gap Summary
Identification and mapping of climate policy

Data is not available in machine-readable formats and is limited to English-language literature from major journals.

Give feedback
Active fire data
Details (click to expand)

Active fire data derived from images taken by satellites such as MODIS, VIRRS, LANDSAT. They are at different spatial resolutions and temporal coverage. Data can be downloaded here: https://firms.modaps.eosdis.nasa.gov/active_fire.

Use CaseData Gap Summary
Wildfire prediction: Short-term (3-7 days)

A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.

Give feedback
Advanced metering infrastructure data
Details (click to expand)

Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems which collect, store, and analyze per building energy consumption.

AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data which is aggregated and anonymized is  the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).

View dataset

Use CaseData Gap Summary
Short-term electricity load forecasting

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.

Give feedback
Automatic surface observation (ASOS)
Details (click to expand)

1-minute observations from automated surface observation system stations https://madis.ncep.noaa.gov/madis_OMO.shtml

Use CaseData Gap Summary
Weather forecasting: Near-term (< 24 hours)

Data volume is large and only data specific to the US is available.

Give feedback
Benchmark datasets for short-term wildfire prediction
Details (click to expand)

Benchmark datasets for wildfire prediction are standardized collections of data that include historical and real-time wildfire occurrences, remote sensing imagery, fuel information, and meteorological data. These datasets provide a common framework for training, validating, and testing machine learning models. By integrating various modalities and sources of data, benchmark datasets simplify the process of data collection, integration, and preprocessing, ensuring consistency and efficiency in developing and evaluating wildfire prediction models.

Use CaseData Gap Summary
Benchmark datasets of building environmental conditions and occupancy
Details (click to expand)

The US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types. The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.

All data featured on the platform is open access with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning in addition to access restrictions are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.

View dataset

Use CaseData Gap Summary
Energy-efficient new building design

Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.

Give feedback
Bioacoustic recordings
Details (click to expand)

Passive acoustic recording provides continuous monitoring of both the environment and the species.

There is in general a lack of robust, large, and diverse annotated datasets. Some of such datasets are hosted at https://arbimon.org/, www.macaulaylibrary.org, and www.xeno-canto.org.

Use CaseData Gap Summary
Assessing forest restoration outcomes

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Terrestrial wildlife detection and species classification

The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.

Give feedback
Building data genome project
Details (click to expand)

The Building Data Genome Project 2 dataset contains hourly whole building data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to to allow for the development of generalizable building models for energy efficiency analysis studies.

View dataset

Use CaseData Gap Summary
Short-term electricity load forecasting

The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.

Give feedback
Building footprint
Details (click to expand)

The foundation for characterizing exposure involves understanding where people live and in what conditions. Building footprint data serves as a crucial layer in this context, offering detailed attributes of buildings, such as their age, materials, heights, rooftop material, and basement features. Notable sources of such data include OpenStreetMap, USGS Building Footprint, Google Open Buildings, Microsoft Building Footprints, and Meta open building footprint data.

Use CaseData Gap Summary
Disaster risk assessment

More information, such as age of the building, should be included in the dataset.

Give feedback
CMIP6
Details (click to expand)

Climate simulations from a consortium of state-of-art climate models. Data can be found here.

Use CaseData Gap Summary
Bias-correction of climate projections

The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.

Give feedback
Data-driven generation of climate simulations

The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.

Give feedback
CPC Precipitation
Details (click to expand)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Give feedback
Cable inspection robot data
Details (click to expand)

Cable inspection robot LiDAR data is beneficial for Specific Power Line (SPL) partitions which include dampers, insulators, broken strands, and attachments which may have degraded due to exposure to natural elements. Specific Fitting Detection partition data focuses on assessing risk at the lowest part of the power line near trees, roofs, and other power lines that may cross. Since the robots physically crawl on the lines, degradation detection of high voltage transmission lines are useful for maintenance scheduling and obstruction detection at the lower levels of the power line.

Use CaseData Gap Summary
Grid asset management: Assessing vegetation-related wildfire risk

Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.

Give feedback
Camera trap images
Details (click to expand)

Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long-time scales. The image sequences can be used to not only classify species but to identify specifics about the individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density.

In general, the raw images from camera traps need to be annotated before they can be used to train ML models. Some of the available annotated camera trap images are shared via Wildlife Insights (www.wildlifeinsights.org) and LILA BC (www.lila.science), while others are listed on GBIF (https://www.gbif.org/dataset/search?q=). However, the majority of camera trap data is likely scattered across individual research labs or organizations and not publicly available. Sharing such images shared could provide significant progress towards fill the gaps associated with the lack of annotated data that currently hinders the progress of efficiently using ML in biodiversity studies. This is what initiatives like Wildlife Insights are looking to do. 

Use CaseData Gap Summary
Assessing forest restoration outcomes

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Automatic individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Carbon stock estimate
Details (click to expand)

ESA aboveground biomass (AGB) estimate is the most updated public dataset on AGB.

Use CaseData Gap Summary
Changes in marine ecosystems
Details (click to expand)

Annual data on changes (e.g. extent) in marine ecosystems such as mangroves, seagrasses, salt marshes, and wetlands due to various factors including coastal erosion, aquaculture, and others.

Use CaseData Gap Summary
ClimSim
Details (click to expand)

An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.

Use CaseData Gap Summary
Development of hybrid-climate models

Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.

Give feedback
ClimateBench v1.0
Details (click to expand)

A benchmark dataset derived from a full complexity Earth System Model (NorESM2; participant of CMIP 6) for for emulation of key climate variables https://zenodo.org/records/7064308.

Use CaseData Gap Summary
Data-driven generation of climate simulations

The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.

Give feedback
Community science data
Details (click to expand)

Images and recordings contributed by citizen scientists and volunteers represent another significant source of data in biodiversity and ecosystem. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.

Use CaseData Gap Summary
Terrestrial wildlife detection and species classification

The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.

Give feedback
Computational fluid dynamics simulation
Details (click to expand)

Computational fluid dynamics (CFD) simulation output is a means of assessing natural ventilation for new building construction in relation to layout geometry, terrain, presence of neighboring buildings and infrastructure, as well as materials. Multi-directional CFD simulations are often run to account for different times in the year where wind can vary with season. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes or Reynolds-Averaged Navier-Stokes equations can be solved over a lattice or grid superimposed on the layout.

Use CaseData Gap Summary
Energy-efficient new building design

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
Copernicus Marine Data Store
Details (click to expand)

https://data.marine.copernicus.eu/products Free-of-charge state-of-the-art data on the state of the Blue (physical), White (sea ice) and Green (biogeochemical) ocean, on a global and regional scale.

Use CaseData Gap Summary
Marine wildlife detection and species classification

Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.

Give feedback
DOE Atmospheric Radiation Measurement (ARM) research facility data products
Details (click to expand)

ARM represents data from various field measurement programs sponsored by the US Department of Energy with a focus on ground-based pyrheliometer and spectrometer data which is useful for solar radiation time series forecasting and solar potential assessment.

Use CaseData Gap Summary
Solar power forecasting: Very-short-term (0-30min)

ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.

Give feedback
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains)
Details (click to expand)

Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.

Use CaseData Gap Summary
Development of hybrid-climate models

Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.

Give feedback
Direct measurement of methane emission of rice paddies
Details (click to expand)

Direct measurement of methane emission of rice paddies by instruments and sampling systems placed in rice paddies to directly measure methane concentrations in the air above the fields or in the soil. 

Use CaseData Gap Summary
Estimation of methane emissions from rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Distribution system simulators
Details (click to expand)

Distribution system simulators such as OpenDSS and GridLab-D are crucial for understanding the hosting capacity of distribution level substation feeders because they allow for the analysis of various factors that can affect the stability and reliability of the power grid. These factors include voltage limits, thermal capability, control parameters, and fault current, among others. By simulating different scenarios and conditions, such as the integration of distributed energy resources (DERs) such as photovoltaic (PV) solar panels, these tools can provide insights into how the grid can be optimized to accommodate these resources without compromising safety and reliability. OpenDSS is free to use as an alternative when distribution utility real circuit feeder data is unavailable.

Use CaseData Gap Summary
Distribution-side hosting capacity estimation

While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.

Give feedback
Drone images
Details (click to expand)

Drone images have revolutionized various fields by providing high-resolution, aerial perspectives that were previously difficult to obtain. Equipped with advanced cameras and sensors, drones capture detailed visual data from above, offering insights into landscapes, infrastructure, and environmental changes.

Use CaseData Gap Summary
Early detection of fire

Thermal images captured by drones have high value but the cost of high-resolution sensors is high.

Give feedback
Drone images for biodiversity
Details (click to expand)

Like camera traps, drone images can offer high-resolution and relatively close-range images for species identification, individual identification, and environment reconstruction. As with camera traps, most drone images are scattered across disparate sources. Some such data is hosted on www.lila.science。 

Use CaseData Gap Summary
Assessing forest restoration outcomes

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Automatic individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Digital reconstruction of the environment

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
ENS
Details (click to expand)

Ensemble forecast up to 15 days ahead, generated by ECMWF numerical weather prediction model; used as a benchmark/baseline for evaluating ML-based weather forecasts. Data can be found here.

Use CaseData Gap Summary
Bias-correction of weather forecasts

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
EPRI10: Transmission control center alarm and operational data set
Details (click to expand)

Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format which includes semi-structured text descriptions of individual alarm events. Often the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.

View dataset

Use CaseData Gap Summary
Analysis of grid reliability events

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
ERA5
Details (click to expand)

Atmospheric reanalysis data integrates both in-situ and remote sensing observations, including data from weather stations, satellites, and radar. This comprehensive dataset can be downloaded from the provided link.

View dataset

Use CaseData Gap Summary
Bias-correction of climate projections

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Development of hybrid-climate models

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Weather forecasting: Subseasonal horizon

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Wildfire prediction: Short-term (3-7 days)

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
ESA land cover map
Details (click to expand)

Yearly land cover classification gridded map at 300-m resolution from 1992 to present produced by European Space Agency (ESA) Climate Change Initiative (CCI) https://catalogue.ceda.ac.uk/uuid/b382ebe6679d44b8b0e68ea4ef4b701c.

Higher resolution land cover maps (at 10-m resolution) are also available for years 2020 and 2021 (https://esa-worldcover.org/en).

For fire prediction, this provides fine-grained information of available fuel.

Use CaseData Gap Summary
Wildfire prediction: Short-term (3-7 days)

Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.

Give feedback
ESRI land cover map
Details (click to expand)

Sentinel-2 10-m annual map of Earth’s land surface from 2017-2023.

There are also other land cover maps available: https://gisgeography.com/free-global-land-cover-land-use-data/.

Use CaseData Gap Summary
Emission dataset compiled from FAO statistics
Details (click to expand)

Dataset taken from FAO statistics and extrapolated spatially

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.

Give feedback
Exposure data
Details (click to expand)

Exposure is defined as the representative value of assests potentially exposed to a natural hazard occurrence. It can be described by a wide range of features, such as GDP, population, buildings, agriculture, depending on the risk exposed to.

There are global open data as well as proprietary data with more detailed information coming from well estabilished insurance markets.

It can be socio-economic data or structural (building occupancy and construction class) data. Two open-source structural data are OpenStreetMap and OpenQuake GEM project.

Use CaseData Gap Summary
Disaster risk assessment

Accessibility and reliability is a big issue.

Give feedback
Faraday: Synthetic smart meter data
Details (click to expand)

Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, Octopus Energy’s Centre for Net Zero, has generated a synthetic dataset conditioned on presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier.

View dataset

Use CaseData Gap Summary
Short-term electricity load forecasting

Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.

Give feedback
FathomNet
Details (click to expand)

FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. It can be used to train, test, and validate state-of-the-art artificial intelligence algorithms to help us understand our ocean and its inhabitants.

Use CaseData Gap Summary
Marine wildlife detection and species classification

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Floating INfrastructure for Ocean observations FINO3
Details (click to expand)

FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly. The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10meters. Data is collected from August 2009 until the present day.

Use CaseData Gap Summary
Offshore wind power forecasting: Long-term (3 hours-1 year)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
GBIF
Details (click to expand)

GBIF—the Global Biodiversity Information Facility—is an international network and data infrastructure funded by the world’s governments. It offers open access to global biodiversity data. It sets common standards for sharing species records collected from various sources, like museum specimens and modern technologies. Using standards like Darwin Core, GBIF.org indexes millions of species records, accessible under open licenses, supporting scientific research and policy-making.

Use CaseData Gap Summary
Terrestrial wildlife detection and species classification

While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.

Give feedback
GEDI lidar
Details (click to expand)

The Global Ecosystem Dynamics Investigation (GEDI) is a joint mission between NASA and the University of Maryland. It uses three lasers to capture and then construct detailed three-dimensional (3D) maps of forest canopy height and the distribution of branches and leaves. By accurately measuring forests in 3D, GEDI data play an important role in estimating the forest height as well as canopy height, and thus understanding the amounts of biomass and carbon forests store and how much they lose when disturbed.

Use CaseData Gap Summary
Estimation of forest carbon stock

There is uncertainty in the data.

Give feedback
Grid event signature library
Details (click to expand)

The Grid Event Signature Library

Use CaseData Gap Summary
Grid2Op and PandaPower
Details (click to expand)

Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid. Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.

Use CaseData Gap Summary
Improving power grid optimization

Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.

Give feedback
Ground survey of building information
Details (click to expand)

On-site collection of data to accurately map and measure the physical dimensions and boundaries of buildings. This survey is typically conducted using a variety of methods and tools to ensure precise and detailed mapping.

Use CaseData Gap Summary
Ground survey of land use and land management
Details (click to expand)

The direct collection of data through field observations to understand how land is utilized and managed.

Use CaseData Gap Summary
Detection of climate-induced ecosystem changes

Data access is restricted due to institutional barriers and other restrictions.

Give feedback
Ground-survey based forest inventory data
Details (click to expand)

Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models.

Use CaseData Gap Summary
Estimation of forest carbon stock

The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.

Give feedback
HRES
Details (click to expand)

Single high-resolution forecast up to 10 days ahead generated by ECMWF numerical weather prediction model, the Integrated Forecasting system (IFS). It is usually used as a benchmark/baseline for evaulating ML-based weather forecast. Data can be found here.

Use CaseData Gap Summary
Bias-correction of weather forecasts

The biggest challenge with using HRES data is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
Hazard data
Details (click to expand)

Hazard data used for risk assessments usually are presented in the form of a catalog of hypothetical events with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalog can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.

Use CaseData Gap Summary
Disaster risk assessment

Resolution of current hazard data is not sufficient for effective physical risk assessment

Give feedback
Health data
Details (click to expand)

Health data refers to information related to individuals’ physical and mental well-being. This can include a wide range of data, such as medical records, health surveys, healthcare utilization, and epidemiological data.

Use CaseData Gap Summary
Assessment of climate impacts on public health

The biggest issue for health data is its limited and restricted access.

Give feedback
High-resolution weather forecast (HRRR)
Details (click to expand)

Near-term weather forecast by High-Resolution Rapid Refresh (HRRR) model. HRRR is real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model. Radar data is assimilated in the HRRR every 15 min over a 1-h period.

Use CaseData Gap Summary
Weather forecasting: Near-term (< 24 hours)

Data volume is large, and only data covering the US is available.

Give feedback
Historical climate observations
Details (click to expand)

Climate observations of the past. Reanalysis dataset like ERA5 provides a global-scale data at coarse-resolution. Climate data aggregated from local weather station observations offer a more granular view.

Use CaseData Gap Summary
Assessment of climate impacts on public health

Processing climate data and Integrating climate data with health data is a big challenge.

Give feedback
Detection of climate-induced ecosystem changes

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
LBNL: Solar panel PV system dataset
Details (click to expand)

Lawrence Berkeley National Lab (LBNL) Solar Panel PV System Dataset is a small tabular dataset that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery rated capacity. The LBNL solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes.

Use CaseData Gap Summary
Solar installation site assessment

The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.

Give feedback
Large-eddy simulations
Details (click to expand)

Very high resolution (finer than 150 m) atmospheric simulations where atmospheric turbulence is explicitly resolved in the model.

Use CaseData Gap Summary
Development of hybrid-climate models

Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.

Give feedback
LiDAR
Details (click to expand)

LiDAR (Light Detection and Ranging) data provides high-resolution, three-dimensional information about surfaces and objects captured using LiDAR technology. Some open datasets that can be used for roof classification include OpenTopography and USGS 3D Elevation Program (3DEP). Many cities, like Boston, London also have their own LiDAR datasets.

Use CaseData Gap Summary
Micro-synchrophasors (µPMU data)
Details (click to expand)

Micro-phasor measurement units (µPMUs) provide synchronized voltage and current measurements with higher accuracy, precision, and sampling rate making it ideal for distribution network monitoring. For example, µPMUs have an angle accuracy to the allowance of .01 degrees and a total vector error allowance of .05% in contrast to 1 degree and 1% total vector error allowance for classic PMUs. With sampling rates of 10-120 samples per second, µPMUs are capable of capturing dynamic and transient states within the low voltage distribution network allowing for improved event and fault detection and localization. Today most µPMU datasets can be accessed through manual field deployments in test-beds, collaborative research studies, or through publicly available datasets.

View dataset

Use CaseData Gap Summary
Fault detection in low voltage distribution grids

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
NASA-USDA global soil moisture data
Details (click to expand)

The NASA-USDA Global soil moisture data offers detailed global soil moisture information at a 0.25°x0.25° resolution, including surface and subsurface moisture, moisture profiles, and anomalies. This dataset integrates satellite observations from SMAP and SMOS with a modified Palmer model using the Ensemble Kalman Filter to enhance soil moisture predictions, especially in areas with sparse precipitation data.

Use CaseData Gap Summary
NEX-GDDP-CMIP6
Details (click to expand)

NASA Earth Exchange Global Daily Downscaled Projections from CMIP6. See https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6 for more information.

Use CaseData Gap Summary
Extreme heat prediction

The major challenge is handling the size of data

Give feedback
NIST campus photovoltaic (PV) arrays and weather station data sets
Details (click to expand)

National Institute of Standards and Technology (NIST) campus photovoltaic arrays collected from August 2014-July 2017 measure electrical, temperature, meteorological, spectral curves, UV light, infrared radiation, from PV sensors along with solar inverter power data from multiple testbeds on the National Institute of Standards and Technology campus. The testbeds include a parking lot canopy array, a ground mount array, a roof-tilted array, a rooftop weather station, and a rooftop module test station. Measurements are sampled and saved at high frequency, with one minute averages. The dataset includes metadata with respect to latitude, longitude, and elevation.

Use CaseData Gap Summary
Solar power forecasting: Very-short-term (0-30min)

Data coverage is limited to Gaithersburg, MD NIST campus and is no longer being maintained after July 2017.

Give feedback
NOAA's SOLRAD network
Details (click to expand)

The National Oceanic and Atmospheric Administration’s SOLRAD Network monitors the surface radiation of various regions in the united states as part of NOAA’s SURface RADiation budget measurement network. The data includes measurements from different types of instruments and sensors, such as pyrheliometers, pyranometers, radiometers, and UV radiometers. These instruments collect data on incoming radiation, including both visible and UV components, with specific measurement resolutions and accuracy requirements to characterize the Earth’s surface radiation budget. By taking minute interval measurements of incoming solar radiation and accounting for reflection, absorption, and emission, solar energy available for power generation can be accurately forecast for solar farms and large scale solar grid planning projects.

Use CaseData Gap Summary
Solar power forecasting: Short-term (30 min-6 hours)

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.

Give feedback
NREL Solar Radiation Database (NSRDB)
Details (click to expand)

National Renewable Energy Laboratory (NREL)’s Solar Radiaion Database is a part of the NREL solar radiation resource assessment project which includes hourly and half hour data modeled using NREL’s Physical Solar Model (PSM) with measurements derived from the Geostationary Operational Environmental Satellite (GOES) of National Oceanic and Atmospheric Administration (NOAA), Interactive Multisensor Snow and Ice Mapping System (IMS), and Moderate Resolution Imaging Spectroradiometer (MODIS) and Modern Era Retrospective Analysis for Research and Applications v2 (MERRA-2). PSM derives cloud and aerosol properties and then feeds values as input into a radiative trasfer model (Fast All-sky Radiation Model for Solar applications (FARMS). Dataset can provide users with spectral on demand irradiances based on time, location, and photovoltaic (PV) orientation.

Use CaseData Gap Summary
Solar power forecasting: Short-term (30 min-6 hours)

While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.

Give feedback
NREL Solar Radiation Research Laboratory (SRRL): Baseline Measurement System (BMS)
Details (click to expand)

SRRL BMS has 130 data at 60sec intervals for joint variable studies on Golden, CO site specific environmental factors that may be used in photovoltaic potential studies and renewable resource climatology studies. Joint datasets available include co-located data from sensors detecting temperature, pressure, precipitation, wind speed, wind direction, humidity, UV index, aerosol optical depth (AOD), albedo, and percent cloud cover (by category consisting of opaque, thin, and clear).

Use CaseData Gap Summary
Solar power forecasting: Short-term (30 min-6 hours)

While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.

Give feedback
NREL solar power data for integration studies
Details (click to expand)

The NREL Solar Power Data for Integration Studies consist of one year (2006) worth of 5 minute solar power and hourly day ahead forecasts for 6,000 simulated PV plants whose locations were based onthe capacity expansion plan for high penetration renewables in Phase 2 of the western wind and solar integration study and the eastern renewable generation integration study in the United States. NREL generated the data using sub-hour irradiance algorithm and day ahead solar forecast data generated by 3TIER based Numeric Weather Prediction (NWP) simulations for Phase 1 of the Western wind and solar integration study. Data contains utility and distributed-scale PVs which differ in that utility PVs utilize single axis tracking while distributed-scale PVs have fixed tilt. This data is beneficial for reliability studies, dispatch, and electricity market bidding and clearing models.

Use CaseData Gap Summary
Solar power forecasting: Long-term (>24 hours)

While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.

Give feedback
Ocean biodiversity data
Details (click to expand)

Data in various formats (image, audio, video) that contains information of biodiversity of the ocean.

Use CaseData Gap Summary
Marine wildlife detection and species classification

Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.

Give feedback
Offshore wind meteorological data and LiDAR wind mapping
Details (click to expand)

Off-shore wind and meteorological and Light Detection and Ranging (LiDAR) based wind mapping data can be found from several providers. LiDAR based wind mapping has advantages over traditional wind mast tower measurements namely higher resolution, larger coverage, and improved data quality. This is because LiDAR can measure wind speeds at various heights from the ground reducing the impact of turbulence on measurements that would typically affect mast measurements. Furthermore, LiDAR based wind mapping can provide near real time wind data suitable for control optimization and load forecasting applications. Datasets include:

TNO: Offshore wind measurements

Lichteiland Goeree (LEG)

Europlatform (EPL)

K13A

L2-FA-1

Meetmast IJmuiden (MMIJ)

Offshore Wind Egmond aan Zee (OWEZ)

Orsted Offshore Meteorological Data:

Anholt offshore wind farm (ANH)

Westermost Rough offshore wind farm (WMR)

FINO2 offshore meteorological data

Access to the above datasets can be requested at TNO and Orsted which are all based in Europe.

Use CaseData Gap Summary
Offshore wind power forecasting: Long-term (3 hours-1 year)

Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
On-animal sensors
Details (click to expand)

On-animal sensors are used to acquire movement trajectories which can then be classified into activity types that relate to the behavior of individuals or social groups.

Movebank (https://www.movebank.org/cms/movebank-main) is the global repository for animal tracking data.

Use CaseData Gap Summary
Open Street Map
Details (click to expand)

Open Street Map is an open-source map that provides data about geographic features such as roads and rails all over the world. It is built and maintained by a community of mappers using aerial imagery, GPS devices, and low-tech field maps.

Use CaseData Gap Summary
Disaster risk assessment

Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.

Give feedback
Open datasets on supply chain traceability
Details (click to expand)

Information on where the commodity is from and who produced it. Trase provides some of those data, but it is only available for certain types of commodities and does not provide traceability down to individual farms.

Use CaseData Gap Summary
OpenAerialMap
Details (click to expand)

Crowd-sourced open aerial images https://openaerialmap.org/

Use CaseData Gap Summary
Optimal power flow simulators
Details (click to expand)

PowerWorld Simulator and MATPOWER are software tools used for optimizing power systems and include representation of both alternating current (AC) and direct current (DC) systems. PowerWorld Simulator models, analyzes, and optimizes power systems for a wide range of configurations and scenarios with the ability to model small distribution networks as well as transmission systems. MATPOWER is an open source alternative and also solves both the AC and DC versions of optimal power flow (OPF) with DC OPF simplified into a quadratic program using DC modeling assumptions and reducing polynomial costs to second order using real power flows as a function of voltage angles (thereby eliminating voltage magnitude and reactive power). PowerWorld Simulator utilizes a combination of iterative algorithms (Newton-Raphson) with traditional power flow equations.

MATPOWER is open source and PowerWorld Simulator has several options for industry practitioners as well as those who would like to use it for academic purposes. Demo software that is licensed for educational use that includes simulator features such as available transfer capability, optimal power flow, security-constrained OPF, OPF reserves, PV/QV curve tool, transient stability, and geomagnetically induced current. In terms of topology, the free version contains up to 13 buses while the full version of the simulator can handle 250,000 buses.

Use CaseData Gap Summary
Improving power grid optimization

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Orsted: Offshore wind SCADA operation data
Details (click to expand)

Orsted: offshore operation data provides 2 years worth of 10 minute Supervisory Control and Data Acquisition (SCADA) information with respect to nacelle wind speed, electrical power, rotor speed, yaw position, and pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshort wind farm sites. One site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20km by 8km with internal spacing between turbines being 5-7 rotors and a depth of 15-19m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a windfarm with a 35km by 35km spatial coverage area.

Use CaseData Gap Summary
Offshore wind power forecasting: Short-term (10 min)

Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
PV Anlage-Reinhart system
Details (click to expand)

PV Anlage-Reinhart System provides direct hourly PV power, Energy, CO2 avoided, and PV system information for publicly available PV systems collated and compiled by SMA Solar Technology AG (System, Mess und Anlagentechnik), a German company specializing in solar energy equipment in Niestetal, Northern Hesse, Germany. SMA is a leading manufacturer and supplier of solar inverters for photovoltaic (PV) systems and have publicly contributed some PV systems data from their international locations (Germany, the US, Chile, Brazil, Mexico, Canada, Spain, Italy, France, China, Australia, Belgium, India, Poland, Japan, UK, South Africa, Türkiye, and the United Arab Emirates) as well as user contributed systems. Their curated PV system power generation data with identified inverter from SMA, modules, and battery information (depending on selected dataset from portal for availability) has the potential to be used in micro-grid studies in addition to time series forecasting of DERs.

Use CaseData Gap Summary
Solar power forecasting: Short-term (30 min-6 hours)

PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.

Give feedback
Pecan Street
Details (click to expand)

Pecan Street DataPort began as a Smart Grid Demonstration program through the Pecan Street energy research nonprofit organization which worked closely with the University of Texas at Austin. Funded by the DOE in 2014, the project signed up 1000 research participants from the Mueller community in Austin, Texas to share green button, smart meter, and home energy management system (HEMS) data in 750 homes and 25 commercial properties. Financial incentivization of plug-in electric vehicle use and rooftop solar installation by Austin Energy encouraged residential lifestyle shifts. In addition to providing access to sub-metered appliance level consumption data, Pecan Street includes electric vehicle charging, rooftop solar, heating, cooling, and water usage data. Data coverage has expanded to volunteer households from California, New York and Colorado. Previously open for use, Pecan Street has been privatized and now data access and products are available for commercial and academic purchase depending on the level of access requested.

View dataset

Use CaseData Gap Summary
Non-intrusive electricity load monitoring

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Power Grid Lib: Optimal power flow benchmark library
Details (click to expand)

The Power Grid Library (PGLib-OPF) is a collection of git repositories that house benchmark data for validating power system simulations. It contains 36 networks with 3-13,659 buses sourced from IEEE Power Flow Test Cases, IEEE Dynamic Test Cases, IEEE Reliability Test System, Polish Test Cases, PEGASE Test Cases, and RTE Test Cases which have been modified to raise optimality gaps to values between 1-10% thereby creating more challenging suboptimal solutions to AC-OPF. By curating and collecting this data, users who want to study more realistic AC-OPF simulation scenarios can directly retrieve compiled bus IDs, branch IDs, generator IDs, power demand, shunt admittance, voltage magnitude range for buses, power injection range for generators, quadratic active power cost function coefficients for generators, branch parameters like series admittance, line charge, transformer parameters, thermal limits, and branch voltage angle difference range which are more realistic. All parameters are conveniently standardized to MATPOWER data file format for direct use. PGLib-OPF is open source.

Use CaseData Gap Summary
Improving power grid optimization

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Radar data (MRMS)
Details (click to expand)

Radar observations from MRMS (multi-radar multi-season) system https://www.nssl.noaa.gov/projects/mrms/

Use CaseData Gap Summary
Weather forecasting: Near-term (< 24 hours)

Obtaining and integrating radar data from various sources is challenging.

Give feedback
Regularly gridded high-resolution atmospheric observations
Details (click to expand)

Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.

Use CaseData Gap Summary
Development of hybrid-climate models

An enhanced version of ERA5 with higher resolution and fidelity is needed. 

Give feedback
Weather forecasting: Near-term (< 24 hours)

An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.

Give feedback
Residential daylight performance metric (DPM) data
Details (click to expand)

Residential daylight performance metric data (DPM) with respect to daylight autonomy (DA), continuous daylight autonomy (cDA), spatial daylight autonomy (sDA), and useful daylight illuminance (UDI) can be generated using physics-based ray tracing simulations which calculate illuminances over a prototype building layout. Some simulation software available to calculate DPMs include IES virtual environment (IESVE), DesignBuilder, VELUX daylight visualizer, and the open source RADIANCE 5.0. To generate synthetic data from these simulation frameworks, the user must provide a geometric model of the building, climate data with respect to the building location, reflectance and transmittance values for materials, desired radiance parameters, occupancy schedule, and a virtual sensor grid over which the incident illuminance is to be calculated. Strategies based on the output of the simulations can assist architects in optimizing window placement and size, incorporation of shading devices, and the design of floor plans to control building direct and diffuse natural light.

Use CaseData Gap Summary
Energy-efficient new building design

Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light. While DPM evaluation is an important step in the planning of commercial buildings, residential buildings do not have similar focus, which is unusual given that most new building construction occurs within the residential sector. Data gaps are provided in the context of residential DPMs which lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
S2S forecast data
Details (click to expand)

NWP model output from S2S experiment https://confluence.ecmwf.int/display/S2S/Models

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

More data is needed to take advantage of the large ML models.

Give feedback
SAR
Details (click to expand)

Synthetic aperture radar images captured by sensors on satellites. Visit https://www.earthdata.nasa.gov/learn/backgrounders/what-is-sar for more information.

Use CaseData Gap Summary
SOLETE
Details (click to expand)

SOLETE, a part of the Energy System Integration Lab (SYSLAB) Technical University of Denmark (DTU) Wind and Energy Systems, includes 15 months measurements with different resolutions (from second to hourly) from the 1st June 2018 to 1st September 2019 covering : timestamp, air temperature, relative humidity, pressure, wind speed, wind direction, global horizontal irradiance, plane of array irradiance, and active power recorded from an 11 kW Gaia wind turbine and a 10 kW PV inverter. This dataset is beneficial for time series forecasting at the inverter level for joint solar and wind DER systems.

Use CaseData Gap Summary
Solar power forecasting: Short-term (30 min-6 hours)

While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.

Give feedback
SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database)
Details (click to expand)

SWINySEG contains 6768 daytime and night time images of sky/cloud patches along with corresponding binary ground truth maps taken from SWIMSEG and SWINSEG in Singapore using WAHRSIS, a calibrated ground-based whole sky imager over a period of 12 months from January 2016 to December 2016. Ground truth annotations were completed with expertise from the Singapore Meteorological Services.

Use CaseData Gap Summary
Solar power forecasting: Very-short-term (0-30min)

There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.

Give feedback
Satellite Images
Details (click to expand)

Satellite imagery of various spatial and spectral resolutions with global coverage and at different (granular) time slices. There are numerous applications of this type of data in Earth monitoring (glacier, ice extent, snow cover), agriculture (soil composition, crop yield, and crop type detection), forest and high-risk ecosystem monitoring (tree height/type and land cover estimation), flood mapping and management (risk analysis, loss assessment, disaster management during floods), wildfire detection and prediction, cities (building height estimation), energy (wind turbine and solar panel localization), and across sectors (methane, CO2, and N2O measurement).

Different applications require different spatial, spectral, and temporal resolutions and different kinds of labels. For example, for many monitoring applications in the energy, building and transport sectors, spatial resolution is far more important, and very high-resolution RGB images are often used. Spectral resolution is more relevant for instance for vegetation and land use, monitoring land cover (forest, vegetation, etc. ).

Some of the most widely used satellite imagery include Sentinel-1 and 2, MODIS, VIIRS, Landsat which are open to the public and of resolution down to 5m. Commercial satellites can have much higher-resolution images (e.g. 30-cm of Maxar) but they are not open to the public and are not of global coverage. It is worth noting that Planet NICFI provides free high-resolution, analysis-ready mosaics of the world’s tropics for non-commercial use. 

Use CaseData Gap Summary
Digital reconstruction of the environment

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Earth observation for climate-related applications

Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.

Give feedback
Post-disaster damage assessment

The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.

Give feedback
Terrestrial wildlife detection and species classification

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Satellite remote sensing data
Details (click to expand)

Remote sensing data with a focus on visible and infrared channels may be utilized to approximate solar irradiance based on cloud cover over a particular region of interest for solar site planning and simulation case studies.

Use CaseData Gap Summary
Solar power forecasting: Medium-term (6-24 hours)

Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.

Give feedback
Simulated variables from process-based models
Details (click to expand)

Soil data generated by a physics-based or called process-based soil model.

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made in within simulation models.

Give feedback
Simulation tools for active power control by wind
Details (click to expand)

Simulation can assist in understanding the effects of wind power on system frequency of an interconnection. NREL has been developing and conducting dynamic simulations using traditional commercial software tools and developing their own to perform a variety of wind generation studies on active power control of the grid.

These simulation tools include:

NREL Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV)

NREL Multi-Area Frequency Response Integration Tool (MAFRIT)

Use CaseData Gap Summary
Variability analysis of wind power generation

To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.

Give feedback
Simulation tools for distribution connected inverter systems
Details (click to expand)

There is a need to enhance existing simulation tools to study inverter based power systems rather than traditional machine based. Simulations should be able to represent a large number of distribution connected inverters which incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing.

NREL’s PREconfiguring and Controlling Inverter SEt-points (PRECISE) can identify interconnection located on network based on PV customer’s address and model the distribution feeder and preconfigure advanced inverter modes to provide grid support and minimize energy curtailment. The tool can allow utilities to perform power flow analysis and analyze inverter modes.

Furthermore, NREL’s Energy Systems Integration Facility (ESIF) has real-time simulation connected with power hardware that allows for smart inverter manufacturers to test operational control with simulated dynamics and scenarios.

Use CaseData Gap Summary
Smart inverter management for distributed energy resources

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Smart inverter (UL1741-SB compliant) devices database
Details (click to expand)

The California Energy Commission maintains a database of UL 1741-SB compliant smart inverters designed to meet the interconnection requirements set by IEEE 1547-2018 which have additional testing requirements with respect to communications protocols for storing or sending information and controlling adjustable inverter functions such as watt/VAR mode, voltage magnitude and time trip (the process of disconnecting an electrical device or system from the power source when certain conditions are met), frequency magnitude and time trip, electromagnetic interference (EMI), surge, rate of change of frequency, dynamic voltage support, enter service, synchronization, open phase, harmonics, DC injection, ground fault overvoltage, load rejection overvoltage, prioritization of DER responses, fault current, and persistence of DER parameter settings.

CEC Grid Support Solar Inverters

CEC Grid Support Battery Inverters

CEC Grid Support Solar/Battery Inverters

CEC Inverters with Power Control Systems functionality

Additional vendors can also be contacted for smart inverter information:

SMA-America Solar Inverters

Use CaseData Gap Summary
Smart inverter management for distributed energy resources

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Socioeconomic data
Details (click to expand)

Socioeconomic data refers to information related to the social and economic conditions of individual or communities, such as population, GDP, housing, and health facilities.

Use CaseData Gap Summary
Disaster risk assessment

The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.

Give feedback
Wildfire prediction: Short-term (3-7 days)

Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.

Give feedback
Sofar spotter archive
Details (click to expand)

Sofar spotter archive is a publicly accessible repository of historical data collected by Sofar’s global network of Spotter buoys.

Use CaseData Gap Summary
Marine wildlife detection and species classification

Data access is restricted.

Give feedback
Soil Survey Geographic Database (SSURGO)
Details (click to expand)

Soil organic carbon data based on the SSURGO and STATGSO2 databases where data was gathered was gathered by walking over the land and observing the soil. Many soil samples were analyzed in laboratories.

View dataset

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Solcast
Details (click to expand)

Solcast is a global solar forecasting and historical solar irradiance data company that compiles and extracts data from Himawari 8, GOES-16, GOES-17, and inputs from Numeric Weather Prediction (NWP) models to update tabular solar data to 10-15minute scale data products. They provide forecasts and sub-hourly estimates that can be used for site-specific direct normal irradiance prediction and solar energy production potentials.

Use CaseData Gap Summary
Solar power forecasting: Very-short-term (0-30min)

Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.

Give feedback
Species interaction data
Details (click to expand)

Species interaction data is important to understand how climate change impacts species and why. Globi (https://www.globalbioticinteractions.org/) provides open access to species interaction data, yet there remains an urgent need for further collection and sharing of this valuable information.

Use CaseData Gap Summary
Street-level images
Details (click to expand)

Street-level images taken by GoPros installed in cars. The images include different perspectives and angles of rooftops other than satellite images as the vehicle moves, which aids in capturing a more comprehensive view of roof structures and features.

Use CaseData Gap Summary
Sub-metered appliance-level data
Details (click to expand)

- Almanac of Minutely Power dataset (AMPds2): A single building electricity, water, and natural gas consumption dataset from a home in Burnaby, British Columbia, Canada from 2012-2014 which includes environment and utility billing data as well. 

- Commercial building energy dataset (COMBED): A dataset of 6 commercial buildings on the Indraprastha Institute of Information Technology (IIIT-Delhi) from August 2013 to the present containing data with respect to the total power consumption, sub-metered data with respect to elevators, air handling units (AHUs), uninterruptible power supplies (UPS), and central campus heating, ventilation, and air conditioning (HVAC) pumps and chillers at a 30 second cadence.

- DEDDIAG: A dataset comprised of aggregate and disaggregated power consumption from 15 southern German homes monitored at 1Hz containing 50 appliances including dishwashers, washing machines, refrigerators and dryers over a span of 3.5 years (2016-2020). Aggregated data includes three-phase measurements. This dataset also contains event start and stop timestamps for 14 appliances.

- Dutch Residential Energy Dataset (DRED): Requires request. Consists of data collected from a single household in the Netherlands which contains the appliance level and total energy consumption over two months. Appliance consumption measured was a refrigerator, washing machine, central heating, microwve, oven, cooker, blender, toaster, television, fan, living room outlets, and a laptop recorded with a sampling frequency of 1 Hz. DRED additionally has data on human occupancy based on WiFi and bluetooth signals received from occupant smartphones and wearable devices to allow for locating the consumer without setting up the home with more intrusive monitoring devices. DRED can be accessed by request.

- Electricity Consumption and Occupation (ECO): A dataset collected from June 2012-January 2013 covering 6 home in Switzerland where 6-10 smart plugs were deployed in each household. Aggregate consumption at the building level was measured in three phases to capture voltage, current, and phase shifts. Occupancy data was tracked by residents manually and via a passive infrared entry door sensor.

- Greend: A dataset of 9 households in Austria and Italy for one year covering December 2013-April 2014. Data included aggregated and submetered appliance level data which varied depending on the appliance inventory of the household covering active power measurements taken at a frequency of 1Hz. GREEND can be requested by form

- HIPE: A dataset from October 2017-December 2017 recording smart meter measurements from 10 machines and the main terminal of an electronics production site operated by the Institute of Data Processing and Electronics (IPE) at Karlsruhe Institute of Technology (KIT) in Germany at a cadence of 5 seconds with measurements with respect to active power, reactive power, voltage, frequency, and distortion.

- Indian data for Ambient Water and Electricity Sensing (iAWE): Total consumption, appliance level, as well as circuit panel level in a single family home in New Delhi, India was collected in summer of 2013 over the course of 73 days. Additional quantities such as water usage from an overhead tank, and network strength based on packet loss was also jointly measured.

- IDEAL: A joint electricity, gas, temperature, humidity, and light dataset for 255 homes in the UK from August 2016 to June 2018. Aggregate and sub-metered consumption was measured at 1 second intervals, while temperature, humidity and light were measured at 12 second intervals. Household occupancy was measured through initial surveys with respect to socio-demographic data and self-reported updates to the data in the event that there was a change in occupancy.

- Reference Energy Disaggregation Dataset (REDD): Contains 119 days worth of aggregate consumption taken in 2011 from 10 residential buildings located in the greater Boston area. The data includes meter level phases of power, and voltage recorded at 15kHz as well as sub-meter level 24 circuits labeled by appliance category and measured at a cadence of 0.5Hz and 1Hz for large and small plug level appliances respectively.

- REFIT: A dataset containing aggregate and individual appliance monitor sub-meter data taken every 8 seconds from 20 UK households from September 2013 to September 2015. Of the 8 households, 6 households had rooftop solar panels however, 3 were rewired to remove the effect of generation.

- UMass Smart Home data set: This dataset is comprised of metered and sub-metered data from three homes in west Massachussetts taken over a period of three years. Measurements included average household load, circuit-level load, and plug load per second. Accompanying generation data from solar panels and wind turbines is available for one of the three homes. Environmental data with respect to the outdoor weather and indoor temperature and humidity are provided as well as occupancy information through wall switch data, doors, and motion sensors. HVAC trigger events and corresponding temperature settings and operational status are also provided. 

- UK Domestic Appliance-Level Electricity data set (UK-DALE): A dataset comprised of measurements of aggregated as well as individual appliance level consumption recorded every 6 seconds from 5 UK homes taken from researchers at Imperial College. The continuous coverage varied per house ranging from 39 to 786 days spanning dates from 2012 to 2015. Data included whole house active power, apparent power, and RMS voltage. Appliance level measurements were taken every 6 seconds using individual appliance monitors for up to 54 appliances per residence. 

View dataset

Use CaseData Gap Summary
Non-intrusive electricity load monitoring

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
Surface elevation data
Details (click to expand)

Surface elevation data, such as Shuttle Radar Topography Mission and srtm-90m-digital-elevation- database-v4-1 provides the background or reference information needed to create a model or run simulations. It is one of the most essential types of reference data. Other reference data includes geographic feature data (OpenStreetMap), soil velocities (USGS data), and administrative boundaries (Global Administrative Areas).

Use CaseData Gap Summary
Disaster risk assessment

Very high-resolution reference data, for example, DEM currently is not freely open to the public.

Give feedback
TIGGE
Details (click to expand)

Global ensemble weather forecast data from 13 numerical weather prediction centers, starting from October 2006 (hindcast instead of real-time forecast). GEFS forecast by NOAA is also part of this dataset. It is usually used to train a ML model for postprocessing weather forecast. Data can be found here.

Use CaseData Gap Summary
The Public Utility Data Liberation (PUDL)
Details (click to expand)

Collected, collated, and cleaned by Catalyst Cooperative, the Public Utility Data Liberation (PUDL) project compiles and organizes data from the power and energy sector from a variety of sources such as the US Energy Information Administration (EIA), the Federal Energy Regulatory Commission (FERC), the Environmental Protection Agency (EPA), Regional Transmission Organizations (RTOs), Independent System Operators (ISOs), the Pipelines and Hazardous Materials Safety Administration, and the US Mining Safety and Health Administration.

Specific datasets include:

- ![EIA Form 860](https://www.eia.gov/electricity/data/eia860/): Data on the generation of electricity by fuel type on the quantity generated, the amount of fuel consumed, and the emissions produced.

- ![EIA Form 861](https://www.eia.gov/electricity/data/eia861/): Utility level data on electricity sales in megawatt-hours, customer count of electricy supplied to end-users by state. It also inclused information on the number of automated meter readings (AMR) and advanced metering infrastructure (AMI) by state.

- ![EIA Form 923](https://www.eia.gov/electricity/data/eia923/): Monthly and annual fuel-based thermal plant generation, fuel consumption, fossil fuel stocks, non-utility source and disposition of electricity in addition to environmental data.

- ![EIA Form 176](https://www.eia.gov/naturalgas/ngqs/#?year1=1997&year2=2018&company=Name): State specific natural gas deliveries and sectors.

- ![EIA Cooling Water](https://www.eia.gov/electricity/data/water/): Per plant statewise water cooling by generator and boiler data.

- ![FERC Form1](https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual): Utility specific income statements, balance sheets, expenditures, depreciation, and financial metrics to assess utility operation and compliance to regulation.

- ![FERC Form2](https://www.ferc.gov/industries-data/natural-gas/overview/general-information/natural-gas-industry-forms/form-22a-data): Financial and operational data from natural gas interstate transmission pipeline companies.

- ![FERC Form714](https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data): Interconnected balancing authority area operations data with respect to generation, actual and scheduled power transfers, load, and status reports on bulk power trade filed by each electric utility.

- ![FERC Form920](https://www.ferc.gov/industries-data/electric/power-sales-and-markets/electric-quarterly-reports-eqr): Electric quarterly reports provided by public utilities which contain data on contractual terms and conditions, cost-based sales, market-based rate sales, transmission service, and transaction information for short-term and long-term market-based power sales.

- ![EPA Clean Air Markets Program Data (CAMPD)](https://campd.epa.gov/): Hourly emissions of carbon dioxide, nitrous oxide, sulfate oxide, and mercury emissions data per facility based on type and source, heat input, gross generated electricity, and monitoring method.

- ![PHMSA Pipelines](https://www.phmsa.dot.gov/data-and-statistics/pipeline/gas-distribution-gas-gathering-gas-transmission-hazardous-liquids): Annual reports with respect to total pipeline mileage, facilities, storage, commodities transported, miles by material, installation dates to monitor natural gas industry operators.

- ISO/RTO: locational marginal pricing information from system operators.

- ![MSHA Mines](https://www.msha.gov/mine-data-retrieval-system): US mine production, employment, health, safety, and operational data from 2000 to the present.

View dataset

Use CaseData Gap Summary
Energy data fusion for policy and market analysis in energy systems

Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.

Give feedback
UAV LiDAR image data over power lines
Details (click to expand)

LiDAR remote sensing data from unmanned aerial vehicles for power line inspection of right of way data on transmission and distribution lines can be accessed from private providers such as LUMA Energy and COR3. China Southern Power Grid has UAV transmission line data from the Yunnan RoW-1, Yunnan RoW-2, and Hubei RoW 4. Open source EPRI distribution inspection imagery is available and labeled with information regarding polyline class (conductor, other wire) and polygon classes (pole, crossarm, insulator, cutouts, transformer, and background structure). These datasets can provide opportunities of analyzing images paired with geolocated GIS data to identify vegetation management areas near lines.

Use CaseData Gap Summary
Grid asset management: Assessing vegetation-related wildfire risk

Unmanned aerial vehicle (UAV) or drone imagery for vegetation management near transmission and distribution lines may require partnerships with private companies and utilities for access and usage. LiDAR data is sparse and may partially scan power transmission lines resulting in poor data quality. Coverage area is often relegated to right of way (RoWs) of interest which may require continuous monitoring for future vegetation growth and inspection.

Give feedback
US large-scale Solar Photovoltaic Database (USPVDB)
Details (click to expand)

The US Large-scale Solar Photovoltaic Database (USPVDB) contains large-scale photovoltaic georectified, digitized and verified polygons associated with facility specific data attributes mined from the US Energy Information Administration (EIA) form 860 and facility type designation by the US Environmental Protection Agency (EPA). The dataset also has information on whether the large scale PV installations are for agrivoltaic purposes. Overall, 3,699 US ground mounted facilities with capacity greater than or equal to 1MWdc are represented.

Use CaseData Gap Summary
Solar installation site assessment

The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.

Give feedback
Weather station data in general
Details (click to expand)

Usually used as the ground-truth for calibrating weather forecast data

Use CaseData Gap Summary
Bias-correction of climate projections

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Bias-correction of weather forecasts

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
WeatherBench 2
Details (click to expand)

Benchmark for global, medium-range (1-14 day) data-driven weather forecasting https://weatherbench2.readthedocs.io/en/latest/data-guide.html

Use CaseData Gap Summary
Weather forecasting: Short-to-medium term (1-14 days)

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
eDNA
Details (click to expand)

Environmental DNA (eDNA) refers to genetic material obtained from environmental samples like soil and water after being shed by living or dead organisms. By analyzing this genetic material, researchers can detect and monitor species present in a non-invasive and efficient manner, aiding biodiversity studies, conservation efforts, and environmental monitoring. Some eDNA data can be found in GBIF. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequences and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks. 

Use CaseData Gap Summary
Automatic individual re-identification for wildlife

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Digital reconstruction of the environment

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Terrestrial wildlife detection and species classification

 One gap in data is the incomplete barcoding reference databases.

Give feedback
subX
Details (click to expand)

NWP model output from subseasonal forecast experiment https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.

Use CaseData Gap Summary
Weather forecasting: Subseasonal horizon

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
xBD
Details (click to expand)

Annotated pre- and post disaster imagery; a benchmark dataset https://paperswithcode.com/dataset/xbd.

Use CaseData Gap Summary
Post-disaster damage assessment

Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.

Give feedback