ExioNAICS: Enterprises Level Emission Estimation Dataset with Large Language Models (Papers Track)

Yanming Guo (University of Sydney); Jin Ma (University of Sydney); Qiao Xiao (Maynooth University); Kevin Credit (Maynooth University)

Paper PDF Slides PDF Poster File Cite
Climate Finance & Economics Natural Language Processing

Abstract

Accurate greenhouse gas emission reporting is increasingly important for governments, businesses, and investors. However, mainstream adoption—particularly among small and medium enterprises—remains limited by the high implementation costs, fragmented emission factor databases, and a lack of robust classification tools. To address these challenges, we introduce \textbf{ExioNAICS}, the first large-scale NLP benchmark dataset for enterprise-level GHG emission estimation. ExioNAICS integrates validated North American Industry Classification System labels for over 20,850 companies with a concordance to an economic model of carbon intensity factors. By framing the classification task as an Information Retrieval problem and fine-tuning Sentence-BERT with a contrastive learning approach, we achieve state-of-the-art performance on NAICS categories, notably 77.51% Top-1 accuracy and 91.33% Top-10 accuracy in our most challenging setting 1,114 classes. We make ExioNAICS publicly available to lower the entry barrier for GHG reporting and facilitate broader collaboration between machine learning researchers and climate experts. Dataset, code and trained models could be found: https://huggingface.co/datasets/Yvnminc/ExioNAICS