Towards the Curation of Environment-related Knowledge Graphs: Fine-tuning General-domain Language Models for Biodiversity Named Entity Recognition (Papers Track)

Geilah Tabanao (University of the Philippines Diliman); Andrew Miguel Pagdanganan (University of the Philippines Diliman); Riza Batista-Navarro (University of Manchester); Roselyn Gabud (University of the Philippines Diliman)

Paper PDF Slides PDF Poster File Cite
Natural Language Processing Ecosystems & Biodiversity Forests

Abstract

The availability of climate data fuels timely science-based climate actions. Providing policymakers and regulators with easy-to-digest, structured climate data, e.g., in the form of a knowledge graph, is critical to mitigating the adverse effects of climate change on the natural environment. Natural language processing (NLP) applications that employ Named Entity Recognition (NER) systems can aid in uncovering information hidden in millions of textual documents. In this paper, we evaluated the NER performance of transformer-based Bidirectional Encoder Representations from Transformers (BERT) models that were pre-trained on general-domain data. We fine-tuned BERT-based models on the COPIOUS dataset for the specialist task of biodiversity NER. Our experiments showed that our DeBERTa NER model demonstrated best performance, obtaining a micro-averaged F1-score of 84.18% based on entity-level evaluation. We employed our DeBERTa NER model in a biodiversity Information Extraction (IE) pipeline and applied it on the forestry compendium of the Centre for Agricultural and Biosciences International (CABI) Digital Library. We demonstrate that the pipeline enables the extraction of structured information on reproductive conditions and habitats of tree species.