KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models

abstract

This paper explores the novel application of large language models (LLMs) as evaluators for structured scientific summaries—a task where traditional natural language evaluation metrics may not readily apply. Leveraging the Open Research Knowledge Graph (ORKG) as a repository of human-curated properties, we augment a gold-standard dataset by generating corresponding properties using three distinct LLMs—Llama, Mistral, and Qwen—under three contextual settings: context-lean (research problem only), context-rich (research problem with title and abstract), and context-dense (research problem with multiple similar papers). To assess the quality of these properties, we employ LLM evaluators (Deepseek, Mistral, and Qwen) to rate them on criteria, including similarity, relevance, factuality, informativeness, coherence, and specificity. This study addresses key research questions: How do LLM-as-a-judge rubrics transfer to the evaluation of structured summaries? How do LLM-generated properties compare to human-annotated ones? What are the performance differences among various LLMs? How does the amount of contextual input affect the generation quality? The resulting evaluation framework, KGEval, offers a customizable approach that can be extended to diverse knowledge graphs and application domains. Our experimental findings reveal distinct patterns in evaluator biases, contextual sensitivity, and inter-model performance, thereby highlighting both the promise and the challenges of integrating LLMs into structured science evaluation.

authors

Nechakhin, Vladyslav
D'Souza, Jennifer
Eger, Steffen
Auer, Sören

publication date

2026

published in

Information Journal

TIB VIVO

KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models Academic Article

Overview

abstract

authors

publication date

published in

Identity

Digital Object Identifier (DOI)

Additional Document Info

start page

volume

issue