KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models Academic Article uri icon

abstract

  • This paper explores the novel application of large language models (LLMs) as evaluators for structured scientific summaries—a task where traditional natural language evaluation metrics may not readily apply. Leveraging the Open Research Knowledge Graph (ORKG) as a repository of human-curated properties, we augment a gold-standard dataset by generating corresponding properties using three distinct LLMs—Llama, Mistral, and Qwen—under three contextual settings: context-lean (research problem only), context-rich (research problem with title and abstract), and context-dense (research problem with multiple similar papers). To assess the quality of these properties, we employ LLM evaluators (Deepseek, Mistral, and Qwen) to rate them on criteria, including similarity, relevance, factuality, informativeness, coherence, and specificity. This study addresses key research questions: How do LLM-as-a-judge rubrics transfer to the evaluation of structured summaries? How do LLM-generated properties compare to human-annotated ones? What are the performance differences among various LLMs? How does the amount of contextual input affect the generation quality? The resulting evaluation framework, KGEval, offers a customizable approach that can be extended to diverse knowledge graphs and application domains. Our experimental findings reveal distinct patterns in evaluator biases, contextual sensitivity, and inter-model performance, thereby highlighting both the promise and the challenges of integrating LLMs into structured science evaluation.

publication date

  • 2026

start page

  • 35

volume

  • 17

issue

  • 1