Conference PaperPDF Available

Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?

Authors:
  • German National Library

Abstract

Large Language Models (LLMs) are advancing at a rapid pace, with significant improvements at natural language processing and coding tasks. Yet, their ability to work with formal languages representing data, specifically within the realm of knowledge graph engineering, remains under-investigated. To evaluate the proficiency of various LLMs, we created a set of five tasks that probe their ability to parse, understand, analyze, and create knowledge graphs serialized in Turtle syntax. These tasks, each embodying distinct degrees of complexity and being able to scale with the size of the problem, have been integrated into our automated evaluation system, the LLM-KG-Bench. The evaluation encompassed four commercially available LLMs - GPT-3.5, GPT-4, Claude 1.3, and Claude 2.0, as well as two freely accessible offline models, GPT4All Vicuna and GPT4All Falcon 13B. This analysis offers an in-depth understanding of the strengths and shortcomings of LLMs in relation to their application within RDF knowledge graph engineering workflows utilizing Turtle representation. While our findings show that the latest commercial models outperform their forerunners in terms of proficiency with the Turtle language, they also reveal an apparent weakness. These models fall short when it comes to adhering strictly to the output formatting constraints, a crucial requirement in this context.
Benchmarking the Abilities of Large Language
Models for RDF Knowledge Graph Creation and
Comprehension: How Well Do LLMs Speak Turtle?
 1,2,3 1,2,3 2,4 1
 1,2
1Institute for Applied Informatics, Goerdelerring 9, 04109 Leipzig, Germany, https: // infai.org
2Agile Knowledge Engineering and Semantic Web (AKSW), https: // aksw.org
3Leipzig University, Institute for Informatics, Germany, https:// www.uni- leipzig.de
4eccenca GmbH, Leipzig, Germany, https:// eccenca.com
Abstract
             
              
           
                 
             
                 
          
                
              
             
            
              
               
     
Keywords
         
1. Introduction
            
   
       
           
dl4kg2023 @ ISWC: Workshop Deep Learning for Knowledge Graphs, November 6th-7th, 2023, Athen, Greece
 
   
     
        
     
                   
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
   

             
           
             
      
      
    
        
            

          
             
             
       
       
   
LLM-KG-Bench         
               LLM-KG-Bench
               
            
            
            
            
             
          
             
             
            
            
2. Related Work
             


           
             
        
        BigBench 

 
             
            


            
     

      
           
               
              
  LLM-KG-Bench 
??
     BigBench   
  
 
             
     
3. Benchmark Tasks
                
          
  T2 TurtleErrorsStatic   T3 TurtleSampleGeneration   
T5 FactExtractStatic          
  
 T1 TurtleConnectionExplainStatic    T4 TurtleFriendCount   
    
               static 
              scalable 
             
               
               
               
                
                 
               
       
         
         
 
  
  
3.1. Task T1: Find Connection in Small Turtle File
             Turtle-
ConnectionExplainStatic            
Prompt 1:             
              
        
            
      
             
            
 :Anne  :Bob    
PREFIX :<https://abc.def/ghi/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ow l: <http://www.w3.org/2002/07/owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX or g: <http://www.w3.org/ns/org#>
:anne afo af :Person ;foaf:firstName "Anne" ;foaf:surname "Miller" ;
vcard:hasAddress [avcard:Home ;vcard:country-name "UK" ] .

:bob afo af :Person ;foaf:firstName "Bob" ;foaf:surname "Tanner" ;

vcard:hasAddress [avcard:Home ;vcard:country-name "US" ] .


:wonderOrg aorg:Organization .

:researchDep aorg:OrganizationalUnit ;org:unitOf :wonderOrg ;

rdfs:label "Research Department" .

:marketingDep aorg:OrganizationalUnit ;org:unitOf :wonderOrg ;

rdfs:label "Marketing Department" .


:chiefResearchOfficer aorg:Role . :marketingManager aorg:Role .


[aorg:Membership ;org:member :anne ;org:organization :researchDep ;

org:ro le :chiefResearchOfficer ] .

[aorg:Membership ;org:member :bob ;org:organization :marketingDep ;

org:ro le :marketingManager ] .
 
            
    
𝑎𝑛𝑛𝑒 𝑜𝑟𝑔∶𝑚𝑒𝑚𝑏𝑒𝑟
𝑏𝑛𝑜𝑑𝑒1 𝑜𝑟 𝑔∶𝑜𝑟𝑔𝑎𝑛𝑖𝑧𝑎𝑡 𝑖𝑜𝑛
𝑟𝑒𝑠𝑒𝑎𝑟 𝑐ℎ𝐷𝑒𝑝 𝑜𝑟 𝑔∶𝑢𝑛𝑖𝑡𝑂𝑓
𝑤𝑜𝑛𝑑𝑒𝑟𝑂𝑟 𝑔 𝑜𝑟𝑔 ∶𝑢𝑛𝑖𝑡 𝑂𝑓
𝑚𝑎𝑟 𝑘𝑒𝑡 𝑖𝑛𝑔𝐷𝑒𝑝 𝑜𝑟 𝑔∶𝑜𝑟𝑔𝑎𝑛𝑖𝑧 𝑎𝑡𝑖𝑜𝑛
𝑏𝑛𝑜𝑑𝑒2 𝑜𝑟 𝑔∶𝑚𝑒𝑚𝑏𝑒𝑟
𝑏𝑜𝑏
     
              
              
        
𝑎𝑛𝑛𝑒, 𝑟 𝑒𝑠𝑒𝑎𝑟𝑐ℎ𝐷 𝑒𝑝, 𝑤 𝑜𝑛𝑑𝑒𝑟 𝑂𝑟 𝑔, 𝑚𝑎𝑟 𝑘𝑒𝑡𝑖𝑛𝑔𝐷𝑒𝑝, 𝑏𝑜𝑏
             
              
               
                
    
3.2. Task T2: Find Errors in Small Turtle File
Prompt 2:              
               
   
 TurtleErrorsStatic          
             
                  
            
              
               
             
       
rdflib
     
             
        
3.3. Task T3: Create Sample Graphs
   TurtleSampleGeneration         
            
               
FOAF

                
         
Prompt 3:         
𝑛

  
foaf:Person
         
  
foaf:knows
   
             
              
𝑛
               
      
𝑛
         
               
             
                
                
      
rdf:type foaf:Person
 persons_relative_error 
             
       
= 0
  
> 0
     
  
< 0
         
−1
   
3.4. Task T4: Count Links in Person Graph
Prompt 4:            
                
            

  TurtleFriendCount          
               
  
foaf:Person
  
foaf:knows
  
        
foaf:Person
    
                
              
         
foaf:Person
  
            
         
3.5. Task T5: Create Knowledge Graph from Factsheet
Prompt 5:           
            
              
      
             
       
    
            
              
       
               
      
     
    
             
            
            
            
           
 
             
            
            
               
              
             
  FactExtractStatic         
          
  
               
           
               
              
           
Figure 1:        
             
               
           
            
            
              
            
            
             

4. Benchmark Study Results and Discussion
  LLM-KG-Bench       
              

       
           
             
            
ggml-vicuna-13b-1.1-q4_2.bin
      

ggml-model-gpt4all-falcon-q4_0.bin
      
   
T1:             
             
              
              
        
Figure 2:            
            
             
T2:               
                 
               
            
                 
               
T3:           
            
           
             
           
                 
             
             
              
           
T4:               
                
            
            
             
              
             
              
              
            
              
            
         
               
              
T5:               
                
              
                 
              
             
              
     
5. Conclusion and Future Work
            
               
            
             
             
                
            
                
             
                
             
             
             
             
             
           
      
Acknowledgments
             
          
              
        
References

              
               
     
arXiv:2303.12712

               
           
 
10.4230/
DAGREP.12.9.60

                
   
arXiv:2306.08302

               
         
arXiv:2307.06917
       
   

              
          
 
arXiv:2308.16622
      
    

          
   
arXiv:2305.04676

               
      
arXiv:2307.01128

                   
         

arXiv:2305.13168

                
         
arXiv:2305.15066

           
         
arXiv:2206.04615

                  
             
 
arXiv:2306.05685

                
               
         
10.5281/zenodo.
5371628
A. Online Resources
     

  
  

... LLM performances is evaluated by BigBench [3], on the Open LLM Leaderboard [4], and the Chatbot-Arena [5], but these are very general and do not focus on KGs or SPARQL. In existing work, KG related benchmarking is done specifically on the Turtle format [6] or often as discussion and comparison of very specific solutions. LLM benchmarking related to SPARQL was done in a small scale by Meyer et al. [7] and by Kovriguina et al. [8]. ...
... Several version iterations of LLMs were compared with focus on their RDF Turtle language capabilities and the answers were conserved in a time capsule for future (re)evaluation. Hofer et al. [26] showed in an LLM-driven RML mapping (in Turtle format) generation experiment in alignment with [25,6], that syntactical errors occurred when generating Turtle files, but most LLMs were capable to repair them. Given that SPARQL is based on Turtle, we designed our experiments as multi turn conversations with error feedback loops. ...
... Brei et al. [27] published as well another dataset based on a small subset of the CoyPu KG 6 . This sub graph is small enough to fit into context size of LLMs evaluated here. ...
Preprint
Full-text available
The integration of Large Language Models (LLMs) with Knowledge Graphs (KGs) offers significant synergistic potential for knowledge-driven applications. One possible integration is the interpretation and generation of formal languages, such as those used in the Semantic Web, with SPARQL being a core technology for accessing KGs. In this paper, we focus on measuring out-of-the box capabilities of LLMs to work with SPARQL and more specifically with SPARQL SELECT queries applying a quantitative approach. We implemented various benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation with several LLMs. The tasks assess capabilities along the dimensions of syntax, semantic read, semantic create, and the role of knowledge graph prompt inclusion. With this new benchmarking tasks, we evaluated a selection of GPT, Gemini, and Claude models. Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs and heavily depends on the specific LLM as well as the complexity of the task. While fixing basic syntax errors seems to pose no problems for the best of the current LLMs evaluated, creating semantically correct SPARQL SELECT queries is difficult in several cases.
... While there exist works that employ or study LLMs for KGC/KGE tasks, investigating the performance of LLMs w.r.t. such low-level interfaces and basic graph comprehension, still remains under-explored, albeit a studies showed [11,3] that syntactical issues hinder the usefulness of semantically meaningful responses. This study compared leading GPT4all models to leading commercial models w.r.t. ...
... -We released an evolved version [9] of the LLM-KG-Bench framework [10] with updated prompts (enhanced clarity), a novel SPARQL task, a feature to rerun (modified) evaluations on captured model responses (e.g. using the time capsule), and support for instantiation-based tasks. -We performed a replication experiment of findings in [3], thereby verifying and reinforcing the original research outcomes and the soundness of the benchmark setup and tasks. ...
... Additionally, for the original tasks from [3] we performed a replication study using the same model versions. Due to the randomness with the default temperature, there is a slight variation but the results remain in the same interval (see [4]). ...
Preprint
Full-text available
In this article, we evaluate the evolution of LLM capabilities w.r.t. the RDF Turtle and SPARQL language as foundational skills to assist with various KGE tasks. We measure the LLM response quality using 6 LLM-KG-Bench tasks for a total of 15 LLM versions available over the course of 2023, covering 5 different "major version" LLM classes (GPT-3.5 Turbo, GPT-4, Claude 1.x, Claude 2.x, and Claude Instant 1.x).
... Several recent works cover the use of LLM prompting to address specific knowledge engineering tasks, like ontology engineering [49,8,5,19], ontology learning [11,2], named entity recognition and linking [21,9,43], knowledge graph construction including mapping generation [25,18]; some works specifically focus on benchmark, metrics and evaluation of such methods [31,12]. ...
... P2: generate the ontology-based knowledge graph of the procedure. In order to obtain the intended output -i.e. a KG of the procedure, linked to its steps, actions, direct objects, equipment and temporal information, according to the given ontology -we assign to the LLM the new role of "expert in knowledge graph construction, with a special background in ontologies on procedural knowledge", and we ask it convert the semi-structured output of the first prompt into RDF formatted in Turtle syntax (similarly to [12]). Rather than providing the entire ontology to be used, we showed the language model an example translation from its initial output to RDF, so that the LLM could find in the example all classes and properties to be used, and how they needed to be mapped to the annotation. ...
Preprint
Full-text available
Procedural Knowledge is the know-how expressed in the form of sequences of steps needed to perform some tasks. Procedures are usually described by means of natural language texts, such as recipes or maintenance manuals, possibly spread across different documents and systems, and their interpretation and subsequent execution is often left to the reader. Representing such procedures in a Knowledge Graph (KG) can be the basis to build digital tools to support those users who need to apply or execute them. In this paper, we leverage Large Language Model (LLM) capabilities and propose a prompt engineering approach to extract steps, actions, objects, equipment and temporal information from a textual procedure, in order to populate a Procedural KG according to a pre-defined ontology. We evaluate the KG extraction results by means of a user study, in order to qualitatively and quantitatively assess the perceived quality and usefulness of the LLM-extracted procedural knowledge. We show that LLMs can produce outputs of acceptable quality and we assess the subjective perception of AI by human evaluators.
... Ontologieslogically well-formed controlled vocabularies designed to represent entities and relationships among themand knowledge graphsontologies merged with actual data about entities and associated relationshipshave been identified as crucial for advancing research on and applications of LLMs [3,6]. Recent research has explored the application of LLMs for ontology alignment [14][15][16][17][18], mining unstructured data for knowledge graph creation [19][20][21][22][23][24][25][26][27], the generation of ontological classes using LLMs [25,[28][29][30][31][32], and the creation of ontologies created using AI models [4]. However, to our knowledge there have been no serious attempts to generate ontologies or knowledge graphs that extend from an upper-level ontology or substantially reuse ontology content from domain ontologies extending from such an upper-level. ...
Preprint
Full-text available
Generative artificial intelligence (AI), exemplified by the release of GPT-3.5 in 2022, has significantly advanced the potential applications of large language models (LLMs), including in the realms of ontology development and knowledge graph creation. Ontologies, which are structured frameworks for organizing information, and knowledge graphs, which combine ontologies with actual data, are essential for enabling interoperability and automated reasoning. However, current research has largely overlooked the generation of ontologies extending from established upper-level frameworks like the Basic Formal Ontology (BFO), risking the creation of non-integrable ontology silos. This study explores the extent to which LLMs, particularly GPT-4, can support ontologists trained in BFO. Through iterative development of a specialized GPT model named "My Ontologist," we aimed to generate BFO-conformant ontologies. Initial versions faced challenges in maintaining definition conventions and leveraging foundational texts effectively. My Ontologist 3.0 showed promise by adhering to structured rules and modular ontology suites, yet the release of GPT-4o disrupted this progress by altering the model's behavior. Our findings underscore the importance of aligning LLM-generated ontologies with top-level standards and highlight the complexities of integrating evolving AI capabilities in ontology engineering.
Conference Paper
Full-text available
Knowledge Graphs (KG) provide us with a structured, flexible , transparent, cross-system, and collaborative way of organizing our knowledge and data across various domains in society and industrial as well as scientific disciplines. KGs surpass any other form of representation in terms of effectiveness. However, Knowledge Graph Engineering (KGE) requires in-depth experiences of graph structures, web technologies , existing models and vocabularies, rule sets, logic, as well as best practices. It also demands a significant amount of work. Considering the advancements in large language models (LLMs) and their interfaces and applications in recent years, we have conducted comprehensive experiments with ChatGPT to explore its potential in supporting KGE. In this paper, we present a selection of these experiments and their results to demonstrate how ChatGPT can assist us in the development and management of KGs.
Article
Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages. In this article, we present a forward-looking roadmap for the unification of LLMs and KGs. Our roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs; 2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering; and 3) Synergized LLMs + KGs , in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge. We review and summarize existing efforts within these three frameworks in our roadmap and pinpoint their future research directions.
  • S Bubeck
  • V Chandrasekaran
  • R Eldan
  • J Gehrke
  • E Horvitz
  • E Kamar
  • P Lee
  • Y T Lee
  • Y Li
  • S Lundberg
  • H Nori
  • H Palangi
  • M T Ribeiro
  • Y Zhang
S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, Y. Zhang, Sparks of artificial general intelligence: Early experiments with gpt-4 (2023). arXiv:2303.12712.
Knowledge graphs and their role in the knowledge engineering of the 21st century (dagstuhl seminar 22372
  • P Groth
  • E Simperl
  • M Van Erp
  • D Vrandečić
P. Groth, E. Simperl, M. van Erp, D. Vrandečić, Knowledge graphs and their role in the knowledge engineering of the 21st century (dagstuhl seminar 22372) (2023). URL: https: //www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22372. doi:10.4230/ DAGREP.12.9.60.
Developing a scalable benchmark for assessing large language models in knowledge graph engineering
  • L.-P Meyer
  • J Frey
  • K Junghanns
  • F Brei
  • K Bulert
  • S Gründer-Fahrer
  • M Martin
L.-P. Meyer, J. Frey, K. Junghanns, F. Brei, K. Bulert, S. Gründer-Fahrer, M. Martin, Developing a scalable benchmark for assessing large language models in knowledge graph engineering, 2023. arXiv:2308.16622, to appear in poster proceedings of Semantics-23, 20-22. 9. 2023, Leipzig, Germany.
Enhancing knowledge graph construction using large language models
  • M Trajanoska
  • R Stojanov
  • D Trajanov
M. Trajanoska, R. Stojanov, D. Trajanov, Enhancing knowledge graph construction using large language models (2023). arXiv:2305.04676.
Iterative zero-shot llm prompting for knowledge graph construction
  • S Carta
  • A Giuliani
  • L Piano
  • A S Podda
  • L Pompianu
  • S G Tiddia
S. Carta, A. Giuliani, L. Piano, A. S. Podda, L. Pompianu, S. G. Tiddia, Iterative zero-shot llm prompting for knowledge graph construction (2023). arXiv:2307.01128.
  • Y Zhu
  • X Wang
  • J Chen
  • S Qiao
  • Y Ou
  • Y Yao
  • S Deng
  • H Chen
  • N Zhang
Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, N. Zhang, Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities, 2023. arXiv:2305.13168.
Gpt4graph: Can large language models understand graph structured data ? an empirical evaluation and benchmarking
  • J Guo
  • L Du
  • H Liu
  • M Zhou
  • X He
  • S Han
J. Guo, L. Du, H. Liu, M. Zhou, X. He, S. Han, Gpt4graph: Can large language models understand graph structured data ? an empirical evaluation and benchmarking (2023). arXiv:2305.15066.
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
  • A Srivastava
A. Srivastava, et al., Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, Transactions on Machine Learning Research (2023). arXiv:2206.04615.