PreprintPDF Available

Developing a Scalable Benchmark for Assessing Large Language Models in Knowledge Graph Engineering

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

As the field of Large Language Models (LLMs) evolves at an accelerated pace, the critical need to assess and monitor their performance emerges. We introduce a benchmarking framework focused on knowledge graph engineering(KGE) accompanied by three challenges addressing syntax and error correction, facts extraction and dataset generation. We show that while being a useful tool, LLMs are yet unfit to assist in knowledge graph generation with zero-shot prompting. Consequently, our LLM-KG-Bench framework provides automatic evaluation and storage of LLM responses as also statistical data with visualization generation to support tracking of prompt engineering and model performance.
Content may be subject to copyright.
     
     
 1,2 1,2 1,2 1
 1 1,2   1,2
1Institute for Applied Informatics, Goerdelerring 9, 04109 Leipzig, Germany, https:// infai.org
2https:// AKSW.org

                 
           
           
                 
        LLM-KG-Bench 
             
         

         
 
               
           
   
         
prompt engineering
             
               
         
   
            
             
            LLM-
KG-Bench            
              
  
Semantics ’23: 19th International Conference on Semantic Systems, September 20–22, 2023, Leipzig, Germany
 
   
  
        
        
  
                   
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
   
  
               
          
       
  Knowledge Base Construction from
Pre-trained Language Models (LM-KBC) Challenge
     
     
        
         
           
            

           
          Beyond the Imitation Game
(BIG-bench) Benchmark
      
 
              
Language Model Evaluation Harness
       
              
               
                  

         LLM-KG-Bench

   
         
            
         
   
              
    LLM-KG-Bench        
            
            
           
              BIG-bench

LLM-KG-Bench          
            
          
generate_text
           
evaluate_model
                
             
 
 
 
 
Benchmark-Runner
Bench Task (connector, size)
Query generator
AI-Model-connector
Answer Evaluator
plot
Stats
Storage
AI
Text
Text
API
Stats
Task-Info
addon queries
Benchmark
Collection
Connector
Collection
Iterate (according to config):
Sizes x Iterations x Connectors x Benchmarks
Benchmark Config:
Iterations=10
Sizes={1k, 10k, 1m}
Connectors={1,2}
Benchmarks= {1,3,4}
  Basic LLM-KG-Bench framework architecture
              
        
  LLM-KG-Bench        
             
          seaborn
  
     
        
   LLM-KG-Bench          
           

        
                 
                 
               
        
              
                
                
            
                
                 
        
 
          
  Subset of metrics from initial tasks. Shown are the F1 scores and mean error of person count
       
           
        
    
          
              
             
              
             
             
             
               
             
  
     
               
              
  
foaf:Person

foaf:knows
      
              persons_relative_error
            
        
= 0
  
> 0
    
   
< 0
         
−1
  
           persons_relative_error
            
    
            
       LLM-KG-Bench     
             
    
 LLM-KG-Bench          
               
                
          

             
            
             
       

     
arXiv:2303.08774

           
         
arXiv:2206.04615

                
   
arXiv:2306.08302

          
            

           
          
arXiv:2304.02711
  
LLM-KG-Bench  
  

ResearchGate has not been able to resolve any citations for this publication.
Article
Large language models (LLMs), such as ChatGPT and GPT4, are making new waves in the field of natural language processing and artificial intelligence, due to their emergent ability and generalizability. However, LLMs are black-box models, which often fall short of capturing and accessing factual knowledge. In contrast, Knowledge Graphs (KGs), Wikipedia and Huapu for example, are structured knowledge models that explicitly store rich factual knowledge. KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages. In this article, we present a forward-looking roadmap for the unification of LLMs and KGs. Our roadmap consists of three general frameworks, namely, 1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs; 2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering; and 3) Synergized LLMs + KGs , in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge. We review and summarize existing efforts within these three frameworks in our roadmap and pinpoint their future research directions.
  • Openai
OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774.
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
  • A Srivastava
A. Srivastava, et al., Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, Transactions on Machine Learning Research (2023). arXiv:2206.04615.
Llm-assisted knowledge graph engineering: Experiments with chatgpt, 2023. Accepted and presented in AI-Tomorrow track on Data Week
  • L.-P Meyer
L.-P. Meyer, et al., Llm-assisted knowledge graph engineering: Experiments with chatgpt, 2023. Accepted and presented in AI-Tomorrow track on Data Week 2023 in Leipzig.
Caufield, more, Structured prompt interrogation and recursive extraction of semantics (spires): A method for populating knowledge bases using zero-shot learning
J. H. Caufield, more, Structured prompt interrogation and recursive extraction of semantics (spires): A method for populating knowledge bases using zero-shot learning, 2023. arXiv:2304.02711.