The rapid expansion of unstructured and semi-structured textual data in technical documentation, industrial datasheets, and regulatory reports has created an urgent need for automated knowledge extraction and representation systems. Traditional rule-based and keyword-driven approaches often fail to capture semantic relationships, hierarchical structures, and contextual dependencies, limiting their effectiveness in structured data retrieval. This thesis explores AI-driven structured
knowledge extraction using Large Language Models (LLMs), specifically GPT-4o and Gemini 2.0 Flash, to generate XML-based knowledge graphs from unstructured PDFs. The proposed methodology consists of a multi-stage AI pipeline that integrates text extraction, structured representation, confidence-aware entity extraction, and question-answering (QA) capabilities:
• Text Extraction and Preprocessing: A layout-aware text extraction using pdfplumber accurately retrieves textual content from multi-column, tabular, and graphically embedded PDFs. The system ensures context preservation, structural consistency, and efficient handling of complex document formats.
• Structured Knowledge Graph Generation: Extracted text is processed using GPT-4o and Gemini 2.0 Flash to transform unstructured content into hierarchically structured XML representations, ensuring that extracted information is machine-readable and semantically rich.
• Confidence-Based Entity Extraction: Gemini 2.0 Flash introduces a confidence-aware extraction framework, where each extracted attribute is assigned a confidence score (0.0–1.0), allowing for uncertainty estimation, ranking of high-confidence attributes, and filtering of unreliable extractions.
• Question-Answering (QA) over Structured Data: The thesis implements QA systems: (i) Rule-Based Querying which directly maps structured queries to XML elements for fast and precise information retrieval, and (ii) AI-Powered Semantic QA using GPT-4o and Gemini 2.0 Flash which interpret natural language queries, by extracting relevant information dynamically from structured knowledge graphs.
• Performance Benchmarking and Evaluation: The structured extraction and QA models are evaluated using: (i) precision, recall, and F1-score to assess extraction accuracy, (ii) processing time and scalability to measure computational efficiency, (iii) schema compliance to ensure adherence to predefined XML structures, and (iv) confidence-score reliability to validate uncertainty estimation in entity extraction.
Key Findings and Contributions: Experimental results demonstrate that GPT-4o excels in structured knowledge graph generation, producing highly accurate and semantically coherent XML representations. Gemini 2.0 Flash is more computationally efficient and introduces confidence-based entity extraction, improving reliability in large-scale document processing. However, challenges
remain:
• Schema Inconsistencies: AI-generated XML structures sometimes deviate from predefined schema formats, requiring post-processing validation.
• Numerical Misinterpretation: Models occasionally misinterpret numerical attributes, unit conversions, and measurement relationships.
• Ambiguity in Query-Based Extraction: AI-powered QA systems struggle with vague or multi-context queries, requiring a hybrid retrieval approach.
• Scalability Constraints: Processing large document corpora with LLM-based structured extraction incurs high computational costs, necessitating optimization strategies.
To address these limitations, this thesis explores post-processing validation techniques, fine-tuning strategies, and confidence-based entity ranking to improve AI-driven structured extraction.
Future research directions include:
• Hybrid AI-Rule-Based Knowledge Extraction: Combining deep learning models with schema-enforced rule-based validation to ensure structural consistency and interpretability.
• Reinforcement Learning for Schema Compliance: Optimizing model outputs to match predefined XML standards dynamically.
• Adaptive Domain-Specific Tuning: Fine-tuning LLMs on specialized corpora (e.g., technical manuals, financial reports, scientific research) to improve extraction accuracy.
• Scalable Processing Pipelines: Implementing distributed and parallelized extraction workflows to enable large-scale structured document processing.
Impact and Applications: The findings of this research contribute to the advancement of AI-powered structured knowledge extraction, offering a scalable, interpretable, and queryable approach to information retrieval. This methodology has significant applications in:
• Enterprise Knowledge Management – Automating the structuring and retrieval of technical and regulatory documentation.
• AI-Assisted Information Retrieval – Enabling intelligent question-answering for business intelligence and decision-support systems.
• Semantic Search and Knowledge Graph Integration – Enhancing ontology-based search engines and domain-specific knowledge bases.
• Scientific Research and Digital Archiving – Facilitating structured extraction from research papers, patents, and legal documents.
This thesis bridges the gap between unstructured document content and structured knowledge representation, paving the way for scalable AI-driven knowledge management systems. Future refinements in hybrid AI pipelines, confidence-aware ranking, and model explainability will further enhance automated structured information extraction’s accuracy, efficiency, and reliability.