April 2025
·
21 Reads
Aim This study aimed to develop and evaluate an automated large language model (LLM)‐based system for assessing the quality of medical imaging guidelines and consensus (GACS) in different languages, focusing on enhancing evaluation efficiency, consistency, and reducing manual workload. Method We developed the QPC‐HASE‐GuidelineEval algorithm, which integrates a Four‐Quadrant Questions Classification Strategy and Hybrid Search Enhancement. The model was validated on 45 medical imaging guidelines (36 in Chinese and 9 in English) published in 2021 and 2022. Key evaluation metrics included consistency with expert assessments, hybrid search paragraph matching accuracy, information completeness, comparisons of different paragraph matching approaches, and cost‐time efficiency. Results The algorithm demonstrated an average accuracy of 77%, excelling in simpler tasks but showing lower accuracy (29%–40%) in complex evaluations, such as explanations and visual aids. The average accuracy rates of the English and Chinese versions of the GACS were 74% and 76%, respectively (p = 0.37). Hybrid search demonstrated superior performance with paragraph matching accuracy (4.42) and information completeness (4.42), significantly outperforming keyword‐based search (1.05/1.05) and sparse‐dense retrieval (4.26/3.63). The algorithm significantly reduced evaluation time to 8 min and 30 s per guideline and reduced costs to approximately 0.5 USD per guideline, offering a considerable advantage over traditional manual methods. Conclusion The QPC‐HASE‐GuidelineEval algorithm, powered by LLMs, showed strong potential for improving the efficiency, scalability, and multi‐language capability of guideline evaluations, though further enhancements are needed to handle more complex tasks that require deeper interpretation.