February 2025
·
8 Reads
Large Language Models (LLMs) hold potential as clinical decision support tools, particularly when integrated with domain-specific knowledge. In radiology, there is limited research on LLMs for assessing imaging appropriateness. This study evaluates a contextualized GPT-4-based LLM’s performance in assessing the appropriateness of musculoskeletal MRI scan requests with standard models and different versions of optimization. The LLMs’ performances was also compared against human clinicians with varying experience (two radiology residents, two subspecialist attendings, an orthopaedic surgeon). Using a retrieval-augmented generation framework, the LLM was provided with a domain-specific knowledge base from 33 American College of Radiology Appropriateness Criteria guidelines. A test dataset of 70 fictional case scenarios was created, including cases with insufficient clinical information. Quantitative analysis using the McNemar mid-P test revealed that the optimized LLM achieved 92.86% accuracy, significantly outperforming the baseline model (61.43%, P < .001) and the standard GPT-4 model (51.29%, P < .001). The optimized model also excelled in identifying cases with insufficient clinical information. In comparison to human clinicians, the optimized LLM performed better than all but one radiologist. This study demonstrates that with contextualization and optimization, GPT-4-based LLMs can improve performance in assessing imaging appropriateness and show promise as clinical decision support tools in radiology.