PreprintPDF Available

xAID Chest CT: retrospective clinical utility assessment

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

This study analyzed the potential clinical benefit of xAID Chest software in the non-balanced selection of cases as assessed by board-certified radiologists from four different European countries.
xAID Chest CT: retrospective clinical utility assessment.
Introduction
xAID is a radiological AI tool that supports findings of multiple features on tomographic images.
While a number of studies on its Chest CT module are available as preprints for peer-reviewed
journals, they all assess retrospectively formal aspects of sensitivity and specificity as defined by
consensus of three radiologists for each case. In the meantime, many studies and opinion pieces
point out that sensitivity and specificity can poorly reflect the real-world clinical value of
radiological software [1, 2].
This is especially true for modern multifeatured AI solutions that actually serve purposes other
than the detection of single pathology. It rather works as a “junior intern” for radiologists,
demonstrating the case, highlighting key areas of concern and performing routine measurements
[3]. In such cases its overall contribution of AI towards correct radiological diagnosis rather than
individual performance of separate program elements seems a more viable alternative.
This study analyzed the potential clinical benefit of AI software in the non-balanced selection of
cases as assessed by board-certified radiologists from four different European countries.
Goal
To assess the xAID Chest CT product potential impact on the clinical performance of
board-certified radiologists with real-world data.
Design
The study analyzed the performance of the research-only version of the xAID Chest CT product
that included 12 findings. All findings with default thresholds and limitations are listed in table
1. The software takes non-contrast chest CT series and sends it to the cloud, where it’s being
processed and sent back as three additional series as shown in picture 1 (A-C):
1. DICOM-SR summary series containing information on most prominent findings (or their
absence) with visual examples in anatomy-wise order
2. DICOM-SR series containing all axial slices with each pathology annotated, a scrollbar
showing the most critical pathology at a given slice and a summary side panel.
3. DICOM-SC in English containing textual information on study findings.
The images could then be reviewed with any DICOM viewer.
Four board-certified radiologists from four European countries (France, Greece, Slovakia and the
United Kingdom) were enrolled in the study on a pay-for-service basis. All of them work both
elective and emergency procedures in different hospitals and have experience with at least 1 AI
product.
They were encouraged to upload between 20 and 25 anonymized non-contrast Chest CT studies
via a specialized demo account. Sources to be used were to be selected by radiologists on their
own (e.g. open datasets, university datasets, personal case selections) with no limitations on
findings or image type expect that known patient age was suggested to be >18 years old and
maximal slice thickness of 3 mm.
Prior to uploading studies, the research participants received instructions for the use of software
and then were asked to adjust the program’s detection\reporting thresholds to fit their standards.
All radiologists participated in 30 minute training session format and were encouraged to analyze
no more than 2 studies per day to make allowance for the learning curve.
The results of each individual image were assessed by radiologist themselves using a formalized
table (see Table 2). Instructions for paper suggested using definite descriptions to be put into
column by radiologists (like “yes” or “no”). Any ambiguous answers were clarified and reduced
to binary options during a follow-up online session. The radiologists were also able to report
specific details of why they agree or disagree with the program's decision. Radiologist’s decision
on case was the sole final evaluation point, no independent analysis of radiologists’ performance
was performed.
Statistical analysis included measurement of binary prevalence with 95% confidence intervals
measured as Wilson score due to case number <100 and values closing upper and lower 20%
edges[4, 5].
Primary outcome:
AI potential contribution to establishing clinical diagnosis (as a share of cases in which AI would
have contributed to clinically significant findings based on radiologists’ judgment)
Secondary outcomes:
1. Miss rated for clinically significant findings
2. Overall Satisfaction
2. Detection rate by pathology
2. Segmentation quality (by pathology)
3. Measurement quality (by pathology)
Results
The study included 81 cases assessed by four board-certified radiologists from different
European countries. The primary outcome demonstrated that AI segmentation contributed to
establishing a clinical diagnosis in 81.5% [71.7–88.5%] of cases. Radiologists reported a 47.4%
[36.9–58.1%] miss rate for clinically significant findings.
Regarding usability, 89.7% [81.2–94.6%] of participants approved the image layout, and 94.9%
[87.8–98.0%] indicated that DICOM-SR components could be integrated into their reports with
minor modifications. AI detected findings outside routine clinical practice in 28.2%
[19.6–38.8%] of cases.
For specific pathologies, the AI demonstrated high accuracy in detecting and measuring
conditions such as pleural effusion (89.7% [81.2–94.6%]) and aortic measurements (89.7%
[81.2–94.6%]), whereas its performance was lower for pulmonary nodules (66.7%
[55.9–76.0%]) and pulmonary opacification (73.1% [62.6–81.5%]). Measurement accuracy and
visualization consistency averaged 81.3% and 81.8%, respectively, while correct detection of
normal/pathological features was lower at 74.1%.
These results highlight the AI's potential as a clinical support tool while identifying areas for
improvement, particularly in false positives for lung nodules and precision in pathology
differentiation.
Discussion
The findings of this study support the utility of the xAID Chest CT Module as a clinical
decision-support tool for radiologists. AI segmentation contributed to establishing a clinical
diagnosis in over 80% of cases, underscoring its role in enhancing radiological workflows
beyond simple pathology detection. However, limitations were observed in specific areas,
particularly in the precision of pulmonary nodule detection and differentiation of certain
pathologies.
False positives in nodule detection were frequently reported, with vessels being misclassified as
nodules, and subpleural nodules proving particularly challenging. Similarly, discrepancies in the
interpretation of age-adjustable findings, such as coronary artery calcium and vertebral
compression fractures, suggest a need for improved standardization in measurement and
classification algorithms. The lower detection accuracy for pulmonary opacification further
highlights areas for refinement, particularly in terms of feature differentiation and measurement
formalization.
Despite these limitations, radiologists found the AI-generated DICOM-SR reports useful, with
nearly 95% indicating that parts of the output could be incorporated into their clinical reports
with minor modifications. The high approval rate for the image layout (nearly 90%) suggests that
the AI’s visualization approach aligns well with radiologists’ expectations. Additionally, the AI
system identified findings that might not typically be considered in routine practice in over a
quarter of cases, suggesting its potential to improve comprehensiveness in reporting.
These results highlight the evolving role of AI in radiology—not as a standalone diagnostic tool
but as an augmentative system that enhances efficiency, consistency, and accuracy. Future
improvements should focus on optimizing nodule detection precision, refining measurement
algorithms, and addressing pathology interpretation discrepancies to further enhance clinical
applicability.
Conclusions
This study demonstrates that the xAID Chest CT Module provides meaningful clinical support
for radiologists, with AI segmentation contributing to diagnosis in over 80% of cases. The tool
effectively enhances workflow efficiency by automating routine measurements and highlighting
key findings, making it a valuable adjunct in radiological practice.
However, challenges remain in certain areas, particularly in reducing false-positive nodule
detections and improving the precision of pathology measurements. Addressing these limitations
through algorithm refinement and improved interpretability will be essential for maximizing AI’s
clinical impact.
Overall, the findings reinforce the potential of multifeatured AI systems as integral components
of modern radiology, assisting clinicians in decision-making rather than replacing their expertise.
Future research should focus on refining AI-driven pathology differentiation and expanding
real-world validation studies to further enhance its diagnostic utility.
Supplement
Table 1. Product basic functionality
Pathology Name
Measurements
Limitations
Lung nodules
Solid nodules only, detects
all nodules including
inflammatory, ≥ 4-6 mm
For patients with known malignancy
or infectious disease, size criteria for
malignancy cannot be used
Pleural effusion
Detects crescent-shaped
liquid accumulations in
gravity-dependent areas
Limited detection in pulmonary-only
series or noisy images
Pulmonary trunk
dilatation
Measures at the widest part,
normal upper limit: 29-33
mm
Some anatomical features may
obstruct the identification
Pulmonary
opacification
Detects consolidated lung
tissue areas; <0.5% volume
may be 0%
Motion artifacts limit detection;
boundaries of large consolidations
are difficult to determine
Pneumothorax
Detects gas accumulation in
pleural cavity
-
Pulmonary
emphysema
Detects voxels with CT
density ≥6%, ≤ -950 HU
Cannot differentiate between cysts,
bronchiectasis, and cavities
Coronary artery
calcification
Detects calcifications,
calculates Agatston index,
classifies severity
(CAC-DRS)
Stents may cause false positives
Pericardial and
epicardial fat
Identifies and segments
pericardial/epicardial fat, ≥
200 ml
Vessels within the pericardium may
be included in segmentation
Hydropericardium
Detects and measures ≥ 50
ml
Heartbeat artifacts limit detection
Dilatation or
aneurysm of
thoracic aorta
Measures
ascending/descending aorta
Limited accuracy in asthenic
patients, esophageal pathology, aortic
dissection, and post-surgery cases
Adrenal gland
lesions
Detects lesions ≥ 10 mm in
adrenal gland
False positives due to tumors from
nearby organs
Spinal
compression
fractures
Classifies fractures by height
reduction: <25% (Genant
0-I), 25-40% (Genant 2),
>40% (Genant 3)
Errors are possible in patients with
extensive scoliosis
Table 2. Questions for research participants.
Study #
Question
Answer options
Normal\pathologi
cal features
detected correctly
Visualized
correctly
Comments
Pulmonary nodules
Pulmonary trunk measurement
Pulmonary opacification
Pleural effusion
Spinal compression fractures
Coronary artery calcification
Pericardial and epicardial fat
Ascending and descending
thoracic aorta measurement
Adrenal gland lesions
Pericardial effusion
General reporting questions
Could AI segmentation
contribute to establishing
diagnosis in this clinical case?
What clinically significant
findings were missed by AI on
this image?
Is there clinical significance for
this case?
Did you like the layout of the
images?
In real practice, would be able to
use parts of DICOM-SR for your
own report (assuming they were
in your working language)? What
needed to be changed for you to
be able to do so?
Have the AI tool found anything
you wouldn’t look for or report
in everyday clinical practice?
Please, provide a report for this
case that you will submit in
real-world example (maybe in
working language) for the first 3
cases
Table 3. Primary outcome and general data
Question
% of positive answers
Could AI segmentation contribute to establishing diagnosis in this clinical
case?
81.5 [71.7–88.5]%
What clinically significant findings were missed by AI on this image?
47.4 [36.9–58.1]%
Did you like the layout of the images?
89.7 [81.2–94.6]%
In real practice, would be able to use parts of DICOM-SR for your own report
(assuming they were in your working language)? What needed to be changed
for you to be able to do so?
94.9 [87.8–98.0]%
Have the AI tool found anything you wouldn’t look for or report in everyday
clinical practice?
28.2 [19.6–38.8]%
Table 4. Secondary outcome (specific metrics)
Question
% or results answered positively
Measured correctly
Visualized
correctly
Normal\pathol
ogical features
detected
correctly
Pulmonary trunk measurement
85.9 [76.7–91.9]%
87.2 [78.2–92.8]%
87.2
[78.2–92.8]%
Pleural effusion
89.7 [81.2–94.6]%
89.7 [81.2–94.6]%
75.6
[65.2–83.7]%
Spinal compression fractures
88.5 [79.7–93.8]%
88.5 [79.7–93.8]%
71.8
[61.2–80.4]%
Coronary artery calcification
83.3 [73.7–89.9]%
84.6 [75.2–90.9]%
65.4
[54.6–74.8]%
Pericardial and epicardial fat
85.9 [76.7–91.9]%
85.9 [76.7–91.9]%
82.1
[72.4–88.9]%
Ascending and descending
thoracic aorta measurement
89.7 [81.2–94.6]%
89.7 [81.2–94.6]%
88.5
[79.7–93.8]%
Adrenal gland lesions
75.6 [65.2–83.7]%
75.6 [65.2–83.7]%
70.5
[59.8–79.3]%
Pericardial effusion
74.4 [63.9–82.6]%
75.6 [65.2–83.7]%
67.9
[57.1–77.1]%
Pulmonary nodules
66.7 [55.9–76.0]%
67.9 [57.1–77.1]%
66.7
[55.9–76.0]%
Pulmonary opacification
73.1 [62.6–81.5]%
73.1 [62.6–81.5]%
65.4
[54.6–74.8]%
Mean
81.28 [71.4–88.3]%
81.78
[72.0–88.7]%
74.11
[63.6–82.4]%
Images
Image 1. Visual interface of the program. A – summary series (DICOM-SC separate series). B –
vertebral compression and bone density (1st slice of general DICOM-SC series). C. Main
DICOM-SC axial series. D. Summary reports (DICOM-SR series).
A
B
C
D
Literature
1. Vasilev, Y., et al., AI-Based CXR First Reading: Current Limitations to Ensure Practical Value.
Diagnostics, 2023. 13(8): p. 1430.
2. Bernstein, M.H., et al., Can incorrect artificial intelligence (AI) results impact radiologists, and if
so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest
radiography. Eur Radiol, 2023. 33(11): p. 8263-8269.
3. Siepmann, R., et al., The virtual reference radiologist: comprehensive AI assistance for clinical
image reading and interpretation. European Radiology, 2024. 34(10): p. 6652-6666.
4. Hazra, A., Using the confidence interval confidently. J Thorac Dis, 2017. 9(10): p. 4125-4130.
5. Agresti, A. and B. Coull, Approximate is Better than “Exact” for Interval Estimation of Binomial
Proportions. The American Statistician, 1998. 52: p. 119-126.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Objectives Large language models (LLMs) have shown potential in radiology, but their ability to aid radiologists in interpreting imaging studies remains unexplored. We investigated the effects of a state-of-the-art LLM (GPT-4) on the radiologists’ diagnostic workflow. Materials and methods In this retrospective study, six radiologists of different experience levels read 40 selected radiographic [ n = 10], CT [ n = 10], MRI [ n = 10], and angiographic [ n = 10] studies unassisted (session one) and assisted by GPT-4 (session two). Each imaging study was presented with demographic data, the chief complaint, and associated symptoms, and diagnoses were registered using an online survey tool. The impact of Artificial Intelligence (AI) on diagnostic accuracy, confidence, user experience, input prompts, and generated responses was assessed. False information was registered. Linear mixed-effect models were used to quantify the factors (fixed: experience, modality, AI assistance; random: radiologist) influencing diagnostic accuracy and confidence. Results When assessing if the correct diagnosis was among the top-3 differential diagnoses, diagnostic accuracy improved slightly from 181/240 (75.4%, unassisted) to 188/240 (78.3%, AI-assisted). Similar improvements were found when only the top differential diagnosis was considered. AI assistance was used in 77.5% of the readings. Three hundred nine prompts were generated, primarily involving differential diagnoses (59.1%) and imaging features of specific conditions (27.5%). Diagnostic confidence was significantly higher when readings were AI-assisted ( p > 0.001). Twenty-three responses (7.4%) were classified as hallucinations, while two (0.6%) were misinterpretations. Conclusion Integrating GPT-4 in the diagnostic process improved diagnostic accuracy slightly and diagnostic confidence significantly. Potentially harmful hallucinations and misinterpretations call for caution and highlight the need for further safeguarding measures. Clinical relevance statement Using GPT-4 as a virtual assistant when reading images made six radiologists of different experience levels feel more confident and provide more accurate diagnoses; yet, GPT-4 gave factually incorrect and potentially harmful information in 7.4% of its responses.
Article
Full-text available
Objective: To examine whether incorrect AI results impact radiologist performance, and if so, whether human factors can be optimized to reduce error. Methods: Multi-reader design, 6 radiologists interpreted 90 identical chest radiographs (follow-up CT needed: yes/no) on four occasions (09/20-01/22). No AI result was provided for session 1. Sham AI results were provided for sessions 2-4, and AI for 12 cases were manipulated to be incorrect (8 false positives (FP), 4 false negatives (FN)) (0.87 ROC-AUC). In the Delete AI (No Box) condition, radiologists were told AI results would not be saved for the evaluation. In Keep AI (No Box) and Keep AI (Box), radiologists were told results would be saved. In Keep AI (Box), the ostensible AI program visually outlined the region of suspicion. AI results were constant between conditions. Results: Relative to the No AI condition (FN = 2.7%, FP = 51.4%), FN and FPs were higher in the Keep AI (No Box) (FN = 33.0%, FP = 86.0%), Delete AI (No Box) (FN = 26.7%, FP = 80.5%), and Keep AI (Box) (FN = to 20.7%, FP = 80.5%) conditions (all ps < 0.05). FNs were higher in the Keep AI (No Box) condition (33.0%) than in the Keep AI (Box) condition (20.7%) (p = 0.04). FPs were higher in the Keep AI (No Box) (86.0%) condition than in the Delete AI (No Box) condition (80.5%) (p = 0.03). Conclusion: Incorrect AI causes radiologists to make incorrect follow-up decisions when they were correct without AI. This effect is mitigated when radiologists believe AI will be deleted from the patient's file or a box is provided around the region of interest. Clinical relevance statement: When AI is wrong, radiologists make more errors than they would have without AI. Based on human factors psychology, our manuscript provides evidence for two AI implementation strategies that reduce the deleterious effects of incorrect AI. Key points: • When AI provided incorrect results, false negative and false positive rates among the radiologists increased. • False positives decreased when AI results were deleted, versus kept, in the patient's record. • False negatives and false positives decreased when AI visually outlined the region of suspicion.
Article
Full-text available
We performed a multicenter external evaluation of the practical and clinical efficacy of a commercial AI algorithm for chest X-ray (CXR) analysis (Lunit INSIGHT CXR). A retrospective evaluation was performed with a multi-reader study. For a prospective evaluation, the AI model was run on CXR studies; the results were compared to the reports of 226 radiologists. In the multi-reader study, the area under the curve (AUC), sensitivity, and specificity of the AI were 0.94 (CI95%: 0.87–1.0), 0.9 (CI95%: 0.79–1.0), and 0.89 (CI95%: 0.79–0.98); the AUC, sensitivity, and specificity of the radiologists were 0.97 (CI95%: 0.94–1.0), 0.9 (CI95%: 0.79–1.0), and 0.95 (CI95%: 0.89–1.0). In most regions of the ROC curve, the AI performed a little worse or at the same level as an average human reader. The McNemar test showed no statistically significant differences between AI and radiologists. In the prospective study with 4752 cases, the AUC, sensitivity, and specificity of the AI were 0.84 (CI95%: 0.82–0.86), 0.77 (CI95%: 0.73–0.80), and 0.81 (CI95%: 0.80–0.82). Lower accuracy values obtained during the prospective validation were mainly associated with false-positive findings considered by experts to be clinically insignificant and the false-negative omission of human-reported “opacity”, “nodule”, and calcification. In a large-scale prospective validation of the commercial AI algorithm in clinical practice, lower sensitivity and specificity values were obtained compared to the prior retrospective evaluation of the data of the same population.
Article
Full-text available
Biomedical research is seldom done with entire populations but rather with samples drawn from a population. Although we work with samples, our goal is to describe and draw inferences regarding the underlying population. It is possible to use a sample statistic and estimates of error in the sample to get a fair idea of the population parameter, not as a single value, but as a range of values. This range is the confidence interval (CI) which is estimated on the basis of a desired confidence level. Calculation of the CI of a sample statistic takes the general form: CI = Point estimate ± Margin of error, where the margin of error is given by the product of a critical value (z) derived from the standard normal curve and the standard error of point estimate. Calculation of the standard error varies depending on whether the sample statistic of interest is a mean, proportion, odds ratio (OR), and so on. The factors affecting the width of the CI include the desired confidence level, the sample size and the variability in the sample. Although the 95% CI is most often used in biomedical research, a CI can be calculated for any level of confidence. A 99% CI will be wider than 95% CI for the same sample. Conflict between clinical importance and statistical significance is an important issue in biomedical research. Clinical importance is best inferred by looking at the effect size, that is how much is the actual change or difference. However, statistical significance in terms of P only suggests whether there is any difference in probability terms. Use of the CI supplements the P value by providing an estimate of actual clinical effect. Of late, clinical trials are being designed specifically as superiority, non-inferiority or equivalence studies. The conclusions from these alternative trial designs are based on CI values rather than the P value from intergroup comparison.
Article
For interval estimation of a proportion, coverage probabilities tend to be too large for "exact" confidence intervals based on inverting the binomial test and too small for the interval based on inverting the Wald large-sample normal test (i.e., sample proportion ± z-score × estimated standard error). Wilson's suggestion of inverting the related score test with null rather than estimated standard error yields coverage probabilities close to nominal confidence levels, even for very small sample sizes. The 95% score interval has similar behavior as the adjusted Wald interval obtained after adding two "successes" and two "failures" to the sample. In elementary courses, with the score and adjusted Wald methods it is unnecessary to provide students with awkward sample size guidelines.