ArticlePDF Available

AI-Based CXR First Reading: Current Limitations to Ensure Practical Value

MDPI
Diagnostics
Authors:
  • Research and Practical Clinical Center of Diagnostics and Telemedicine Technologies, Department of Health Care of Moscow
  • State Budget-Funded Health Care Institution of the City of Moscow "Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department"

Abstract and Figures

We performed a multicenter external evaluation of the practical and clinical efficacy of a commercial AI algorithm for chest X-ray (CXR) analysis (Lunit INSIGHT CXR). A retrospective evaluation was performed with a multi-reader study. For a prospective evaluation, the AI model was run on CXR studies; the results were compared to the reports of 226 radiologists. In the multi-reader study, the area under the curve (AUC), sensitivity, and specificity of the AI were 0.94 (CI95%: 0.87–1.0), 0.9 (CI95%: 0.79–1.0), and 0.89 (CI95%: 0.79–0.98); the AUC, sensitivity, and specificity of the radiologists were 0.97 (CI95%: 0.94–1.0), 0.9 (CI95%: 0.79–1.0), and 0.95 (CI95%: 0.89–1.0). In most regions of the ROC curve, the AI performed a little worse or at the same level as an average human reader. The McNemar test showed no statistically significant differences between AI and radiologists. In the prospective study with 4752 cases, the AUC, sensitivity, and specificity of the AI were 0.84 (CI95%: 0.82–0.86), 0.77 (CI95%: 0.73–0.80), and 0.81 (CI95%: 0.80–0.82). Lower accuracy values obtained during the prospective validation were mainly associated with false-positive findings considered by experts to be clinically insignificant and the false-negative omission of human-reported “opacity”, “nodule”, and calcification. In a large-scale prospective validation of the commercial AI algorithm in clinical practice, lower sensitivity and specificity values were obtained compared to the prior retrospective evaluation of the data of the same population.
Content may be subject to copyright.
... One primary obstacle is the resource-intensive nature of manual labeling for large datasets. This process is not only costly but also extremely time-consuming, potentially limiting the scalability of AI models [12]. Furthermore, attempts to automate label extraction from radiology reports have proven challenging due to the nuanced nature of medical terminology, where semantically similar words can lead to misinterpretation, and the frequent occurrence of incomplete annotated data [12]. ...
... This process is not only costly but also extremely time-consuming, potentially limiting the scalability of AI models [12]. Furthermore, attempts to automate label extraction from radiology reports have proven challenging due to the nuanced nature of medical terminology, where semantically similar words can lead to misinterpretation, and the frequent occurrence of incomplete annotated data [12]. ...
Article
Full-text available
Background/Objectives: This study investigated the diagnostic capabilities of two AI-based tools, M4CXR (research-only version) and ChatGPT-4o, in chest X-ray interpretation. M4CXR is a specialized cloud-based system using advanced large language models (LLMs) for generating comprehensive radiology reports, while ChatGPT, built on the GPT-4 architecture, offers potential in settings with limited radiological expertise. Methods: This study evaluated 826 anonymized chest X-ray images from Inha University Hospital. Two experienced radiologists independently assessed the performance of M4CXR and ChatGPT across multiple diagnostic parameters. The evaluation focused on diagnostic accuracy, false findings, location accuracy, count accuracy, and the presence of hallucinations. Interobserver agreement was quantified using Cohen’s kappa coefficient. Results: M4CXR consistently demonstrated superior performance compared to ChatGPT across all evaluation metrics. For diagnostic accuracy, M4CXR achieved approximately 60–62% acceptability ratings compared to ChatGPT’s 42–45%. Both systems showed high interobserver agreement rates, with M4CXR generally displaying stronger consistency. Notably, M4CXR showed better performance in anatomical localization (76–77.5% accuracy) compared to ChatGPT (36–36.5%) and demonstrated fewer instances of hallucination. Conclusions: The findings highlight the complementary potential of these AI technologies in medical diagnostics. While M4CXR shows stronger performance in specialized radiological analysis, the integration of both systems could potentially optimize diagnostic workflows. This study emphasizes the role of AI in augmenting human expertise rather than replacing it, suggesting that a combined approach leveraging both AI capabilities and clinical judgment could enhance patient care outcomes.
... Чувствительность при работе врачей-рентгенологов в этом исследовании возросла с 0,51 до 0,60, специфичность не изменилась, составив 0,96 [23]. Тем не менее при оценке другого коммерческого алгоритма ИИ не было выявлено значимой разницы в чувствительности и специфичности между алгоритмом ИИ и врачами-рентгенологами [24]. Чувствительность алгоритма составила 0,90, специфичность -0,89, в то время как чувствительность результатов описаний врачей была 0,90, а специфичность -0,95. ...
Article
Lung cancer is one of the leading causes of morbidity and mortality worldwide. Lung cancer risk factors are numerous, including heredity, active and passive smoking, environmental and professional conditions, as well as many other conditions, such as chronic obstructive pulmonary disease and HIV infection. Due to prolonged asymptomatic progression, lung cancer can be diagnosed at advanced stages, which significantly worsens the prognosis for these patients. Screening programs have been established to reduce mortality rates from lung cancer. This involves the use of radiology modalities such as fluorography, radiography and computed tomography. However, despite their widespread use and availability, fluorography and radiography have technical limitations in detecting small-sized neoplasms. Computed tomography is the most informative method for detecting lung cancer and many countries have used low-dose computed tomography as a screening method. Some such programs have proven to be effective. Artificial intelligence algorithms can act as an additional tool to improve screening, and additional research to identify the optimal group of patients at risk may increase the effectiveness of screening programs.
... Radiology as a data-driven specialty can greatly benefit from these models in both interpretive and non-interpretive uses, from report generation to streamlining administrative processes [3], [4]. Task-specific specialised models such as those that can detect radiologic findings on chest x-rays [5] or do real-time detection and triaging of cerebral haemorrhage cases on CT [6], [7] are already being seen in clinical practice. ...
Preprint
Full-text available
Background: Publicly available artificial intelligence (AI) Visual Language Models (VLMs) are constantly improving. The advent of vision capabilities on these models could enhance workflows in radiology. Evaluating their performance in radiological image interpretation is vital to their potential integration into practice. Aim: This study aims to evaluate the proficiency and consistency of the publicly available VLMs, Claude and GPT, across multiple iterations in basic image interpretation tasks. Method: Subsets from publicly available datasets, ROCOv2 and MURAv1.1, were used to evaluate 6 VLMs. A system prompt and image were inputted into each model thrice. The outputs were compared to the dataset captions to evaluate each model's accuracy in recognising the modality, anatomy, and detecting fractures on radiographs. The consistency of the output across iterations was also analysed. Results: Evaluation of the ROCOv2 dataset showed high accuracy in modality recognition, with some models achieving 100%. Anatomical recognition ranged between 61% and 85% accuracy across all models tested. On the MURAv1.1 dataset, Claude-3.5-Sonnet had the highest anatomical recognition with 57% accuracy, while GPT-4o had the best fracture detection with 62% accuracy. Claude-3.5-Sonnet was the most consistent model, with 83% and 92% consistency in anatomy and fracture detection, respectively. Conclusion: Given Claude and GPT's current accuracy and reliability, integration of these models into clinical settings is not yet feasible. This study highlights the need for ongoing development and establishment of standardised testing techniques to ensure these models achieve reliable performance.
... In a prospective multicenter study on the performance of chest X-rays using a commercial AI, Vasilev et al. [56] found a kappa value between AI and radiologists of 0.42 (CI 0.38-0.45). Though images were only classified as either normal or pathological, results are comparable to our kappa values, which range between 0.39 and 0.63. ...
Article
Full-text available
Background: The integration of artificial intelligence (AI) into radiology aims to improve diagnostic accuracy and efficiency, particularly in settings with limited access to expert radiologists and in times of personnel shortage. However, challenges such as insufficient validation in actual real-world settings or automation bias should be addressed before implementing AI software in clinical routine. Methods: This cross-sectional study in a maximum care hospital assesses the concordance between diagnoses made by a commercial AI-based software and conventional radiological methods augmented by AI for four major thoracic pathologies in chest X-ray: fracture, pleural effusion, pulmonary nodule and pneumonia. Chest radiographs of 1506 patients (median age 66 years, 56.5% men) consecutively obtained between January and August 2023 were re-evaluated by the AI software InferRead DR Chest®. Results: Overall, AI software detected thoracic pathologies more often than radiologists (18.5% vs. 11.1%). In detail, it detected fractures, pneumonia, and nodules more frequently than radiologists, while radiologists identified pleural effusions more often. Reliability was highest for pleural effusions (0.63, 95%-CI 0.58–0.69), indicating good agreement, and lowest for fractures (0.39, 95%-CI 0.32–0.45), indicating moderate agreement. Conclusions: The tested software shows a high detection rate, particularly for fractures, pneumonia, and nodules, but hereby produces a nonnegligible number of false positives. Thus, AI-based software shows promise in enhancing diagnostic accuracy; however, cautious interpretation and human oversight remain crucial.
Article
Background Publicly available artificial intelligence (AI) Vision Language Models (VLMs) are constantly improving. The advent of vision capabilities on these models could enhance radiology workflows. Evaluating their performance in radiological image interpretation is vital to their potential integration into practice. Aim This study aims to evaluate the proficiency and consistency of the publicly available VLMs, Anthropic's Claude and OpenAI's GPT, across multiple iterations in basic image interpretation tasks. Method Subsets from publicly available datasets, ROCOv2 and MURAv1.1, were used to evaluate 6 VLMs. A system prompt and image were input into each model three times. The outputs were compared to the dataset captions to evaluate each model's accuracy in recognising the modality, anatomy, and detecting fractures on radiographs. The consistency of the output across iterations was also analysed. Results Evaluation of the ROCOv2 dataset showed high accuracy in modality recognition, with some models achieving 100%. Anatomical recognition ranged between 61% and 85% accuracy across all models tested. On the MURAv1.1 dataset, Claude‐3.5‐Sonnet had the highest anatomical recognition with 57% accuracy, while GPT‐4o had the best fracture detection with 62% accuracy. Claude‐3.5‐Sonnet was the most consistent model, with 83% and 92% consistency in anatomy and fracture detection, respectively. Conclusion Given Claude and GPT's current accuracy and reliability, the integration of these models into clinical settings is not yet feasible. This study highlights the need for ongoing development and establishment of standardised testing techniques to ensure these models achieve reliable performance.
Article
Full-text available
Background: The implementation of radiological artificial intelligence (AI) solutions remains challenging due to limitations in existing testing methodologies. This study assesses the efficacy of a comprehensive methodology for performance testing and monitoring of commercial-grade mammographic AI models. Methods: We utilized a combination of retrospective and prospective multicenter approaches to evaluate a neural network based on the Faster R-CNN architecture with a ResNet-50 backbone, trained on a dataset of 3641 mammograms. The methodology encompassed functional and calibration testing, coupled with routine technical and clinical monitoring. Feedback from testers and radiologists was relayed to the developers, who made updates to the AI model. The test dataset comprised 112 medical organizations, representing 10 manufacturers of mammography equipment and encompassing 593,365 studies. The evaluation metrics included the area under the curve (AUC), accuracy, sensitivity, specificity, technical defects, and clinical assessment scores. Results: The results demonstrated significant enhancement in the AI model’s performance through collaborative efforts among developers, testers, and radiologists. Notable improvements included functionality, diagnostic accuracy, and technical stability. Specifically, the AUC rose by 24.7% (from 0.73 to 0.91), the accuracy improved by 15.6% (from 0.77 to 0.89), sensitivity grew by 37.1% (from 0.62 to 0.85), and specificity increased by 10.7% (from 0.84 to 0.93). The average proportion of technical defects declined from 9.0% to 1.0%, while the clinical assessment score improved from 63.4 to 72.0. Following 2 years and 9 months of testing, the AI solution was integrated into the compulsory health insurance system. Conclusions: The multi-stage, lifecycle-based testing methodology demonstrated substantial potential in software enhancement and integration into clinical practice. Key elements of this methodology include robust functional and diagnostic requirements, continuous testing and updates, systematic feedback collection from testers and radiologists, and prospective monitoring.
Article
Full-text available
BACKGROUND: We propose a novel model for processing chest radiographs acquired through population screening using classification by autonomous artificial intelligence (AI) as medical device with maximum sensitivity of 1.0 (95% CI 1.0,1.0). The system categorizes screening examinations (X-ray and photofluorography of the chest) into two groups: 'abnormal' and 'normal'. The abnormal category encompasses all deviations (i.e., pathological conditions, post-surgical changes, age-related variations, and congenital features) to be reviewed by radiologists. The normal category includes studies without pathological findings that implicitly do not require radiologist interpretation. AIM: This study aims to evaluate the feasibility and effectiveness of autonomous AI classification in chest radiography studies acquired through population screening. METHODS: We conducted a prospective, multicenter diagnostic study to assess the safety and performance of autonomous AI classification of screening chest radiography studies. The study analyzed 575,549 screening images acquired by fluoroscopy and radiography machines and processed by three different AI-powered medical devices. The scientific merit was achieved using statistical and analytical methods. RESULTS: The autonomous AI system classified 54.8% of screening chest radiographs as 'normal', proportionally reducing radiologist workload that otherwise would have been spent on interpretation and reporting. The autonomous system demonstrated 99.95% accuracy in classification. Clinically significant discrepancies occurred in 0.05% of cases (95% CI: 0.04, 0.06%). CONCLUSION: This study demonstrates the clinical and economic effectiveness of using autonomous classification in chest radiograph screening by means of AI-powered medical devices. Future steps should focus on regulatory framework updates to legitimize the implementation of AI-powered medical devices for specific preventive screening tasks.
Preprint
Full-text available
This study analyzed the potential clinical benefit of xAID Chest software in the non-balanced selection of cases as assessed by board-certified radiologists from four different European countries.
Article
Full-text available
Background: One of the first radiology areas in which artificial intelligence began to be used and is still actively used to this day is chest X-ray examination. However, when interpreting these studies using artificial intelligence, radiologists still face a number of limitations on a daily basis that must be taken into account when making a medical opinion and which developers need to pay attention to in order to further improve the algorithms to increase their efficiency. Aims: Identification of limitations in the use of currently available artificial intelligence services for chest X-ray examinations and identification of promising directions for their further development. Materials and methods: A retrospective analysis of 155 cases of disagreement between the results of conclusions of artificial intelligence services and medical opinions when analyzing chest X-ray examinations was carried out. All cases included in the study were obtained from the Unified Radiological Information Service of the Unified Medical Information and Analytical System of Moscow. Results: Among the 155 analyzed cases of disagreement, 48 (31.0%) were false positives and 78 (50.3%) were false negatives. The remaining 29 (18.7%) cases were excluded from further study because they turned out to be true positive (27) or true negative (2). Among the 48 false-positive cases, the majority (93.8%) was due to the fact that the artificial intelligence service mistook normal anatomical structures of the chest (97.8% of cases) or a catheter shadow (2.2% of cases) for pneumothorax. Among false-negative studies, the proportion of missed clinically significant pathologies was 22.0%. Almost half of these cases (44.4%) were associated with missed lung nodes. The most common clinically insignificant pathology was calcifications in the lungs (60.9%). Conclusions: On the part of AI services, there was a tendency towards overdiagnosis. All false-positive cases were associated with erroneous detection of clinically significant pathology: pneumothorax, lung nodules, and pulmonary consolidation. Among false-negative cases, the proportion of missing clinically significant pathology was small and amounted to less than one-fourth.
Article
Full-text available
Limitations of the chest X-ray (CXR) have resulted in attempts to create machine learning systems to assist clinicians and improve interpretation accuracy. An understanding of the capabilities and limitations of modern machine learning systems is necessary for clinicians as these tools begin to permeate practice. This systematic review aimed to provide an overview of machine learning applications designed to facilitate CXR interpretation. A systematic search strategy was executed to identify research into machine learning algorithms capable of detecting >2 radiographic findings on CXRs published between January 2020 and September 2022. Model details and study characteristics, including risk of bias and quality, were summarized. Initially, 2248 articles were retrieved, with 46 included in the final review. Published models demonstrated strong standalone performance and were typically as accurate, or more accurate, than radiologists or non-radiologist clinicians. Multiple studies demonstrated an improvement in the clinical finding classification performance of clinicians when models acted as a diagnostic assistance device. Device performance was compared with that of clinicians in 30% of studies, while effects on clinical perception and diagnosis were evaluated in 19%. Only one study was prospectively run. On average, 128,662 images were used to train and validate models. Most classified less than eight clinical findings, while the three most comprehensive models classified 54, 72, and 124 findings. This review suggests that machine learning devices designed to facilitate CXR interpretation perform strongly, improve the detection performance of clinicians, and improve the efficiency of radiology workflow. Several limitations were identified, and clinician involvement and expertise will be key to driving the safe implementation of quality CXR machine learning systems.
Article
Full-text available
Due to its widespread availability, low cost, feasibility at the patient’s bedside and accessibility even in low-resource settings, chest X-ray is one of the most requested examinations in radiology departments. Whilst it provides essential information on thoracic pathology, it can be difficult to interpret and is prone to diagnostic errors, particularly in the emergency setting. The increasing availability of large chest X-ray datasets has allowed the development of reliable Artificial Intelligence (AI) tools to help radiologists in everyday clinical practice. AI integration into the diagnostic workflow would benefit patients, radiologists, and healthcare systems in terms of improved and standardized reporting accuracy, quicker diagnosis, more efficient management, and appropriateness of the therapy. This review article aims to provide an overview of the applications of AI for chest X-rays in the emergency setting, emphasizing the detection and evaluation of pneumothorax, pneumonia, heart failure, and pleural effusion.
Article
Full-text available
BACKGROUND: In radiology, important information can be found not only in medical images, but also in the accompanying text descriptions created by radiologists. Identification of study protocols containing certain data and extraction of these data can be useful primarily for clinical problems; however, given the large amount of such data, the development of machine analysis algorithms is necessary. AIM: To estimate the possibilities and limitations of using a tool for machine processing of radiology reports to search for pathological findings. MATERIALS AND METHODS: To create an algorithm for automatic analysis of radiology reports, use cases were selected that participated in the experiment on the use of innovative technologies in the computer vision for the analysis of medical images in 2020. Mammography, chest X-ray, chest computed tomography (CT), and LDCT, were among the use cases performed in Moscow. A dictionary of keywords has been compiled. After the automatic marking of the reports by the developed tool, the results were assessed by a radiologist. The number of protocols analyzed by the radiologist for training and validation of the algorithms was 977 for mammography, 4,804 for all chest X-ray scans, 4,074 for chest CT, and 398 for chest LDCT. For the final testing of the developed algorithms, test datasets of 1,032 studies for mammography, 544 for chest X-ray, 5,000 for CT of the chest, and 1,082 studies for the LDCT of the chest were additionally labeled. RESULTS: The best results were achieved in the search for viral pneumonia in chest CT reports (accuracy 0.996, sensitivity 0.998, and specificity 0.989) and breast cancer in mammography reports (accuracy 1.0, sensitivity 1.0, and specificity 1.0). When searching for signs of lung cancer by the algorithm, the metrics were as follows: accuracy 0.895, sensitivity 0.829, and specificity 0.936, when searching for pathological changes in the chest organs in radiography and fluorography protocols (accuracy 0.912, sensitivity 1.000, and specificity 0.844). CONCLUSIONS: Machine methods with high accuracy can be used to automatically classify the radiology reports of mammography and chest CT with viral pneumonia. The achieved accuracy is sufficient for successful application to automatically compare the conclusions of physicians and artificial intelligence models when searching for signs of lung cancer in chest CT and LDCT, pathological findings in chest X-ray.
Article
Full-text available
In medical practice, chest X-rays are the most ubiquitous diagnostic imaging tests. However, the current workload in extensive health care facilities and lack of well-trained radiologists is a significant challenge in the patient care pathway. Therefore, an accurate, reliable, and fast computer-aided diagnosis (CAD) system capable of detecting abnormalities in chest X-rays is crucial in improving the radiological workflow. In this prospective multicenter quality-improvement study, we have evaluated whether artificial intelligence (AI) can be used as a chest X-ray screening tool in real clinical settings. Methods: A team of radiologists used the AI-based chest X-ray screening tool (qXR) as a part of their daily reporting routine to report consecutive chest X-rays for this prospective multicentre study. This study took place in a large radiology network in India between June 2021 and March 2022. Results: A total of 65,604 chest X-rays were processed during the study period. The overall performance of AI achieved in detecting normal and abnormal chest X-rays was good. The high negatively predicted value (NPV) of 98.9% was achieved. The AI performance in terms of area under the curve (AUC), NPV for the corresponding subabnormalities obtained were blunted CP angle (0.97, 99.5%), hilar dysmorphism (0.86, 99.9%), cardiomegaly (0.96, 99.7%), reticulonodular pattern (0.91, 99.9%), rib fracture (0.98, 99.9%), scoliosis (0.98, 99.9%), atelectasis (0.96, 99.9%), calcification (0.96, 99.7%), consolidation (0.95, 99.6%), emphysema (0.96, 99.9%), fibrosis (0.95, 99.7%), nodule (0.91, 99.8%), opacity (0.92, 99.2%), pleural effusion (0.97, 99.7%), and pneumothorax (0.99, 99.9%). Additionally, the turnaround time (TAT) decreased by about 40.63% from pre-qXR period to post-qXR period. Conclusions: The AI-based chest X-ray solution (qXR) screened chest X-rays and assisted in ruling out normal patients with high confidence, thus allowing the radiologists to focus more on assessing pathology on abnormal chest X-rays and treatment pathways.
Article
Full-text available
Importance: The efficient and accurate interpretation of radiologic images is paramount. Objective: To evaluate whether a deep learning-based artificial intelligence (AI) engine used concurrently can improve reader performance and efficiency in interpreting chest radiograph abnormalities. Design, setting, and participants: This multicenter cohort study was conducted from April to November 2021 and involved radiologists, including attending radiologists, thoracic radiology fellows, and residents, who independently participated in 2 observer performance test sessions. The sessions included a reading session with AI and a session without AI, in a randomized crossover manner with a 4-week washout period in between. The AI produced a heat map and the image-level probability of the presence of the referrable lesion. The data used were collected at 2 quaternary academic hospitals in Boston, Massachusetts: Beth Israel Deaconess Medical Center (The Medical Information Mart for Intensive Care Chest X-Ray [MIMIC-CXR]) and Massachusetts General Hospital (MGH). Main outcomes and measures: The ground truths for the labels were created via consensual reading by 2 thoracic radiologists. Each reader documented their findings in a customized report template, in which the 4 target chest radiograph findings and the reader confidence of the presence of each finding was recorded. The time taken for reporting each chest radiograph was also recorded. Sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC) were calculated for each target finding. Results: A total of 6 radiologists (2 attending radiologists, 2 thoracic radiology fellows, and 2 residents) participated in the study. The study involved a total of 497 frontal chest radiographs-247 from the MIMIC-CXR data set (demographic data for patients were not available) and 250 chest radiographs from MGH (mean [SD] age, 63 [16] years; 133 men [53.2%])-from adult patients with and without 4 target findings (pneumonia, nodule, pneumothorax, and pleural effusion). The target findings were found in 351 of 497 chest radiographs. The AI was associated with higher sensitivity for all findings compared with the readers (nodule, 0.816 [95% CI, 0.732-0.882] vs 0.567 [95% CI, 0.524-0.611]; pneumonia, 0.887 [95% CI, 0.834-0.928] vs 0.673 [95% CI, 0.632-0.714]; pleural effusion, 0.872 [95% CI, 0.808-0.921] vs 0.889 [95% CI, 0.862-0.917]; pneumothorax, 0.988 [95% CI, 0.932-1.000] vs 0.792 [95% CI, 0.756-0.827]). AI-aided interpretation was associated with significantly improved reader sensitivities for all target findings, without negative impacts on the specificity. Overall, the AUROCs of readers improved for all 4 target findings, with significant improvements in detection of pneumothorax and nodule. The reporting time with AI was 10% lower than without AI (40.8 vs 36.9 seconds; difference, 3.9 seconds; 95% CI, 2.9-5.2 seconds; P < .001). Conclusions and relevance: These findings suggest that AI-aided interpretation was associated with improved reader performance and efficiency for identifying major thoracic findings on a chest radiograph.
Article
Full-text available
Background Chest x-rays are the most commonly used type of x-rays today, accounting for up to 26% of all radiographic tests performed. However, chest radiography is a complex imaging modality to interpret. Several studies have reported discrepancies in chest x-ray interpretations among emergency physicians and radiologists. It is of vital importance to be able to offer a fast and reliable diagnosis for this kind of x-ray, using artificial intelligence (AI) to support the clinician. Oxipit has developed an AI algorithm for reading chest x-rays, available through a web platform called ChestEye. This platform is an automatic computer-aided diagnosis system where a reading of the inserted chest x-ray is performed, and an automatic report is returned with a capacity to detect 75 pathologies, covering 90% of diagnoses. Objective The overall objective of the study is to perform validation with prospective data of the ChestEye algorithm as a diagnostic aid. We wish to validate the algorithm for a single pathology and multiple pathologies by evaluating the accuracy, sensitivity, and specificity of the algorithm. Methods A prospective validation study will be carried out to compare the diagnosis of the reference radiologists for the users attending the primary care center in the Osona region (Spain), with the diagnosis of the ChestEye AI algorithm. Anonymized chest x-ray images will be acquired and fed into the AI algorithm interface, which will return an automatic report. A radiologist will evaluate the same chest x-ray, and both assessments will be compared to calculate the precision, sensitivity, specificity, and accuracy of the AI algorithm. Results will be represented globally and individually for each pathology using a confusion matrix and the One-vs-All methodology. Results Patient recruitment was conducted from February 7, 2022, and it is expected that data can be obtained in 5 to 6 months. In June 2022, more than 450 x-rays have been collected, so it is expected that 600 samples will be gathered in July 2022. We hope to obtain sufficient evidence to demonstrate that the use of AI in the reading of chest x-rays can be a good tool for diagnostic support. However, there is a decreasing number of radiology professionals and, therefore, it is necessary to develop and validate tools to support professionals who have to interpret these tests. Conclusions If the results of the validation of the model are satisfactory, it could be implemented as a support tool and allow an increase in the accuracy and speed of diagnosis, patient safety, and agility in the primary care system, while reducing the cost of unnecessary tests. International Registered Report Identifier (IRRID) PRR1-10.2196/39536
Article
Full-text available
There have been few independent evaluations of computer-aided detection (CAD) software for tuberculosis (TB) screening, despite the rapidly expanding array of available CAD solutions. We developed a test library of chest X-ray (CXR) images which was blindly re-read by two TB clinicians with different levels of experience and then processed by 12 CAD software solutions. Using Xpert MTB/RIF results as the reference standard, we compared the performance characteristics of each CAD software against both an Expert and Intermediate Reader, using cut-off thresholds which were selected to match the sensitivity of each human reader. Six CAD systems performed on par with the Expert Reader (Qure.ai, DeepTek, Delft Imaging, JF Healthcare, OXIPIT, and Lunit) and one additional software (Infervision) performed on par with the Intermediate Reader only. Qure.ai, Delft Imaging and Lunit were the only software to perform significantly better than the Intermediate Reader. The majority of these CAD software showed significantly lower performance among participants with a past history of TB. The radiography equipment used to capture the CXR image was also shown to affect performance for some CAD software. TB program implementers now have a wide selection of quality CAD software solutions to utilize in their CXR screening initiatives.
Article
Full-text available
The review considers the possible use of artificial intelligence for the interpretation of chest X-rays by analyzing 45 publications. Experimental and commercial diagnostic systems for pulmonary tuberculosis, pneumonia, neoplasms and other diseases have been analyzed.
Article
Full-text available
Introduction for the management of patients referred to respiratory triage during the early stages of the SARS-CoV-2 pandemic, either chest radiograph (CXR) or computed tomography (CT) were used as first-line diagnostic tools. The aim of this study was to compare the impact on triage, diagnosis and prognosis of patients with suspected COVID-19 when clinical decisions are derived from reconstructed CXR or from CT. Methods we reconstructed CXR (r-CXR) from high-resolution CT (HRCT) scan. Five clinical observers independently reviewed clinical charts of 300 subjects with suspected COVID-19 pneumonia, integrated with either r-CXR or HRCT report in two consecutive blinded and randomised sessions: clinical decisions were recorded for each session. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and prognostic value were compared between r-CXR and HRCT. The best radiological integration was also examined to develop an optimised respiratory triage algorithm. Results interobserver agreement was fair (Kendall's W =0.365; p<0.001) by r-CXR-based protocol and good (Kendall's W =0.654; p<0.001) by CT-based protocol. NPV assisted by r-CXR (31.4%) was lower than that of HRCT (77.9%). In case of indeterminate or typical radiological appearence for COVID-19 pneumonia, extent of disease on r-CXR or HRCT were the only two imaging variables that were similarly linked to mortality by adjusted multivariable models Conclusions the present findings suggest that clinical triage is safely assisted by CXR. An integrated algorithm using first-line CXR and contingent use of HRCT can help optimise management and prognostication of COVID-19.
Article
Full-text available
The application of machine learning (ML) technologies in medicine generally but also in radiology more specifically is hoped to improve clinical processes and the provision of healthcare. A central motivation in this regard is to advance patient treatment by reducing human error and increasing the accuracy of prognosis, diagnosis and therapy decisions. There is, however, also increasing awareness about bias in ML technologies and its potentially harmful consequences. Biases refer to systematic distortions of datasets, algorithms, or human decision making. These systematic distortions are understood to have negative effects on the quality of an outcome in terms of accuracy, fairness, or transparency. But biases are not only a technical problem that requires a technical solution. Because they often also have a social dimension, the 'distorted' outcomes they yield often have implications for equity. This paper assesses different types of biases that can emerge within applications of ML in radiology, and discusses in what cases such biases are problematic. Drawing upon theories of equity in healthcare, we argue that while some biases are harmful and should be acted upon, others might be unproblematic and even desirable-exactly because they can contribute to overcome inequities.