Figure 5
Typical data sets used for development and testing of an artificial intelligence algorithm. AI, artificial intelligence.
Source publication
Artificial intelligence (AI) will likely affect various fields of medicine. This article aims to explain the fundamental principles of clinical validation, device approval, and insurance coverage decisions of AI algorithms for medical diagnosis and prediction. Discrimination accuracy of AI algorithms is often evaluated with the Dice similarity coef...
Similar publications
Introduction: The need of accurate three-dimensional data of anatomical structures is increasing in the surgical field. The development of convolutional neural networks (CNNs) has been helping to fill this gap by trying to provide efficient tools to clinicians. Nonetheless, the lack of a fully accessible datasets and open-source algorithms is slowi...
Artificial intelligence (AI) will likely affect various fields of medicine. This article aims to explain the fundamental principles of clinical validation, device approval, and insurance coverage decisions of AI algorithms for medical diagnosis and prediction. Discrimination accuracy of AI algorithms is often evaluated with the Dice similarity coef...
Citations
... Typically, AI-device approval emphasises technical performance rather than determining its benefit for patient care improvement. 81 Pathology and laboratory medicine are experiencing a transformative shift, positioning pathologists and laboratory scientists at the forefront of patient care. This evolution redefines their role beyond diagnostic specialisation to active involvement in predicting disease risk, assessing prognosis, guiding treatment decisions and overseeing patient follow-up post-treatment. ...
Clinical laboratory testing advances have improved diagnostic accuracy, emphasising the need for effective methodologies within evidence-based medicine. Evaluation of methods is an essential function of the monitoring system that guarantees the continuous excellence of a clinical laboratory. This mini-review addresses the basic principles of method evaluation in clinical laboratories, highlighting key factors that support accurate diagnoses. The systematic process of method evaluation involves validation and verification, while considering scope, parameters and regulatory compliance. Analytical performance characteristics, sensitivity, specificity, accuracy, precision, linearity, method comparison, matrix effect and carryover determine test suitability, requiring reference materials and standardisation for accuracy. This comprehensive process includes assessing test performance when there are alterations in reagents, instruments or methodologies. Furthermore, the success of the evaluation procedure hinges on the careful definition of experimental design and subsequent statistical analysis, emphasising their intimate connection, which must be established before initiation. Current trends highlight automation and the transformative impact of artificial intelligence (AI), reshaping the role of laboratory professionals in predicting disease risk and guiding treatment decisions. While advances are celebrated, emphasis is given to the inherent limits of evaluation and clinical methods. Challenges persist, including variations in test accuracy and inconsistencies between laboratories. To address these, proposed solutions include clear guidance, harmonising terminology, setting minimum standards and standardising study design components.
... Image segmentation was considered suboptimal if a myocardium with a valid shape was not produced or if areas other than the LV myocardium were included. Second, we used the Dice similarity coefficient (DSC) to measure the degree of overlap between automated segmentation and the reference standard [14]. ...
Objective:
T1 mapping provides valuable information regarding cardiomyopathies. Manual drawing is time consuming and prone to subjective errors. Therefore, this study aimed to test a DL algorithm for the automated measurement of native T1 and extracellular volume (ECV) fractions in cardiac magnetic resonance (CMR) imaging with a temporally separated dataset.
Materials and methods:
CMR images obtained for 95 participants (mean age ± standard deviation, 54.5 ± 15.2 years), including 36 left ventricular hypertrophy (12 hypertrophic cardiomyopathy, 12 Fabry disease, and 12 amyloidosis), 32 dilated cardiomyopathy, and 27 healthy volunteers, were included. A commercial deep learning (DL) algorithm based on 2D U-net (Myomics-T1 software, version 1.0.0) was used for the automated analysis of T1 maps. Four radiologists, as study readers, performed manual analysis. The reference standard was the consensus result of the manual analysis by two additional expert readers. The segmentation performance of the DL algorithm and the correlation and agreement between the automated measurement and the reference standard were assessed. Interobserver agreement among the four radiologists was analyzed.
Results:
DL successfully segmented the myocardium in 99.3% of slices in the native T1 map and 89.8% of slices in the post-T1 map with Dice similarity coefficients of 0.86 ± 0.05 and 0.74 ± 0.17, respectively. Native T1 and ECV showed strong correlation and agreement between DL and the reference: for T1, r = 0.967 (95% confidence interval [CI], 0.951-0.978) and bias of 9.5 msec (95% limits of agreement [LOA], -23.6-42.6 msec); for ECV, r = 0.987 (95% CI, 0.980-0.991) and bias of 0.7% (95% LOA, -2.8%-4.2%) on per-subject basis. Agreements between DL and each of the four radiologists were excellent (intraclass correlation coefficient [ICC] of 0.98-0.99 for both native T1 and ECV), comparable to the pairwise agreement between the radiologists (ICC of 0.97-1.00 and 0.99-1.00 for native T1 and ECV, respectively).
Conclusion:
The DL algorithm allowed automated T1 and ECV measurements comparable to those of radiologists.
... kjronline.org diagnostic cohort studies [81] and two were clinical trials with cohort sizes ranging between 86 and 29138 patients (median: 524). Other characteristics (including imaging purpose, imaging modality, imaging target, and whether a subgroup analysis was performed) are provided in Supplement. ...
Objective:
"Diagnostic yield," also referred to as the detection rate, is a parameter positioned between diagnostic accuracy and diagnosis-related patient outcomes in research studies that assess diagnostic tests. Unfamiliarity with the term may lead to incorrect usage and delivery of information. Herein, we evaluate the level of proper use of the term "diagnostic yield" and its related parameters in articles published in Radiology and Korean Journal of Radiology (KJR).
Materials and methods:
Potentially relevant articles published since 2012 in these journals were identified using MEDLINE and PubMed Central databases. The initial search yielded 239 articles. We evaluated whether the correct definition and study setting of "diagnostic yield" or "detection rate" were used and whether the articles also reported companion parameters for false-positive results. We calculated the proportion of articles that correctly used these parameters and evaluated whether the proportion increased with time (2012-2016 vs. 2017-2022).
Results:
Among 39 eligible articles (19 from Radiology and 20 from KJR), 17 (43.6%; 11 from Radiology and 6 from KJR) correctly defined "diagnostic yield" or "detection rate." The remaining 22 articles used "diagnostic yield" or "detection rate" with incorrect meanings such as "diagnostic performance" or "sensitivity." The proportion of correctly used diagnostic terms was higher in the studies published in Radiology than in those published in KJR (57.9% vs. 30.0%). The proportion improved with time in Radiology (33.3% vs. 80.0%), whereas no improvement was observed in KJR over time (33.3% vs. 27.3%). The proportion of studies reporting companion parameters was similar between journals (72.7% vs. 66.7%), and no considerable improvement was observed over time.
Conclusion:
Overall, a minority of articles accurately used "diagnostic yield" or "detection rate." Incorrect usage of the terms was more frequent without improvement over time in KJR than in Radiology. Therefore, improvements are required in the use and reporting of these parameters.
... Techniques in the field are already being streamlined into production under various stages of the life cycle of pharmaceutical products 2 , namely, drug discovery [3][4][5] , pharmaceutical development 6 , and clinical trial development and monitoring 7,8 . The usage of AI, and specifically ML algorithms, is dependent on high volumes of data 9 . Nonetheless, regulatory agencies still mostly accept data acquired through RCT. ...
Introduction:
Randomized clinical trials (RCT) are limited in reflecting observable results out of controlled settings, which requires the execution of further lengthy observational studies. The usage of real-world data (RWD) has been recently considered to be a viable alternative to overcome these issues and complement certain clinical conclusions. Transcriptomics and other high-throughput data contain a molecular description of medical conditions and disease states. When linked to RWD, including demographical information, transcriptomics data is capable of elucidating nuances in disease pathways in specific patient populations. This work focuses on the construction of a patient repository database with clinical information resulting from the integration of publicly available transcriptomics datasets.
Results:
Samples from patient data were integrated into the patient repository by using a new post-processing technique allowing for the combined usage of samples originating from Gene Expression Omnibus (GEO) datasets. RWD was mined from GEO samples′ metadata, and a clinical and demographical characterization of the database was obtained. Our post-processing technique, that we′ve called MACAROON, aims to uniformize, and integrate transcriptomics data (considering batch effects and possible processing-originated artefacts). This process was able to better reproduce the down streaming biological conclusions in a 10% enhancement (compared to other methods available). RWD mining was done through a manually curated synonym dictionary, allowing for the correct assignment (95.33% median accuracy) of medical conditions.
Conclusion:
Our strategy produced a RWD repository, including molecular information and clinical and demographical RWD. The exploration of these data facilitates shedding light on clinical outcomes and pathways specific to predetermined populations of patients by integrating multiple public datasets.
... Some publications address individual problems in compiling test datasets in pathology, e.g., how to avoid bias in the performance evaluation caused by site-specific image features in test datasets 29 . Other publications provide general recommendations for evaluating AI methods for medical applications without considering the specific challenges of pathology [30][31][32][33][34] . ...
... The present paper focuses exclusively on compiling test datasets. For advice on other issues related to validating AI solutions in pathology, such as how to select an appropriate performance metric, how to make algorithmic results interpretable, or how to conduct a clinical performance evaluation with end users, we also refer to other works 30,31,33,34,133,134 . ...
Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations on compiling test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help pathologists and regulatory agencies verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.
... Participation in planning the National Artificial Intelligence Strategy is essential. This task plans to construct Indonesian HR in the field of prevalent AI. Park et al. (2021), evaluated that Indonesia should zero in on applying AI to each modern area to help the public system. This step is viewed as ready to build Indonesia's seriousness in tackling different issues in each area, including protection. ...
This study examines various sources of publications to complement the discussion of legal literacy studies on applying artificial intelligence based on health insurance and public services regarding challenges and opportunities in consumer protection. Several studies have been published on artificial intelligence; some even have one regarding health, but the legal status has not been found in consumer protection for technology-based insurance. On that basis, the existence of this research. We obtained data from several electronic searches and analyzed them to answer the research problem. Our approach is that we first analyze the data with an understanding of the questions, then search the data electronically and then review it; it involves a system of data coding, interpretation, and in-depth evaluation. Finally, based on the discussion of the results, it can be concluded that the use of intelligence applications in terms of public services in health insurance is something that helps the implementation of health insurance, which includes very transparent data, where the algorithm has been designed in such a way.
... First, although our DL model was developed and validated using two public datasets and one private dataset, it was not evaluated using external validation. Clinical usefulness of our DL model should be further evaluated by external validation 32 . Second, our DL model focused on the threecategory classification of COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy. ...
This retrospective study aimed to develop and validate a deep learning model for the classification of coronavirus disease-2019 (COVID-19) pneumonia, non-COVID-19 pneumonia, and the healthy using chest X-ray (CXR) images. One private and two public datasets of CXR images were included. The private dataset included CXR from six hospitals. A total of 14,258 and 11,253 CXR images were included in the 2 public datasets and 455 in the private dataset. A deep learning model based on EfficientNet with noisy student was constructed using the three datasets. The test set of 150 CXR images in the private dataset were evaluated by the deep learning model and six radiologists. Three-category classification accuracy and class-wise area under the curve (AUC) for each of the COVID-19 pneumonia, non-COVID-19 pneumonia, and healthy were calculated. Consensus of the six radiologists was used for calculating class-wise AUC. The three-category classification accuracy of our model was 0.8667, and those of the six radiologists ranged from 0.5667 to 0.7733. For our model and the consensus of the six radiologists, the class-wise AUC of the healthy, non-COVID-19 pneumonia, and COVID-19 pneumonia were 0.9912, 0.9492, and 0.9752 and 0.9656, 0.8654, and 0.8740, respectively. Difference of the class-wise AUC between our model and the consensus of the six radiologists was statistically significant for COVID-19 pneumonia ( p value = 0.001334). Thus, an accurate model of deep learning for the three-category classification could be constructed; the diagnostic performance of our model was significantly better than that of the consensus interpretation by the six radiologists for COVID-19 pneumonia.
... Over the course of nearly a decade, DOD funded and supported multiple research contracts which resulted in several subsequent FDA clearances; in these contracts, the company followed a robust, evidence-based regulatory pathway to FDA clearance and eventual introduction by the company of a novel, handheld, ruggedized, multimodal, medical device incorporating AI-derived algorithms to aid the clinician in objective, rapid assessment, and diagnosis of mTBI. These AI-derived algorithms, of course, did not come without risk; clinical validation was necessary in order to prove accuracy [7]. The PPP between DOD and the company further illustrated a financial symbiosis and mutual leveraging which benefitted the healthcare mission of each party. ...
Given the convergence of the long and challenging development path for medical devices with the need for diagnostic capabilities for mild traumatic brain injury (mTBI/concussion), the effective role of public-private partnership (PPP) can be demonstrated to yield Food and Drug Administration (FDA) clearances and innovative product introductions. An overview of the mTBI problem and landscape was performed. A detailed situation analysis of an example of a PPP yielding an innovative product was further demonstrated. The example of PPP has led to multiple FDA clearances and product introductions in the TBI diagnostic product category where there was an urgent military and public need. Important lessons included defining the primary public and military health objective for new product introduction, the importance of the government-academia-industry PPP triad with a "collaboration towards solutions" Quality-by-Design (QbD) mindset to assure clinical validity with regulatory compliance, the development of device comparators and integration of measurements into a robust, evidence-based statistical and FDA pathway, and the utility of top-down, flexible, practical action while operating within governmental guidelines and patient safety.
... Two of them were correctly diagnosed by all the radiologists. Nevertheless, the relatively low sensitivity in the external test set is reasonable, as overestimating the model's performance during internal validation due to overfitting is a well-known problem in deep learning [35]. ...
Objective:
To develop and evaluate a deep learning-based artificial intelligence (AI) model for detecting skull fractures on plain radiographs in children.
Materials and methods:
This retrospective multi-center study consisted of a development dataset acquired from two hospitals (n = 149 and 264) and an external test set (n = 95) from a third hospital. Datasets included children with head trauma who underwent both skull radiography and cranial computed tomography (CT). The development dataset was split into training, tuning, and internal test sets in a ratio of 7:1:2. The reference standard for skull fracture was cranial CT. Two radiology residents, a pediatric radiologist, and two emergency physicians participated in a two-session observer study on an external test set with and without AI assistance. We obtained the area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity along with their 95% confidence intervals (CIs).
Results:
The AI model showed an AUROC of 0.922 (95% CI, 0.842-0.969) in the internal test set and 0.870 (95% CI, 0.785-0.930) in the external test set. The model had a sensitivity of 81.1% (95% CI, 64.8%-92.0%) and specificity of 91.3% (95% CI, 79.2%-97.6%) for the internal test set and 78.9% (95% CI, 54.4%-93.9%) and 88.2% (95% CI, 78.7%-94.4%), respectively, for the external test set. With the model's assistance, significant AUROC improvement was observed in radiology residents (pooled results) and emergency physicians (pooled results) with the difference from reading without AI assistance of 0.094 (95% CI, 0.020-0.168; p = 0.012) and 0.069 (95% CI, 0.002-0.136; p = 0.043), respectively, but not in the pediatric radiologist with the difference of 0.008 (95% CI, -0.074-0.090; p = 0.850).
Conclusion:
A deep learning-based AI model improved the performance of inexperienced radiologists and emergency physicians in diagnosing pediatric skull fractures on plain radiographs.
Commodification and thinning margins of insurers have increased competition in the industry. Accessing accurate data and its meaningful analysis is becoming vital. The rise of digital technologies
driven by the fourth industrial revolution (Industry 4.0) enables this but also poses new risks to insurers. The literature is evolving, and most reviews have focused on technologies or insurance value chain aspects. This systematic review of research on digital technologies in insurance discusses their benefits, enablers and inhibitors with specific reference to Industry 4.0–driven changes and identifies opportunities and imminent changes in the industry. This article discusses directions for future research.