Fig 3 - uploaded by Davide Chicco
Content may be subject to copyright.
a Example of Precision-Recall curve, with the precision score on the y axis and the recall score on the x axis (Tip 8). The grey area is the PR cuve area under the curve (AUPRC). b Example of receiver operating characteristic (ROC) curve, with the recall (true positive rate) score on the y axis and the fallout (false positive rate) score on the x axis (Tip 8). The grey area is the ROC area under the curve (AUROC)
Source publication
Machine learning has become a pivotal tool for many projects in computational biology, bioinformatics, and health informatics. Nevertheless, beginners and biomedical researchers often do not have enough experience to run a data mining project effectively, and therefore can follow incorrect practices, that may lead to common mistakes or over-optimis...
Contexts in source publication
Context 1
... you will end up having a real valued array for each FN, TN, FP, TP classes. To measure the quality of your performance, you will be able to choose between two common curves, of which you will be able to compute the area under the curve (AUC): receiver operating characteristic (ROC) curve (Fig. 3a), and Precision-Recall (PR) curve (Fig. 3b) ...
Context 2
... you will end up having a real valued array for each FN, TN, FP, TP classes. To measure the quality of your performance, you will be able to choose between two common curves, of which you will be able to compute the area under the curve (AUC): receiver operating characteristic (ROC) curve (Fig. 3a), and Precision-Recall (PR) curve (Fig. 3b) ...
Similar publications
Cloud forest is a sensitive and vulnerable ecosystem that is threatened by human activities as well as climate change. Previous studies have shown how transitional ecosystems such as cloud forests will be the most negatively impacted by the global increase in temperature. Therefore, the niche modeling framework was used in this study to geographica...
Citations
... R-square shows the degree of explanation of the dependent variable, which is explained by the independent variables (Chicco, 2021). Table 4.5 shows that 69.9% (R-vsquare = 0.699 a ) of the variation in competitive performance of the banking sector is explained by their internal and external CSR activities. ...
... Cross-validation Analysis K-fold cross-validation [35] was used to evaluate the performance of different models. The Kfold method involved randomly dividing the training dataset into k parts without reintegration: the K-1 parts were used for training the model, and a part was used for testing. ...
Cardiotoxicity, which leads to irreversible myocardial damage, is a major adverse effect associated with chemotherapy. Electrocardiogram (ECG) is an inexpensive, rapid, and simple tool that may provide valuable diagnostic information pertinent to cardiotoxicity. An automatic interpretation and classification of the ECG signals by machine learning algorithms is considered superior to human interpretation of the ECG which may not be able to early detect subtle alterations in the ECG and vary according to the experience of the specialist. The present work aimed at using different machine learning algorithms to classify ECG signals recorded from doxorubicin-injected rats. Rats were divided into four groups and each group was intraperitoneally injected with different cumulative doses of doxorubicin (0, 6, 12, and 18 mg/kg). ECG signal classification depended on multiple features that were extracted from the recorded signals under different conditions. K nearest-neighbors’ algorithm achieved higher classification accuracy (99.83%) than random forest (99.56%), decision tree (99.54%), artificial neural network (99.50%), and support vector machine (99.38%). Furthermore, the dose-dependent cardiotoxicity was validated via a histopathological examination of the left ventricle that indicated significant pathological alterations in the cardiac tissue. The present findings emphasized the potential of the machine learning-based enhanced detection of cardiotoxicity and validated the dose-dependent toxicity of doxorubicin in the cardiac left ventricle. This approach might be applicable clinically to avoid cardiotoxicity in chemotherapy-treated patients.
... In the context of clinical validation using a MRMC study design, the above two examples rarely occur. The population prevalence of a disease is often low (< 10%), and dataset imbalance is expected [11][12][13]. Therefore, accurate disease detection requires strong performance on all four conditional probabilities, not just sensitivity and specificity. ...
Background
In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews’ Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.
Results
Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI’s performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.
Conclusions
Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device’s performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.
... Finally, we have the best parameters for the tuned MLP in Figure 2c, which were obtained using a grid-search algorithm with a 10-fold cross-validation [13,14]. We found that the Huber Loss function [15], which is a combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE) losses, worked the best for our model. ...
Artificial intelligence (AI) has the ability to predict rheological properties and constituent composition of 3D-printed materials with appropriately trained models. However, these models are not currently available for use. In this work, we trained deep learning (DL) models to (1) predict the rheological properties, such as the storage (G') and loss (G") moduli, of 3D-printed polyacrylamide (PAA) substrates, and (2) predict the composition of materials and associated 3D printing parameters for a desired pair of G' and G". We employed a multilayer perceptron (MLP) and successfully predicted G' and G" from seven gel constituent parameters in a multivariate regression process. We used a grid-search algorithm along with 10-fold cross validation to tune the hyperparameters of the MLP, and found the R 2 value to be 0.89. Next, we adopted two generative DL models named variational autoencoder (VAE) and conditional variational autoencoder (CVAE) to learn data patterns and generate constituent compositions. With these generative models, we produced synthetic data with the same statistical distribution as the real data of actual hydrogel fabrication, which was then validated using Student's t-test and an autoencoder (AE) anomaly detector. We found that none of the seven generated gel constituents were significantly different from the real data. Our trained DL models were successful in mapping the input-output relationship for the 3D-printed hydrogel substrates, which can predict multiple variables from a handful of input variables and vice versa.
... Furthermore, the accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC) were calculated. The MCC is suitable for the imbalanced dataset because this metric considers the ratio of confusion matrix size [54]. We used AUC and MCC as performance indices throughout this paper. ...
Autism spectrum disorder (ASD) is a lifelong condition with elusive biological mechanisms. The complexity of factors, including inter-site and developmental differences, hinders the development of a generalizable neuroimaging classifier for ASD. Here, we developed a classifier for ASD using a large-scale, multisite resting-state fMRI dataset of 730 Japanese adults, aiming to capture neural signatures that reflect pathophysiology at the functional network level, neurotransmitters, and clinical symptoms of the autistic brain. Our adult ASD classifier was successfully generalized to adults in the United States, Belgium, and Japan. The classifier further demonstrated its successful transportability to children and adolescents. The classifier contained 141 functional connections (FCs) that were important for discriminating individuals with ASD from typically developing controls. These FCs and their terminal brain regions were associated with difficulties in social interaction and dopamine and serotonin, respectively. Finally, we mapped attention-deficit/hyperactivity disorder (ADHD), schizophrenia (SCZ), and major depressive disorder (MDD) onto the biological axis defined by the ASD classifier. ADHD and SCZ, but not MDD, were located proximate to ASD on the biological dimensions. Our results revealed functional signatures of the ASD brain, grounded in molecular characteristics and clinical symptoms, achieving generalizability and transportability applicable to the evaluation of the biological continuity of related diseases.
... The effectiveness of these tools depends on how well the human user is capable of building and exploiting them [12]. Current literature provides general guidelines on using ML models in different areas like chemical science, COVID-19 data, etc. [13][14][15][16], which focuses more on input data, leakage, reproducibility, class imbalance, parameter tuning, and choosing an appropriate metric. Unfortunately, there is a notable absence of best practices concerning the creation of high-quality training datasets, the evaluation of trained models, and the calibration for interpreting model performance in real-world situations. ...
Author summary
Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on how to check AI/ML models from 2 perspectives—the user and the developer.
... Ranges [−1, 1] and produces a high score if the model obtained good results in all boxes of the confusion matrix[58,59] 2. Receiver Operating Characteristic (ROC) CurveAnother metric extremely useful in classification models is the Area Under the Receiver Operating Characteristic (ROC) curve. The ROC curve represents a graphical depiction of a classification model's performance using vs.False Positive Rate /( + ) across various threshold levels. ...
In this paper, we present Iterative Classification of Graph-Set-Based Design (IC-GSBD), a framework utilizing graph-based techniques with geometric deep learning (GDL) integrated within a set-based design (SBD) approach for the classification and down-selection complex engineering systems represented by graphs. We demonstrate this approach on aircraft thermal management systems (TMSs) utilizing previous datasets created using an enumeration or brute-force graph generation procedure to represent novel aircraft TMSs as graphs. However, as with many enumerative approaches, combinatorial explosion limits its efficacy in many real-world problems, particularly when simulations and optimization must be performed on the many (automatically-generated) physics models. Therefore, the approach uses the directed graphs representing aircraft TMSs and GDL to predict on a subset of the graph-based dataset through graph classification. This paper's findings demonstrate that incorporating additional graph-based features using principle component analysis (PCA) enhances GDL model performance, achieving an accuracy of 98% for determining a graph's compilability and simulatability while using only 5% of the data for training. By applying iterative classification methods, we also successfully segmented the total set of graphs into more specific groups with an average inclusion of 75.5 of the top 100 highest-performing graphs, achieved by training on 40% of the data.
... 34 Preprocessing of heterogeneous datasets is a critical step in every data analysis. 35,36 This includes the aspect of completeness and clearing data from empty fields. Sufficient and well-prepared data is essential for accurate AI models or their validation. ...
... Sufficient and well-prepared data is essential for accurate AI models or their validation. 35 Scientific integrity can function practically by adhering to standards encompassing reproducibility next to objectivity, clarity, and utility. 37 While FAIR repositories are one step toward reproducibility, specialized tools are necessary to ensure the portability of software and system dependencies for code execution. ...
... An imbalance can lead to a problematic dataset ratio. 35 In the exemplary single dataset used for the qualitative comparison of features it has been observed that TERT mutations are underrepresented compared to wildtype TERT. While connections between specific features have been observed as for the example of IDH1, others could not be replicated. ...
Objective
Data sharing promotes the scientific progress. However, not all data can be shared freely due to privacy issues. This work is intended to foster FAIR sharing of sensitive data exemplary in the biomedical domain, via an integrated computational approach for utilizing and enriching individual datasets by scientists without coding experience.
Methods
We present an in silico pipeline for openly sharing controlled materials by generating synthetic data. Additionally, it addresses the issue of inexperience to computational methods in a non-IT-affine domain by making use of a cyberinfrastructure that runs and enables sharing of computational notebooks without the need of local software installation. The use of a digital twin based on cancer datasets serves as exemplary use case for making biomedical data openly available. Quantitative and qualitative validation of model output as well as a study on user experience are conducted.
Results
The metadata approach describes generalizable descriptors for computational models, and outlines how to profit from existing data resources for validating computational models. The use of a virtual lab book cooperatively developed using a cloud-based data management and analysis system functions as showcase enabling easy interaction between users. Qualitative testing revealed a necessity for comprehensive guidelines furthering acceptance by various users.
Conclusion
The introduced framework presents an integrated approach for data generation and interpolating incomplete data, promoting Open Science through reproducibility of results and methods. The system can be expanded from the biomedical to any other domain while future studies integrating an enhanced graphical user interface could increase interdisciplinary applicability.
... More specifically, the binary (RF) classifier was evaluated by computing the 2 × 2 confusion matrix (CM) and metrics derived from CM such as accuracy, recall (sensitivity), specificity, and MCC (the Matthews correlation coefficient). MCC is generally regarded as being one of the best measures of describing the confusion matrix of true and false positives and negatives by a single number [52][53][54]. The MCC ranges from −1 to 1, where 1 indicates a perfect prediction, 0 no better than a random prediction, and −1 indicates total disagreement between prediction and observation. ...
Purpose
Irritable bowel syndrome (IBS) is a diagnosis defined by gastrointestinal (GI) symptoms like abdominal pain and changes associated with defecation. The condition is classified as a disorder of the gut-brain interaction (DGBI), and patients with IBS commonly experience psychological distress. The present study focuses on this distress, defined from reports of fatigue, anxiety, depression, sleep disturbances, and performance on cognitive tests. The aim was to investigate the joint contribution of these features of psychological distress in predicting IBS versus healthy controls (HCs) and to disentangle clinically meaningful subgroups of IBS patients.
Methods
IBS patients (n=49) and HCs (n=28) completed the Chalder Fatigue Scale (CFQ), the Hamilton Anxiety and Depression Scale (HADS), and the Bergen Insomnia Scale (BIS), and performed tests of memory function and attention from the Repeatable Battery Assessing Neuropsychological Symptoms (RBANS). An initial exploratory data analysis was followed by supervised (Random Forest) and unsupervised (K-means) classification procedures.
Results
The explorative data analysis showed that the group of IBS patients obtained significantly more severe scores than HCs on all included measures, with the strongest pairwise correlation between fatigue and a quality measure of sleep disturbances. The supervised classification model correctly predicted belongings to the IBS group in 80% of the cases in a test set of unseen data. Two methods for calculating feature importance in the test set gave mental and physical fatigue and anxiety the strongest weights. An unsupervised procedure with K=3 showed that one cluster contained 24% of the patients and all but two HCs. In the two other clusters, their IBS members were overall more impaired, with the following differences. One of the two clusters showed more severe cognitive problems and anxiety symptoms than the other, which experienced more severe problems related to the quality of sleep and fatigue. The three clusters were not different on a severity measure of IBS and age.
Conclusion
The results showed that psychological distress is an integral component of IBS symptomatology. The study should inspire future longitudinal studies to further dissect clinical patterns of IBS to improve the assessment and personalized treatment for this and other patient groups defined as disorders of the gut-brain interaction. The project is registered at https://classic.clinicaltrials.gov/ct2/show/NCT04296552 20/05/2019.
... Big data, characterized by large and complex datasets from various resources, is effectively managed by AI (38). Data preprocessing involves adjusting datasets and removing problematic data points (39). ML involves training AI to classify inputs by learning from correct input-output behavior (40). ...
This paper explores the transformative impact of artificial intelligence (AI), particularly machine learning (ML), on diagnosing and treating hearing loss, which affects over 5% of the global population across all ages and demographics. AI encompasses various applications, from natural language processing models like ChatGPT to image recognition systems; however, this paper focuses on ML, a subfield of AI that can revolutionize audiology by enhancing early detection, formulating personalized rehabilitation plans, and integrating electronic health records for streamlined patient care. The integration of ML into audiometry, termed "computational audiology," allows for automated, accurate hearing tests. AI algorithms can process vast data sets, provide detailed audiograms, and facilitate early detection of hearing impairments. Research shows ML's effectiveness in classifying audiograms, conducting automated audiometry, and predicting hearing loss based on noise exposure and genetics. These advancements suggest that AI can make audiological diagnostics and treatment more accessible and efficient. The future of audiology lies in the seamless integration of AI technologies. Collaborative efforts between audiologists, AI experts, and individuals with hearing loss are essential to overcome challenges and leverage AI's full potential. Continued research and development will enhance AI applications in audiology, improving patient outcomes and quality of life worldwide.