Results of MCC and Brier score for the BS1, BS2, BS3, BS4, BS5, and BS6 use cases. normMCC = (MCC + 1)/2. complBS = 1 -BS. The values of both normMCC and complBS lay in the [0, 1] interval, with worst value equal to 0 and best value equal to 1. We reported the details of these use cases in Table 4.

Source publication

FIGURE 3: Relationship between MCC and the Brier score, with simulated...

FIGURE 4: Relationship between MCC and Brier score, with simulated...

FIGURE 5: Results of MCC and Brier score for the BS1, BS2, BS3, BS4,...

The Matthews Correlation Coefficient (MCC) is More Informative Than Cohen’s Kappa and Brier Score in Binary Classification Assessment

Article

Full-text available

May 2021

Even if measuring the outcome of binary classifications is a pivotal task in machine learning and statistics, no consensus has been reached yet about which statistical rate to employ to this end. In the last century, the computer science and statistics communities have introduced several scores summing up the correctness of the predictions with res...

Context 1

... 100 0 0 -1.000 0.000 1.000 K2 0 90 10 0 -1.000 -0.220 0.780 K3 0 80 20 0 -1.000 -0.471 0.529 K4 0 70 30 0 -1.000 -0.724 0.276 K5 0 60 40 0 -1.000 -0.923 0.077 K6 0 50 50 0 -1.000 -1.000 0.000 zero. To highlight these differences, we represent them as 712 barplots in Figure 5. ...

View in full-text

Supervised and unsupervised learning models for pharmaceutical drug rating and classification using consumer generated reviews

Article

Full-text available

Jun 2024

Corban Allenbrand

Optimization of medication therapy depends on maximizing benefits and minimizing side effects of medications. This research showed how a joint approach using text mining, natural language processing, and machine learning can provide information for personalized and optimized medication therapy. Reviews on the benefits and side effects of prescription and over-the-counter medications were used to determine how well an integrated supervised and unsupervised learning method could learn medication satisfaction. Supervised learning with naïve-Bayes, non-linear support vector machine with radial basis function kernels, and random forests with CART decision trees was measured by a micro-aggregated Matthews correlation coefficient and a macro-averaged F1 measure. Random forests outperformed support vector machines by almost 250% and naive-Bayes by 600% on the two evaluation metrics. All models did better with three rating levels, instead of five. Topic modeling and stacked cluster analysis were coupled with parts-of-speech tagging and text mining operations to establish a robust data preprocessing procedure to eliminate noisy features from the data. Unsupervised topic modeling and clustering represented an exploratory validation of how easy supervised classification would be. Well-defined latent topics were discovered including topics on ''sleep quality'', ''the opportunity to get back to work'', and ''weight gain''. Overlapping clusters revealed that incorporating more information on social, demographic, or medical history variables could improve classifier performance. This research provided evidence that medication satisfaction can be learned with carefully designed joint supervised, unsupervised, and natural language learning techniques.

An intelligent cellular automaton scheme for modelling forest fires

Article

May 2024
ECOL INFORM

Cardiac Disease Diagnosis Using K-Nearest Neighbor Algorithm: A Study on Heart Failure Clinical Records Dataset

Article

Full-text available

Apr 2024

This article introduces an approach to diagnose heart diseases utilizing the K-Nearest Neighbor algorithm and diverse correlation filters for selecting the most pertinent attributes. Results high- light that meticulous filter selection enhances survival predictions in patients with heart diseases. Employing K = 5 and correlation filter CF = 0.1, key attributes for classification were identified as anemia, high blood pressure, serum creatinine, and sex. Omitting the 'time' attribute led to information loss but was crucial to prevent biases and generalize predictions across various clinical scenarios. Utilizing these classification parameters, we designed an Android mobile application called “Heart Info System”, functioning as an artificial intelligence service. It employs the K-Nearest Neighbor algorithm with optimal parameters to evaluate the probability of survival in the progression of heart disease. The main activity of the application retrieves data from a Firebase database. While the study results show promise, the accuracy of the application may be influenced by inaccurate or incomplete input data. Nevertheless, this application has the potential to improve the early detection of heart diseases, paving the way for life-saving interventions.

Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data

Article

Full-text available

Apr 2024
BMC BIOINFORMATICS

Background Numerous transcriptomic-based models have been developed to predict or understand the fundamental mechanisms driving biological phenotypes. However, few models have successfully transitioned into clinical practice due to challenges associated with generalizability and interpretability. To address these issues, researchers have turned to dimensionality reduction methods and have begun implementing transfer learning approaches. Methods In this study, we aimed to determine the optimal combination of dimensionality reduction and regularization methods for predictive modeling. We applied seven dimensionality reduction methods to various datasets, including two supervised methods (linear optimal low-rank projection and low-rank canonical correlation analysis), two unsupervised methods [principal component analysis and consensus independent component analysis (c-ICA)], and three methods [autoencoder (AE), adversarial variational autoencoder, and c-ICA] within a transfer learning framework, trained on > 140,000 transcriptomic profiles. To assess the performance of the different combinations, we used a cross-validation setup encapsulated within a permutation testing framework, analyzing 30 different transcriptomic datasets with binary phenotypes. Furthermore, we included datasets with small sample sizes and phenotypes of varying degrees of predictability, and we employed independent datasets for validation. Results Our findings revealed that regularized models without dimensionality reduction achieved the highest predictive performance, challenging the necessity of dimensionality reduction when the primary goal is to achieve optimal predictive performance. However, models using AE and c-ICA with transfer learning for dimensionality reduction showed comparable performance, with enhanced interpretability and robustness of predictors, compared to models using non-dimensionality-reduced data. Conclusion These findings offer valuable insights into the optimal combination of strategies for enhancing the predictive performance, interpretability, and generalizability of transcriptomic-based models.

Object-Based Semi-Supervised Spatial Attention Residual UNet for Urban High-Resolution Remote Sensing Image Classification

Article

Full-text available

Apr 2024

Accurate urban land cover information is crucial for effective urban planning and management. While convolutional neural networks (CNNs) demonstrate superior feature learning and prediction capabilities using image-level annotations, the inherent mixed-category nature of input image patches leads to classification errors along object boundaries. Fully convolutional neural networks (FCNs) excel at pixel-wise fine segmentation, making them less susceptible to heterogeneous content, but they require fully annotated dense image patches, which may not be readily available in real-world scenarios. This paper proposes an object-based semi-supervised spatial attention residual UNet (OS-ARU) model. First, multiscale segmentation is performed to obtain segments from a remote sensing image, and segments containing sample points are assigned the categories of the corresponding points, which are used to train the model. Then, the trained model predicts class probabilities for all segments. Each unlabeled segment’s probability distribution is compared against those of labeled segments for similarity matching under a threshold constraint. Through label propagation, pseudo-labels are assigned to unlabeled segments exhibiting high similarity to labeled ones. Finally, the model is retrained using the augmented training set incorporating the pseudo-labeled segments. Comprehensive experiments on aerial image benchmarks for Vaihingen and Potsdam demonstrate that the proposed OS-ARU achieves higher classification accuracy than state-of-the-art models, including OCNN, 2OCNN, and standard OS-U, reaching an overall accuracy (OA) of 87.83% and 86.71%, respectively. The performance improvements over the baseline methods are statistically significant according to the Wilcoxon Signed-Rank Test. Despite using significantly fewer sparse annotations, this semi-supervised approach still achieves comparable accuracy to the same model under full supervision. The proposed method thus makes a step forward in substantially alleviating the heavy sampling burden of FCNs (densely sampled deep learning models) to effectively handle the complex issue of land cover information identification and classification.

Exploring the Entropy-Based Classification of Time Series Using Visibility Graphs from Chaotic Maps

Article

Full-text available

Mar 2024

The classification of time series using machine learning (ML) analysis and entropy-based features is an urgent task for the study of nonlinear signals in the fields of finance, biology and medicine, including EEG analysis and Brain–Computer Interfacing. As several entropy measures exist, the problem is assessing the effectiveness of entropies used as features for the ML classification of nonlinear dynamics of time series. We propose a method, called global efficiency (GEFMCC), for assessing the effectiveness of entropy features using several chaotic mappings. GEFMCC is a fitness function for optimizing the type and parameters of entropies for time series classification problems. We analyze fuzzy entropy (FuzzyEn) and neural network entropy (NNetEn) for four discrete mappings, the logistic map, the sine map, the Planck map, and the two-memristor-based map, with a base length time series of 300 elements. FuzzyEn has greater GEFMCC in the classification task compared to NNetEn. However, NNetEn classification efficiency is higher than FuzzyEn for some local areas of the time series dynamics. The results of using horizontal visibility graphs (HVG) instead of the raw time series demonstrate the GEFMCC decrease after HVG time series transformation. However, the GEFMCC increases after applying the HVG for some local areas of time series dynamics. The scientific community can use the results to explore the efficiency of the entropy-based classification of time series in “The Entropy Universe”. An implementation of the algorithms in Python is presented.

Identifying prognostic factors for survival in intensive care unit patients with SIRS or sepsis by machine learning analysis on electronic health records

Article

Full-text available

Mar 2024

Background Systemic inflammatory response syndrome (SIRS) and sepsis are the most common causes of in-hospital death. However, the characteristics associated with the improvement in the patient conditions during the ICU stay were not fully elucidated for each population as well as the possible differences between the two. Goal The aim of this study is to highlight the differences between the prognostic clinical features for the survival of patients diagnosed with SIRS and those of patients diagnosed with sepsis by using a multi-variable predictive modeling approach with a reduced set of easily available measurements collected at the admission to the intensive care unit (ICU). Methods Data were collected from 1,257 patients (816 non-sepsis SIRS and 441 sepsis) admitted to the ICU. We compared the performance of five machine learning models in predicting patient survival. Matthews correlation coefficient (MCC) was used to evaluate model performances and feature importance, and by applying Monte Carlo stratified Cross-Validation. Results Extreme Gradient Boosting (MCC = 0.489) and Logistic Regression (MCC = 0.533) achieved the highest results for SIRS and sepsis cohorts, respectively. In order of importance, APACHE II, mean platelet volume ( MPV ), eosinophil counts ( EoC ), and C-reactive protein ( CRP ) showed higher importance for predicting sepsis patient survival, whereas, SOFA, APACHE II, platelet counts ( PLTC ), and CRP obtained higher importance in the SIRS cohort. Conclusion By using complete blood count parameters as predictors of ICU patient survival, machine learning models can accurately predict the survival of SIRS and sepsis ICU patients. Interestingly, feature importance highlights the role of CRP and APACHE II in both SIRS and sepsis populations. In addition, MPV and EoC are shown to be important features for the sepsis population only, whereas SOFA and PLTC have higher importance for SIRS patients.

Machine Learning-Based Assessment of Survival and Risk Factors in Non-Alcoholic Fatty Liver Disease-Related Hepatocellular Carcinoma for Optimized Patient Management

Article

Full-text available

Mar 2024

Simple Summary Non-alcoholic fatty liver disease (NAFLD) is the most prevalent chronic liver condition globally. The increasing incidence of NAFLD suggests that in the upcoming years, NAFLD-related hepatocellular carcinoma (HCC) is poised to become the leading cause of this type of tumor. The aim of this study is to evaluate the survival rates of these patients and identify the primary risk factors contributing to a less favorable prognosis. To accomplish this, we have employed machine learning techniques. This introduces a novel approach for identifying these factors that can be targeted to enhance the life expectancy of these patients, offering a more personalized and effective management strategy. This enhanced management approach not only aids in the optimization of patient care but also facilitates the delivery of the most effective available treatments. Abstract Non-alcoholic fatty liver disease (NAFLD) is the most common chronic liver disease worldwide, with an incidence that is exponentially increasing. Hepatocellular carcinoma (HCC) is the most frequent primary tumor. There is an increasing relationship between these entities due to the potential risk of developing NAFLD-related HCC and the prevalence of NAFLD. There is limited evidence regarding prognostic factors at the diagnosis of HCC. This study compares the prognosis of HCC in patients with NAFLD against other etiologies. It also evaluates the prognostic factors at the diagnosis of these patients. For this purpose, a multicenter retrospective study was conducted involving a total of 191 patients. Out of the total, 29 presented NAFLD-related HCC. The extreme gradient boosting (XGB) method was employed to develop the reference predictive model. Patients with NAFLD-related HCC showed a worse prognosis compared to other potential etiologies of HCC. Among the variables with the worst prognosis, alcohol consumption in NAFLD patients had the greatest weight within the developed predictive model. In comparison with other studied methods, XGB obtained the highest values for the analyzed metrics. In conclusion, patients with NAFLD-related HCC and alcohol consumption, obesity, cirrhosis, and clinically significant portal hypertension (CSPH) exhibited a worse prognosis than other patients. XGB developed a highly efficient predictive model for the assessment of these patients.

A Hybrid Framework Leveraging Whale Optimization and Deep Learning With Trust-Index for Attack Identification in IoT Networks

Article

Full-text available

Mar 2024

The rise of smart cities, smart homes, and smart health powered by the Internet of Things (IoT) presents significant challenges in design, deployment, and security. The seamless data processing across a complex network of interconnected devices in unprotected conditions makes it vulnerable to potential breaches, underscoring the need for robust security at various levels of the network. Traditional security methods based on statistics often struggle to comprehend data patterns and provide the desired level of security. This work proposes a novel hybrid framework that combines Whale Optimization and Deep Learning with a trust-index to identify malicious nodes engaging in various attacks such as DoS, DDoS, Drop attack, and Tamper Attacks, thus enhancing IoT node security. The developed framework first calculates a trust-index score for IoT nodes based on drop attack, tamper attack, replay attack, and multiple-max attack. Subsequently, it utilizes the trust index score in the Optimized Neural Network model to effectively identify the malicious IoT node. The neural network optimization is achieved through a fitness function that determines optimal weights using the Whale Optimization Algorithm. The proposed framework has been tested across varying network sizes, comprising 100, 500, and 1000 nodes. The resulting outcomes were evaluated against benchmark security methods such as Logical regression, Random Forest, Support Vector Machine, Bayesian models, ANN, Elephant herding optimization, and Lion algorithm using metrics like specificity, sensitivity, accuracy, precision, False Positive Rate, False Negative Rate, False Discovery Rate, Error, F1 score, Matthews Correlation Coefficient, and Negative Predictive Value. The results reveal a notable enhancement in accuracy (26.63%, 13.04%, 17.78%, 30.52%, 22.45%, 4.26%, and 2.24%) for a 100- node network when compared to the benchmark security methods. Furthermore, the proposed framework consistently demonstrates strong performance even when applied to larger IoT networks with a higher node count.

Augmenting assessment with AI coding of online student discourse: A question of reliability

Article

Full-text available

Mar 2024

Results of MCC and Brier score for the BS1, BS2, BS3, BS4, BS5, and BS6 use cases. normMCC = (MCC + 1)/2. complBS = 1 -BS. The values of both normMCC and complBS lay in the [0, 1] interval, with worst value equal to 0 and best value equal to 1. We reported the details of these use cases in Table 4.

Context in source publication

Citations