Figure - available from: Scientific Reports
This content is subject to copyright. Terms and conditions apply.
The work flow of proposed logistic regression model combining SSL and AL.

The work flow of proposed logistic regression model combining SSL and AL.

Source publication
Article
Full-text available
Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive. Technologies such as active learning and semi-supervise...

Similar publications

Preprint
Full-text available
Compared to supervised learning, semi-supervised learning reduces the dependence of deep learning on a large number of labeled samples. In this work, we use a small number of labeled samples and perform data augmentation on unlabeled samples to achieve image classification. Our method constrains all samples to the predefined evenly-distributed clas...
Article
Full-text available
Compared to supervised learning, semi-supervised learning reduces the dependence of deep learning on a large number of labeled samples. In this work, we use a small number of labeled samples and perform data augmentation on unlabeled samples to achieve image classification. Our method constrains all samples to the predefined evenly-distributed clas...
Article
Full-text available
Early risk tagging is crucial in maternal health, especially because it threatens both the mother and the long-term development of the baby. By tagging high-risk pregnancies, mothers would be given extra care before, during, and after pregnancies, thus reducing the risk of complications. In the Philippines, where the fertility rate is high, especia...

Citations

... More specifically, the goal is to model the probability of one of the binary outcomes based on the predictor variables. In machine learning and various scientific applications, logistic regression appears in numerous settings, including online learning (Zhang et al. 2012), feature selection (Koh, Kim, and Boyd 2007), anomaly detection (Hendrycks, Mazeika, and Dietterich 2019;Feng et al. 2014), disease classification (Liao and Chin 2007;Chai et al. 2018), image & signal processing (Dong, Zhu, and Gong 2019;Rosario 2004), probability calibration (Kull et al. 2019) and many more. ...
Article
In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
... Here, we discuss about the logistic regression model (LRM) with respect to the UCM [11]. For example, the coal dataset contains N samples that includes few labelled samples as N F (Non-fire samples) and unlabelled samples F (Fire samples) where N = N F + F . ...
... 1) Uncertainty Sampling: Here, we use uncertainty sampling to choose the samples from unlabelled dataset in active learning (AL). In the LRM, the probability of a sample that is close to the decision boundary can be considered as the most uncertain sample in AL [11]. Therefore, the modelled can be expressed as ...
... Therefore, the SSL selects high confidence unlabelled samples that restricts the classifier to be nearsighted. Moreover, AL selects the low-confidence data samples [11]. Therefore, the model is represented as ...
Article
Full-text available
Continuous monitoring is crucial for the early detection of mine fires in underground coal mines (UCMs). Internet of Things (IoT), is widely used for continuous monitoring of the environment of UCMs. However, coverage and connectivity issues among the nodes in (UCM) pose challenges for sustained monitoring and communication. To address this, the strategy involves partitioning the UCM using designated target points. The key idea is to confirm that nodes adequately cover these predetermined targets, ensuring comprehensive coverage and connectivity throughout the entire UCM. This study addresses the challenges of node coverage and connectivity in UCMs by proposing a novel k-coverage and m-connectivity model using the Fish Swarm Optimization (FSO) algorithm. The methodology involves partitioning the UCM into designated target points and deploying nodes to ensure comprehensive coverage and connectivity. Sensor nodes gather local environmental data, which is then processed to predict potential fire hazards using a logistic regression model (LRM). The FSO algorithm optimizes node deployment by achieving k-coverage, where each target point is covered by at least k nodes, and m-connectivity, ensuring robust communication routes even when some connections fail. A fitness function is formulated to minimize node count, maximize coverage, and maintain connectivity. The proposed method demonstrates high accuracy and efficiency in simulations, outperforming traditional linear regression and Naive Bayes models with an accuracy of 98% in fire prediction. This approach is scalable and can be adapted for various Consumer IoT applications to enhance safety and operational efficiency.
... These selected samples are then labeled manually by experts, and the model is retrained using the newly labeled data. 34 Active learning could be applied to various tasks in dementia prevention, such as identifying relevant risk factors or biomarkers, or targeted samples for clinical trials recruitment. Other methods include selftraining where a model is initially trained on a small, labeled data set. ...
Article
Full-text available
INTRODUCTION A wide range of modifiable risk factors for dementia have been identified. Considerable debate remains about these risk factors, possible interactions between them or with genetic risk, and causality, and how they can help in clinical trial recruitment and drug development. Artificial intelligence (AI) and machine learning (ML) may refine understanding. METHODS ML approaches are being developed in dementia prevention. We discuss exemplar uses and evaluate the current applications and limitations in the dementia prevention field. RESULTS Risk‐profiling tools may help identify high‐risk populations for clinical trials; however, their performance needs improvement. New risk‐profiling and trial‐recruitment tools underpinned by ML models may be effective in reducing costs and improving future trials. ML can inform drug‐repurposing efforts and prioritization of disease‐modifying therapeutics. DISCUSSION ML is not yet widely used but has considerable potential to enhance precision in dementia prevention. Highlights Artificial intelligence (AI) is not widely used in the dementia prevention field. Risk‐profiling tools are not used in clinical practice. Causal insights are needed to understand risk factors over the lifespan. AI will help personalize risk‐management tools for dementia prevention. AI could target specific patient groups that will benefit most for clinical trials.
... This latter approach is called active learning and has been successfully used in ECG beat classification, image classification, gene expression, and artefact detection. 22,23 A fourth subtype of learning models is RL, which constitutes a totally different paradigm in terms of how the model learns. In general, there will be an agent which observes and learns the best policy (e.g. ...
Article
Full-text available
Developing functional machine learning (ML)-based models to address unmet clinical needs requires unique considerations for optimal clinical utility. Recent debates about the rigours, transparency, explainability, and reproducibility of ML models, terms which are defined in this article, have raised concerns about their clinical utility and suitability for integration in current evidence-based practice paradigms. This featured article focuses on increasing the literacy of ML among clinicians by providing them with the knowledge and tools needed to understand and critically appraise clinical studies focused on ML. A checklist is provided for evaluating the rigour and reproducibility of the four ML building blocks: data curation, feature engineering, model development, and clinical deployment. Checklists like this are important for quality assurance and to ensure that ML studies are rigourously and confidently reviewed by clinicians and are guided by domain knowledge of the setting in which the findings will be applied. Bridging the gap between clinicians, healthcare scientists, and ML engineers can address many shortcomings and pitfalls of ML-based solutions and their potential deployment at the bedside.
... This latter approach is called active learning and has been successfully used in ECG beat classification, image classification, gene expression, and artefact detection. 22,23 A fourth subtype of learning models is RL, which constitutes a totally different paradigm in terms of how the model learns. In general, there will be an agent which observes and learns the best policy (e.g. ...
Article
Full-text available
Developing functional machine learning-based models to address unmet clinical needs requires unique considerations for optimal clinical utility. Recent debates about the rigors, transparency, explainability, and reproducibility of machine learning models, terms which are defined in this article, have raised concerns about their clinical utility and suitability for integration in current evidence-based practice paradigms. This featured article focuses on increasing literacy of machine learning among clinicians by providing them with the knowledge and tools needed to understand and critically appraise clinical studies focused on machine learning. A checklist is provided for evaluating the rigor and reproducibility of the four machine learning building blocks: data curation, feature engineering, model development, and clinical deployment. Checklists like this are important for quality assurance and to ensure that machine learning studies are rigorously and confidently reviewed by clinicians and are guided by domain knowledge of the setting in which the findings will be applied. Bridging the gap between clinicians, healthcare scientists, and machine learning engineers can address many shortcomings and pitfalls of machine learning-based solutions and their potential deployment at the bedside.
... To make full use of the massive tunneling data and limited borehole data, the semi-supervised learning (SSL) approach [9] shows promising potential in the soil identification problem. The SSL approach harbors the idea that the unlabeled data (tunneling data without borehole information), when used in conjunction with a small amount of labeled data (tunneling data with borehole information), can produce a considerable improvement in learning accuracy, which has been widely used in disease classification [10], structural health monitoring [11], and extreme disaster prediction [12]. The remainder of the paper is organized as follows. ...
... The algorithms of it are employed in lots of field, such as speech recognition, visual object recognition, object detection, etc [1,2]. Usually, the ML algorithms accomplish the tasks by training models on their own datasets [3][4][5][6][7][8][9][10][11][12][13]. The typical ones are, for example, logistic regression [8], neural network [5,6,9,10], support vector machine [11,12], Bayesian classifier [13] and so on. ...
... Usually, the ML algorithms accomplish the tasks by training models on their own datasets [3][4][5][6][7][8][9][10][11][12][13]. The typical ones are, for example, logistic regression [8], neural network [5,6,9,10], support vector machine [11,12], Bayesian classifier [13] and so on. ...
Article
Full-text available
Quantum machine learning (QML) has aroused great interest because it has the potential to speed up the established classical machine learning processes. However, the present QML models can merely be trained on the dataset of single domain of interest. This severely limits the application of the QML to the scenario where only small datasets are available. In this work, we have proposed a QML model that allows the transfer of the knowledge from one domain encoded by quantum states to another, which is called quantum transfer learning. Using such a model, we demonstrate that the classification accuracy can be greatly improved for the training process on small datasets, comparing with the results obtained by former QML algorithm. Last but not least, we have proved that the complexity of our algorithm is basically logarithmic, which can be considered an exponential speedup over the related classical algorithms.
... [33] explored semi-supervised and active learning based on gaussian mixture models for microalgae classification. [34] proposed a logistic regression model combining SSL and AL. They used the unlabeled samples with least cost in an attempt to improve the disease classification. ...
Article
Full-text available
Due to datasets have continuously grown, efforts have been performed in the attempt to solve the problem related to the large amount of unlabeled data in disproportion to the scarcity of labeled data. Another important issue is related to the trade-off between the difficulty in obtaining annotations provided by a specialist and the need for a significant amount of annotated data to obtain a robust classifier. In this context, active learning techniques jointly with semi-supervised learning are interesting. A smaller number of more informative samples previously selected (by the active learning strategy) and labeled by a specialist can propagate the labels to a set of unlabeled data (through the semi-supervised one). However, most of the literature works neglect the need for interactive response times that can be required by certain real applications. We propose a more effective and efficient active semi-supervised learning framework, including a new active learning method. An extensive experimental evaluation was performed in the biological context (using the ALL-AML, Escherichia coli and PlantLeaves II datasets), comparing our proposals with state-of-the-art literature works and different supervised (SVM, RF, OPF) and semi-supervised (YATSI-SVM, YATSI-RF and YATSI-OPF) classifiers. From the obtained results, we can observe the benefits of our framework, which allows the classifier to achieve higher accuracies more quickly with a reduced number of annotated samples. Moreover, the selection criterion adopted by our active learning method, based on diversity and uncertainty, enables the prioritization of the most informative boundary samples for the learning process. We obtained a gain of up to 20% against other learning techniques. The active semi-supervised learning approaches presented a better trade-off (accuracies and competitive and viable computational times) when compared with the active supervised learning ones.
... The idea of this method is very naive, namely that if samples are closer in the feature space, they are more likely to be in the same class. Then, we observed two linear classification modelsâĂŹ performance, linear support vector classifier (linear SVC) [3,17] and logistic regression (LR) [7]. Both identify hyperplanes in the feature space to split samples into different classes. ...
Preprint
Full-text available
Facial appearance matters in social networks. Individuals frequently make trait judgments from facial clues. Although these face-based impressions lack the evidence to determine validity, they are of vital importance, because they may relate to human network-based social behavior, such as seeking certain individuals for help, advice, dating, and cooperation, and thus they may relate to centrality in social networks. However, little to no work has investigated the apparent facial traits that influence network centrality, despite the large amount of research on attributions of the central position including personality and behavior. In this paper, we examine whether perceived traits based on facial appearance affect network centrality by exploring the initial stage of social network formation in a first-year college residential area. We took face photos of participants who are freshmen living in the same residential area, and we asked them to nominate community members linking to different networks. We then collected facial perception data by requiring other participants to rate facial images for three main attributions: dominance, trustworthiness, and attractiveness. Meanwhile, we proposed a framework to discover how facial appearance affects social networks. Our results revealed that perceived facial traits were correlated with the network centrality and that they were indicative to predict the centrality of people in different networks. Our findings provide psychological evidence regarding the interaction between faces and network centrality. Our findings also offer insights in to a combination of psychological and social network techniques, and they highlight the function of facial bias in cuing and signaling social traits. To the best of our knowledge, we are the first to explore the influence of facial perception on centrality in social networks.
... Finally, expectation maximization co-training (co-EM) was employed to automatically label instances that showed a low disagreement between the two classifiers. Other studies exploited certainty-based AL with self-training aiming to manual labeling with minimum human cost in spoken language understanding [28], natural language processing [29], sound classification [30], disease classification [31] and cell segmentation [32]. In another study [33], the authors addressed the problem of imbalanced training data in object detection. ...
Article
Full-text available
One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.