Article

Experiment With a New Boosting Algorithm

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Schapire in 1990(Schapire, 1990) comes from sequentially converting weak learners into stronger ones to create a higher accuracy ensemble model through reduction in variance and bias (Zounemat-Kermani et al., 2021;Zhou, 2012;Bhagat et al., 1123;2021a,b). Boosting techniques, like bagging, use bootstrap resampling from the primary dataset to train commonly homogeneous base learners (DT models), but in a sequential way; then a linear aggregation rule is applied to outputs combination (Freund and Schapire, 1996) (Zhou, 2012;Onainor, 2019). Gradient Boosting, Extreme Gradient boosting, Stochastic Gradient Boosting, Bayesian Additive Regression Trees and Adaboost are prominent examples of this technique (Ganaie et al., 2021;Freund and Schapire, 1996). ...
... Boosting techniques, like bagging, use bootstrap resampling from the primary dataset to train commonly homogeneous base learners (DT models), but in a sequential way; then a linear aggregation rule is applied to outputs combination (Freund and Schapire, 1996) (Zhou, 2012;Onainor, 2019). Gradient Boosting, Extreme Gradient boosting, Stochastic Gradient Boosting, Bayesian Additive Regression Trees and Adaboost are prominent examples of this technique (Ganaie et al., 2021;Freund and Schapire, 1996). ...
... By weighted voting of weak learners, the final strong output model is created (Zounemat-Kermani et al., 2021;Yan et al., 2020;Li et al., 2022b;Elmousalami, 2020). C) Stacking: Stacking is defined as a two-level procedure in which ensemble members or base learners are trained in parallel at the first level, and then base learner outputs are fed into a metamodel for combination at the second level (Re and Valentini, 2012;Breiman, 2001;Li et al., 2022a;Schapire, 1990;Elmousalami, 2020;Freund and Schapire, 1996). Contrary to the previous ensemble techniques, this method is applied to commonly heterogeneous base learners (different-type models). ...
... Adaptive boosting (AdaBoost) was first proposed by Yoav Freund and Robert Schapire in 1996 [33]. This study mainly uses this model to complete the regression prediction task. ...
... In the iterative process, the prediction error training data is given more weight, and the weight of the training data is updated whenever a new weak learner is created [34]. When calculating the weight of the learner, the maximum error on the training set and the relative error of each sample are first calculated [33], ...
Article
Full-text available
Soil water content is an important indicator used to maintain the ecological balance of farmland. The efficient spatial prediction of soil water content is crucial for ensuring crop growth and food production. To this end, 104 farmland soil samples were collected in the Yellow River Delta (YRD) in China, and the soil water content was determined using the drying method. A gradient boosting decision tree (GBDT) model based on a tree-structured Parzen estimator (TPE) hyperparametric optimization was developed, and then the soil water content was predicted and mapped based on the soil texture and vegetation index from Sentinel-2 remote sensing images. The results of statistical analysis showed that the soil water content had a high coefficient of variation (55.30%), a non-normal distribution, and complex spatial variability. Compared with other models, the TPE-GBDT model had the highest prediction accuracy (RMSE = 6.02% and R2 = 0.71), and its mapping results showed that the areas with high soil water content were distributed on both sides of the river and near the estuary. Furthermore, the results of Shapley additive explanation (SHAP) analysis showed that the soil texture (PC2 and PC5), modified normalized difference vegetation index (MNDVI), and Sentinel-2 red edge position (S2REP) index provided important contributions to the spatial prediction of soil water content. We found that the hydraulic physical properties of soil texture and the vegetation characteristics (such as vegetation coverage, root action, and transpiration) are the key factors affecting the spatial migration and heterogeneity of the soil water content in the study area. The above results show that the TPE algorithm can quickly capture the hyperparameters that are most suitable for the GBDT model, so that the GBDT model can ensure prediction accuracy, reduce the loss function with less training data, and accurately learn of the nonlinear relationship between soil water content and environmental factors. This paper proposes a machine learning method for hyperparameter optimization that shows considerable potential to predict the spatial heterogeneity of soil water content, which can effectively support regional farmland soil and water conservation and high-quality agricultural development.
... After applying the Synthetic Minority Over-Sampling Technique (SMOTE) to balance the training set, we used the Adaptive Boosting algorithm (AdaBoost) [40] and a fully connected Artificial Neural Network (ANN) to predict the clusters emerged from the previous steps and test the replicability of the solution. We compared the performances of these two supervised ML algorithms, with AdaBoost being selected for its good performance in imbalance classification problems [41,42]. ...
... The Adaptive Boosting algorithm is an ensemble method that functions in a boosting network. Boosting is a technique that can significantly reduce the error of any weak learning algorithm to create classifiers that only need to be slightly better than random guessing [40]. The AdaBoost algorithm assigns weights to each sample based on its importance and places the most weight on those examples that are most frequently misclassified by the previous classifiers. ...
Article
Full-text available
Assessing the cognitive abilities of students in academic contexts can provide valuable insights for teachers to identify their cognitive profile and create personalized teaching strategies. While numerous studies have demonstrated promising outcomes in clustering students based on their cognitive profiles, effective comparisons between various clustering methods are lacking in the current literature.In this study, we aim to compare the effectiveness of two clustering techniques to group students based on their cognitive abilities including general intelligence, attention, visual perception, working memory, and phonological awareness. 292 students, aged 11–15 years, participated in the study.A two-level approach based on the joint use of Kohonen's Self-Organizing Map (SOMs) and k-means clustering algorithm was compared with an approach based on the k-means clustering algorithm only. The resulting profiles were then predicted via AdaBoost and ANN supervised algorithms.The results showed that the two-level approach provides the best solution for this problem while the ANN algorithm was the winner in the classification problem.These results laying the foundations for developing a useful instrument for predicting the students’ cognitive profile.
... Two different feature extraction methods are used: one directly extracts the descriptive statistical features of various variables (e.g., the mean, median, maximum, minimum, and standard deviation of the speed), and the other directly takes time in the time series data as a dimension of the data. The AdaBoost algorithm (Freund and Schapire, 1996) is used to classify whether a potential crash event is a real crash using the constructed features. Models trained with the two different features are fused for the final classification. ...
... Xgboost (Chen and Guestrin, 2016) 0.890 0.100 LightGBM (Ke et al., 2017) 0.884 0.073 Decision Tree (Quinlan, 1986) 0.820 0.120 Random Forest (Breiman, 2001) 0.036 0.073 AdaBoost (Freund and Schapire, 1996) 0.911 0.050 Fig. 10. Illustration of the model ensemble. ...
Article
As the automobile market gradually develops towards intelligence, networking, and information-orientated, intelligent identification based on connected vehicle data becomes a key technology. Specifically, real-time crash identification using vehicle operation data can enable automotive companies to obtain timely information on the safety of user vehicle usage so that timely customer service and roadside rescue can be provided. In this paper, an accurate vehicle crash identification algorithm is developed based on machine learning techniques using electric vehicles’ operation data provided by SAIC-GM-Wuling. The point of battery disconnection is identified as a potential crash event. Data before and after the battery disconnection is retrieved for feature extraction. Two different feature extraction methods are used: one directly extracts the descriptive statistical features of various variables, and the other directly unfolds the multivariate time series data. The AdaBoost algorithm is used to classify whether a potential crash event is a real crash using the constructed features. Models trained with the two different features are fused for the final outputs. The results show that the final model is simple, effective, and has a fast inference speed. The model has an F1 score of 0.98 on testing data for crash classification, and the identified crash times are all within 10 s around the true crash times. All data and code are available at https://github.com/MeixinZhu/vehicle-crash-identification.
... It works on the principle that combining several weak classifiers results in a strong and accurate classifier. Ada-Boost is an abbreviation for adaptive boosting which is a machine learning algorithm, formulated by Yoav Freund and Robert Schapire [22]. In our proposed method, 50 decision trees are employed to develop the AdaBoost classifier. ...
... Given the low number of input and output variables, tree-based algorithms such as random forest are selected, and the hyperparameters of the models (e.g., number of trees) are fine-tuned to avoid overfitting the ML model to training data. Using a random selection of features to split each node yields error rates that are more robust with respect to noise (Freund and Schapire 1996). Additionally, the generalization error for forests converges to a limit as the number of trees in the forest increases. ...
Article
Full-text available
Petrophysical interpretation of borehole geophysical measurements in the presence of deep mud-filtrate invasion remains a challenge in formation evaluation. Traditional interpretation methods often assume a piston-like radial resistivity model to estimate the radial length of invasion, resistivities in the flushed and virgin zones, and the corresponding fluid saturations from apparent resistivity logs. Such assumptions often introduce notable inaccuracies, especially when the radial distribution of formation resistivity exhibits a deep and smooth radial front. Numerical simulation of mud-filtrate invasion and well logs combined with inversion methods can improve the estimation accuracy of petrophysical properties from borehole geophysical measurements affected by the presence of mud-filtrate invasion. We develop a new method to quantify water saturation in the virgin zone, residual hydrocarbon saturation, and permeability from borehole geophysical measurements. This method combines the numerical simulation of well logs with the physics of mud-filtrate invasion to quantify the effect of petrophysical properties and drilling parameters on nuclear and resistivity logs. Our approach explicitly considers the different volumes of investigation associated with the borehole geophysical measurements included in the interpretation. The new method is successfully applied to a tight-gas sandstone formation invaded with water-base mud (WBM). Petrophysical properties were estimated in three closely spaced vertical wells that exhibited different invasion conditions (i.e., different times of invasion and different overbalance pressures). Available rock-core laboratory measurements were used to calibrate the petrophysical models and obtain realistic spatial distributions of petrophysical properties around the borehole. This approach assumes that initial water saturation is equal to irreducible water saturation. Based on the calibrated petrophysical models, thousands of invasion conditions were numerically simulated for a wide range of petrophysical properties, including porosity and permeability. Based on the large data set of numerical simulations, analytical and machine-learning (ML) models were combined to infer unknown rock properties in each well. Mean-absolute-percent errors (MAPE) of the analytical and ML models for the estimation of water saturation in the virgin zone are 5% and 2%, respectively, while the MAPE of the analytical models for the estimation of residual hydrocarbon saturation is 10%. Synthetic and field examples are examined to benchmark the successful application and verification of the new interpretation method. Estimates of water saturation in the virgin zone using the new method are in good agreement with core-based models.
... In [58], a reverse boosting algorithm is introduced. This method distinguishes between safe, noisy, and borderline patterns, and assigns them different weights during boosting. ...
Article
The development of image classification is one of the most important research topics in remote sensing. The prediction accuracy depends not only on the appropriate choice of the machine learning method but also on the quality of the training datasets. However, real-world data is not perfect and often suffers from noise. This paper gives an overview of noise filtering methods. Firstly, the types of noise and the consequences of class noise on machine learning are presented. Secondly, class noise handling methods at both the data level and the algorithm level are introduced. Then ensemble-based class noise handling methods including class noise removal, correction, and noise robust ensemble learners are presented. Finally, a summary of existing data-cleaning techniques is given.
... We evaluated several potential classifiers, mainly concerned with their explainability and ability to learn on small data sets. Later, Section VI-A shows the classification metrics for the DT [60], logistic regression (LR) [61], k-nearest neighbors (KNNs) [62], RF [63], and tree-based AdaBoost [64]. The final model was trained on the data set of 160 110 benign instances and 1990 C&C instances. ...
Article
Explainability and alert reasoning are essential but often neglected properties of intrusion detection systems. The lack of explainability reduces security personnel’s trust, limiting the overall impact of alerts. This paper proposes the BOTA (Botnet Analysis) system, which uses the concepts of weak indicators and heterogeneous meta-classifiers to maintain accuracy compared with state-of-the-art systems while also providing explainable results that are easy to understand. To evaluate the proposed system, we have implemented a demonstration of intrusion weak-indication detectors, each working on a different principle to ensure robustness. We tested the architecture with various real-world and lab-created datasets, and it correctly identified 94.3% of infected IoT devices without false positives. Furthermore, the implementation is designed to work on top of extended bidirectional flow data, making it deployable on large 100 Gbps large-scale networks at the level of Internet Service Providers. Thus, a single instance of BOTA can protect millions of devices connected to end-users’ local networks and significantly reduce the threat arising from powerful IoT botnets.
... In our previous research [13,[21][22][23][24][25] we have described and experimented with a wide diversity of algorithms such as K-Nearest Neighbors (KNN) [36], Support Vector Machine (SVM) [37], AdaBoost [38], Random Forest (RF) [39], Light Gradient Boosting Machine (LGBM) [40], Extreme Gradient Boosting (XGB) [41], and Logistic Regressor (LR) [42]. Here, we compared the results obtained by these algorithms with the ones achieved by classifiers synthesized by Decision Trees (DT), GP and GE to check if the results of the interpretable classifiers are competitive. ...
Article
Full-text available
Background In this work, we developed many machine learning classifiers to assist in diagnosing respiratory changes associated with sarcoidosis, based on results from the Forced Oscillation Technique (FOT), a non-invasive method used to assess pulmonary mechanics. In addition to accurate results, there is a particular interest in their interpretability and explainability, so we used Genetic Programming since the classification is made with intelligible expressions and we also evaluate the feature importance in different experiments to find the more discriminative features. Methodology/principal findings We used genetic programming in its traditional tree form and a grammar-based form. To check if interpretable results are competitive, we compared their performance to K-Nearest Neighbors, Support Vector Machine, AdaBoost, Random Forest, LightGBM, XGBoost, Decision Trees and Logistic Regressor. We also performed experiments with fuzzy features and tested a feature selection technique to bring even more interpretability. The data used to feed the classifiers come from the FOT exams in 72 individuals, of which 25 were healthy, and 47 were diagnosed with sarcoidosis. Among the latter, 24 showed normal conditions by spirometry, and 23 showed respiratory changes. The results achieved high accuracy (AUC > 0.90) in two analyses performed (controls vs. individuals with sarcoidosis and normal spirometry and controls vs. individuals with sarcoidosis and altered spirometry). Genetic Programming and Grammatical Evolution were particularly beneficial because they provide intelligible expressions to make the classification. The observation of which features were selected most frequently also brought explainability to the study of sarcoidosis. Conclusions The proposed system may provide decision support for clinicians when they are struggling to give a confirmed clinical diagnosis. Clinicians may reference the prediction results and make better decisions, improving the productivity of pulmonary function services by AI-assisted workflow.
... Boosting, one of the best known Perturb and Combine methods, originated from the question posed by Kearns & Valiant (1994) of whether a set of weak classifiers could be converted into a robust classifier. Freund & Schapire (1996, 1997 designed Ad-aBoost (which stands for adaptive boosting), an ensemble algorithm aiming to drive the training set error rapidly to zero. The key idea consists of repeatedly using the base 235 weak learning algorithm on differently weighted versions of the training data, yielding a sequence of weak classifiers that are finally combined (Galar et al., 2011). ...
Article
Label Ranking (LR) is an emerging non-standard supervised classification problem with practical applications in different research fields. The Label Ranking task aims at building preference models that learn to order a finite set of labels based on a set of predictor features. One of the most successful approaches to tackling the LR problem consists of using decision tree ensemble models, such as bagging, random forest, and boosting. However, these approaches, coming from the classical unweighted rank correlation measures, are not sensitive to label importance. Nevertheless, in many settings, failing to predict the ranking position of a highly relevant label should be considered more serious than failing to predict a negligible one. Moreover, an efficient classifier should be able to take into account the similarity between the elements to be ranked. The main contribution of this paper is to formulate, for the first time, a more flexible label ranking ensemble model which encodes the similarity structure and a measure of the individual label importance. Precisely, the proposed method consists of three item-weighted versions of the AdaBoost boosting algorithm for label ranking. The predictive performance of our proposal is investigated both through simulations and applications to three real datasets.
... Because of their high performance on small-sized training data, Support Vector Machines (SVM) (Hearst et al., 1998) were commonly used. Additionally, in the classification stage, various classification approaches such as bagging (Opitz and Maclin, 1999), cascade learning (Dalal and Triggs, 2005), and Adaboost (Freund et al., 1996) were utilised, resulting in better detection accuracy. ...
Article
Full-text available
Surveillance systems do not give a rapid response to deal with suspicious activities such as armed robbery in public places. Consequently, there is a need for technology that can recognize criminal activities from Closed Circuit Televisions (CCTV) footage without the need of human help. Various high-performance computing algorithms have been developed but are limited to specific conditions. In this paper, we have identified gaps between existing technologies for weapon detection. The automatic detection of guns/weapons could help in the investigation of crime scenes. A new and difficult area of study is identifying the specific type of firearm used in an attack known as intra-class detection. The study examines and classifies the strengths and shortcomings of several existing algorithms using classical machine learning and deep learning approaches, employed in the detection of different kinds of weapons. We have thoroughly compare and analyze the performance of several recent state-of-the-art methods on different datasets along with their future scope. We observed that deep learning techniques beat traditional machine learning techniques in terms of speed and accuracy.
... In this study, the term frequency-inverse document frequency (TF-IDF) (Joachims, 1996) was used for sentence embedding, and the classification performance was confirmed using three classification models: random forest (RF) (Breiman, 2001), XGBoost (Chen & Guestrin, 2016), and adaptive boosting (AdaBoost) (Freund, Schapire, et al., 1996). In addition, among the libraries actively used in recent NLP studies, fastText (Joulin, et al., 2016) and Korean BERT pre-trained cases (KoBERT) 6 applicable to Korean were used to compare the performance. ...
Article
As e-commerce markets have gradually expanded, online shopping malls have provided various services aiming to secure competitiveness. A service for providing an accurate and prompt response when a customer writes an inquiry regarding a product represents a space directly connected to the customer and plays an important role, as it is directly related to product sales. However, the current online shopping mall answering service has disadvantages, e.g., it takes time for an administrator to write an answer directly, or to provide an answer within a set of answers. In this paper, we propose an answer framework for solving this problem, based on customer reviews. When a user writes a query, the framework provides an appropriate answer in real time through the system’s question-and-answer pairs and customer reviews. The framework’s performance is verified through a qualitative evaluation. In addition, it is confirmed that a customized model for reflecting the characteristics of each shopping mall can be created by using additional information from the collected data. The proposed framework is expected to support customers’ online shopping through more reliable and efficient information retrieval, and to reduce shopping mall operation and maintenance costs.
... Boosted Regression Trees (BRT) is a machine learning method that combines the strengths of a regression tree algorithm and a boosting algorithm (Freund, 1996;Schapire, 2003). Construction of multiple regression models in the boosting algorithm makes BRT differ fundamentally from conventional techniques that aim to produce a single "best" parsimonious model (Schapire, 2003;Elith et al., 2008). ...
Article
Full-text available
Long‐term hydrological partitioning of catchments can be well described by the Budyko framework with a parameter (e.g., Fu's equations with parameter ω). The Budyko framework considers aridity index as the dominant control on hydrological partitioning, while the parameter represents integrated influences of catchment properties. Our understanding regarding the controls of catchment properties on the parameter is still limited. In this study, two machine learning methods, that is, boosted regression tree (BRT) and CUBIST, were used to model ω. Interpretable machine learning methods were adopted for better physical understanding including feature importance, accumulated local effects (ALE), and local interpretable model‐agnostic explanations. Among the 15 properties of 443 Australian catchments, analysis of feature importance showed that root zone storage capacity (SR), vapor pressure, vegetation coverage (M), precipitation depth, climate seasonality and asynchrony index (SAI), and water use efficiency (WUE) were the six primary control factors on ω. ALE showed that ω varied nonlinearly with all factors, and varied non‐monotonically with M, SAI, and WUE. LIME showed that the importance of the six dominant factors on ω varied between regions. CUBIST was further used to build regionally varying relationships between ω and the primary factors. Continental scale ω and evapotranspiration were further mapped across Australia based on the most robust BRT‐trained parameterization scheme with a resolution of 0.05°. Instead of using the machine learning method as a black box, we employed interpretability approaches to identify the controls. Our findings not only improved the capability of the Budyko method for hydrological partitioning across Australia, but also demonstrated that the controls of catchment properties on hydrological partitioning vary in different regions.
... In boosting weak learners are sequentially combined in an adaptive way, i. e. each model gives more importance to the misclassified examples by assigning lower weights to correctly classified examples and higher weights to examples difficult to classify. AdaBoost [33] is the most known boosting method. ...
Article
The outbreak of novel corona virus 2019 (COVID-19) has been treated as a public health crisis of global concern by the World Health Organization (WHO). COVID-19 pandemic hugely affected countries worldwide raising the need to exploit novel, alternative and emerging technologies to respond to the emergency created by the weak health-care systems. In this context, Artificial Intelligence (AI) techniques can give a valid support to public health authorities, complementing traditional approaches with advanced tools. This study provides a comprehensive review of methods, algorithms, applications, and emerging AI technologies that can be utilized for forecasting and diagnosing COVID-19. The main objectives of this review are summarized as follows. (i) Understanding the importance of AI approaches such as machine learning and deep learning for COVID-19 pandemic; (ii) discussing the efficiency and impact of these methods for COVID-19 forecasting and diagnosing; (iii) providing an extensive background description of AI techniques to help non-expert to better catch the underlying concepts; (iv) for each work surveyed, give a detailed analysis of the rationale behind the approach, highlighting the method used, the type and size of data analyzed, the validation method, the target application and the results achieved; (v) focusing on some future challenges in COVID-19 forecasting and diagnosing.
... 2) GB: GB is a powerful ML model for different regression and classification problems. The basic idea of GB is originated from AdaBoost proposed by Freuman and Schapire [22]. While RF constructs an ensemble of deep individual trees, GB builds an ensemble of shallow trees. ...
Article
Introduction Chronic diseases have become one of the main causes of premature death all around the world in recent years. The diagnosis of chronic diseases is time-consuming and costly. Therefore, timely diagnosis and prediction of chronic diseases are very necessary. Methods In this paper, a new method for chronic disease diagnosis is proposed by combining convolutional neural network (CNN) and ensemble learning. This method utilizes random forest (RF) as the base classifier to improve classification performance and diagnostic accuracy, and then combines AdaBoost to successfully replace the Softmax layer of CNN to generate multiple accurate base classifiers while determining their optimal attributes, achieving high-quality classification and prediction of chronic diseases. Results To verify the effectiveness of the proposed method, real-world Electronic Medical Records dataset (C-EMRs) was used for experimental analysis. The results show that compared with other traditional machine learning methods such as CNN, K-Nearest Neighbor, and RF, the proposed method can effectively improve the accuracy of diagnosis and reduce the occurrence of missed diagnosis and misdiagnosis. Conclusions This study will provide effective information for the diagnosis of chronic diseases, assist doctors in making clinical decisions, develop targeted intervention measures, and reduce the probability of misdiagnosis.
Article
Alzheimer's disease and related dementias (ADRD) present a looming public health crisis, affecting roughly 5 million people and 11 % of older adults in the United States. Despite nationwide efforts for timely diagnosis of patients with ADRD, >50 % of them are not diagnosed and unaware of their disease. To address this challenge, we developed ADscreen, an innovative speech-processing based ADRD screening algorithm for the protective identification of patients with ADRD. ADscreen consists of five major components: (i) noise reduction for reducing background noises from the audio-recorded patient speech, (ii) modeling the patient's ability in phonetic motor planning using acoustic parameters of the patient's voice, (iii) modeling the patient's ability in semantic and syntactic levels of language organization using linguistic parameters of the patient speech, (iv) extracting vocal and semantic psycholinguistic cues from the patient speech, and (v) building and evaluating the screening algorithm. To identify important speech parameters (features) associated with ADRD, we used the Joint Mutual Information Maximization (JMIM), an effective feature selection method for high dimensional, small sample size datasets. Modeling the relationship between speech parameters and the outcome variable (presence/absence of ADRD) was conducted using three different machine learning (ML) architectures with the capability of joining informative acoustic and linguistic with contextual word embedding vectors obtained from the DistilBERT (Bidirectional Encoder Representations from Transformers). We evaluated the performance of the ADscreen on an audio-recorded patients' speech (verbal description) for the Cookie-Theft picture description task, which is publicly available in the dementia databank. The joint fusion of acoustic and linguistic parameters with contextual word embedding vectors of DistilBERT achieved F1-score = 84.64 (standard deviation [std] = ±3.58) and AUC-ROC = 92.53 (std = ±3.34) for training dataset, and F1-score = 89.55 and AUC-ROC = 93.89 for the test dataset. In summary, ADscreen has a strong potential to be integrated with clinical workflow to address the need for an ADRD screening tool so that patients with cognitive impairment can receive appropriate and timely care.
Article
Full-text available
Polycystic ovary syndrome (PCOS) is the most frequent endocrinological anomaly in reproductive women that causes persistent hormonal secretion disruption, leading to the formation of numerous cysts within the ovaries and serious health complications. But the real-world clinical detection technique for PCOS is very critical since the accuracy of interpretations being substantially dependent on the physician's expertise. Thus, an artificially intelligent PCOS prediction model might be a feasible additional technique to the error prone and time-consuming diagnostic technique. In this study, a modified ensemble machine learning (ML) classification approach is proposed utilizing state-of-the-art stacking technique for PCOS identification with patients' symptom data; employing five traditional ML models as base learners and then one bagging or boosting ensemble ML model as the meta-learner of the stacked model. Furthermore, three distinct types of feature selection strategies are applied to pick different sets of features with varied numbers and combinations of attributes. To evaluate and explore the dominant features necessary for predicting PCOS, the proposed technique with five variety of models and other ten types of classifiers is trained, tested and assessed utilizing different feature sets. As outcomes, the proposed stacking ensemble technique significantly enhances the accuracy in comparison to the other existing ML based techniques in case of all varieties of feature sets. However, among various models investigated to categorize PCOS and non-PCOS patients, the stacking ensemble model with ‘Gradient Boosting’ classifier as meta learner outperforms others with 95.7% accuracy while utilizing the top 25 features selected using Principal Component Analysis (PCA) feature selection technique.
Article
Autism spectrum disorder (ASD) is a complex neurological developmental disorder in children, and is associated with social isolation and restricted interests. The etiology of this disorder is still unknown. There is neither any confirmed laboratory test nor any effective therapeutic strategy to diagnose or cure it. We performed data independent acquisition (DIA) and multiple reaction monitoring (MRM) analysis of plasma from children with ASD and controls. The result showed that 45 differentially expressed proteins (DEPs) were identified between autistic subjects and controls. Among these, only one DEP was down-regulated in ASD; other DEPs were up-regulated in ASD children's plasma. These proteins are found associated with complement and coagulation cascades, vitamin digestion and absorption, cholesterol metabolism, platelet degranulation, selenium micronutrient network, extracellular matrix organization and inflammatory pathway, which have been reported to be related to ASD. After MRM verification, five key proteins in complement pathway (PLG, SERPINC1, and A2M) and inflammatory pathway (CD5L, ATRN, SERPINC1, and A2M) were confirmed to be significantly up-regulated in ASD group. Through the screening of machine learning model and MRM verification, we found that two proteins (biotinidase and carbonic anhydrase 1) can be used as early diagnostic markers of ASD (AUC = 0.8, p = 0.0001). SIGNIFICANCE: ASD is the fastest growing neurodevelopmental disorder in the world and has become a major public health problem worldwide. Its prevalence has been steadily increasing, with a global prevalence rate of 1%. Early diagnosis and intervention can achieve better prognosis. In this study, data independent acquisition (DIA) and multiple reaction monitoring (MRM) analysis was applied to analyze the plasma proteome of ASD patients (31 (±5) months old), and 378 proteins were quantified. 45 differentially expressed proteins (DEPs) were identified between the ASD group and the control group. They mainly were associated with platelet degranulation, ECM proteoglycar, complement and coagulation cascades, selenium micronutrient network, regulation of insulin-like growth factor (IGF) transport and uptake by insulin-like growth factor binding proteins (IGFBPs), cholesterol metabolism, vitamin metabolism, and inflammatory pathway. Through the integrated machine learning methods and the MRM verification of independent samples, it is considered that biotinidase and carbon anhydrase 1 have the potential to become biomarkers for the early diagnosis of ASD. These results complement proteomics database of the ASD patients, broaden our understanding of ASD, and provide a panel of biomarkers for the early diagnosis of ASD.
Article
Full-text available
The United Nations' (UN) Sustainable Development Goals (SDGs) agenda has triggered numerous countries to harness solar energy from solar photovoltaic (PV) modules to increase the share of renewable energy in the global energy mix. However, geographical and climatic factors have a significant impact on the electrical performance of solar PV modules. In addition, since solar PV energy production models are the only physics-based approach to transferring ground-measured PV energy production to other locations, the authors developed 294 physical models from six different PV power technologies and validated them for the model's adaptability. To facilitate the possible determination of PV electric energy generation in the unique geographical and climatic environment of the experiment site, these models were built using machine learning, Gumbel's probabilistic approach (GPM), and hybridization of the two. The major challenges in this study are in developing the hybridized machine learning with the Gumbel probabilistic functional model, which resides in the mathematical transformation process, which required a great deal of repeated mathematical science knowledge to arrive at the final transformed and efficient model for predicting the potential of solar PV output. With a thorough coefficient of determination (R2) of 0.9998% and a root mean square error (RMSE) of 0.0063 kwh, the hybrid model with only the measurable solar radiation parameter is the closest to the measured PV energy production of all technologies. The best hybridized model was used to explore the potential impacts of climate change on the different solar PV technologies. This was achieved by using energy parameters from the Australian Community Climate and Global System Simulation (ACCESS-CM2) in Phase 6. On an annual basis, the effects of climate change on various PV technologies have had a small adverse impact (less than 1%) on these renewable energy technologies. It was also found that, compared to other technologies, CIGS thin film technology produced the least negative effects on climate change, with 10.94%–36.75% in the best-case, 35.71%–36.36% in the moderate-case, and 33.33–40.00% in the worst-case scenario for shared socioeconomic pathways (SSP126, SSP245, and SSP585) emissions. This suggests the intrinsic properties of Copper Indium Gallium Selenide (CIGS) thin film modules are more effective at withstanding high temperatures as they contribute 60.00–89.66% of their intrinsic module properties to PV energy production compared to other technologies. However, taking into account the time, resource availability, cost-effectiveness, commercialization, and consumption of various PV technologies studied in this era of global sustainability, poly-crystalline (p-Si) technology is highly recommended for harvesting solar PV energy products in Alice Springs, Australia.
Article
To explore the effects of hydration shell layer on the surface tension of electrolytes solution and to build an effective prediction model, a machine learning based model is proposed to accurately predict and explain the surface tension of electrolytes solution. The model combines machine learning (ML) algorithms, force filed parameters of molecular dynamics simulations and radial distribution function (RDF) to accurately capture the structure feature for electrolytes solution. The prediction performed an extremely low average relative deviation. SHapley Additive explanation (SHAP) method is used to indicate the features order of importance from strong to weak. It noted that the second hydration shell on the influence of surface tension may beyond the first hydration shell. This work provides a method for one-step acquisition of surface tension data that not only to accurately predict physical and chemical properties of materials, but also to extend the application of molecular dynamics simulations, providing enlightening insights for detecting underlying physical mechanisms.
Article
Full-text available
Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning architectures are showing better performance compared to the shallow or traditional models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorized into bagging, boosting, stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous/heterogeneous ensemble, decision fusion strategies based deep ensemble models. Applications of deep ensemble models in different domains are also briefly discussed. Finally, we conclude this paper with some potential future research directions.
Article
A reliable and efficient forecasting system can be used to warn the general public against the increasing PM2.5 concentration. This paper proposes a novel AdaBoost‐ensemble technique based on a hybrid data preprocessing‐analysis strategy, with the following contributions: (i) a new decomposition strategy is proposed based on the hybrid data preprocessing‐analysis strategy, which combines the merits of two popular decomposition algorithms and has been proven to be a promising decomposition strategy; (ii) the LSTM, as a powerful deep learning forecasting algorithm, is applied to individually forecast the decomposed components, which can effectively capture the long‐short patterns of complex time series; and (iii) a novel AdaBoost‐LSTM ensemble technique is then developed to integrate the individual forecasting results into the final forecasting results, which provides significant improvement to the forecasting performance. To evaluate the proposed model, a comprehensive and scientific assessment system with several evaluation criteria, comparison models and experiments is designed. The experimental results indicate that our developed hybrid model considerably surpasses the compared models in terms of forecasting precision and statistical testing, and that its excellent forecasting performance can guide in developing effective control measures to decrease environmental contamination and prevent the health issues caused by a high PM2.5 concentration.
Article
Crop prescription data contains an extensive amount of information on crops, environment and pests, and has notable diagnostic capabilities. At present, there is lack of feasible methods for efficiently mining crop prescription data to perform accurate diagnoses. In view of the above problems, the purpose of our study is to mine prescription data information and assist the accurate diagnosis of crop diseases. In this paper, six tomato diseases and pests, namely, the tomato virus disease, tomato late blight, tomato gray mold, aphid, thrips and whiteflies, were explored to construct a diagnosis model based on prescription data mining. Original prescription data was subjected to pre-processing, text labeling and one-hot coding. The recursive feature elimination (RFE) method was then employed to extract 37 key features relating to crop diseases and pests from original 50 features. We constructed a tomato disease and pest diagnosis model based on two-stage Stacking ensemble learning to improve the diagnosis accuracy. The experimental results demonstrated the proposed diagnosis model in this paper exhibits a slightly superior performance compared to the best model (LGBM) among ten diagnosis models. The optimal Stacking model is composed of two layers: base-classifiers including GDBT, XGBoost and LGBM, and meta-classifier RF. The diagnosis accuracy of the proposed model for the tomato virus disease reached 94.84%, with an F1-score of 95.98% and overall accuracy of 80.36%. It also performed well on the multi-classification metrics: Macro avg (Precision: 76.55%, Recall: 78.17%, F1-score: 77.05%) and Weighted avg (Precision: 80.96%, Recall: 80.36%, F1-score: 80.50%). Moreover, following feature selection, the Stacking-based diagnosis model can reduce the running time by 12.08% with unchanged diagnosis accuracy. The proposed diagnosis model meets the real-world diagnosis requirements. This work provides new research concepts and a methodological foundation for future crop disease and pest diagnosis.
Article
Cities are undergoing huge shifts in technology and operations in recent days, and ‘data science’ is driving the change in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting insights or actionable knowledge from city data and building a corresponding data-driven model is the key to making a city system automated and intelligent. Data science is typically the study and analysis of actual happenings with historical data using a variety of scientific methodology, machine learning techniques, processes, and systems. In this paper, we concentrate on and explore “Smart City Data Science”, where city data collected from various sources like sensors and Internet-connected devices, is being mined for insights and hidden correlations to enhance decision-making processes and deliver better and more intelligent services to citizens. To achieve this goal, various machine learning analytical modeling can be employed to provide deeper knowledge about city data, which makes the computing process more actionable and intelligent in various real-world services of today’s cities. Finally, we identify and highlight ten open research issues for future development and research in the context of data-driven smart cities. Overall, we aim to provide an insight into smart city data science conceptualization on a broad scale, which can be used as a reference guide for the researchers, professionals, as well as policy-makers of a country, particularly, from the technological point of view.
Article
Given the large amount of customer data available to financial companies, the use of traditional statistical approaches (e.g., regressions) to predict customers’ credit scores may not provide the best predictive performance. Machine learning (ML) algorithms have been explored in the credit scoring literature to increase predictive power. In this paper, we predict commercial customers’ credit scores using hybrid ML algorithms that combine unsupervised and supervised ML methods. We implement different approaches and compare the performance of the hybrid models to that of individual supervised ML models. We find that hybrid models outperform their individual counterparts in predicting commercial customers’ credit scores. Further, while the existing literature ignores past credit scores, we find that the hybrid models’ predictive performance is higher when these features are included.
Conference Paper
This paper proposes a method that uses adaptive decision-making of wavelet level for detecting high-voltage direct current (HVDC) discharge in wavelet transform. Identification and detection of HVDC discharge are essential study subjects for pipeline safety and optimal operation of electrical power systems. This method overcomes the disadvantage that wavelet packet transform needs to determine the level in advance. The decomposition level of wavelet packet transform is controlled by calculating relatively wavelet energy change to decide its wavelet level. Our proposal extracts richer features of HVDC discharge by comparing other feature extraction algorithms. The second primary discovery is that a wavelet-based application framework is designed to detect the HVDC discharge and further protect the energy pipeline. These discoveries have application value in the protection of power systems and provide opportunities and brighter perspectives along with valuable studies in the detection and classification of time-series data.
Article
Parameter Identification plays an important role in electric power transmission systems. Existing approaches for parameter identification tasks typically have two limitations: (1) They generally ignored development trend of historical data, and did not mine characteristics of corresponding power grid branches. (2) They did not consider the constraints of power grid topology, and treated different branches independently. Therefore, they could not characterize correlations between the center node and its neighborhoods. To overcome these limitations, this work proposes a multi-task graph convolutional neural network (MT-GCN) which utilizes the graph convolutional network (GCN) and the fully convolutional network (FCN) as building blocks for parameter identification. Specially, GCN can extract the structure information to enhance local feature extraction. FCN is a decoding module following GCN module, and it is used to identify the parameters of each branch according to its characteristics. Compared with previous methods, the proposed method is significantly improved in accuracy. Besides, this method is robust to measurement noise and errors, and can cope with multiple conditions in real power transmission systems.
ResearchGate has not been able to resolve any references for this publication.