Figure 3 - uploaded by Hemant Kumar Gianey
Content may be subject to copyright.
Entropy of a decision tree Source: Entropy of a DT algorithm [22] Algorithm for Information Gain: i. Calculate Entropy of the target.

Entropy of a decision tree Source: Entropy of a DT algorithm [22] Algorithm for Information Gain: i. Calculate Entropy of the target.

Similar publications

Article
Full-text available
Sample size estimation is a crucial step in experimental design but is understudied in the context of deep learning. Currently, estimating the quantity of labeled data needed to train a classifier to a desired performance, is largely based on prior experience with similar models and problems or on untested heuristics. In many supervised machine lea...
Preprint
Full-text available
Sensor and control data of modern mechatronic systems are often available as heterogeneous time series with different sampling rates and value ranges. Suitable classification and regression methods from the field of supervised machine learning already exist for predictive tasks, for example in the context of condition monitoring, but their performa...

Citations

... The authors reviewed supervised and unsupervised machine learning algorithms in [32,33] and presented their findings. Like other systems and processes, there are some limitations of machine learning, some of which are highlighted in [34,35]. ...
Article
Full-text available
Electronic manufacturing and design companies maintain test sites for a range of products. These products are designed according to the end-user requirements. The end user requirement, then, determines which of the proof of design and manufacturing tests are needed. Test sites are designed to carry out two things, i.e., proof of design and manufacturing tests. The team responsible for designing test sites considers several parameters like deployment cost, test time, test coverage, etc. In this study, an automated test site using a supervised machine learning algorithm for testing an ultra-high frequency (UHF) transceiver is presented. The test site is designed in three steps. Firstly, an initial manual test site is designed. Secondly, the manual design is upgraded into a fully automated test site. And finally supervised machine learning is applied to the automated design to further enhance the capability. The manual test site setup is required to streamline the test sequence and validate the control and measurements taken from the test equipment and unit under test (UUT) performance. The manual test results showed a high test time, and some inconsistencies were observed when the test operator was required to change component values to tune the UUT. There was also a sudden increase in the UUT quantities and so, to cater for this, the test site is upgraded to an automated test site while the issue of inconsistencies is resolved through the application of machine learning. The automated test site significantly reduced test time per UUT. To support the test operator in selecting the correct component value the first time, a supervised machine learning algorithm is applied. The results show an overall improvement in terms of reduced test time, increased consistency, and improved quality through automation and machine learning.
... Precision is the ratio between the correctly predicted positive values and the overall number of positive predictions [54]. Thus, while accuracy expresses the proximity of the model to the actual results, precision quantifies the consistency of the results, ignoring the achievement of the goal. ...
... Recall is the ratio between the number of correctly predicted positive values and the total number of actual positive results in the test set [54]. ...
Article
Full-text available
Electrical utilities and system operators (SOs) are constantly looking for solutions to problems in the management and control of the power network. For this purpose, SOs are exploring new research fields, which might bring contributions to the power system environment. A clear example is the field of computer science, within which artificial intelligence (AI) has been developed and is being applied to many fields. In power systems, AI could support the fault prediction of cable joints. Despite the availability of many legacy methods described in the literature, fault prediction is still critical, and it needs new solutions. For this purpose, in this paper, the authors made a further step in the evaluation of machine learning methods (ML) for cable joint health assessment. Six ML algorithms have been compared and assessed on a consolidated test scenario. It simulates a distributed measurement system which collects measurements from medium-voltage (MV) cable joints. Typical metrics have been applied to compare the performance of the algorithms. The analysis is then completed considering the actual in-field conditions and the SOs’ requirements. The results demonstrate: (i) the pros and cons of each algorithm; (ii) the best-performing algorithm; (iii) the possible benefits from the implementation of ML algorithms.
... Because of the ease of interpretation and short calculation time, the K-nearest neighbour (KNN) method is well-known for its simplicity. It saves available cases and categorises new instances based on homogeneity, similar to the distance function [10]. A majority vote of the object's neighbours determines its classification and this process is known as class integration. ...
... A majority vote of the object's neighbours determines its classification and this process is known as class integration. Following that, the object is assigned to the class with the highest similarity among the K nearest neighbours [10]. ...
Conference Paper
In the current era of computation, machine learning is the most commonly used technique to find out a pattern of highly complex datasets. The present paper shows some existing applications, such as stock data mining, undergraduate admission, and breast lesion detection, where different supervised machine learning algorithms are used to classify various patterns. A performance analysis, in terms of accuracy, precision, sensitivity, and specificity is given for all three applications. It is observed that a support vector machine (SVM) is the commonly used supervised learning method that shows good performance in terms of performance metrics. A comparative analysis of SVM classifiers on the above-mentioned applications is shown in the paper.
... Societal interest in machine learning (ML), especially the subtopic of deep learning (DL), has surged within recent years. This is partially driven by the continuing success of these approaches in many application areas [240,398,424,439], facilitated by both fundamental advances [147,152,323,506] and the increasing availability of computational resources. Unsurprisingly, on the academic side, the field of artificial intelligence (AI) continues to dominate research outputs, as noted by the 2021 UNESCO Science Report [487]. ...
Preprint
With most technical fields, there exists a delay between fundamental academic research and practical industrial uptake. Whilst some sciences have robust and well-established processes for commercialisation, such as the pharmaceutical practice of regimented drug trials, other fields face transitory periods in which fundamental academic advancements diffuse gradually into the space of commerce and industry. For the still relatively young field of Automated/Autonomous Machine Learning (AutoML/AutonoML), that transitory period is under way, spurred on by a burgeoning interest from broader society. Yet, to date, little research has been undertaken to assess the current state of this dissemination and its uptake. Thus, this review makes two primary contributions to knowledge around this topic. Firstly, it provides the most up-to-date and comprehensive survey of existing AutoML tools, both open-source and commercial. Secondly, it motivates and outlines a framework for assessing whether an AutoML solution designed for real-world application is 'performant'; this framework extends beyond the limitations of typical academic criteria, considering a variety of stakeholder needs and the human-computer interactions required to service them. Thus, additionally supported by an extensive assessment and comparison of academic and commercial case-studies, this review evaluates mainstream engagement with AutoML in the early 2020s, identifying obstacles and opportunities for accelerating future uptake.
... In PFI, the impact of shuffling the values of a feature, e.g., impervious locations (x m ) within the contributing area over the target variable ( y i 1 ) is quantified to observe the response in output variables due to the change in input variables. The score of the error matrix (RMSE) derived from the observed and predicted 91 . In this study, F1-score and Jaccard similarity score are used to evaluate the ML classifiers. ...
Article
Full-text available
As urbanization increases across the globe, urban flooding is an ever-pressing concern. Urban fluvial systems are highly complex, depending on a myriad of interacting variables. Numerous hydraulic models are available for analyzing urban flooding; however, meeting the demand of high spatial extension and finer discretization and solving the physics-based numerical equations are computationally expensive. Computational efforts increase drastically with an increase in model dimension and resolution, preventing current solutions from fully realizing the data revolution. In this research, we demonstrate the effectiveness of artificial intelligence (AI), in particular, machine learning (ML) methods including the emerging deep learning (DL) to quantify urban flooding considering the lower part of Darby Creek, PA, USA. Training datasets comprise multiple geographic and urban hydraulic features (e.g., coordinates, elevation, water depth, flooded locations, discharge, average slope, and the impervious area within the contributing region, downstream distance from stormwater outfalls and dams). ML Classifiers such as logistic regression (LR), decision tree (DT), support vector machine (SVM), and K-nearest neighbors (KNN) are used to identify the flooded locations. A Deep neural network (DNN)-based regression model is used to quantify the water depth. The values of the evaluation matrices indicate satisfactory performance both for the classifiers and DNN model (F-1 scores- 0.975, 0.991, 0.892, and 0.855 for binary classifiers; root mean squared error- 0.027 for DNN regression). In addition, the blocked K-folds Cross Validation (CV) of ML classifiers in detecting flooded locations showed satisfactory performance with the average accuracy of 0.899, which validates the models to generalize to the unseen area. This approach is a significant step towards resolving the complexities of urban fluvial flooding with a large multi-dimensional dataset in a highly computationally efficient manner.
... Supervised machine learning is the most prevalent in medicine. [11][12][13][14][15][16][17][18][19] In supervised learning, labeled datasets are used to train an algorithm to correctly classify data. [11][12][13][14][15][16][17][18][19] To train an algorithm, the labeled data is used to deduce any association between independent and dependent variables. ...
... [11][12][13][14][15][16][17][18][19] In supervised learning, labeled datasets are used to train an algorithm to correctly classify data. [11][12][13][14][15][16][17][18][19] To train an algorithm, the labeled data is used to deduce any association between independent and dependent variables. The respective weights of independent variables are adjusted within the algorithm until it arrives at the best fit that has the least error in predicting the dependent variable. ...
Article
The field of pediatric critical care has been hampered in the era of precision medicine by our inability to accurately define and subclassify disease phenotypes. This has been caused by heterogeneity across age groups that further challenges the ability to perform randomized controlled trials in pediatrics. One approach to overcome these inherent challenges include the use of machine learning algorithms that can assist in generating more meaningful interpretations from clinical data. This review summarizes machine learning and artificial intelligence techniques that are currently in use for clinical data modeling with relevance to pediatric critical care. Focus has been placed on the differences between techniques and the role of each in the clinical arena. The various forms of clinical decision support that utilize machine learning are also described. We review the applications and limitations of machine learning techniques to empower clinicians to make informed decisions at the bedside. Critical care units generate large amounts of under-utilized data that can be processed through artificial intelligence.This review summarizes the machine learning and artificial intelligence techniques currently being used to process clinical data.The review highlights the applications and limitations of these techniques within a clinical context to aid providers in making more informed decisions at the bedside. Critical care units generate large amounts of under-utilized data that can be processed through artificial intelligence. This review summarizes the machine learning and artificial intelligence techniques currently being used to process clinical data. The review highlights the applications and limitations of these techniques within a clinical context to aid providers in making more informed decisions at the bedside.
... ML techniques are proposed to detect AM-related issues like geometric tolerancing [15], in-situ defect detection [16] and fatigue crack growth rate [17]. Most of the training data are available with labels that enable the use of supervised learning tasks [12,18,19]. Typical ML models used for defect detection tasks, including linear regression (LR), support vector regression (SVR) [20], k-Nearest Neighbours (KNN) [17,21,22], Decision Trees (DT) [17], Random Forest (RF) [17,23], Gradient Booster (GB), and Artificial Neural Networks (ANN) [24], which learn from the training dataset to predict the desired output parameters of a new test dataset. ...
Article
Full-text available
Additive Manufacturing (AM) / 3D printing technology is a game-changing technology for developing new improved solutions of product innovation with smart manufacturing advancements. One of the major challenges of AM manufactured metallic parts using laser powder bed fusion (LPBF) is the quality prediction of the printed samples under varying process parameters. This paper focuses on predicting the part density from pyrometer-based data using machine learning (ML) models, including Linear Regression (LR), Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Regression (SVR), Extreme Gradient Boosting (XGB), and Artificial Neural Network (ANN). Different pre-processing methods like Butterworth filter and thresholding have been compared with the raw pyrometer data-based analysis. Time-domain-based statistical features including mean, standard deviation, root mean square, entropy etc. have been used as inputs to the ML models. The six ML models were trained with and without feature selection (FS) to predict the part density. Among the regression algorithms used in this study, the best performance metrics R² of 0.85 and 0.86 were obtained by RF regression using raw and filtered data respectively, while thresholding reduced model performance. Analysis reveals that the combined effect of laser power and scanning speed most influences the quality of printed parts. A subsequent experiment with the new process parameters chosen based on the data analysis was able to print parts with improved quality, thereby confirming the validity of our ML framework.
... The third type of analysis is done using the Lift Curve. The Lift Curve as defined provides the measure to determine the effectiveness of the classifier model so generated [5]. It is the ratio of the result obtained with or without the classifier model. ...
Article
Fake news detection continues to be a major problem that affects our society today. Fake news can be classified using a variety of methods. Predicting and detecting fake news has proven to be challenging even for machine learning algorithms. This research employs Legitimacy, a unique ensemble machine learning model to accomplish the task of Credibility-Based Fake News Detection. The Legitimacy ensemble combines the learning potential of a Two-Class Boosted Decision Tree and a Two-Class Neural Network. The ensemble technique follows a pseudo-mixture-of-experts methodology. For the gating model, an instance of Two-Class Logistic Regression is implemented. This study validates Legitimacy using a standard dataset with features relating to the credibility of news publishers to predict fake news. These features are analysed using the ensemble algorithm. The results of these experiments are examined using four evaluation methodologies. The analysis of the results reveals positive performance with the use of the ensemble ML method with an accuracy of 96.9%. This ensemble’s performance is compared with the performance of the two base machine learning models of the ensemble. The performance of the ensemble surpasses that of the two base models. The performance of Legitimacy is also analysed as the size of the dataset increases to demonstrate its scalability. Hence, based on our selected dataset, the Legitimacy ensemble model has proven to be most appropriate for Credibility-Based Fake News Detection.
... LR is the simplest form of supervised ML algorithm, which predicts the dependent variable when provided with independent variables [37]. It can predict single and multiple dependent variables; it is called multi-output regression to predict multiple output variables [38]. ...
Article
Full-text available
The decision tree (DT), linear regression (LR), and K-nearest neighbours (KNN) models were employed in this work to estimate the thermal performance of tubular solar still with a wicked rotating drum. These three models were developed using real-world experimental data and calculated values. This study used a dataset containing 95 experimental iterations in total. Five input parameters, including solar intensity, basin water temperature, wind speed, ambient temperature, and glass temperature, were used as the independent variables of the DT, LR, and KNN models, and two dependent variables, thermal efficiency and productivity, were predicted. The DT model was the most significant model due to its lowest error and most incredible R² value compared to the LR and KNN model performances. The MAE, RMSE, and R² values for the DT model were 0.566828, 0.85135, and 0.9602, respectively, with the model efficiency of 0.961, which is the most significant value compared to other models. These results suggest that the DT model is a good fit for forecasting the thermal performance of tubular solar stills.
... It hypothesizes that a large amount of training data allows a better classification model performance than smaller training data. However, if the amount of training data is disproportionate, there will be model overfitting [33]. Table 6 to show the performance comparison between algorithms in experiment I. Based on the figure, the SVM algorithm has the best accuracy but the worst performance. ...
... It hypothesizes that a large amount of training data allows producing a classification model with better performance compared to smaller training data. However, if the amount of training data is disproportionate, there will be model overfitting [33]. Fig. 5 is a summary of Table 7, to show the performance comparison between algorithms in experiment II. ...
... Then, It resulted in a sparse matrix. As a result, this work confirms that the weakness of nave Bayes is the handle sparse dataset [33,23]. D. Experiment III Experiment III is KSA using a machine learning algorithm with TF-IDF and is the final experiment of this study. ...
Article
Full-text available
Amount social media active users are always increasing and come from various backgrounds. An active user habit in social media is to use their local or national language to express their thoughts, social conditions, socialize, ideas, perspectives, and publish their opinions. Karonese is a non-English language prevalent mostly in North Sumatra, Indonesia, with unique morphology and phonology. Sentiment analysis has been frequently used in the study of local or national languages to obtain an overview of the broader public opinion behind a particular topic. Good quality Karonese resources are needed to provide good Karonese sentiment analysis (KSA). Limitation resources become an obstacle in KSA research. This work provides Karonese Dataset from multi-domain social media. To complete the dataset for sentiment analysis, sentiment label annotated by Karonese transcribers, three kinds of experiments were applied: KSA using machine learning, KSA using machine learning with two variants of feature extraction methods. Machine learning algorithms include Logistic Regression, Naïve Bayes, Support Vector Machine and K-Nearest Neighbor. Feature extraction improves model performance in the range of 0.1 – 7.4 percent. Overall, TF-IDF as feature extraction on machine learning has a better contribution than BoW. The combination of the SVM algorithm with TF-IDF is the combination with the highest performance. The value of accuracy is 58.1 percent, precision is 58.5 percent, recall is 57.2, and F1 score is 57.84 percent