C4.5: Programs for Machine Learning
... They were used in statistical researches in 1984 and afterwards decision trees became a part of machine learning and gained popularity by dint of Quinlan's works e.g. (Quinlan, 1993). Undoubtedly, decision trees are at present the most popular and the most efficient method of Computational Intelligence and they are used for solving prediction problems, ex-plaining data structure, extracting knowledge from data and also for explaining models obtained from the other methods (Wieczorek, 2008). ...
... A number of different methods of constructing decision trees has already been published. Some of them are: Classification and Regression Trees (CART) (Breiman et al., 1984), ID3 and C4.5 (Quinlan, 1993), Separability of Split Value (SSV) Trees, Fast Algorithm for Classification Trees (FACT), Statistical Tree (QUEST), Cal5 (Jankowski & Grąbczewski, 2006). Apart from the conventional rules, there are also systems which are used for building alternative types of rules, for example fuzzy tree algorithm (Ichihashi, 1996). ...
... Expected information value after splitting the set E into subsets E (m) , m = 1,...,|v f |, for which feature "f" has got value v m , is defined as (Quinlan, 1993): ...
Decision trees are one of the computing intelligence methods which proved to be very reliable as far as solving complicated multidimensional problems is concerned. Therefore, these methods are often used for extracting rules and to predict variables, what makes them useful for production automation. In this paper authors discuss the possibility of the use of decision trees for electric arc steelmaking process. The main goal is to predict temperature in the electric arc furnace by the use of decision trees. Proper automatic temperature prediction may reduce the number of temperature measurements during the process and consequently, it may shorten the time of the process. Optimization of production processes leads to real benefits, which is, for example, lowering costs of production. Calculations were done by the use of six types of regression decisions trees available in Statistica Dаtа Mіnеr software. The algorithms were examined considering the minimum error rate of temperature prediction, but also less complicated tree structure. The structure of a decision tree is also important owing to computational complexity.
... A Hoeffding test is performed at each split attempt to determine whether the leaf will split in the feature that maximizes the purity of the child nodes. This mechanism distributes the cost of the greedy best-split evaluation throughout the stream, and it presents an efficient algorithm for dealing with data streams compared to batch decision trees, such as CART [7] or C4.5 [8]. ...
... Yet, monitoring the class distribution purity did not present trees with good predictive quality. We argue that either the detectors are not Drift AGR-f (7,8,9,10,9,8,7) Reocurring-Abrupt AGR-f (7,8,9,10,9,8,7) PLASTIC has shown the worst ranking compared to Hoeffding Trees, as authors do experiments in what they define as "lightweight versions of state-of-the-art decision trees" by defining a max depth of 20; however, Hoeffding Trees "amnesia" effect upon discarding instances when splitting mitigates overfitting and constructs a model that constantly learns with new samples [35]. Limiting the trees depth to 20 suppresses the amnesia effect, which results in the good results of Hoeffding Trees, and we judge that a depth of 20 is high for preventing overfitting. ...
... Yet, monitoring the class distribution purity did not present trees with good predictive quality. We argue that either the detectors are not Drift AGR-f (7,8,9,10,9,8,7) Reocurring-Abrupt AGR-f (7,8,9,10,9,8,7) PLASTIC has shown the worst ranking compared to Hoeffding Trees, as authors do experiments in what they define as "lightweight versions of state-of-the-art decision trees" by defining a max depth of 20; however, Hoeffding Trees "amnesia" effect upon discarding instances when splitting mitigates overfitting and constructs a model that constantly learns with new samples [35]. Limiting the trees depth to 20 suppresses the amnesia effect, which results in the good results of Hoeffding Trees, and we judge that a depth of 20 is high for preventing overfitting. ...
Decision trees are leading in state-of-the-art architectures for classifying high-speed data streams. Hoeffding-based trees have dominated the field and are more efficient than batch counterparts because they are incremental and distribute the cost of greedy best-split evaluations throughout the stream through periodic evaluation. However, this splitting mechanism is invariant to the state of the stream and the tree, as periodic evaluations cannot capture the state of data or performance of the tree in between assessments. In this work, we significantly outline the main behaviors of a novel decision tree that can deal with high-speed data streams and adapt to performance decays considering the tree state, namely the local adaptive streaming tree (LAST) and Hoeffding-based trees. We also provide a comprehensive benchmark of online decision trees, analyze how split moments affect performance, how the trees react to increases in the number of features and classes, and how the trees behave in concept drift scenarios. Results show that (i) LAST presents the best results, regardless of the change detector selected, (ii) LAST’s strategy of branching the tree based on performance decays is the reason it outperforms other decision trees, (iii) decision trees present similar CPU-Time, but trees with split reevaluation are more costly in memory, and (iv) LAST presents superior results in datasets with abrupt changes, while datasets with gradual changes depend on choosing detectors that are more suitable for gradual changes.
... There are numerous methods to check the predictability of variables influencing the happening of a phenomenon. At this research, the Information Gain Ratio (IGR) has been utilized (Quinlan 1993). Quinlan (1993) proposed the IGR index in such a way that higher IGR values indicate the enhanced predictive power of that factor in the modeling process Where SplitInfo represents the potential data generated by splitting the training data into subsets. ...
... At this research, the Information Gain Ratio (IGR) has been utilized (Quinlan 1993). Quinlan (1993) proposed the IGR index in such a way that higher IGR values indicate the enhanced predictive power of that factor in the modeling process Where SplitInfo represents the potential data generated by splitting the training data into subsets. ...
Effectively reducing damages and making decisions on land development policies can be achieved identifying areas that are at high risk. Thus, the aim of the current research is to explore the risk of landslides zonation in Sarvabad by use of advanced machine learning algorithms based on statistical models. At the present research, landslide hazard zonation was conducted by use of hybrid algorithms, including Frequency Ratio-Random Forest (FR-RF), Frequency Ratio-Support Vector Machine (FR-SVM), Weight of Evidence-Random Forest (WoE-RF), and Weight of Evidence-Support Vector Machine (Woe-SVM). First, the point shape files of 166 landslides in Sarvabad, prepared by the Natural Resource Departmens of Kurdistan State, considered as the inventory of landslides map. The data from the landslide points was split into training sets comprising 70% and validation sets comprising 30%. To achieve to conduct the landslides hazard zone, a sum of 16 items were used that are as follows: slope, aspect, elevation, distance to the stream, distance to the road, river density, distance to the fault, fault density, roadway density, precipitation, land utilization, Normalized Difference Vegetation Index (NDVI), lithology, earthquake, Stream Power Index (SPI), and topographic wetness index (TWI). Finally, the performance of the models was evaluated using the ROC curve. Among the FR-RF, WoE-RF, FR-SVM, and WoE-SVM models, the FR-RF model accounted for the most high performance. In the end, it can be concluded that obtaining an accurate and reasonable spatial prediction map can help managers and urban planners identify zones susceptible to landslide occurrence so that they can manage the potential crises of landslide-prone zones.
... To assess the performance of the knockoff algorithm, we compare the results of various prevalent machine learning methods using both the original and knockoff feature sets. Such performance is shown for models including logistic regression (Cox, 1958), decision trees (Quinlan, 2014), random forest (Breiman, 2001), and Extreme Gradient Boosting (XGBoost) (Chen & Guestrin, 2016). We provide the brief description of all the methods in Table 2. ...
... A decision tree (Breiman et al., 1984;Quinlan, 2014) is a versatile supervised machine learning method used for regression and classification tasks. It constructs a tree-like model where ...
The rapid integration of black-box Machine Learning (ML) models into critical decision-making scenarios has triggered an urgent call for transparency from stakeholders in Artificial Intelligence (AI). This call stems from growing concerns about the deployment of models whose decisions lack justification, legitimacy, and detailed explanations of their behavior. To address these concerns, Explainable Artificial Intelligence (XAI) has emerged as a crucial field, focusing on methods and processes that enable the comprehension of how AI systems make decisions, generate predictions, and execute their functions. The importance of XAI lies in its ability to provide explanations that justify a model’s outputs, thereby ensuring trust and accountability in AI systems. In this work, we propose a novel XAI framework that leverages state-of-the-art statistical knockoff techniques to identify the most informative predictors while maintaining a controlled False Discovery Rate (FDR). This framework enhances informed decision-making by ensuring robust and interpretable insights. We validate our approach through synthetic data experiments, demonstrating that it can effectively identify important features with high power while providing finite-sample FDR control across various scenarios. We demonstrate the efficacy of our approach by applying it to predict the outcomes of National Football League (NFL) playoffs, a domain of significant importance in sports analytics. Our method provides invaluable insights that support strategic decision-making in the highly competitive field of professional football.
... In recent years, machine learning and deep learning methods have revolutionized the area by enabling models to learn directly from raw financial data. Machine learning algorithms such as decision trees [48,49] and ensemble methods [7,24,66,75] have been effectively applied to identify patterns with better robustness compared to traditional statistical methods. Deep learning models, ...
... Traditional approaches primarily rely on fundamental analysis, utilizing manually engineered features and macroeconomic indicators [3,20,23]. The advent of machine learning introduced models such as decision trees [48,49] and gradient boosting trees (GBTs) [7,24] improves predictive performance by capturing the dynamic and nonlinear nature of market behavior [44,66]. Deep learning models have recently revolutionized SPMF by enabling the direct utilization of raw time-series data, reducing reliance on manually engineered features. ...
The stock market, as a cornerstone of the financial markets, places forecasting stock price movements at the forefront of challenges in quantitative finance. Emerging learning-based approaches have made significant progress in capturing the intricate and ever-evolving data patterns of modern markets. With the rapid expansion of the stock market, it presents two characteristics, i.e., stock exogeneity and volatility heterogeneity, that heighten the complexity of price forecasting. Specifically, while stock exogeneity reflects the influence of external market factors on price movements, volatility heterogeneity showcases the varying difficulty in movement forecasting against price fluctuations. In this work, we introduce the framework of Cross-market Synergy with Pseudo-volatility Optimization (CSPO). Specifically, CSPO implements an effective deep neural architecture to leverage external futures knowledge. This enriches stock embeddings with cross-market insights and thus enhances the CSPO's predictive capability. Furthermore, CSPO incorporates pseudo-volatility to model stock-specific forecasting confidence, enabling a dynamic adaptation of its optimization process to improve accuracy and robustness. Our extensive experiments, encompassing industrial evaluation and public benchmarking, highlight CSPO's superior performance over existing methods and effectiveness of all proposed modules contained therein.
... Algoritma C4.5 dikembangkan oleh Quinlan sebagai pengembangan dari algoritma ID3 dengan kemampuan menangani atribut kontinu dan menangani data yang hilang [5]. Algoritma ini mampu menangani data dengan atribut yang kontinu dan diskrit, serta menghasilkan model yang mudah diinterpretasikan. ...
... C. Algoritma C4. 5 Algoritma C4.5 adalah salah satu algoritma pembelajaran mesin yang dikembangkan oleh Ross Quinlan pada tahun 1993 dan digunakan untuk membentuk pohon keputusan, yang secara luas diaplikasikan dalam masalah klasifikasi. Algoritma ini menghasilkan pohon keputusan yang mudah dipahami karena menyajikan proses pengambilan keputusan dalam bentuk percabangan logis. ...
Divorce is a complex social phenomenon that has significant impacts on individuals and society. In Indonesia, divorce is a serious concern for the government and society because the divorce rate has continued to increase in recent decades. Understanding the factors that contribute to divorce is essential to developing effective prevention strategies. The objective of this study is to classify the factors of divorce in Indonesia using the C4.5 algorithm, an algorithm used for data mining in building decision trees. This study used divorce data from 20 provinces in Indonesia in 2023 with various factors causing divorce. Divorce data was obtained from the Central Statistics Agency of Indonesia. The research process includes data collection, data pre-processing, and application of the C4.5 algorithm to build a classification model. The results of the study showed that continuous disputes and quarrels, apostasy, and economy are the most significant factors in divorce. The research results are expected to be a reference for policy makers and marriage counselors in formulating appropriate interventions to reduce divorce rates. The findings of this study can also serve as a foundation for establishing more effective policies and interventions to reduce divorce rates and enhance family institutions in Indonesia.
... • C4.5: Developed by Quinlan [10], C4.5 extends the earlier ID3 algorithm. It builds trees using information gain and handles both categorical and numerical data. ...
... The confusion matrices for the C4.5, CART, and C5.0 decision tree algorithms provide insights into their performance for heart disease prediction. The C4.5 algorithm has a confusion matrix of (50,10,15, 25), which corresponds to 50 true positives (TP), 10 false negatives (FN), 15 false positives (FP), and 25 true negatives (TN). This indicates a moderate ability to identify heart disease cases and a reasonable number of false alarms. ...
Accurately predicting heart disease is crucial for effective diagnosis and treatment. Decision tree algorithms, such as C4.5, CART, and C5.0, are widely used in medical diagnostics due to their interpretability and performance. This study compares these three prominent decision tree algorithms to a heart disease dataset. This research aims to assess and compare their effectiveness in predicting heart disease using various performance metrics, including accuracy, precision, recall, and F1 score. The analysis involves training and validating each algorithm on the dataset, followed by a detailed examination of their classification results. Our findings reveal distinct strengths and weaknesses among the algorithms, providing insights into their suitability for heart disease prediction. The results suggest that while all three algorithms perform well, C5.0 exhibits superior accuracy and robustness, making it a potentially more effective tool for heart disease prediction. This paper contributes valuable information for selecting the most appropriate decision tree algorithm for medical diagnostics and highlights the importance of performance metrics in evaluating predictive models.
... There are numerous extensions and variants around which various shortcomings or specific types of data and applications address" (Berthold et al., 2020, p. 225). However, the general approach of all DT algorithms is similar, originating from Quinlan (1993) and Breiman et al. (1984), which are typically referred to in introductions to data-based DTs. ...
This study explores how high school students construct decision trees using data cards and the software CODAP (codap.concord.org) in interviews after attending a teaching unit. We conceptualized data-based decision tree construction using nine key aspects that we intended to teach, tested variations of two design elements in teaching, and analyzed the interviews qualitatively to compare student behavior to intended outcomes. We found high alignment to intentions but also deviations in data activities and informal or context-based rather than data-based reasoning. The design element of context-free (blinded) data seems to enhance data-based reasoning, while the design element of data card use showed diagnostic potential.
... The most common approaches to building a decision tree are ID3 [34], CART [6], and C4.5 [33]. They are based on some purity measures as splitting criteria. ...
Differentially private selection mechanisms offer strong privacy guarantees for queries aiming to identify the top-scoring element r from a finite set R, based on a dataset-dependent utility function. While selection queries are fundamental in data science, few mechanisms effectively ensure their privacy. Furthermore, most approaches rely on global sensitivity to achieve differential privacy (DP), which can introduce excessive noise and impair downstream inferences. To address this limitation, we propose the Smooth Noisy Max (SNM) mechanism, which leverages smooth sensitivity to yield provably tighter (upper bounds on) expected errors compared to global sensitivity-based methods. Empirical results demonstrate that SNM is more accurate than state-of-the-art differentially private selection methods in three applications: percentile selection, greedy decision trees, and random forests.
... Tree-based models (TBMs) in the VFL scenario are mainly constructed with multiple decision trees [86][87][88], which have been successfully used in a large range of real-world applications for solving classification and regression tasks [89][90][91][92][93]. For a decision tree, an internal node (including the root node) represents a splitting rule consisting of a splitting feature and a splitting value, and the branches attached to this internal node indicate the results according to the splitting rule. ...
Tree-based models have achieved great success in a wide range of real-world applications due to their effectiveness, robustness, and interpretability, which inspired people to apply them in vertical federated learning (VFL) scenarios in recent years. In this paper, we conduct a comprehensive study to give an overall picture of applying tree-based models in VFL, from the perspective of their communication and computation protocols. We categorize tree-based models in VFL into two types, i.e., feature-gathering models and label-scattering models, and provide a detailed discussion regarding their characteristics, advantages, privacy protection mechanisms, and applications. This study also focuses on the implementation of tree-based models in VFL, summarizing several design principles for better satisfying various requirements from both academic research and industrial deployment. We conduct a series of experiments to provide empirical observations on the differences and advances of different types of tree-based models.
... In clinical practice, electronic health records (EHRs) are extensively utilized by physicians for research and patient care management. The integration of ML techniques with EHR data enables personalized patient care and enhances hospital performance monitoring [14][15]. Support Vector Machines (SVMs) represent a cutting-edge ML approach widely applied in cancer diagnostics. ...
Lung cancer is one of the most prevalent causes of mortality worldwide, making early detection essential for improving patient survival rates. Computed tomography (CT) imaging serves as a crucial diagnostic tool; however, the large volume of generated images poses challenges in precise interpretation by radiologists. This study evaluates the effectiveness of lung cancer classification by utilizing various feature extraction techniques in combination with support vector machine (SVM) and k-nearest neighbours (KNN) classifiers. By analysing different feature sets, the research aims to identify the most effective combination for enhanced classification accuracy. The findings indicate notable improvements in classification performance, facilitating more reliable lung cancer detection.
... 3) Decision Tree (DT): A Decision Tree is a hierarchical model where internal nodes represent tests on attributes, branches signify outcomes, and leaf nodes correspond to class labels. Although interpretable, Decision Trees are prone to overfitting, especially with noisy data [18]. 4) Random Forest (RF)): Random Forest is an ensemble method that constructs multiple decision trees and combines their predictions to improve accuracy while mitigating overfitting by introducing randomness in the training process. ...
This study explores sentiment analysis of Bengali book reviews, addressing a critical research gap in analyzing Bengali text for this domain. Customer reviews and ratings significantly influence online platforms, and automated sentiment analysis offers an efficient method to understand user emotions and opinions. To facilitate this, a dataset of 16,600 Bengali reviews was developed by compiling three datasets and augmenting them with data from online bookstores and Facebook pages. The Word2Vec technique was utilized for feature extraction, employing Continuous Bag of Words (CBOW) and Skip-gram models. Various machine learning classifiers were a pplied to classify sentiment polarity as positive or negative, including Logistic Regression (LR), k-Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), AdaBoost, XGBoost, Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), and a Hybrid model. The Skip-gram model paired with the Hybrid classifier achieved t he highest accuracy of 92%. Furthermore, classifiers were e valuated using metrics such as accuracy, recall, precision, Fl-score, and ROC curves. This research advances sentiment analysis for Bengali text, demonstrating effective techniques for predicting sentiment and contributing to broader natural language processing applications for underrepresented languages.
... For learning a DT, numerous methods exist such as CART [15], ID3 [53], and C4.5 [54]. These algorithms all evaluate different predicates by calculating some impurity measure and then greedily pick the most promising before splitting the dataset on that predicate and recursively continuing with the two children. ...
Safety-critical controllers of complex systems are hard to construct manually. Automated approaches such as controller synthesis or learning provide a tempting alternative but usually lack explainability. To this end, learning decision trees (DTs) have been prevalently used towards an interpretable model of the generated controllers. However, DTs do not exploit shared decision-making, a key concept exploited in binary decision diagrams (BDDs) to reduce their size and thus improve explainability. In this work, we introduce predicate decision diagrams (PDDs) that extend BDDs with predicates and thus unite the advantages of DTs and BDDs for controller representation. We establish a synthesis pipeline for efficient construction of PDDs from DTs representing controllers, exploiting reduction techniques for BDDs also for PDDs.
... Various models, including J48 (decision tree) [22], Naive Bayes [23], NBTree [24], Random Forest [25], Random Tree [26], Multi−Layer Perceptron (MLP) [27], and Support Vector Machine (SVM) [28], all from the Weka toolkit [29], were evaluated using the KDDTest+ and KDDTest−21 datasets. The corresponding results of these evaluations are presented in Table I. ...
With the rapid growth of IoT devices, ensuring robust network security has become a critical challenge. Traditional intrusion detection systems (IDSs) often face limitations in detecting sophisticated attacks within high-dimensional and complex data environments. This paper presents a novel approach to network anomaly detection using hyperdimensional computing (HDC) techniques, specifically applied to the NSL-KDD dataset. The proposed method leverages the efficiency of HDC in processing large-scale data to identify both known and unknown attack patterns. The model achieved an accuracy of 91.55% on the KDDTrain+ subset, outperforming traditional approaches. These comparative evaluations underscore the model's superior performance, highlighting its potential in advancing anomaly detection for IoT networks and contributing to more secure and intelligent cybersecurity solutions.
... For instance, at each node, the tree asks a question about a single feature: "Is feature greater than some value ?" This question divides the feature space into two regions, ≤ and > , through hyperplanes parallel to the axis, = 0. Due to the unparalleled simplicity and interpretability of the decision tree model, algorithms that can learn an accurate decision tree model-for instance, classification and regression trees (CART) [19], C4.5 [48] and random forests [17]-are very widely used across various fields. Breiman [18] aptly noted, "On interpretability, trees rate an A+. ...
In this paper, we introduce a generic data structure called decision trees, which integrates several well-known data structures, including binary search trees, K-D trees, binary space partition trees, and decision tree models from machine learning. We provide the first axiomatic definition of decision trees. These axioms establish a firm mathematical foundation for studying decision tree problems. We refer to decision trees that satisfy the axioms as proper decision trees. We prove that only proper decision trees can be uniquely characterized as K-permutations. Since permutations are among the most well-studied combinatorial structures, this characterization provides a fundamental basis for analyzing the combinatorial and algorithmic properties of decision trees. As a result of this advancement, we develop the first provably correct polynomial-time algorithm for solving the optimal decision tree problem. Our algorithm is derived using a formal program derivation framework, which enables step-by-step equational reasoning to construct side-effect-free programs with guaranteed correctness. The derived algorithm is correct by construction and is applicable to decision tree problems defined by any splitting rules that adhere to the axioms and any objective functions that can be specified in a given form. Examples include the decision tree problems where splitting rules are defined by axis-parallel hyperplanes, arbitrary hyperplanes, and hypersurfaces. By extending the axioms, we can potentially address a broader range of problems. Moreover, the derived algorithm can easily accommodate various constraints, such as tree depth and leaf size, and is amenable to acceleration techniques such as thinning method.
... Thứ tư là thiếu sự ổn định trong những thay đổi nhỏ trong dữ liệu huấn luyện (Quinlan, 1993). Những thay đổi nhỏ trong dữ liệu có thể dẫn đến một cây hoàn toàn khác. ...
Quản trị lợi nhuận (Earning Management - EM) là một chiến lược được ban quản trị cố tình sử dụng để điều chỉnh chỉ tiêu thu nhập của công ty với các mục tiêu đã xác định trước. Trong khi một số ý kiến coi đây là một công cụ hữu ích trong báo cáo tài chính thì luồng ý kiến khác lại xem đây là một hành vi lừa đảo làm sai lệch tình trạng tài chính thực sự của công ty. Do đó, việc nghiên cứu EM có ý nghĩa rất quan trọng đối với các đối tượng sử dụng Báo cáo tài chính. Với phương pháp phân tích, tổng hợp các nghiên cứu liên quan, nghiên cứu tập trung vào các phương pháp xác định EM thông qua các cách tiếp cận khác nhau, giúp tác giả đi sâu, hiểu cụ thể hơn về bản chất nghiên cứu, từ phương pháp truyền thống tới mô hình khai phá dữ liệu hiện đại. Mỗi một phương pháp đều có ưu, nhược điểm khác nhau khi ứng dụng trong nghiên cứu thực tiễn. Việc kết hợp đồng thời các phương pháp trong nghiên cứu có thể cung cấp sự hiểu biết toàn diện về thực tiễn và ý nghĩa của chúng trong nghiên cứu EM hiện nay.
... In this study, identifying air objects indicated as black flight using one machine learning method is considered to perform best. Five machine learning methods are tested for recognition and identification: Decision Tree (DT) [ 23 ], Random Forest (RF) [ 24 ], k-nearest Neighbor (kNN) [ 25 ], Support Vector Machine (SVM) [ 26 ], and Neural Network (NN) with a backpropagation mechanism or BPNN [ 27 ]. ...
Identification of aircraft entering a country’s sovereign airspace if it shuts down its identification system, either the Identification Friend or Foe system and/or the Automatic Dependent Surveil- lance Broadcast system, has long been a challenge for the National Air Operations Command. Air- craft that do not want their identities to be revealed are called black flights and generally have certain missions that can interfere with the sovereignty of a country’s airspace. Military radar units that have the task of monitoring airspace are generally equipped with Primary Surveillance Radar that detects the presence of aircraft in their operating area and Secondary Surveillance Radar which functions to identify the aircraft. In the case of black flight, data from the radar in the form of airspeed, altitude, and position are not able to help identify the identity of the black flight. The contributions of this research are:
• a new method of black flight identification that combines air speed data and altitude with Radar Cross Section (RCS) data using machine learning,
• a new information system that combines the display of the Plan Position Indicator (PPI) of military radar and ADS-B to accelerate decision-making on black flight,
• a new approach to national air defense procedures.
... Kharbat et al. [35] applied an accuracybased LCS (XCS) [36] to analyze the Frenchay Breast Cancer dataset, grouping the resulting rules using the qualitythreshold clustering algorithm [37] to provide insightful feedback to medical experts. Their results demonstrated that the XCS outperformed C4.5 [38], a well-known decisiontree induction learning technique, in knowledge discovery. Iqval et al. [39] leveraged LCS for object recognition by developing an extraction technique for image feature points. ...
Air-transport infrastructure is facing increasing congestion, necessitating innovative approaches to optimize arrival management and minimize delays near airports. To address this issue, this study developed an en route arrival management framework that employs a learning classifier system (LCS) to reduce arrival delays and fuel consumption by generating adaptive and interpretable speed-control rules. A virtual design database, created from multiobjective optimization and integrated with a cellular automaton (CA)-based air-traffic model, facilitates the development of these rules. Within this database, a binary target variable represents speed-control decisions, whereas explanatory variables encompass factors such as aircraft proximity and congestion levels. The LCS is trained on this dataset to generate speed-control rules, and their effectiveness is evaluated through CA-based traffic simulations. The results indicate that the LCS achieved 96.8% accuracy in predicting appropriate speed-control actions, significantly outperforming traditional decision-tree methods (70.5% accuracy). Furthermore, experimental findings demonstrate reductions in average flight times by 10–20 s and up to 5 fewer spacing adjustments for traffic within 600 s before and after the arrival of speed-controlled flights. Overall, the LCS reduced total flight time by 1,801 s and total fuel consumption by 1,650.1 kg per operational cycle, resulting in an estimated annual cost savings of $466,000. Furthermore, the LCS approach successfully extracts both sector-specific and common rules, offering enhanced adaptability to various traffic scenarios. These results suggest that the ability to generalize rules across different airspace sectors can improve the safety and efficiency of air-traffic management.
... This strategy not only shortens the duration needed for feature extraction and classification but also boosts overall efficiency. Feature correlation is typically evaluated using information gain [30], derived from the concept of entropy. ...
In this research, we address the sophisticated task of reconstructing the semantic elements within complex scenes, a critical endeavor for a broad range of artificial intelligence (AI) applications. Our objective is to effortlessly blend multi-channel perceptual visual features for accurately adjusting to scenic images with detailed spatial layouts. Central to our approach is the development of a deep hierarchical model, carefully crafted to replicate human gaze movements with high accuracy. Utilizing the BING objectness metric, our model excels at rapidly and accurately identifying semantically and visually important scenic patches by detecting objects or their parts at various scales in diverse settings. Following this, we formulate a time-sensitive feature selector to obtain high-quality visual features from different scenic patches. To emulate the human ability to pinpoint essential scene segments, we implement a technique termed locality-preserved learning (LRAL). This method effectively generates gaze shift paths (GSP) for each scene by 1) preserving the local coherence of varied scenes, and 2) intelligently selecting scene segments that match human visual attention. With LRAL, we systematically create a GSP for each scene and derive its deep feature set through a deep aggregation model. These deep GSP features are then incorporated into a probabilistic transfer model for retargeting a variety of sceneries. Our methodology’s efficacy is confirmed through comprehensive empirical studies, highlighting its substantial advantages.
... The following actions were undertaken: • Using Quinlan's C4.5 algorithm to create a decision tree. The cases were classified into groups based on the similarity of their characteristics using the C4.5 algorithm, one of the data mining methods [38]. This algorithm operates on the assumption that a large dataset may contain latent knowledge that conventional statistical methods cannot uncover. ...
A case study analysis was conducted to develop a comprehensive empirical model of transitions in adolescent development, essential for understanding teenage mothers’ responsibility. Data used to reconstruct individual life trajectories were collected through exploratory interviews and questionnaires and analyzed using the Strategy of Process Transformation Reconstruction (PTR). PTR is an innovative methodological tool that integrates qualitative case analysis with Quinlan’s C4.5 algorithm to enable generalizations. The resulting empirical model identifies six distinct variants of teenage motherhood, each reflecting specific transformations in the sense of responsibility. These transitions are shaped by cognitive and emotional development, influencing both identity formation and caregiving behaviors. In Variants M1-M2, teenage mothers internalize their role, integrating responsibility for both themselves and their child through proactive caregiving, future-oriented planning, and emotional bonding. These stages reflect the emergence of reflective responsibility, where decision-making is guided by an evolving sense of self-efficacy and attachment to the child. In Variants M3-M4, responsibility becomes more segmented, driven by external expectations rather than intrinsic motivation. While personal accountability and caregiving duties are acknowledged, decision-making often lacks long-term planning and emotional investment, reflecting a more task-focused approach. Finally, Variants M5-M6 represent a stage of rejecting motherhood, characterized by an avoidance of responsibility and delegation of childcare to others, often due to emotional distress, cognitive overload, or overwhelming social pressures. This model highlights the dynamic nature of responsibility development in teenage mothers, illustrating how cognitive and emotional processes shape their caregiving roles. By recognizing these diverse trajectories, the study provides valuable insights into intervention strategies that support teenage mothers in fostering both personal development and stable caregiving practices.
... • Multinomial logistic regression (Böhning, 1992) • Naive Bayes classifier (Xu, 2018) • Decision trees in the Gini Index and Informational Gain versions to decompose tree nodes (Wu et al., 2010) • Bagged trees (Abellan and Masegosa, 2010) • Decision trees C4.5 (Quinlan, 2014) • C50 decision trees (Kuhn and Quinlan, 2023) • Support Vector Machines (SVM) with radial and sigmoid kernel (Cortes and Vapnik, 1995) • Random forests (Ho, 1995) • K-Nearest Neighbors (Mucherino et al., 2009) • Artificial Neural Networks (ANN) (Haykin, 2009) • eXtreme Gradient Boosted Trees (XGBoost) (Chen and Guestrin, 2016). ...
In this paper we present the results of an experiment aimed to use machine learning methods to obtain models that can be used for the automatic classification of products. In order to apply automatic classification methods, we transformed the product names from a text representation to numeric vectors, a process called word embedding. We used several embedding methods: Count Vectorization, TF-IDF, Word2Vec, FASTTEXT, and GloVe. Having the product names in a form of numeric vectors, we proceeded with a set of machine learning methods for automatic classification: Logistic Regression, Multinomial Naive Bayes, kNN, Artificial Neural Networks, Support Vector Machines, and Decision trees with several variants. The results show an impressive accuracy of the classification process for Support Vector Machines, Logistic Regression, and Random Forests. Regarding the word embedding methods, the best results were obtained with the FASTTEXT technique.
... J48 constructs decision trees using information entropy, choosing attributes with the highest normalized information gain, and halting the splitting when all instances in a subset share the same class [32]. J48 Consolidated Improved version of J48 that consolidates redundant branches, leading to a more compact decision tree. ...
This study delves into the significance of face milling tools in machining, emphasizing the need for timely fault diagnosis to enhance the efficiency of manufacturing processes. By examining defect scenarios such as flank wear, breakage and chipping, along with a reference for good tool condition, the research aims to improve diagnostic accuracy and optimize manufacturing performance.
Vibration signals generated during milling operations are analyzed to identify tool faults. A feature extraction process incorporating statistical, histogram, and ARMA features is employed to gain a nuanced understanding of tool behavior. Feature selection is performed using the J48 decision tree algorithm which helps identify the most relevant features. Subsequently, 13 tree-based classifiers are applied to classify tool faults effectively.
A comparative analysis of classification outcomes provides practical insights into the most effective features for fault diagnosis in milling tools. The study’s findings show that the combination of ARMA features with Extra trees achieved an impressive accuracy of 96.88% for milling tool fault diagnosis. The outcomes from the study contribute to real-world applications by enhancing diagnostic methodologies, ultimately advancing fault detection and classification in machining processes.
... Karar ağacı çıkarımı için kullanılan öz niteliklerin seçiminde birçok yaklaşım vardır ve çoğu yaklaşım, özniteliğe doğrudan bir kalite ölçüsü atar. Karar ağacı indüksiyonunda en sık kullanılan nitelik seçim ölçüleri Information Gain Ratio (29) ve Gini Endeksi (30). ...
Acil servislerde iş akışlarındaki temel problemler; yoğunluk, gereksiz kullanım eğilimleri ve uzun bekleme süreleri olarak özetlenebilir. Covid-19 pandemisi sırasında kırılma noktasını yaşayan acil servis yönetiminde yeni yaklaşımlar gündeme gelmiştir. Sağlık hizmeti sağlayıcıları, dünya çapında bu zorlukların çözümü olarak, yapay zekâ uygulamalarını acil servis iş süreçlerine dâhil etmeye başlamışlardır. Yapay zeka tabanlı makine öğrenimi modelleri, gelecekte klinik karar destek sistemlerine entegre edilerek hekimlerin iş yükünü azaltmalarının yanında acil servis işleyişleri için de yardımcı rol oynayacaklar gibi görünmektedir. Biz bu yazımızda, acil serviste makine öğrenimi birlikteliğine götüren nedenler temelinde modellemelerin acil servis hizmetlerindeki güncel durumu özetlemeye çalıştık. Makine öğrenimi modellerinin klinisyenlerin karar verme yetilerini geliştirdiği, tanısal hataları ve bilişsel yüklenmeyi azalttığı görüşleri öne çıkmaktadır.
... This approach allows for integrating diverse hypotheses, often yield superior predictive results. RF offers robustness against noise and is less prone to overfitting thanks to the averaged predictions across multiple trees (Liakos et al., 2018;Quinlan, 1993). GNB belongs to the family of Bayesian models, which are probabilistic graphical models employed within the framework of Bayesian inference. ...
The advent of machine learning technologies in conjunction with the advancements in UAV-based remote sensing pioneered a new era of research in agriculture. The escalating concern for water management in drought-prone areas such as California underscores the urgent need for sustainable solutions. Stem water potential (SWP) measurement using pressure chambers is one of the most common methods used to directly determine tree water status and the optimal timing for irrigation in orchards. However, this approach is inefficient due to its labor-intensive nature. To address this problem, we used weather, thermal and multispectral data as inputs to the machine learning (ML) algorithms to predict the SWP of pistachio and almond trees. For each crop, we first deployed six supervised ML classification models: Random Forest (RF), Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), Decision Tree (DT), K-Nearest Neighbors (KNN), and Artificial Neural Network (ANN). All classifiers provided more than 79% of accuracy while RF showed high performance in both pistachio and almond orchards at 88% and 89%, respectively. The feature importance results by the RF model revealed that the weather features were the most influential factors in the decision-making process. In both crops, canopy temperature 𝑇𝑐 was the next important feature closely followed by OSAVI in pistachios and NDVI in almonds. RF regression model predicted SWPs with 𝑅2 of 0.70 in pistachio and 𝑅2 of 0.55 in the almond orchard. Our results demonstrate that ML models are practical tools for irrigation scheduling decisions. This study offered a data-driven approach that effectively balances minimal data requirements with accuracy to facilitate optimal water management for end-users.
... In MIS, real-time decision-making is critical, particularly in sectors like finance, healthcare, and retail, where decisions must be made based on the most up-to-date information available. Data reduction techniques allow these systems to generate actionable insights more quickly, enabling businesses to respond promptly to emerging opportunities or risks [43]. ...
This article investigates the role of advanced data preprocessing techniques and technological innovation in enhancing decision-making capabilities within Management Information Systems (MIS). As organizations increasingly rely on data-driven insights, the accuracy and reliability of the information processed by MIS become essential for effective decision-making. Advanced data preprocessing methods, such as data cleansing, transformation, and normalization, play a critical role in ensuring data quality and consistency. With the advent of artificial intelligence (AI) and machine learning ML), these preprocessing steps can now be automated, enabling faster and more efficient handling of large data sets. By automating data preparation, AI and ML can significantly reduce human error, improve processing speed, and support real-time data integration, which is particularly valuable in sectors such as finance, healthcare, and manufacturing. This study explores how integrating these technologies into MIS enhances data quality, speeds up information retrieval, and generates actionable insights, ultimately improving decision-making processes across industries. Through case studies and practical examples, the article illustrates the benefits of advanced data preprocessing and the strategic role that AI and ML play in transforming raw data into valuable business intelligence. The conclusion discusses potential future developments in MIS, emphasizing how continuous advancements in data processing and technology could shape the future of data-driven decision-making.
Artificial Intelligence (AI) models have reached a very significant level of accuracy. While their superior performance offers considerable benefits, their inherent complexity often decreases human trust, which slows their application in high-risk decision-making domains, such as finance. The field of eXplainable AI (XAI) seeks to bridge this gap, aiming to make AI models more understandable. This survey, focusing on published work from 2018 to 2024, categorizes XAI approaches that predict financial time series. In this paper, explainability and interpretability are distinguished, emphasizing the need to treat these concepts separately as they are not applied the same way in practice. Through clear definitions, a rigorous taxonomy of XAI approaches, a complementary characterization, and examples of XAI’s application in the finance industry, this paper provides a comprehensive view of XAI’s current role in finance. It can also serve as a guide for selecting the most appropriate XAI approach for future applications.
The prevalence of dementia is increasing due to the aging of the population worldwide; the proportion of people aged 65 years or older is expected to be approximately 16% of the world population by 2050. It is estimated that approximately 100 million people will suffer from dementia (Prince et al. 2013; United Nations Population Division 2002). Therefore, the social cost of dementia is increasing annually, and early detection and preventive treatment are challenging.
This paper presents TSRF-Dist, a novel distance between time series based on Random Forests (RFs). We extend to the time-series domain concepts and tools of RF distances, a recent class of robust data-dependent distances defined for vectorial representations, thus proposing the firstRF distance for time series. The distance is determined by (i) creating an RF to model a set of time series, and (ii) exploiting the trained RF to quantify the similarity between time series. As for the first step, we introduce in this paper the Extremely Randomized Canonical Interval Forest (ERCIF), a novel extension of Canonical Interval Forests that can model time series and can be trained without labels. We then exploit three different schemes, following ideas already employed in the vectorial case. The proposed distance, in different variants, has been thoroughly evaluated with 128 datasets from the UCR Time Series archive, showing promising results compared with literature alternatives.
Tree-based models have achieved great success in a wide range of real-world applications due to their effectiveness, robustness, and interpretability, which inspired people to apply them in vertical federated learning (VFL) scenarios in recent years. In this paper, we conduct a comprehensive study to give an overall picture of applying tree-based models in VFL, from the perspective of their communication and computation protocols. We categorize tree-based models in VFL into two types, i.e., feature-gathering models and label-scattering models, and provide a detailed discussion regarding their characteristics, advantages, privacy protection mechanisms, and applications. This study also focuses on the implementation of tree-based models in VFL, summarizing several design principles for better satisfying various requirements from both academic research and industrial deployment. We conduct a series of experiments to provide empirical observations on the differences and advances of different types of tree-based models.
The ability to adapt machine learning models trained on one domain to perform well on another, a process commonly referred to as domain adaptation, is an important area of research in machine learning. Domain adaptation addresses the challenge of transferring knowledge from a source domain (where ample labeled data is available) to a target domain (where labeled data may be scarce or unavailable). This paper reviews key techniques for domain adaptation, including instance-based methods, feature-based methods, and classifier-based methods. Additionally, we examine the challenges of domain mismatch, distribution shift, and the curse of dimensionality. We provide an overview of early approaches, evaluate their strengths and weaknesses, and highlight real-world applications across areas such as computer vision, natural language processing, and speech recognition. The paper concludes with a discussion of future trends in domain adaptation research, especially in the context of deep learning and transfer learning.
The increasing complexity of organizational systems creates new opportunities for insider threats to exploit vulnerabilities and cause significant damage. Insider threat detection (ITD) has become a critical first line of defense for organizations to prevent security breaches. Researchers have developed numerous methodologies targeting specific types of network activities, such as file transfers, login attempts, and network traffic patterns to address these threats. User behavioral-based insider threat detection (UBITD) is a critical research and development direction in cybersecurity. Despite the abundance of research on ITD methods, there is a notable scarcity of systematic reviews focusing on the latest advancements and the data used to train them. Although numerous review papers have explored various ITD approaches, most adopt a non-systematic approach, merely comparing existing techniques without providing a comprehensive analytical synthesis of methodologies and performance outcomes. Consequently, these reviews fall short of delivering a holistic understanding of the current ITD landscape, as much of the existing literature emphasizes signature-based ITD with a focus on machine learning and deep learning models, while UBITD remains minimally explored. This paper presents an in-depth analysis of UBITD by systematically reviewing 101 of the most influential research papers published on the topic. Our analysis rigorously examines the technical advancements, data preprocessing techniques, detection approaches, evaluation metrics, researcher collaborations, datasets, and future trends in this field. The findings reveal unsolved research challenges and uncharted research areas within each of these perspectives. By outlining several high-impact future research endeavors, this study aims to strengthen ITD role in cybersecurity, contributing to the development of more robust and proactive defenses against insider threats.
Decision trees are part of the decision theory and are excellent tools in the decision-making process. Majority of decision tree learning methods were developed within the last 30 years by scholars like Quinlan, Mitchell, and Breiman, just to name a few (Ozgulbas & Koyuncugil, 2006). There are a number of methods and sophisticated software used to graphically present decision trees. Decision trees have a great number of benefits and are widely used in many business functions as well as different industries. However there are also disagreements and various concerns as to how useful decision trees really are. As technology evolves so do decision trees. Therefore not only do many controversies arise but also solutions and new proposals to these arguments.
Convolutional Neural Networks (CNNs) have proven to be one of the state-of-the-art systems in image understanding and other complex tasks where input patterns must undergo convolutions. CNNs have highlighted the “vertical” development of a classical ANN significantly increasing the number of processing layers between the input (its pattern) and the output (its correct classification). Its intermediate layers including convolutional, pooling, and dropout layers are inspired by how the visual cortex processes light signals.
Capsicum L. varieties and species are closely related so much that there have been some confusions among different taxonomists on their taxonomic status. This study aimed to determine the taxonomic standing of Capsicum L. genus members in Nigeria to assess genetic divergences and similarities among them to provide some insight into their identification and the infrageneric classification (INC) of the genus. Seeds of five cultivars of Capsicum spp., collected from various sources and authenticated, were regenerated, and nurtured to fruit. Variations in their foliar organic compounds were identified quantitatively, using Gas Chromatography Mass Spectroscopy (GCMS). A total of 17 organic chemical characters (12 Esters, 2 Alkanols 1 each of Alkanoate, Alkanoic acid, and Alkane) were detected. The percentage peak area values obtained include 1.75 to 21.88 of esters, 2.24 to 11.99 of alkanols, 5.8 of alkanoate, 17.48 to 55.15 of alkanoic acid, and 4.9 of alkane. The cultivars of each genus were hierarchically clustered as operational taxonomic units (OTUs) using squared Euclidean distance computed through PASTatistics software (Ward’s method). Artificial keys were also constructed for the identification of the species in the genus. The categories of chemical characters adopted gave useful insights into the INC of the genus, as their combination was sufficiently diagnostic of the species as evidenced by the artificial keys. The taxonomic status of Nigerian representatives of the genus Capsicum L. was successfully determined in connection to the distribution of their fruit capsaicin concentration (FCC), similar to those that were previously reported of morphology and phytochemicals in Capsicum. The challenge of vague infrageneric boundaries has also been partially resolved in the Nigerian Capsicum spp. studied.
This research investigates the efficacy of an Academic Predictor with a Decision Support System (APDSS) in optimizing student management and academic planning in a secondary school setting. The study focuses on incoming Grade 11 students, comprising 10 sections in TVL, 3 sections in HUMSS, and 2 sections in STEM, with 20 selected students per section participating. Demographic profiles, including General Weighted Average (GWA), Science grade, Math grade, and Career test results, were analyzed using frequency and percentage scores. Additionally, the study evaluates the extent of compliance of the developed system with ISO 25010:2011 Software Quality Standards, assessed by IT experts and users. Results indicate that the APDSS demonstrates a high degree of compliance with software quality standards, particularly in functionality, performance efficiency, compatibility, reliability, security, maintainability, and portability. Challenges faced in academic contexts, both before and after implementation, are identified through frequency and percentage distributions among faculty and students. The pre-implementation challenges include workload management, communication gaps, career guidance inconsistencies, and manual tracking of student progress, while the post-implementation challenges encompass strategic alignment, resource allocation, data integrity, privacy compliance, communication, training, leadership, data transparency, administrative processes, information access, academic planning, and student engagement. Despite challenges during implementation, students appreciate the APDSS for its quick access to academic information, assistance in course selection and career planning, and modernization of administrative processes. Thus, the APDSS shows promise in enhancing academic engagement and efficiency among students, highlighting its potential to streamline academic processes and improve outcomes.
Diabetes is an increasing global health issue, with millions at risk due to factors like lifestyle, genetics, and other health conditions. Early diagnosis is essential for timely treatment, avoiding complications, and easing the strain on healthcare systems. The disease’s complexity, with its different stages, requires advanced models that can distinguish between diabetic, non-diabetic, and pre-diabetic individuals. This study aimed to develop a precise multiclass classification model to predict a patient’s diabetes status based on various health indicators. In addition to standard factors like blood sugar level, BMI, cholesterol, and age, external risk factors have also been considered for better accuracy. In the current study, the target variable categorizes patients as Diabetic, Non-Diabetic, or Pre-Diabetic. The current work applies Logistic Regression, SVM, Decision Tree, Random Forest, and Gradient Boosting models to address the classification challenge. After training and testing the models, Random Forest has been identified to deliver the highest accuracy at 98%, outperforming the others. These findings highlight the power of machine learning in effectively classifying patients based on diabetes status.
Ambiguity is considered an indispensable attribute of all natural languages. The process of associating the precise interpretation to an ambiguous word taking into consideration the context in which it occurs is known as word sense disambiguation (WSD). Supervised approaches to WSD are showing better performance in contrast to their counterparts. These approaches, however, require sense annotated corpus to carry out the disambiguation process. This paper presents the first-ever standard WSD dataset for the Kashmiri language. The raw corpus used to develop the sense annotated dataset is collected from different resources and contains about 1 M tokens. The sense-annotated corpus is then created using this raw corpus for 124 commonly used ambiguous Kashmiri words. Kashmiri WordNet, an important lexical resource for the Kashmiri language, is used for obtaining the senses used in the annotation process. The developed sense-tagged corpus is multifarious in nature and has 19,854 sentences. Based on this annotated corpus, the Lexical Sample WSD task for Kashmiri is carried out using different machine-learning algorithms (J48, IBk, Naive Bayes, Dl4jMlpClassifier, SVM). To train these models for the WSD task, bag-of-words (BoW) and word embeddings obtained using the Word2Vec model are used. We used different standard measures, viz. accuracy, precision, recall, and F1-measure, to calculate the performance of these algorithms. Different machine learning algorithms reported different values for these measures on using different features. In the case of BoW model, SVM reported better results than other algorithms used, whereas Dl4jMlpClassifier performed better with word embeddings.
In this fast-changing educational landscape of today, data science and predictive analytics are tools critical to creating student success and transforming educational systems. This chapter will further explore how predictive analytics can be utilized to anticipate and improve student outcomes. It also includes methodologies in collecting and analyzing student data, algorithms predicting their academic performance, and insights for early interventions and adapted support by educators and administrators. The predictive model, based on historical and real-time data, can predict the at-risk or chance of succeeding in student and develop learning paths for each one. The chapter also tackles data privacy issues, ethical implications, and the AI technology integration processes in schools. This chapter explains how predictive analytics the power can have to offer a better personalized, fair, and effective learning environment that would ensure improved student success and retention.
Decision tree is a typical “divide-and-conquer” approach. It uses a tree-like graph or model to partition the feature space, also called the search space.
The governing Partial Differential Equation (PDE) for wave propagation or the wave equation involves multi-scale and multi-dimensional oscillatory phenomena. Wave PDE challenges traditional computational methods due to high computational costs with rigid assumptions. The advent of scientific machine learning (SciML) presents a novel paradigm by embedding physical laws within neural network architectures, enabling efficient and accurate solutions. This study explores the evolution of SciML approaches, focusing on PINNs, and evaluates their application in modeling acoustic, elastic, and guided wave propagation. PINN is a gray-box predictive model that offers the strong predictive capabilities of data-driven models but also adheres to the physical laws. Through theoretical analysis and problem-driven examples, the findings demonstrate that PINNs address key limitations of traditional methods, including discretization errors and computational inefficiencies, while offering robust predictive capabilities. Despite current challenges, such as optimization difficulties and scalability constraints, PINNs hold transformative potential for advancing wave propagation modeling. This comprehensive study underscores the transformative potential of PINN, followed by recommendations on why and how it could advance elastic, acoustic, and guided wave propagation modeling and sets the stage for future research in the field of Structural Health Monitoring (SHM)/Nondestructive Evaluation (NDE).
Currently, the inevitability of big data stream processing is increasingly becoming more critical and imperative due to the incremental nature of the heterogeneous and massive volume of data engendered by various sources. C4.5 is the most renowned algorithm to address the problem of data stream classification in online mode by constructing decision trees. But, the traditional C4.5 algorithm has a significant adverse impact on the speed of data stream processing in real-time scenarios. To handle this problem, we propose a new variant of the C4.5 classifier, named the Clustered-C4.5 (CC4.5) algorithm to classify data streams in online mode. This work integrates a Correlation-based Feature selection and Bat Optimization Technique (CFBOT) with the CC4.5 algorithm to improve the classification performance. The CFBOT is employed to measure the dependencies among selected features and to determine the ideal subset for training and testing tasks. Meanwhile, CC4.5 organizes the data into smaller clusters instead of building a single decision tree. The integration of the CC4.5 algorithm and CFBOT helps the system classify imbalanced and multi-class datasets effectively. The performance of the proposed classifier is cautiously studied on the real-world dataset (i.e., UCI-Heart disease dataset) by relating its enactment with many state-of-the-art approaches regarding classification accuracy, sensitivity, specificity, the area under the receiver operating characteristic curve (AUC), and intersection over-union (IoU). Additionally, the Wilcoxon rank-sum test is employed to assess whether our classification technique offers a statistically significant improvement over other classifier methods. The experimental results demonstrate that the proposed classifier outperforms other approaches with superior accuracy, sensitivity, specificity, AUC, and IoU of 93.6%, 96.2%, 79.3%, 92.3%, and 90.3%, correspondingly. Accordingly, we can conclude that CC4.5 is the better data stream classification model related to classification approaches recently reported in the literature.
This paper compares five methods for pruning decision trees, developed from sets of examples. When used with uncertain rather than deterministic data, decision-tree induction involves three main stages—creating a complete tree able to classify all the training examples, pruning this tree to give statistical reliability, and processing the pruned tree to improve understandability. This paper concerns the second stage—pruning. It presents empirical comparisons of the five methods across several domains. The results show that three methods—critical value, error complexity and reduced error—perform well, while the other two may cause problems. They also show that there is no significant interaction between the creation and pruning methods.
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.