Article

Induction of Decision Trees

Authors:
  • rulequest research
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Decision tree [2], [10] is a typical machine learning algorithm which uses a tree-like graph/model to classify the unknown instances. The mostly used crisp decision tree induction algorithms include ID3 [2], C4.5 [3], and CART [1], etc., which respectively use information gain, gain ratio and gini impurity to select the expanded attribute [12]. ...
... Decision tree [2], [10] is a typical machine learning algorithm which uses a tree-like graph/model to classify the unknown instances. The mostly used crisp decision tree induction algorithms include ID3 [2], C4.5 [3], and CART [1], etc., which respectively use information gain, gain ratio and gini impurity to select the expanded attribute [12]. ...
... While the crisp decision trees are useful and well-performed in building the knowledge based expert systems, they often suffer from the inadequately or improperly expressing and handling the uncertainty associated with human thinking and understanding [11]. As pointed out by Quinlan [2], they "do not convey potential uncertainties in classification". Thus, in order to effectively handle the uncertainties arising in classification problem, fuzzy decision tree (FDT) inductions are suggested by many authors [4], [5], [6], [7], [8], [9], [11]. ...
Article
Full-text available
In this paper, we give a fuzzy decision tree (simply FDT) induction algorithm, named FDTAmbig, to handle the classification with discrete attributes through the uncertainty reduction. In FDTAmbig, the uncertainty is measured with classification ambiguity. FDTAmbig selects the attribute which will cause the further reduction of uncertainty as the expanded attribute for each decision node. The experimental result shows that FDTAmbig has the better generalization capability in comparison with the FDT induced with classification entropy (FDTEntr).
... Quinlan [11]'s C4.5 algorithm has been widely studied for its applicability in medical diagnostics. For instance, a study by Jia et al. [5] applied C4.5 to the Cleveland Heart Disease dataset, demonstrating its ability to classify patients accurately. ...
... The C5.0 algorithm, an enhancement of C4.5, has been recognized for its improved performance and efficiency. Quinlan [11] detailed the advancements of C5.0 over its predecessor, including faster processing and better handling of large datasets. Studies like Loh and Shih [8] demonstrated that C5.0 offered superior predictive accuracy and computational efficiency compared to C4.5 and CART. ...
Article
Full-text available
Accurately predicting heart disease is crucial for effective diagnosis and treatment. Decision tree algorithms, such as C4.5, CART, and C5.0, are widely used in medical diagnostics due to their interpretability and performance. This study compares these three prominent decision tree algorithms to a heart disease dataset. This research aims to assess and compare their effectiveness in predicting heart disease using various performance metrics, including accuracy, precision, recall, and F1 score. The analysis involves training and validating each algorithm on the dataset, followed by a detailed examination of their classification results. Our findings reveal distinct strengths and weaknesses among the algorithms, providing insights into their suitability for heart disease prediction. The results suggest that while all three algorithms perform well, C5.0 exhibits superior accuracy and robustness, making it a potentially more effective tool for heart disease prediction. This paper contributes valuable information for selecting the most appropriate decision tree algorithm for medical diagnostics and highlights the importance of performance metrics in evaluating predictive models.
... A collection of hundreds of decision trees is called an RF model [42], [43]. In a decision tree, each node stands for a data query, and the branches provide potential responses to that question. ...
... This means that based on the presented results, the MLP model exhibits improved feature sensitivity and classification in the higher frequency domains. The tendency of growing the performance from the lowest to higher frequency bands with a slightly lower [40][41][42][43][44][45][46][47][48][49][50] Hz suggests that the model depends on the frequency of the input, which is important for achieving the best results in frequency-based classification. ...
Article
Full-text available
Rotating machines are crucial in industries for reliable system operation, but unexpected failures can result in significant financial losses and personnel damage. Hence, fault diagnosis is important. Among the common types of faults in rotating machines, imbalance, and misalignment are important types of faults in operation but are rarely studied. This research utilizes data from four distinct operating conditions: normal operation, imbalanced condition, imbalanced condition combined with horizontal misalignment, and imbalanced condition combined with vertical misalignment. This study proposed a method for fault diagnosis using Continuous Wavelet Transform (CWT) for extracting vibration signals into RGB (Red – Green - Blue) images and using Convolutional Neural Network (CNN) models for fault diagnosis. Additionally, Internet of Things (IoT) integration is implemented using the Message Queuing Telemetry Transport (MQTT) protocol, enabling the update of fault status to the cloud for efficient monitoring. A mobile application provides an intuitive platform for visualizing and interacting with fault data. Moreover, several machine learning (ML) models are also utilized for comparison with the proposed method such as Random Forest (RF), Multilayer Perceptron (MLP), and Xtreme Gradient Boosting (XGBoost). The results indicate that CNN yields an overall accuracy of 93.27% compared to other ML methods.
... The high performance of Logistic Regression and ensemble-based models like Random Forest indicates their potential for realtime monitoring of public sentiment trends, enabling health agencies to track misinformation patterns and adjust communication strategies accordingly [31]. These models can help detect shifts in public debates, allowing policymakers to intervene before misinformation narratives escalate [30][31][32]. However, future applications should explore deep-learning approaches, such as transformer-based models, to improve sentiment classi cation accuracy to interpret the sarcastic language used in social media texts [33]. ...
Preprint
Full-text available
Social media platforms like Facebook and Instagram are pivotal in shaping public opinion on health interventions, including Community Water Fluoridation (CWF). Despite its recognition as a safe and effective public health measure, CWF remains a polarising topic, with misinformation on these platforms contributing to public mistrust. This study collected 109,117 Facebook and Instagram posts from 2014 to 2023 to examine public sentiment surrounding CWF. The analysis revealed a mix of opinions, with 42.1% positive, 39.1% negative, and 18.8% neutral sentiments. Trends highlighted a surge in negative sentiment during 2017–2019, likely influenced by misinformation and significant public events, while positive sentiment has gradually regained ground in recent years. Key themes included health benefits, safety concerns, and government trust, with positive discussions emphasising CWF’s role in public health and negative discussions focusing on risks and chemical exposure. The study used advanced sentiment analysis models to highlight the importance of monitoring public discourse and addressing misinformation to build trust and support for evidence-based health policies like CWF. These findings provide digital data-driven insights for public health communication strategies to enhance community understanding and acceptance of vital health interventions.
... The decision tree model [20] is a model that learns a hierarchical tree structure composed of conditional expressions with explanatory variables. Table 3 shows the accuracy, recall, F-score, and AUC when the decision tree model showed the highest percentage of accuracy responses after fivefold cross validation. ...
Article
Full-text available
In recent years, the market for online flea markets, which are consumer-to-consumer (C2C) services where goods are bought and sold among users, has been expanding. In such services, sellers (individuals that offer goods or services for sale) create product details, including price, condition, shipping method, and images, when listing their items for sale. We consider the product image to be the first thing users see when selecting a product as a thumbnail, significantly impacting their purchase decisions. In this study, we proposed a discriminant model of purchase decisions for online flea market data to clarify the factors influencing these decisions based on product details. Specifically, we used metadata such as price and delivery method, along with image labels for product thumbnails, as features. We created and compared models with three patterns: metadata only, image labels only, and a combination of metadata and image labels. We created four types of models in this study: logistic regression, decision tree, gradient boosting, and random forest. We selected the models based on their accuracy evaluations. Our analysis revealed that the model using both metadata and image labels as features, combined with the gradient boosting method, had the highest accuracy. The partial dependence plots of the selected models highlighted the features important for users' purchase decisions. Received: 10 August 2024 | Revised: 8 February 2025 | Accepted: 14 March 2025 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement Data sharing is not applicable to this article as no new data were created or analyzed in this study. Author Contribution Statement Emi Iwanade: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing – original draft, Writing – review & editing, Visualization. Yoshihisa Shinozawa: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Writing–original draft, Writing – review & editing, Visualization. Kohei Otake: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration, Funding acquisition.
... Decision trees can be especially beneficial for comprehending the impact of specific features (e.g., function calls or external dependencies) on flaky test behavior. Decision trees are an effective tool for comprehending flaky test behavior when the decision boundaries are straightforward and comprehensible, as per [29]. • Naive Bayes: Assuming feature independence, the probabilistic classifier Naive Bayes, which is based on Bayes' Theorem, performs especially well in highdimensional spaces. ...
Article
Full-text available
Software development is significantly impeded by flaky tests, which intermittently pass or fail without requiring code modifications, resulting in a decline in confidence in automated testing frameworks. Code smells (i.e., test case or production code) are the primary cause of test flakiness. In order to ascertain the prevalence of test smells, researchers and practitioners have examined numerous programming languages. However, one isolated experiment was conducted, which focused solely on one programming language. Across a variety of programming languages, such as Java, Python, C++, Go, and JavaScript, this study examines the predictive accuracy of a variety of machine learning classifiers in identifying flaky tests. We compare the performance of classifiers such as Random Forest, Decision Tree, Naive Bayes, Support Vector Machine, and Logistic Regression in both single-language and cross-language settings. In order to ascertain the impact of linguistic diversity on the flakiness of test cases, models were trained on a single language and subsequently tested on a variety of languages. The following key findings indicate that Random Forest and Logistic Regression consistently outperform other classifiers in terms of accuracy, adaptability, and generalizability, particularly in cross-language environments. Additionally, the investigation contrasts our findings with those of previous research, exhibiting enhanced precision and accuracy in the identification of flaky tests as a result of meticulous classifier selection. We conducted a thorough statistical analysis, which included t-tests, to assess the importance of classifier performance differences in terms of accuracy and F1-score across a variety of programming languages. This analysis emphasizes the substantial discrepancies between classifiers and their effectiveness in detecting flaky tests. The datasets and experiment code utilized in this study are accessible through an open source GitHub repository a to facilitate reproducibility. Our results emphasize the effectiveness of probabilistic and ensemble classifiers in improving the reliability of automated testing, despite certain constraints, including the potential biases introduced by language-specific structures and dataset variability. This research provides developers and researchers with practical insights that can be applied to the mitigation of flaky tests in a variety of software environments..
... It has proven to be an effective technique in various sectors, providing more reliable forecasts by combining predictions from multiple models (Gupta et al. 2024b). Weak individual learners, such as DT learners, are frequently sensitive to training input and can overfit (Quinlan 1986;Zhou 2012). To address these challenges, merging several weak learners has become a feasible method known as the ensemble method in machine learning (Zhou 2012). ...
Article
Full-text available
Determining the ultimate axial compressive load-carrying capacity (UACLC) of square/rectangular concrete-filled steel tube (S/RCFST) short columns is crucial for maintaining structural integrity in civil engineering projects. This investigation presents an interactive ensemble learning approach utilizing five machine learning algorithms: Decision Trees (DT), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), Boosted Regression Trees (BRT), and Categorical Gradient Boosting (CatBoost). The study employed a comprehensive dataset comprising 932 experimental samples to train and validate 491 models, with each model undergoing hyperparameter optimization. Among these, the CatBoost model exhibited superior performance, achieving a training R² of 0.999 and a testing R² of 0.984 when configured with optimal hyperparameters (learning rate = 0.1, maximum depth = 5, and 2000 estimators). The model's accuracy was further demonstrated by a testing weighted mean absolute percentage error (WMAPE) of 0.071 and a root mean square error (RMSE) of 0.316 MPa. Monotonicity analysis confirmed the consistency of the model’s predictions, revealing that material properties such as concrete compressive strength and steel yield stress had the most significant impact on UACLC predictions. To facilitate practical application, the researchers developed a graphical user interface (GUI) enabling real-time predictions, which allows engineers to integrate the model into their structural design processes. This innovative framework provides a robust and highly precise tool for predicting UACLC, addressing significant limitations in conventional methods and enhancing the practical design of S/RCFST columns.
... Decision Tree classifier (DTC) is a supervised machine learning approach that can be used to solve regression and classification issues [49]. The DTC algorithm seeks to divide the dataset into more manageable chunks, with a unique class label or value being assigned to each input [50]. ...
Article
Full-text available
High strength concrete (HSC) is undoubtedly the most advanced building materials available nowadays. Its production involves simple steps with a variety of additives, including cement, water, fine and coarse aggregates, fly ash (FA), and ground granulated blast furnace slag (GGBFS). Although the interactions between these materials do not strictly follow a mathematical formula, the amounts of these ingredients show a major impact on the compressive strength. The most often used mechanical property for quality monitoring in concrete is its compressive strength after 28 days. It is crucial to have a tool that can directly simulate these interactions prior to production and casting the specimen. Machine learning (ML) models have shown to remain an effective technique to predict the concrete compressive strength, yielding results that can be more reliable than conventional. For the experimental data, the XGBoost Regression (XGB) model is the most dependable, with a R2R^2 equal to 0.92, mean absolute error (MAE), and root mean squared error (RMSE) values of 2.92 MPa and 4.45 MPa. In addition, a comparison of particle swarm optimization was used to improve the relationship between input parameters and concrete compressive strength (CS). The study emphasises the accuracy with which machine learning approaches, specifically the XGB, can estimate the CS in building materials is higher than other models. It further provides researchers a swift and more reliable way to evaluate the effects of materials along with other factors on CS, eliminating the requirement for lengthy and expensive trial experiments.
... Since there may be many such pairs, we first sort by lexicographic order and pick a pair of adjacent points (p i , p i+1 ) with different values. There still may be multiple such pairs and we pick such pair that maximizes information gain [21] by looking at the points left and right of the pair-i.e., by looking at the subsets {p 1 , . . . , p i } and {p i+1 , . . . ...
Preprint
Full-text available
This short paper proposes to learn models of satisfiability modulo theories (SMT) formulas during solving. Specifically, we focus on infinite models for problems in the logic of linear arithmetic with uninterpreted functions (UFLIA). The constructed models are piecewise linear. Such models are useful for satisfiable problems but also provide an alternative driver for model-based quantifier instantiation (MBQI).
... Where pi is the proportion of samples in S belonging to class i, and k is the number of classes [35]. ...
Article
Full-text available
Chatterbots, also known as chatbots, have become essential for improving human-computer interaction in a number of fields, including e-commerce, healthcare, education, and customer support. From rule-based systems like ELIZA to contemporary AI-driven solutions employing modern machine learning (ML) techniques, this review paper examines the development of chatbots. It highlights how ML technologies, such as decision trees (DT), support vector machines (SVM), linear regression, and natural language processing (NLP), can be used to build chatbots that are more context-aware, responsive, and adaptive. The paper highlights important advances including deep learning, multimodal capabilities, and continuous learning mechanisms by looking at recent advancements and the mathematical models that support these techniques. These developments have driven an increasing support for chatbots by allowing them to provide personalized interactions, enhance accessibility, and reduce repetitive tasks. In order to open the door for further study and applications, this paper aims to bring light on the challenges and the efficacy of using ML into chatbot building.
... This model is constructed from root node, branches and leaf nodes forming a hierarchical structure. The root node represents the tests on the features and branches represent the outcome of those tests, while the final classification is represented by the leaf node [19]. Decision Tree model selects features with highest information gain to ensuring maximised separation between different classes. ...
Article
Full-text available
Classifying medical datasets using machine learning algorithms could help physicians to provide accurate diagnosing and suitable treatment. For instance, stroke is one of the serious diseases that attacks many patients annually, and analyzing it is symptoms in advance could save patients’ lives. The warning signs of the stroke can be investigated to be used as attributes or predictors for machine learning models. This study evaluates the performance of four machine learning models to classify stroke datasets. Specifically, Decision Tree, Naïve Bayes, K- Nearest Neighbor (KNN) and Linear discriminant Analyses (LDA) models were trained on 11 attributes collected from 5110 patients to predict stroke risk. The findings showed that KNN outperformed the three other models with an achieved accuracy of 90%. The study also considered balancing the employed data prior validating the models to provide accurate classification. Cross-validation technique was used to avoid over-fitting and under-fitting during training phases.
... Chen & Guestrin, 2016). GBDT, which is well-known for its training efficiency and accuracy in solving regression problems, is a decision trees (Quinlan, 1986;Song & Lu, 2015) based algorithm while multiple decision trees are trained sequentially to prevent overfitting and improve The intensity model is trained by the input variables at the grid points with active fires (i.e., binary digit equals 1 in the binary frames) while original FRP values are used. Following RAVE, valid fires are represented by valid nonzero FRP values, otherwise are zeros. ...
Article
Full-text available
Fire activities introduce hazardous impacts on the environment and public health by emitting various chemical species into the atmosphere. Most operational air quality forecast (AQF) models estimate smoke emissions based on the latest available satellite fire products, which may not represent real‐time fire behaviors without considering fire spread. Hence, a novel machine learning (ML) based fire spread forecast model, the Fire Intensity and spRead forecAst (FIRA), is developed for AQF model applications. FIRA aims to improve the performance of AQF models by providing realistic, dynamic fire characteristics including the spatial distribution and intensity of fire radiative power (FRP). In this study, data sets in 2020 over the continental United States (CONUS) and a historical California fire in 2024 are used for model training and evaluation. For application assessment, FIRA FRP predictions are applied to the Unified Forecast System coupled with smoke (UFS‐Smoke) model as inputs, focusing on a California fire case in September 2020. Results show that FIRA captures fire spread with R‐squared (R²) near 0.7 and good spatial similarity (∼95%). The comparison between UFS‐Smoke simulations using near‐real‐time fire products and FIRA FRP predictions show good agreements, indicating that FIRA can accurately represent future fire activities. Although FIRA generally underestimates fire intensity, the uncertainties can be mitigated by applying scaling factors to FRP values. Use of the scaled FIRA largely outperforms the experimental UFS‐Smoke model in predicting aerosol optical depth and the three‐dimensional smoke contents, while also demonstrating the ability to improve surface fine particulate matter (PM2.5) concentrations affected by fires.
... By the way of iteration, the parameter s is approached to the optimum step by step. Six common methods of classification, including Random forest [8], Decision tree [9], Naïve Bayes [10], K-near-neighbor, Adaboost [11] and SVM [12], are used to test the performance with evaluation index: precision and recall ratio. The comparison result is shown in Table 2. From the comparison result, we find that every method's precision and recall ratio have been raised by about 2% and 4% separately. ...
Article
Full-text available
In traditional way, the segmentation of image is conducted by simple technology of image processing, which cannot be operated automatically. In this paper, we present a kind of classification method to find the boundary area to segment character image. Referring to sample points and sample areas, the essential segmentation information is extracted. By merging different formats of image transformation, including rotation, erosion and dilation, more features are used to train and test the segmentation model. Parameter tuning is also proposed to optimize the model for promotion. By the means of cross validation, the basic training model and parameter tuning are integrated in iteration way. The comparison results show that the best precision and recall can up to 97.84% in precision and 94.09% in recall.
... Historically, financial data analysis has heavily relied on traditional statistical models and machine learning algorithms such as linear regression, decision trees, and support vector machines (SVMs). These methods are valued for their interpretability, where the decision-making process is transparent and straightforward, making them suitable for applications where model understanding is crucial [22][23][24][25]. However, as the complexity and volume of financial data increase, these models often struggle with high-dimensional and temporally dynamic data, requiring extensive feature engineering and manual selection that is time-consuming and susceptible to biases. ...
Article
Full-text available
Convolutional Neural Networks (CNNs) excel in feature extraction and pattern recognition in areas like image classification and speech processing. However, their application in the financial sector has been limited due to the complexity, high dimensionality, and temporal nature of financial data, as well as the need for model interpretability. This study, based on CNN technology, proposes a Reordering-Enhanced Grad-CAM algorithm to improve model interpretability and reliability, offering transparent and dependable tools for financial decision-making. The innovation of this study lies in two key aspects: firstly, it replaces the traditional manual variable selection approach with automatic feature extraction and fusion using CNNs, demonstrating the effectiveness of deep learning in handling large-scale financial data. Secondly, we propose a novel reordering-based iterative algorithm that adapts Grad-CAM, originally designed for image classification, to multi-source financial time series data, treating sliding window data segments as pseudo-images to improve interpretability and identify critical features. Using data from Shanghai and Shenzhen A-shares (1990–2020), the Reordering-Enhanced Grad-CAM technique generated heatmaps that identified key predictive indicators, leading to improved model performance. Robustness analysis demonstrated that over 70% of important variables were consistently identified, with some models reaching up to 100%, confirming the reliability and stability of our method in financial distress prediction.
... We used models Elastic Net [30], Decision Tree [31], k-Nearest Neighbors [32], Support Vector Machine [33], Random Forest [34], and Extreme Gradient Boosting [35]. The following sections detail datasets and modeling. ...
Preprint
Full-text available
This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.
... Early attempts at anomaly detection in this domain primarily relied on statistical methods [4][5][6] and traditional machine learning models [7][8][9]. While these approaches have demonstrated utility in specific scenarios, they often treat blockchain data as isolated instances, failing to fully exploit the inherently relational and heterogeneous structure. ...
Article
Full-text available
Detecting fraudulent activities such as Ponzi schemes within smart contract transactions is a critical challenge in decentralized finance. Existing methods often fail to capture the heterogeneous, multi-faceted nature of blockchain data, and many graph-based models overlook the contextual patterns that are vital for effective anomaly detection. In this paper, we propose MVCG-SPS, a Multi-View Contrastive Graph Neural Network designed to address these limitations. Our approach incorporates three key innovations: (1) Meta-Path-Based View Construction, which constructs multiple views of the data using meta-paths to capture different semantic relationships; (2) Reinforcement-Learning-Driven Multi-View Aggregation, which adaptively combines features from multiple views by optimizing aggregation weights through reinforcement learning; and (3) Multi-Scale Contrastive Learning, which aligns embeddings both within and across views to enhance representation robustness and improve anomaly detection performance. By leveraging a multi-view strategy, MVCG-SPS effectively integrates diverse perspectives to detect complex fraudulent behaviors in blockchain ecosystems. Extensive experiments on real-world Ethereum datasets demonstrated that MVCG-SPS consistently outperformed state-of-the-art baselines across multiple metrics, including F1 Score, AUPRC, and Rec@K. Our work provides a new direction for multi-view graph-based anomaly detection and offers valuable insights for improving security in decentralized financial systems.
... Decision trees (DTs) are a fundamental component of machine learning and were first introduced in 1986 [37]. They have a flowchart-like structure where each internal node represents a test on an attribute, each branch denotes the outcome of the test, and each leaf node holds a class label [38]. ...
Article
Full-text available
Gouty arthritis (GA) and its association with kidney failure present significant challenges in healthcare, necessitating effective detection and management strategies. GA, characterized by the deposition of monosodium urate crystals in joints and other tissues, leads to inflammation and severe joint pain, often accompanied by metabolic comorbidities such as myocardial infarction and diabetes. Although GA has been widely studied in the medical field, limited research has explored the use of machine learning (ML) to identify key biomarkers affecting disease progression. This study aims to bridge this gap by leveraging ML models for predictive analysis. In this study, machine learning models such as decision trees, random forests, logistic regression, and artificial neural networks were used to classify GA using demographic, clinical, and laboratory data, and, most importantly, to identify the factors that affect GA. The analysis yielded promising results, with the decision tree model achieving the highest accuracy of 92.85%. Moreover, key factors such as urea, creatinine, and hemoglobin levels were identified during the initial attack, shedding light on the pathophysiology of GA. This study demonstrates how ML methods help identify key factors affecting GA and assist in disease management. By leveraging machine learning techniques, it is possible to refine the factors affecting GA and inform personalized interventions, ultimately improving patient care and outcomes.
... c) Decision tree classifier: A Decision Tree algorithm was employed for the classification task. Decision Trees are widely used in ML for Online First their interpretability and ability to handle both categorical and continuous data [44]. ...
Article
Full-text available
Text classification consists in attributing a text to its corresponding category. It is a crucial task in natural language processing (NLP), with applications spanning content recommendation, spam detection, sentiment analysis, and topic categorization. While significant advancements have been made in text classification for widely spoken languages, Arabic remains underrepresented despite its large and diverse speaker base. Another challenge is that, unlike flat classification, hierarchical text classification involves categorizing texts into a multi-level taxonomy. This adds layers of complexity, particularly in distinguishing between closely related categories within the same super-class. To tackle these challenges, we propose a novel approach using AraGPT2, a variant of the Generative Pre-trained Transformer 2 (GPT-2) model adapted specifically for Arabic. Fine-tuning AraGPT2 for hierarchical text classification leverages the model's pre-existing linguistic knowledge and adapts it to recognize and classify Arabic text according to hierarchical structures. Fine-tuning, in this context, refers to the process of training a pre-trained model on a specific task or dataset to improve its performance on that task. Our experiments and comparative study demonstrate the efficiency of our solution. The fine-tuned AraGPT2 classifier achieves a hierarchical HF score of 80.64%, outperforming the machine learning-based classifier, which scores 41.90%.
... 2. 랜덤 포레스트 알고리즘 랜덤 포레스트 알고리즘은 다수의 의사결정 트리를 활용하는 학습 방법이다. 의사결정 트리란, 일련의 분류 규칙을 순차적으 로 적용하면서 데이터를 분할하는 분류 모델이다 (Quinlan, 1986 (Breslow & Aha, 1997). 랜덤 포레스트는 의사결정 트리의 단점을 보완하기 위해 여러 개의 의사결정 트리를 조합하여 예측 성능을 향상시키는 알고 리즘이다 (Breiman, 2001 과를 평균 내거나 다수결 투표하여 최종 결과를 도출하는 '앙상블 학습'을 한다. ...
... The parameters determining the structure of the tree are the predictor variables used to make the decision at each node and the corresponding numeric threshold values. [19], [20] and [21] introduced the most widely cited algorithms for learning decision trees: CART, ID3, and C4.5 (ID3's successor), respectively. These algorithms all proceed greedily, growing a tree in a top-down manner, but differ in the objective functions used to decide on the splits. ...
Preprint
Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.
... The subsequent emergence of deep learning, specifically with convolutional neural networks (CNNs) and recurrent neural networks (RNNs), further transformed geoscience workflows, allowing models to capture intricate spatial and temporal relationships within geological data (Quinlan, 1986). Over time, interdisciplinary collaboration has gained prominence, combining domain expertise with machine learning capabilities to enhance the accuracy and reliability of predictions in geoscience applications (Liu et al., 2018). ...
... Among these ML algorithms, the light gradient boosting machine (LGBM) [39], which is a ML framework based on the gradient boosting machine [40] algorithm, has demonstrated good performance in various fields owing to its efficiency and accuracy. As supervised learning algorithms, random forest (RF) [41] and decision tree (DT) [42] are advantageous for large-scale datasets with missing values. Therefore, in this study, multiple models are used to capture different aspects of the data, reducing the risk of instability in practical applications that may arise from high bias in a single model. ...
Article
Full-text available
The increasing scale of wind farms demands more efficient approaches to turbine monitoring and maintenance. Here, we present an innovative framework that combines enhanced kernel principal component analysis (KPCA) with ensemble learning to revolutionize normal behavior modeling (NBM) of wind turbines. By integrating random kitchen sinks (RKS) algorithm with KPCA, we achieved a 25.21% reduction in computational time while maintaining model accuracy. Our mixed ensemble approach, synthesizing LightGBM, random forest, and decision tree algorithms, demonstrated exceptional performance across diverse operational conditions, achieving R² values of 0.9995 in primary testing. The framework reduced mean absolute error by 25.1% and mean absolute percentage error by 33.4% compared to conventional methods. Notably, when tested across three distinct operational environments, the model maintained robust performance (R² > 0.97), demonstrating strong generalization capability. The system automatically detects anomalies using a 0.1% threshold, enabling real‐time monitoring of 78 variables across 136,000+ operational records. This scalable approach integrates seamlessly with existing SCADA infrastructure, offering a practical solution for large‐scale wind farm management. Our findings establish a new paradigm for wind turbine monitoring, combining computational efficiency with unprecedented accuracy in normal behavior prediction.
... Support Vector Machine (SVM) uses the principle of maximizing the margin between classes in a high-dimensional feature space, leading to robust performance, especially in high-dimensional or small-sample scenarios [109]. Decision Tree (DT) constructs a hierarchical structure of if-then rules by recursively partitioning the feature space, offering ease of interpretation due to its rule-based nature [110]. Random Forest (RF) is an ensemble method that aggregates multiple decision trees grown on bootstrap samples, enhancing predictive accuracy and reducing overfitting [111]. ...
Article
Full-text available
The argan tree (Argania spinosa) is a rare species native to southwestern Morocco, valued for its fruit, which produces argan oil, a highly prized natural product with nutritional, health, and cosmetic benefits. However, increasing deforestation poses a significant threat to its survival. This study monitors changes in an argan forest near Agadir, Morocco, from 2017 to 2023 using Sentinel-2 satellite imagery and advanced image processing algorithms. Various machine learning models were evaluated for argan tree detection, with LightGBM achieving the highest accuracy when trained on a dataset integrating spectral bands, temporal features, and vegetation indices information. The model achieved 100% accuracy on tabular test data and 85% on image-based test data. The generated deforestation maps estimated an approximate forest loss of 2.86% over six years. This study explores methods to enhance detection accuracy, provides valuable statistical data for deforestation mitigation, and highlights the critical role of remote sensing, advanced image processing, and artificial intelligence in environmental monitoring and conservation, particularly in argan forests.
... First, researchers design traffic features (e.g., the number of packets, minimum/maximum packet size) based on specific classification requirements (e.g., protocol/traffic type). These features are then fed into various classifiers based on machine learning models, including decision trees (DT) 22 , k-nearest neighbors (kNN) 23 , and support vector machines (SVM) 24 , for classification. These methods break down the overall classification task into multiple sub-problems (e.g., feature derivation, machine learning model evaluation) and address them individually. ...
Article
Full-text available
Traffic classification is a crucial technique in network management that aims to identify and manage data packets to optimize network efficiency, ensure quality of service, enhance network security, and implement policy management. As graph convolutional networks (GCNs) take into account not only the features of the data itself, but also the relationships among sets of data during classification. Many researchers have proposed their own traffic classification methods based on GCN in recent years. However, most of the current approaches use two-layer GCN primarily due to the over-smoothing problem associated with deeper GCN. In scenarios with small samples, a two-layer GCN may not adequately capture relationships among traffic data, leading to limited classification performance. Additionally, during graph construction, traffic usually needs to be trimmed to a uniform length, and for traffic with insufficient length, zero-padding is typically applied to extension. This zero-padding strategy poses significant challenges in traffic classification with small samples. In this paper, we propose a method based on autoencoder (AE) and deep graph convolutional networks (ADGCN) for traffic classification for few-shot datasets. ADGCN first utilizes an AE to reconstruct the traffic. AE enables shorter traffic to learn abstract feature representations from longer traffic of the same class to replace zeros, mitigating the adverse effects of zero-padding. The reconstructed traffic is then classified using GCNII, a deep GCN model that addresses the challenge of insufficient data samples. ADGCN is an end-to-end traffic classification method applicable to various scenarios. According to experimental results, ADGCN can achieve a classification accuracy improvement of 3.5 to 24% compared to existing state-of-the-art methods. The code is available at https://github.com/han20011019/ADGCN.
Article
This paper aims to create an innovative approach to improving IoT-based smart parking systems by integrating machine learning (ML) and Artificial Intelligence (AI) with mathematical approaches in order to increase the accuracy of the parking availability predictions. Three regression-based ML models, random forest, gradient boosting, and LightGBM, were developed and their predictive capability was compared using data collected from three parking locations in Skopje, North Macedonia from 2019 to 2021. The main novelty of this study is based on the use of autoregressive modeling strategies with lagged features and Z-score normalization to improve the accuracy of regression-based time series forecasts. Bayesian optimization was chosen for its ability to efficiently explore the hyperparameter space while minimizing RMSE. The lagged features were able to capture the temporal dependencies more effectively than the other models, resulting in lower RMSE values. The LightGBM model with lagged data produced an R2 of 0.9742 and an RMSE of 0.1580, making it the best model for time series prediction. Furthermore, an IoT-based system architecture was also developed and deployed which included real-time data collection from sensors placed at the entry and exit of the parking lots and from individual slots. The integration of ML, AI, and IoT technologies improves the efficiency of the parking management system, reduces traffic congestion and, most importantly, offers a scalable approach to the development of urban mobility solutions.
Article
The production of biogas in wastewater treatment plants (WWTPs), often considered critical facilities, is a significant element of energy and environmental security. Given increasing demands to reduce greenhouse gas emissions and the pursuit of energy self-sufficiency, the role of biogas in the energy sector keeps steadily growing, granting it strategic potential. In Poland and other European Union countries, biogas production is supported by policies promoting renewable energy sources, enhancing its importance in the energy transition process. This study analyses biogas yields and their impact on achieving energy and environmental security goals. Additionally, the use of meta-regression methods and machine learning aims to improve biogas yield prediction based on a range of process parameters.
Article
Sustainable seaweed cultivation is crucial for marine environmental protection, ecosystem health, socio-economic development, and carbon sequestration. Accurate and timely information on the distribution, extent, species, and production of cultivated seaweeds is essential for tracking biomass production, monitoring ecosystem health, assessing environmental impacts, optimizing cultivation planning, supporting investment decisions, and quantifying carbon sequestration potential. However, this important information is usually lacking. This study developed a high-precision monitoring approach by integrating Otsu thresholding features with random forest classification, implemented through Google Earth Engine using Sentinel-2 imagery (10-m). The method was applied to analyze spatiotemporal variations of seaweed cultivation across the Korean Peninsula from 2017 to 2023. Results showed that annual cultivation acreage in North Korea remained relatively stable between 1506 and 2033 ha, while it experienced a significant increase of 8209 ha in South Korea. By integrating spectral features, seaweed phenology, and field cultivation practices, we successfully differentiated the predominant species: laver (Pyropia) and kelp (Saccharina and Undaria). During the 2022–2023 cultivation season, South Korea’s farms comprised 78% laver and 22% kelp, while North Korea’s showed an inverse distribution. A strong correlation (r2 = 0.99) between acreage and seaweed production enabled us to estimate annual seaweed production in North Korea, effectively addressing data gaps in regions with limited statistics. Our approach demonstrates the potential for global seaweed cultivation monitoring, while the spatial analysis lays the foundation for identifying potential cultivation zones. Given the relatively low initial investment requirement of seaweed farming and significant economic return, this approach offers valuable insights for promoting economic development and food security, ultimately supporting sustainable aquaculture management.
Article
The Flamelet generated manifolds method (FGM) is popular in turbulent combustion simulation because of its low computation cost and ability to combine with detailed chemical reaction mechanisms. However, FGM is suffered from it’s coarse assumption of the joint PDF shape when two controlling variables are used and thus caused a limit to high fidelity. In order to build a joint presumed PDF generation method with high accuracy for FGM, a data driven method based on prior knowledge of PDF characteristics was proposed with random forest model. Firstly, the error analysis for the presumed PDF was conducted based on the data of the two representative flames (Sandia Flame D and Sydney SM1). The results show that 𝛽 PDF is still a good option for mixture fraction in construction of the joint PDF and the conditional 𝛽 PDF proposed in this paper can well represent the conditional distribution of progress variable. A random forest model was constructed based on the experimental data to identify the parameters in conditional 𝛽 PDF. The results show that the new model obtained by small amount of data can decrease the FGM prediction error effectively and show general applicability for different turbulent flames。
Article
Failure to manage emotional withdrawal symptoms can exacerbate relapse to methamphetamine use. Understanding the neuro-mechanisms underlying methamphetamine overuse and the associated emotional withdrawal symptoms is crucial for developing effective clinical strategies. This study aimed to investigate the distinct functional contributions of fine-scale gyro-sulcal signaling in the psychopathology of patients with methamphetamine use disorder and its associations with emotional symptoms. We recruited 48 male abstinent methamphetamine use disorders and 48 age- and gender-matched healthy controls, obtaining their resting-state functional magnetic resonance imaging data along with scores on anxiety and depressive symptoms. The proposed deep learning model, a spatio-temporal graph convolutional network utilizing gyro-sulcal subdivisions, achieved the highest average classification accuracy in distinguishing resting-state functional magnetic resonance imaging data of methamphetamine use disorders from healthy controls. Within this model, nodes in the lateral orbitofrontal cortex, and the habitual and executive control networks, contributed most significantly to the classification. Additionally, emotional symptom scores were negatively correlated with the sum of negative functional connectivity in the right caudal anterior cingulate sulcus and the functional connectivity between the left putamen and pallidum in methamphetamine use disorders. These findings provide novel insights into the differential functions of gyral and sulcal regions, enhancing our understanding of the neuro-mechanisms underlying methamphetamine use disorders.
Article
Full-text available
Billions of people worldwide are affected by vision impairment majorly caused due to age-related degradation and refractive errors. Diabetic Retinopathy(DR) and Macular Hole(MH) are among the most prevalent senescent retinal diseases. Machine Intelligence can assist ophthalmologists and clinicians in fast and accurate disease diagnosis by identifying patterns in disease progression for a better healthcare system. In this paper, the Retinal Fundus Multi-disease Image Dataset (RFMiD) is used to design a machine intelligence system with two chief components namely a disease risk classifier, and a multi-label classifier. The disease risk classifier predicts whether the retinal fundus image is infected or not. Based on the prediction of the disease risk classifier, a multi-label classifier can be applied to obtain probabilities for the susceptibility to DR and MH. Finally, an ensemble is employed with the best of 3 models for each classifier. The disease riskpredictor attained a peak F1-score of 88%, while the multi-label classifier achieved an Area Under the Curve(AUC) score of 86%. However, the individual binary classifiers for DR and MH reached maximum F-scores of 91% and 93%, respectively.
Article
In this article, we introduce a novel framework for learning spatial concepts within a human-in-the-loop (HITL) context, highlighting the critical role of explainability in AI systems. By incorporating human feedback, the approach enhances the learning process, making it particularly suitable for applications where user trust and interpretability are essential, such as AiTR. Namely, we introduce a new parametric similarity measure for spatial relations expressed as histograms of forces (HoFs). Next, a spatial concept is represented as a spatially attributed graph and HoF bundles. Last, a process is outlined for utilizing this structure to make decisions and learn from human feedback. The framework’s robustness is demonstrated through examples with diverse user types, showcasing how varying feedback strategies influence learning efficiency, accuracy, and ability to tailor the system to a particular user. Overall, this framework represents a promising step toward human-centered AI systems capable of understanding complex spatial relationships while offering transparent insights into their reasoning processes.
Article
Full-text available
OBJECTIVE: Obesity is a global health problem. The aim is to analyze the effectiveness of machine learning models in predicting obesity classes and to determine which model performs best in obesity classification. METHODS: We used a dataset with 2,111 individuals categorized into seven groups based on their body mass index, ranging from average weight to class III obesity. Our classification models were trained and tested using demographic information like age, gender, and eating habits without including height and weight variables. RESULTS: The study demonstrated that when trained on demographic information, machine learning can classify body mass index. The random forest model provided the highest performance scores among all the classification models tested in this research. CONCLUSION: Machine learning methods have the potential to be used more extensively in the classification of obesity and in more effective efforts to combat obesity.
Article
Full-text available
Retracted paper: Accidents to pipelines have been recorded and they often result in catastrophic consequences for environment and society with a great deal of economic loss. Standard methods of evaluating pipeline risk have stressed index-based and conditional based data assessment processes. Data mining represents a shift from verification-driven data analysis approaches to discovery-driven methods in risk assessment.
Article
This work presents a comprehensive and chronologically ordered survey of existing studies and data sources on Electrocardiogram (ECG) based biometric recognition systems. This survey is organized in terms of the two main goals pursued in it: first, a description of the main ECG features and recognition techniques used in the existing literature, including a comprehensive compilation of references; second, a survey of the ECG databases available and used by the referenced studies. The most relevant characteristics of the databases are identified, and a comprehensive compilation of databases is given. To date, no other work has presented such a complete overview of both studies and data sources for ECG-based biometric recognition. Readers interested in the subject can obtain an understanding of the state of the art, easily identifying specific key papers by using different criteria, and become aware of the databases where they can test their novel algorithms.
Article
Full-text available
Lung cancer remains one of the major causes of mortality worldwide. But if treated early and diagnosed at an early stage, there is improved chance of survival. The medical community still remains besieged by the front of predictive modeling. Having access to such vast amounts of data is a double-edged sword. The data retrieval challenge is tackled through data mining, but the big data challenge is not yet conquered. Classification cutting algorithms have been employed in this research to predict lung cancer incidence in patients. These are the most appropriate methods for primary health care units as they allow estimation of lung cancer probability taking age, sex, smoking, dyspnea, wheezing, and chest pain into account. Decision tree modeling technique attempts to forecast medical decisions regarding patient referral based on their previous examination results. Heuristic and probabilistic algorithms are used in this model for the purpose of aiding starting physicians in the making of swift and accurate treatment decisions. Increasingly, initial diagnosis is being triggered by the aid of high-level informative and computation technology. Physicians are supposed to be aided with swift and accurate medical decisions towards patients in need in urgent situations.
Article
Full-text available
The Meta-DENDRAL program is described in general terms that are intended to clarify the similarities and differences to other learning programs. Its approach of model-directed heuristic search through a comptex space of possible rules appears well suited to many induction tasks. The use of a strong model of the domain to direct time rule Eearch has been demons{rated for rule formation in two areas of chemistry. The high performance of programs which use the generated rules attests to the success of this learning strategy.
Book
Full-text available
This book reflects the expansion of machine learning research through presentation of recent advances in the field. The book provides an account of current research directions. Major topics covered include the following: learning concepts and rules from examples; cognitive aspects of learning; learning by analogy; learning by observation and discovery; and an exploration of general aspects of learning.
Thesis
A "structured induction" technique was developed and tested using a rules- from -examples generator together with a chess -specific application package. A drawback of past experience with computer induction, reviewed in this thesis, has been the generation of machine -oriented rules opaque to the user. By use of the structured approach humanly understandable rules were synthesized from expert supplied examples. These rules correctly performed chess endgame classifications of sufficient complexity to be regarded as difficult by international master standard players. Using the "Interactive ID3" induction tools developed by the author, chess experts, with a little programming support, were able to generate rules which solve problems considered difficult or impossible by conventional programming techniques. Structured induction and associated programming tools were evaluated using the chess endgames Icing and Pawn vs. King (Black -tomove) and King and Pawn vs. King and Rook (White -to -move, White Pawn on a7) as trial problems of measurable complexity.
Article
Three themes are discussed: (1) The task of acquiring and organizing the knowledge on which to base an expert system is difficult. (2) Inductive inference systems can be used to extract this knowledge from data. (3) The knowledge so obtained is powerful enough to enable systems using it to compete handily with more conventional algorithm-based systems. These themes are explored in the context of attempts to construct high-performance programs relevant to the chess endgame king-rook versus king-knight.
Chapter
This chapter discusses the objective and concept of machine learning. The study and computer modeling of learning processes in their multiple manifestations constitutes the subject matter of machine learning. At present, the field of machine learning is organized around three primary research foci: (1) task-oriented studies—the development and analysis of learning systems to improve performance in a predetermined set of tasks also known as the engineering approach. (2) Cognitive simulation—the investigation and computer simulation of human learning processes. (3) Theoretical analysis—the theoretical exploration of the space of possible learning methods and algorithms independent of application domain. An equally basic scientific objective of machine learning is the exploration of alternative learning mechanisms, including the discovery of different induction algorithms, the scope and limitations of certain methods, the information that must be available to the learner, the issue of coping with imperfect training data, and the creation of general techniques applicable in many task domains.
Chapter
A series of experiments dealing with the discovery of efficient classification procedures from large numbers of examples is described, with a case study from the chess end game king-rook versus king-knight. After an outline of the inductive inference machinery used, the paper reports on trials leading to correct and very fast attribute-based rules for the relations lost 2-ply and lost 3-ply. On another tack, a model of the performance of an idealized induction system is developed and its somewhat surprising predictions compared with observed results. The paper ends with a description of preliminary work on the automatic specification of relevant attributes.
Article
The determination of pattern recognition rules is viewed as a problem of inductive inference, guided by generalization rules, which control the generalization process, and problem knowledge rules, which represent the underlying semantics relevant to the recognition problem under consideration. The paper formulates the theoretical framework and a method for inferring general and optimal (according to certain criteria) descriptions of object classes from examples of classification or partial descriptions. An experimental computer implementation of the method is briefly described and illustrated by an example.
Article
A new signature-table technique is described together with an improved book-learning procedure which is thought to be much superior to the linear polynomial method. Full use is made of the so-called “alpha-beta” pruning and several forms of forward pruning to restrict the spread of the move tree and to permit the program to look ahead to a much greater depth than it otherwise could do. While still unable to outplay checker masters, the program's playing ability has been greatly improved.tplay checker masters, the
Article
BACON.4 is a production system that discovers empirical laws. The program represents information at varying levels of description, with higher levels summarizing the levels below them. BACON.4 employs a small set of data-driven heuristics to detect regularities in numeric and nominal data. These heuristics note constancies and trends, causing BACON.4 to formulate hypotheses, to define theoretical terms, and to postulate intrinsic properties. The introduction of intrinsic properties plays an important role in BACON.4’s rediscovery of Ohm’s law for electric circuits and Archimedes’ law of displacement. When augmented with a heuristic for noting common divisors, the system is able to replicate a number of early chemical discoveries, arriving at Proust’s law of definite proportions, Gay-Lussac’s law of combining volumes, Cannizzaro’s determination of the relative atomic weights, and Prout’s hypothesis. The BACON.4 heuristics, including the new technique for finding common divisors, appear to be general mechanisms applicable to discovery in diverse domains.
Article
An important form of learning from observation is constructing a classification of given objects or situations. Traditional techniques for this purpose, developed in cluster analysis and numerical taxonomy, are often inadequate because they arrange objects into classes solely on the basis of a numerical measure of object similarity. Such a measure is a function only of compared objects and does not take into consideration any global properties or concepts characterizing object classes. Consequently, the obtained classes may have no simple conceptual description and may be difficult to interpret. The above limitation is overcome by an approach called conceptual clustering, in which a configuration of objects forms a class only if it is describable by a concept from a predefined concept class. This chapter gives a tutorial overview of conjunctive conceptual clustering, in which the predefined concept class consists of conjunctive statements involving relations on selected object attributes. The presented method arranges objects into a hierarchy of classes closely circumscribed by such conjunctive descriptions. Descriptions stemming from each node are logically disjoint, satisfy given background knowledge, and optimize a certain global criterion. The method is illustrated by an example in which the conjunctive conceptual clustering program CLUSTER/2 constructed a classification hierarchy of a large collection of Spanish folk songs. The conclusion suggests some extensions of the method and topics for further research.
Article
The connection between the simplicity of scientific theories and the credence attributed to their predictions seems to permeate the practice of scientific discovery. When a scientist succeeds in explaining a set of nobservations using a model Mof complexity c then it is generally believed that the likelihood of finding another explanatory model with similar complexity but leading to opposite predictions decreases with increasing nand decreasing c. This paper derives formal relationships between n, c and the probability of ambiguous predictions by examining three modeling languages under binary classification tasks: perceptrons, Boolean formulae, and Boolean networks. Bounds are also derived for the probability of error associated with the policy of accepting only models of complexity not exceeding c. Human tendency to regard the simpler as the more trustworthy is given a qualified justification.
Article
This book attempts to treat its topic, concept learning, from a "problem oriented" point of view. Concept learning is an important part of the organization of knowledge. Therefore it is worth treating in its own right; not solely as a topic in logic, a type of behavior to be derived from psychological theory, or a possible area of application for electronic computers. An attempt has been made to bring together some of the relevant material from all these fields. To keep the book within a reasonable size, it was necessary to exercise considerable selection in including theoretical points of view and reports of particular research. Inevitably I had to use my own judgment. Therefore I had best state my own biases. I originally became interested in concept learning as a topic in psychology, somewhat later I became interested in the application of digital computer programs to inductive reasoning problems. My knowledge of symbolic logic is largely self-acquired, I can only hope that I have made an adequate presentation of the role of concepts, as conceived by some philosophers, in formal logic. With these limitations in mind, I hope that this report will be useful in correlating the efforts of many researchers who have approached the same topic in diverse ways. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
The determination of pattern recognition rules is viewed as a problem of inductive inference, guided by generalization rules, which control the generalization process, and problem knowledge rules, which represent the underlying semantics relevant to the recognition problem under consideration. The paper formulates the theoretical framework and a method for inferring general and optimal (according to certain criteria) descriptions of object classes from examples of classification or partial descriptions. The language for expressing the class descriptions and the guidance rules is an extension of the first-order predicate calculus, called variable-valued logic calculus VL21. VL21 involves typed variables and contains several new operators especially suited for conducting inductive inference, such as selector, internal disjunction, internal conjunction, exception, and generalization. Important aspects of the theory include: 1) a formulation of several kinds of generalization rules; 2) an ability to uniformly and adequately handle descriptors (i.e., variables, functions, and predicates) of different type (nominal, linear, and structured) and of different arity (i.e., different number of arguments); 3) an ability to generate new descriptors, which are derived from the initial descriptors through a rule-based system (i.e., an ability to conduct the so called constructive induction); 4) an ability to use the semantics underlying the problem under consideration. An experimental computer implementation of the method is briefly described and illustrated by an example.
Conference Paper
Abstract This,paper ,proposes ,an alternative ,to Quinlan's,algorithm ,for ,forming ,classification trees,from ,large ,sets,of examples. ,My algorithm is guaranteed,to terminate. ,Quinlan's ,algorithm isusually,faster. I.The,Nature,of the ,Problem.
Conference Paper
This paper investigates the applicability to a shape-recognition problem of a concept learning algorithm which generates decision rules from examples. A comprehensive analysis of this algorithm applied to an industrial vision problem is described. This problem has no obvious 'best' solution and much effort has been devoted to performing a realistic appraisal of the algorithm by making a detailed set of comparisons with the performances of appropriate alternative classifiers. Results presented show the algorithm to be comparable in performance with the alternative classifiers but superior in terms of both the cost of making a classification and also the intelligibility of the solution. L. Introduction, A nonparametric (or "distribution-free") solution to a pattern recognition problem is one designed only from information contained In representative examples of the pattern classes. This approach is popular In practical problems where detailed statistical information is rarely available. A promising nonparametric solution to classification problems is to produce decision rules which can be expressed in the form of decision trees. Some algorithms for generating such decision trees from examples have been suggested (eg Henrichon & Fu 69, Selthi & Sarvaraydu 82) but have concentrated on classification using only numerical measurements of the patterns. A similar limitation
Conference Paper
A scheme for allowing a problem-solver to improve its performance with experience is out­ lined. A more complete definition of the scheme for a particular problem-solving program is given. Some results showing the effectiveness of the scheme are reported.
Article
Machine learning involves the modification or creation by program of stored information structures, so that machine-deliverable information becomes more accurate, larger in amount, or cheaper or faster to deliver. A further desideratum, concerned with intelligibility to the user, is reviewed in the light of recent work on computer induction.
Experiments in induction Experiments in automatic learning of medical diagnostic rules (Technical report) Rediscovering chemistry with the BACON system
  • E B Hunt
  • J Marin
  • P J Stone
  • I Kononenko
  • I Bratko
  • E Roskar
Hunt, E.B., Marin, J., & Stone, P.J. (1966). Experiments in induction. New York: Academic Press. Kononenko, I., Bratko, I., & Roskar, E. (1984). Experiments in automatic learning of medical diagnostic rules (Technical report). Jozef Stefan Institute, Ljubljana, Yugoslavia. Langley, P., Bradshaw, G.L., & Simon, H.A. (1983). Rediscovering chemistry with the BACON system
Structured induction of plans and programs
  • R Dechter
  • D Michie
  • R. Dechter
Dechter, R., & Michie, D. (1985). Structured induction of plans and programs (Technical report). IBM Scientific Center, Los Angeles, CA. INDUCTION OF DECISION TREES
Induction using the shafer representation
  • J Catlett
  • J. Catlett
Catlett, J. (1985). Induction using the shafer representation (Technical report). Basser Department of Computer Science, University of Sydney, Australia.
Entropy, information and rational decisions (Technical report)
  • J Pearl
Pearl, J. (1978a). Entropy, information and rational decisions (Technical report). Cognitive Systems Laboratory, University of California, Los Angeles.
Expert systems in the 1980s
  • E A Feigenbaum
Structured induction of plans and programs (Technical report)
  • R. Dechter
The effect of noise on concept learning
  • J R Quinlan
Learning from noisy data
  • J R Quinlan
The effect of noise on concept learning
  • J.R. Quinlan
  • R.S. Michalski
  • J.G. Carbonell
  • T.M. Mitchell
Machine learning: An artificial intelligence approach
  • R S Michalski
  • R E Stepp
ACLS user manual. Glasgow: Intelligent Terminals Ltd
  • A Patterson
  • T Niblett
Machine learning: An artificial intelligence approach
  • P Langley
  • G L Bradshaw
  • H A Simon
Current developments in Artificial Intelligence and Expert Systems. In International Handbook of Information Technology and Automated Office Systems
  • D Michie