Classification And Regression Trees
Abstract
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
... Throughout the 70's Breiman 86 , Friedman 87 , and Quinlan 88 independently proposed similar algorithms for the induction of tree-based models. A decision tree is a type of supervised learning algorithm with advantages such as handling heterogeneous data, robustness to outliers and to noise due to feature selection, and applicability both for classification and regression tasks 89,90,91,92,93 . Classification and Regression Trees (CART), Iterative Dichotomiser 3 (ID3), and C4 are examples of decision tree algorithms. ...
... We then need to define a measure ( ) of the node's impurity as a non-negative function of , with = 1 − . Let us define an impurity measure ( ) in general terms, using the framework of Breiman in 1984 89 , as a function that assesses the goodness of any node . In CART, Breiman et al. 89 identify a class of impurity function ( ) must possess the following characteristics 93 : 1. ...
... Let us define an impurity measure ( ) in general terms, using the framework of Breiman in 1984 89 , as a function that assesses the goodness of any node . In CART, Breiman et al. 89 identify a class of impurity function ( ) must possess the following characteristics 93 : 1. It should reach its highest value when the distribution is uniform, meaning that all values are equal. ...
Time-of-Flight Secondary Ion Mass Spectrometry (ToF-SIMS) imaging is a potent analytical tool that provides spatially resolved chemical information of surfaces at the microscale. However, the hyperspectral nature of ToF-SIMS datasets constitutes can be challenging to analyze and interpret. Both supervised and unsupervised Machine Learning (ML) approaches are increasingly useful to help analyze ToF-SIMS data. Random Forest (RF) has emerged as a robust and powerful algorithm for processing mass spectrometry data. This machine learning approach offers several advantages, including the accommodating non-linear relationships, robustness to outliers in the data, managing the high-dimensional feature space, and mitigating the risk of overfitting. The application of RF to ToF-SIMS imaging facilitates the classification of complex chemical compositions and the identification of features contributing to these classifications. This tutorial aims to assist non-experts in either machine learning or ToF-SIMS to apply Random Forest to complex ToF-SIMS datasets.
... Random Forest generates several classifiers and aggregates the results, unlike traditional classifiers [89][90][91]. Each node is divided using the best score between a subset of the random predictors, generating an ensemble of decision trees from training data and features and aggregating individual trees to make a final classification decision [89,92,93]. ...
... Random Forest aggregates values using boosting and bagging of classification trees [89,92]. The random selection of features, introduced by Ho [90,93] and Amit and Geman [94], constructs decision trees with controlled variation. ...
... The training data, created in Microsoft Excel, comprised 400 rows for each landscape characteristic. Variables representing post-mining landscape characteristics were recorded in columns, and cumulative and percentage Random Forest aggregates values using boosting and bagging of classification trees [89,92]. The random selection of features, introduced by Ho [90,93] and Amit and Geman [94], constructs decision trees with controlled variation. ...
Post-mining landscapes are multifaceted, comprising multiple characteristics, more so in big metropolitan regions such as Gauteng, South Africa. This paper evaluates the efficacy of Fuzzy overlay and Random Forest classification for integrating and representing post-mining landscapes and how this influences the perception of these landscapes. To this end, this paper uses GISs, MCDA, Fuzzy overlay, and Random Forest classification models to integrate post-mining landscape characteristics derived from the literature. It assesses the results using an accuracy assessment, area statistics, and correlation analysis. The findings from this study indicate that both Fuzzy overlay and Random Forest classification are applicable for integrating multiple landscape characteristics at varying degrees. The resultant maps show some similarity in highlighting mine waste cutting across the province. However, the Fuzzy overlay map has higher accuracy and extends over a larger footprint owing to the model’s use of a range of 0 to 1. This shows both areas of low and high memberships, as well as partial membership as intermediate values. This model also demonstrates strong relationships with regions characterised by landscape transformation and waste and weak relationships with areas of economic decline and inaccessibility. In contrast, the Random Forrest classification model, though also useful for classification purposes, presents a lower accuracy score and smaller footprint. Moreover, it uses discrete values and does not highlight some areas of interaction between landscape characteristics. The Fuzzy overlay model was found to be more favourable for integrating post-mining landscape characteristics in this study as it captures the nuances in the composition of this landscape. These findings highlight the importance of mapping methods such as Fuzzy overlay for an integrated representation and shaping the perception and understanding of the locality and extent of complex landscapes such as post-mining landscapes. Methods such as Fuzzy overlay can support research, planning, and decision-making by providing a nuanced representation of how multiple landscape characteristics are integrated and interact in space and how this influences public perception and policy outcomes.
... The CART was used to evaluate the determinants of UPF consumption. The CART is a method that divides the data into segments that are as homogeneous as possible relative to the outcome variable (percentage of energy participation of UPF in the individual's diet) (15,16) . A homogeneous node is considered one in which all cases have the same value for the outcome, therefore being a terminal node (16) . ...
... The CART is a method that divides the data into segments that are as homogeneous as possible relative to the outcome variable (percentage of energy participation of UPF in the individual's diet) (15,16) . A homogeneous node is considered one in which all cases have the same value for the outcome, therefore being a terminal node (16) . ...
... The algorithms usually used to build trees work from top to bottom by grouping independent variables, which allows complex interactions to be established between variables and the outcome without prior specification. Also, the CART algorithm itself determines the ideal cut-off point for identifying risk or protection groups through interaction with one or more variables (16) . ...
This article aims to evaluate the sociodemographic determinants of ultra-processed foods (UPF) consumption in the Brazilian population ≥ 10 years of age. The study used data from the personal and resident food consumption module of the Family Budget Surveys, grouping foods according to the NOVA classification of food processing. The classification and regression tree (CART) was used to identify the factors determining the lowest to highest percentage participation of UPF in the Brazilian population. UPF accounted for 37·0 % of energy content in 2017–2018. In the end, eight nodes of UPF consumption were identified, with household situation, education in years, age in years and per capita family income being the determining factors identified in the CART. The lowest consumption of UPF occurred among individuals living in rural areas with less than 4 years of education (23·78 %), while the highest consumption occurred among individuals living in urban areas, < 30 years of age and with per capita income ≥ US$257 (46·27 %). The determining factors identified in CART expose the diverse pattern of UPF consumption in the Brazilian population, especially conditions directly associated with access to these products, such as penetration in urban/rural regions. Through the results of this study, it may be possible to identify focal points for action in policies and actions to mitigate UPF consumption.
... Currently, the Gini index [51] is the preferred Cancers 2024, 16, 3702 6 of 23 method for measuring impurity; it represents the probability of a randomly chosen element being incorrectly classified so that a value of zero means a completely pure partition. The Classification And Regression Trees−CART by Leo Breiman also used pruning, a process that reduces the tree size to avoid overfitting [52,53]. It is still largely used in imaging analysis due to its intuitiveness and ease of use, e.g., it could classify tumor histology from image descriptions in MRI [54]. ...
... Currently, the Gini index [51] is the preferred method for measuring impurity; it represents the probability of a randomly chosen element being incorrectly classified so that a value of zero means a completely pure partition. The Classification And Regression Trees−CART by Leo Breiman also used pruning, a process that reduces the tree size to avoid overfitting [52,53]. It is still largely used in imaging analysis due to its intuitiveness and ease of use, e.g., it could classify tumor histology from image descriptions in MRI [54]. ...
Artificial intelligence (AI), the wide spectrum of technologies aiming to give machines or computers the ability to perform human-like cognitive functions, began in the 1940s with the first abstract models of intelligent machines. Soon after, in the 1950s and 1960s, machine learning algorithms such as neural networks and decision trees ignited significant enthusiasm. More recent advancements include the refinement of learning algorithms, the development of convolutional neural networks to efficiently analyze images, and methods to synthesize new images. This renewed enthusiasm was also due to the increase in computational power with graphical processing units and the availability of large digital databases to be mined by neural networks. AI soon began to be applied in medicine, first through expert systems designed to support the clinician’s decision and later with neural networks for the detection, classification, or segmentation of malignant lesions in medical images. A recent prospective clinical trial demonstrated the non-inferiority of AI alone compared with a double reading by two radiologists on screening mammography. Natural language processing, recurrent neural networks, transformers, and generative models have both improved the capabilities of making an automated reading of medical images and moved AI to new domains, including the text analysis of electronic health records, image self-labeling, and self-reporting. The availability of open-source and free libraries, as well as powerful computing resources, has greatly facilitated the adoption of deep learning by researchers and clinicians. Key concerns surrounding AI in healthcare include the need for clinical trials to demonstrate efficacy, the perception of AI tools as ‘black boxes’ that require greater interpretability and explainability, and ethical issues related to ensuring fairness and trustworthiness in AI systems. Thanks to its versatility and impressive results, AI is one of the most promising resources for frontier research and applications in medicine, in particular for oncological applications.
... A análise deste gráfico permite identificar desvios sistemáticos, avaliar a precisão das previsões em diferentes faixas de valores e detectar possíveis outliers. Além disso, a dispersão fornece insights sobre a robustez do modelo em lidar com variações não lineares nos dados, uma característica distintiva das árvores de decisão (BREIMAN et al., 1984;HASTIE;TIBSHIRANI;FRIEDMAN, 2009). ...
... Segundo Montgomery, Peck e Vining (2012), esse modelo é particularmente adequado para situações em que há uma relação linear clara entre as variáveis, proporcionando uma abordagem simples, porém eficaz, para a previsão de resultados financeiros.Além disso, Angrist e Pischke (2009) ressaltam que os modelos de regressão linear são fundamentais no campo da econometria, sendo utilizados como ferramentas computacionais para estimar as diferenças entre grupos tratados e grupos de controle, com ou sem o uso de covariáveis.Esse método é crucial na avaliação de intervenções e na mensuração de seus impactos, oferecendo controle preciso sobre os fatores que podem influenciar os resultados.Por outro lado, as Árvores de Decisão são técnicas de machine learning que se destacam por sua capacidade de particionar os dados em subconjuntos homogêneos, criando uma estrutura hierárquica que facilita a tomada de decisões. De acordo comBreiman et al. (1984), as árvores de REVISTA ARACÊ, São José dos Pinhais, v. 6, n. 2, p. 3055-3075, 2024 3058 decisão são especialmente úteis quando existem relações complexas e não lineares entre as variáveis, como é frequentemente o caso na previsão de inadimplência de dívidas. Esse método permite identificar padrões ocultos nos dados e gerar previsões mais detalhadas e precisas.AsÁrvores de Decisão são amplamente reconhecidas por sua facilidade de interpretação e aplicabilidade em diversas áreas. ...
The study analyzed focuses on the application of predictive models, specifically Linear Regression and Decision Trees, for the management of defaulted debts in the public context of the United States. The main objective of the work is to compare the effectiveness of these models in predicting the compliance of debts with more than 120 days, assisting in directing these debts to the Treasury Offset Program (TOP), an essential initiative for the government's financial recovery. The problem that the study addresses is the need for effective management of defaulted public debts, seeking to ensure compliance with public financial policies that promote compliance and the adequate redirection of financial resources to the government. This is particularly important to ensure fiscal transparency and accountability of federal agencies. The methodology used in the study was quantitative, based on the analysis of data on eligible debts extracted from reports of the US Treasury. Linear Regression and Decision Tree models were applied, with performance metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Coefficient of Determination (R²). The study addressed financial and temporal variables to analyze the behavior of these debts and their compliance. The main results show that both models presented high accuracy in predictions, with Linear Regression showing a perfect fit (R² = 1) and Decision Trees standing out in capturing non-linear nuances of the data. The variable "Compliance Rate Amount" was identified as the most significant in the Decision Tree model, suggesting that the amount of the compliance rate is one of the most important factors in predicting the compliance of defaulted debts. This study offers valuable contributions to the field of public management, by demonstrating that the use of predictive models can help optimize debt recovery, improve fiscal transparency and contribute to more informed decision-making.
... CART-LC [1] uses a deterministic hill-climbing algorithm, which may lead to a local minimum. DT-SE [7], which uses soft entropy as the loss function, faces a similar issue. ...
... The question to be asked here is whether replacing the subtree Tt (originating from node t) with a terminal node t significantly decreases the training error. For example, CART [1] uses α to measure the strength of a split, which is defined as ...
Traditional decision trees are limited by axis-orthogonal splits, which can perform poorly when true decision boundaries are oblique. While oblique decision tree methods address this limitation, they often face high computational costs, difficulties with multi-class classification, and a lack of effective feature selection. In this paper, we introduce LDATree and FoLDTree, two novel frameworks that integrate Uncorrelated Linear Discriminant Analysis (ULDA) and Forward ULDA into a decision tree structure. These methods enable efficient oblique splits, handle missing values, support feature selection, and provide both class labels and probabilities as model outputs. Through evaluations on simulated and real-world datasets, LDATree and FoLDTree consistently outperform axis-orthogonal and other oblique decision tree methods, achieving accuracy levels comparable to the random forest. The results highlight the potential of these frameworks as robust alternatives to traditional single-tree methods.
... We used a classification tree (Breiman et al. 1984) to examine the interaction between classes of family support and communication program participation. Although interactions between explanatory variables are usually examined with regression models, this was infeasible in the present study because of the small number of participants who had appeared in court for an offence (n=31, 6%). ...
... Classification trees are models represented in a branching diagram. In the present study, we used the CART algorithm (Breiman et al. 1984). Classification trees split the data into different profiles, based on the explanatory variables, to predict the dependent variable. ...
In this report, we investigate the effects of the Pathways to Prevention Project on the onset of youth offending. We find persuasive evidence for the impact of an enriched preschool program, the communication program, in reducing by more than 50 percent the number of young people becoming involved in court-adjudicated youth crime by age 17. We find equally strong evidence that comprehensive family support increased the efficacy and sense of empowerment of parents receiving family support. No children offended in the communication program if their parents also received family support, but family support on its own did not reduce youth crime. The rate of youth offending between 2008 and 2016 in the Pathways region was at least 20 percent lower than in other Queensland regions at the same low socio-economic level, consistent with (but not proving) the hypothesis that the Pathways Project reduced youth crime at the aggregate community level.
... the propensity to change lifestyle (quantitative variable Y synthetic S ), and the independent variables, relating to the state of the respondents' energy economy and the main motives for energy efficiency measures, as well as the control variables characterizing the household, the regression tree method was proposed. Its use in discriminant and regression analysis was presented by Breiman et al. [38]. This method (belonging to the group of non-parametric methods for building discriminant and regression models) is used to predict the value of the explanatory variable measured on a quotient or interval scale. ...
... For this purpose, the direct stopping rule FACT (Fast Algorithm for Classification Trees) was used for a given fraction of objects-5% of the tested community. The calculations were conducted using Statistica software (version 13 implementation of the CART method introduced by Breiman and his colleagues from Berkeley [38]. ...
Background: The implementation of the EU climate and energy policy, along with changes in the legal environment, has led to a significant increase in energy prices in Poland. Consequently, energy expenditures are now a larger part of household budgets. These rising energy costs and the evolving legal landscape are compelling households to invest in energy-saving solutions and modify their energy consumption habits. This article aims to identify the activities of households in Poland regarding the rationalization of energy expenditures. It formulates the following research hypothesis: households invest in energy-saving appliances to rationalize energy expenditures and/or change their behaviors to reduce energy consumption. Methods: The paper is based on primary research conducted using an online questionnaire survey on a sample of 331 respondents in Poland in March and April 2023. Results: A classification tree algorithm was used to identify the level of investment activities and behavioral changes made by households to reduce energy expenditures. The authors found that low-income households and people who fear further energy price increases are the first of all to change their behaviors for more energy-efficient ones. Medium- and high-income households take investment measures. They replace household appliances with more energy-efficient ones and install heat pumps and photovoltaic panels. These investments are motivated by responsible consumption, environmental protection, cleanliness, and the ease of use of the appliances.
... Machine learning algorithms especially tree ensembles like RF, have shown good performance for stroke upper extremity function prognosis [11], [12], [13], [17]. Tree ensembles combine multiple decision trees, a type of machine learning algorithm [18], by averaging their outputs to reduce overtraining and improve predictive performance, although their interpretability is complex [19]. Regression tree ensembles could be a feasible technique for the estimation of stroke upper extremity motor function by allowing the combination of physiological information acquired from different sources. ...
... The frequency of variables within the first nodes of the regression tree ensembles was computed to determine variables' contribution to the models' estimation. A higher frequency of predictor variables within the first nodes of regression trees is related to a higher association of the predictor variable with the predicted variable [18]. In addition, the probability that predictor variables were related to a higher or a lower score of the clinical assessments was also calculated. ...
Accurate diagnosis of upper extremity motor function in stroke patients is important for effective rehabilitation. However, the approach to correctly perform clinical assessments is still a matter of discussion and requires both trained personnel and specialized materials, thus, limiting the availability of stroke upper extremity diagnosis. Computer-aided methods have been scarcely reported for stroke upper extremity motor function estimation and could support personnel training and clinical decision-making. For these reasons, in the present study linear regression and regression tree ensembles were applied to estimate upper extremity assessments’ scores using neurophysiological measurements, including electroencephalography (EEG) and transcranial magnetic stimulation (TMS). A database was used to evaluate these approaches and was comprised by measurements of upper extremity sensorimotor and functional performance of stroke patients assessed with the Fugl-Meyer Assessment for the Upper Extremity (FMA-UE) and the Action Research Arm Test (ARAT). Regression tree ensembles outperformed linear models, estimating 66.7% of the FMA-UE scores and 70% of the ARAT scores with errors below the minimal clinically important difference. The median absolute errors were 3.5 points for the FMA-UE and 1.8 points for the ARAT, within clinically acceptable ranges. Variables that were associated with a higher upper extremity function measured with FMA-UE and ARAT were a higher corticospinal integrity in patients’ affected hemisphere, lower interhemispheric functional connectivity in the central region of the cortex during hand motor intention and, higher alpha activation in the central and lower activation in the parietal regions of the cortex during hand motor intention. Limitations of the study considered, the performance of the proposed approach implied that computer-aided estimation of upper extremity motor function is feasible using physiological information and nonlinear models. These models could be used to create expert systems that support clinical personnel training and decision making regarding upper extremity assessment in stroke.
... Fig. 10(a) outlines the permutation-based feature importance for the XGBoost model with 28 features. The permutation feature importance method was initially proposed by Breiman et al. [41,50] and is a modelagnostic explainability method based on decoupling the relationship established between a feature column and their outcome label using a random shuffling of the entire feature column. This process is carried out individually for each feature column by monitoring the variation of a certain control metric concerning its baseline value without shuffling. ...
Among the key benefits of using structural timber is its potential for reuse after being dismantled from an existing building. Recycling and reuse are central concepts in the circular economy. However, the installation and dismantling of structural elements often leave traces from previous use, such as holes from connectors like dowels or screws, internal piping and cabling. Therefore, it is crucial to develop methods to rigorously quantify the reduced load-bearing capacity of recycled beams due to potential holes using efficient and expedited methods akin to visual grading approaches. This work proposes a visual-based method for classifying recycled timber based on the geometric characteristics of the artificial holes. A stochastic mechanics-based numerical model was developed to predict the bending strength reduction of beams with random hole patterns and thus generate an extensive dataset for calibrating data-driven binary classification models. Machine learning and conditional classification models are used to determine if the reduction in bending strength is greater or less than 20%, being the predefined threshold value for reduced strength grading. An experimental campaign on timber beams with specific hole patterns, determined after experimental design, led to the numerical model validation and the calibration of thresholds for the conditional classification model, which relies on a single feature: the sum of the diameters of the holes in two beam regions. The study shows that with an elementary conditional model, high-performance metrics of the binary classification model comparable to machine learning techniques can be achieved. In other words, with a balanced dataset, accuracies over 80% in classifying the level of capacity reduction, greater or lesser than 20%, can be achieved simply by comparing the sum of diameters to a predetermined threshold. This method currently fills a regulatory and methodological gap in safely reusing structural timber.
... 3. We limit the depth of all but the final iterated decision tree layers to two to enhance interpretability. For the final IDT layer, we don't limit the depth but instead perform minimal cost complexity pruning (Breiman et al., 1984). ...
We present a logic based interpretable model for learning on graphs and an algorithm to distill this model from a Graph Neural Network (GNN). Recent results have shown connections between the expressivity of GNNs and the two-variable fragment of first-order logic with counting quantifiers (C2). We introduce a decision-tree based model which leverages an extension of C2 to distill interpretable logical classifiers from GNNs. We test our approach on multiple GNN architectures. The distilled models are interpretable, succinct, and attain similar accuracy to the underlying GNN. Furthermore, when the ground truth is expressible in C2, our approach outperforms the GNN.
... Decision trees can be constructed using a variety of algorithms. Classification trees, first famous by Breiman et al. (1984). These are the most common solutions for binary problems. ...
Credit risk is a crucial component of daily financial services operations; it measures the likelihood that a borrower will default on a loan, incurring an economic loss. By analysing historical data for assessment of the creditworthiness of a borrower, lenders can reduce credit risk. Data are vital at the core of the credit decision-making processes. Decision-making depends heavily on accurate, complete data, and failure to harness high-quality data would impact credit lenders when assessing the loan applicants’ risk profiles. In this paper, an empirical comparison of the robustness of seven machine learning algorithms to credit risk, namely support vector machines (SVMs), naïve base, decision trees (DT), random forest (RF), gradient boosting (GB), K-nearest neighbour (K-NN), and logistic regression (LR), is carried out using the Lending Club credit data from Kaggle. This task uses seven performance measures, including the F1 Score (recall, accuracy, and precision), ROC-AUC, and HL and MCC metrics. Then, the harnessing of generative adversarial networks (GANs) simulation to enhance the robustness of the single machine learning classifiers for predicting credit risk is proposed. The results show that when GANs imputation is incorporated, the decision tree is the best-performing classifier with an accuracy rate of 93.01%, followed by random forest (92.92%), gradient boosting (92.33%), support vector machine (90.83%), logistic regression (90.76%), and naïve Bayes (89.29%), respectively. The classifier is the worst-performing method with a k-NN (88.68%) accuracy rate. Subsequently, when GANs are optimised, the accuracy rate of the naïve Bayes classifier improves significantly to (90%) accuracy rate. Additionally, the average error rate for these classifiers is over 9%, which implies that the estimates are not far from the actual values. In summary, most individual classifiers are more robust to missing data when GANs are used as an imputation technique. The differences in performance of all seven machine learning algorithms are significant at the 95% level.
... It is often recommended to grow smaller trees with fewer larger leaves to prevent overfitting. Moreover, many small leaves result in highly flexible regression tree models [51]. ...
In this study, a Machine Learning (ML)-based approach is proposed to enhance the computational efficiency of a particular method that was previously proposed by the authors for passive localization of radar emitters based on multipath exploitation with a single receiver in Electronic Support Measures (ESM) systems. The idea is to utilize a ML model on a dataset consisting of useful features obtained from the priori-known operational environment. To verify the applicability and computational efficiency of the proposed approach, simulations are performed on the pseudo-realistic scenes to create the datasets. Well-known regression ML models are trained and tested on the created datasets. The performance of the proposed approach is then evaluated in terms of localization accuracy and computational speed. Based on the results, it is verified that the proposed approach is computationally efficient and implementable in radar detection applications on the condition that the operational environment is known prior to implementation.
... We identified the crucial variables for predicting BIC transitions to forecast store sales using decision tree algorithms, such as random forest [29] and CART [30]. Decision tree algorithms highlight the essential variables for generating decision rules by employing Gini coefficients or variable importance. ...
Multivariate time series data can be collected and employed in various fields to predict future data. However, owing to significant uncertainty and noise, controlling the prediction accuracy during practical applications remains challenging. Therefore, this study examines the Bayesian information criterion (BIC) as an evaluation metric for prediction models and analyzes its changes by varying the explanatory variables, variable pairs, and learning and validation periods. Descriptive statistics and decision tree-based algorithms, such as classification and regression tree, random forest, and dynamic time warping, were employed in the analysis. The experimental evaluations were conducted using two types of restaurant data: sales, weather, number of customers, number of views on gourmet site, and day of the week. Based on the experimental results, we compared and discussed the learning behavior based on various explanatory variable combinations. We discovered that 1. the explanatory variable, the number of customers, exhibited a significantly different trend from other variables when dynamic time warping was applied, particularly in combination with other variables, and 2. variables with seasonality yielded the best performance when used independently; otherwise, the predictive accuracy decreased according to the decision tree results. This comparative investigation revealed that the proposed BIC analysis method proposed can be used to effectively identify the optimal combination of explanatory variables for multivariate time series data that exhibit characteristics such as seasonality.
... The package "rpart" in R was used for this purpose. Regression tree analysis is a nonparametric method that recursively splits data into successively smaller groups with binary subdivisions based on a single continuous predictor variable (Breiman et al., 1984). In response, the regression tree generates a tree diagram with branches determined by the subdivision rules and a set of three terminal nodes containing the average yield (or starch content) and the number of observations contained in each terminal node. ...
Cassava (Manihot esculenta Crantz) was declared the “crop of the 21st century” by the Food and Agriculture Organization of the United Nations due to its high starch content and low input requirements. The management factors that govern yields and starch content in cassava in Brazil are still unclear. The aim of this study was to identify the main factors that limit the yield and starch content of cassava fields in Brazilian Cerrado. The data were collected as part of a survey covering 300 cassava fields in two growing seasons (2020–2021 and 2021–2022). Throughout the development cycle, management practices, yield, and percentage starch content in the roots were described. The database was divided into high and low yield tertiles. Mean comparison tests, regression tree analyses, and boundary functions were applied. The importance of genetics, environment, and associated crop constraints on cassava production (yields and starch content) was assessed. The yield gap in cassava was 44.6 Mg ha⁻¹. The most important factors leading to yield and starch losses were variety, planting date, and potassium fertilization. By adapting optimal practices, it is possible to produce an additional 1.5 million tons of cassava on the current cultivation area in the western Brazilian Cerrado, which corresponds to 8.3% of total production in Brazil and could increase the production of cassava starch by more than 400,000 Mg.
... DT, RF, and XGB are all tree-based (TB) models. DT is a non-parametric supervised learning method capable of deriving decision rules from a series of data characterized by features and labels (Breiman et al., 1984). These rules are presented in a tree-like graphical structure, utilized to address classification and regression problems. ...
Wetland methane (CH4) emissions have a significant impact on the global climate system. However, the current estimation of wetland CH4 emissions at the global scale still has large uncertainties. Here we developed six distinct bottom‐up machine learning (ML) models using in situ CH4 fluxes from both chamber measurements and the Fluxnet‐CH4 network. To reduce uncertainties, we adopted a multi‐model ensemble (MME) approach to estimate CH4 emissions. Precipitation, air temperature, soil properties, wetland types, and climate types are considered in developing the models. The MME is then extrapolated to the global scale to estimate CH4 emissions from 1979 to 2099. We found that the annual wetland CH4 emissions are 146.6 ± 12.2 Tg CH4 yr⁻¹ (1 Tg = 10¹² g) from 1979 to 2022. Future emissions will reach 165.8 ± 11.6, 185.6 ± 15.0, and 193.6 ± 17.2 Tg CH4 yr⁻¹ in the last two decades of the 21st century under SSP126, SSP370, and SSP585 scenarios, respectively. Northern Europe and near‐equatorial areas are the current emission hotspots. To further constrain the quantification uncertainty, research priorities should be directed to comprehensive CH4 measurements and better characterization of spatial dynamics of wetland areas. Our data‐driven ML‐based global wetland CH4 emission products for both the contemporary and the 21st century shall facilitate future global CH4 cycle studies.
... Classification and Regression Trees (CART) is the common name for tree-based algorithms (Breiman et al., 1984). When the outcome variable is at the classification level, the primary goal of CART is to categorize all units in the study using factors that are thought to be predictors, or to perform a point estimation when the outcome variable is continuous (Orrù et al., 2020). ...
Previous researchers have identified socioeconomic status as a significant predictor of achievement/literacy. However, it is important to recognize that the influence of socioeconomic status on literacy may vary at different levels of socioeconomic status. Thus, this study analyzes the relationship between socioeconomic status and literacy scores for all domains in PISA Türkiye data from 2003 to 2022 through the Classification and Regression Trees and linear regression methods. Upon examining the results, separate investigations carried out for the lower and upper socioeconomic status groups indicate that R 2 values were found to be equal to or greater than .80 in 37 out of the 42 analyses. From 2003 to 2009, the R2 values in both groups were considerably high; however, there has been a notable decline in subsequent periods. The year 2009 demonstrated particularly high R2 values by ESCS in all domains for both upper and lower groups. Consequently, socioeconomic status exhibited a greater predictive power on literacy scores across all domains in the lower socioeconomic group than upper socioeconomic group.
... Les variables indépendantes dans le modèle de régression logistique peuvent être désignées par 0 et 1, indiquant l'absence ou la présence d'un risque. La sortie du modèle varie entre 0 et 1 et représente la susceptibilité aux risques.3.5.6 K-Voisins les plus proches (KNN)L'alorithme KNN fait partie de la classe des algorithmes qui peuvent classer une entité inconnue si nous avons des données avec des propriétés spécifiques (X) et la valeur de la relation (Y)(Breiman et al., 2017). Il classe une instance (glissement de terrain, érosion par ravinement ou inondation) qui est principalement représentée au sein de ses (k) voisins. ...
La présente étude a pour objectif d’examiner l'efficacité de différentes méthodes statistiques et d'apprentissage automatique pour la cartographie et la prédiction de la susceptibilité dans le bassin versant de Tensift et la plaine de Haouz. Le modèle de rapport des fréquences (FR) et le taux de prédiction (PR) ont été appliqués avec succès pour évaluer les risques de glissements de terrain dans la région de Tizi N'tichka, le long de la route nationale (RN9) reliant Marrakech à Ouarzazate. Ce modèle a montré de bonnes performances, avec un AUC de 92,30 %. De plus, l'utilisation du Processus de Hiérarchie Analytique (AHP), combinée à la méthode de l'Équation Universelle de Perte de Sol (RUSLE) et aux modèles de la seizième phase du Coupled Model Intercomparison Project (CMIP6) dans la plaine de Haouz, a permis d'estimer les pertes de sol actuelles et futures pour 2040. Les résultats montrent une augmentation de 24,90 % et 50,40 % respectivement pour les scénarios RCP2.6 et RCP8.5.
Enfin, sept algorithmes d'apprentissage automatique, notamment la forêt aléatoire (RF), la machine à vecteurs de support (SVM), le k-voisin le plus proche (KNN), le boosting de gradient extrême (XGBoost), le réseau neuronal artificiel (ANN), l'arbre de décision (DT) et la régression logistique (LR), ont été utilisés pour cartographier et prédire la susceptibilité aux trois géorisques les plus fréquents dans cette zone, à savoir : les inondations, l'érosion par ravinement et les glissements de terrain. Cela a permis d’établir une cartographie des multirisques, révélant des résultats variables en fonction des caractéristiques topographiques, géomorphologiques, géologiques, climatiques et de l'utilisation des terres. Les résultats de l’application de ces modèles basés sur les algorithmes d'apprentissage automatique montrent que le modèle XGBoost présente les meilleures performances, avec des AUC de 93,78 %, 91,07 % et 93,41 % respectivement pour les géorisques des inondations, de l’érosion par ravinement et des glissements de terrain..
Cette thèse de doctorat introduit une méthodologie plus précise pour évaluer la susceptibilité à plusieurs types de risques. Cette approche revêt une grande importance pour les planificateurs et les décideurs politiques, car elle permet d'identifier les zones sensibles en fonction de l'impact relatif de chaque facteur spécifique à une région donnée. La carte de susceptibilité à plusieurs risques pour le bassin versant de Tensift et la plaine du Haouz, au Maroc, offre aux autorités locales la possibilité de développer des mesures et des stratégies d'atténuation sur mesure pour faire face aux risques multiples qui pourraient survenir à l'avenir.
... Additionally, TCNs might be more feasible for systems that make predictions based on shorter data sequences, such as those found in small-cell deployments with high UE velocity, where short but frequent handovers occur. Notably, although not presented in this study, we also explored other machine learning methods, such as Decision Trees [45], which can deliver comparable results to TCNs while offering much faster deployment. When faced with challenges in obtaining sufficient and diverse data, the model's behavior might not be reliable in unique scenarios where data is lacking. ...
The handover (HO) procedure is one of the most critical functions in a cellular network driven by measurements of the user channel of the serving and neighboring cells. The success rate of the entire HO procedure is significantly affected by the preparation stage. As massive Multiple-Input Multiple-Output (MIMO) systems with large antenna arrays allow resolving finer details of channel behavior, we investigate how machine learning can be applied to time series data of beam measurements in the Fifth Generation (5G) New Radio (NR) system to improve the HO procedure. This paper introduces the Early-Scheduled Handover Preparation scheme designed to enhance the robustness and efficiency of the HO procedure, particularly in scenarios involving high mobility and dense small cell deployments. Early-Scheduled Handover Preparation focuses on optimizing the timing of the HO preparation phase by leveraging machine learning techniques to predict the earliest possible trigger points for HO events. We identify a new early trigger for HO preparation and demonstrate how it can beneficially reduce the required time for HO execution reducing channel quality degradation. These insights enable a new HO preparation scheme that offers a novel, user-aware, and proactive HO decision making in MIMO scenarios incorporating mobility.
... We used a classification tree (Breiman et al. 1984) to examine the interaction between classes of family support, gender, early behavioural risk and communication program participation. Although interactions between explanatory variables are usually examined with regression models, in the present study, the combination of latent class of family support and communication program participation produced some empty cells for the dependent variable, meaning that there were no participants with an offending outcome. ...
This paper investigates the effects on court-adjudicated offending to age 17 of comprehensive, community-based support offered through the Pathways to Prevention Project to families of preschool and primary age children. The sample is 543 children from a disadvantaged region in Brisbane, 192 of whom, at age four in 2002 or 2003, participated in the standard preschool curriculum plus a program designed to strengthen oral language and communication skills, and who transitioned to a local primary school where family support remained available. Family support (involving 41% of families) was associated overall with a heightened risk of offending, reflecting the high level of need in these families, particularly in the later primary years. However, family support combined with the communication program corresponded to a very low offending rate. This suggests that family support should be combined with both high-quality, early-in-life preventive initiatives and with evidence-based child and parent programs in late primary school.
... As discussed in Section 3, we can use S l to parameterize our tree-growing algorithm similar to the minimum bucket parameter in standard CART algorithms [Bre84]. ...
Decision Trees have remained a popular machine learning method for tabular datasets, mainly due to their interpretability. However, they lack the expressiveness needed to handle highly nonlinear or unstructured datasets. Motivated by recent advances in tree-based machine learning (ML) techniques and first-order optimization methods, we introduce Generalized Soft Trees (GSTs), which extend soft decision trees (STs) and are capable of processing images directly. We demonstrate their advantages with respect to tractability, performance, and interpretability. We develop a tractable approach to growing GSTs, given by the DeepTree algorithm, which, in addition to new regularization terms, produces high-quality models with far fewer nodes and greater interpretability than traditional soft trees. We test the performance of our GSTs on benchmark tabular and image datasets, including MIMIC-IV, MNIST, Fashion MNIST, CIFAR-10 and Celeb-A. We show that our approach outperforms other popular tree methods (CART, Random Forests, XGBoost) in almost all of the datasets, with Convolutional Trees having a significant edge in the hardest CIFAR-10 and Fashion MNIST datasets. Finally, we explore the interpretability of our GSTs and find that even the most complex GSTs are considerably more interpretable than deep neural networks. Overall, our approach of Generalized Soft Trees provides a tractable method that is high-performing on (un)structured datasets and preserves interpretability more than traditional deep learning methods.
... Feature importance scores were also computed to determine the relative contribution of each image feature in the classification of the different land cover types. Some of the most frequent approximations for feature selection include Gini Index [66], gain-ratio [67], and Chi-square test [68]. Feature importance scores for the overall classification were estimated using the RF-based Gini criterion. ...
The Greater Amanzule Peatlands (GAP) in Ghana is an important biodiversity hotspot facing increasing pressure from anthropogenic land-use activities driven by rapid agricultural plantation expansion, urbanisation, and the burgeoning oil and gas industry. Accurate measurement of how these pressures alter land cover over time, along with the projection of future changes, is crucial for sustainable management. This study aims to analyse these changes from 2010 to 2020 and predict future scenarios up to 2040 using multi-source remote sensing and machine learning techniques. Optical, radar, and topographical remote sensing data from Landsat-7, Landsat-8, ALOS/PALSAR, and Shuttle Radar Topography Mission derived digital elevation models (DEMs) were integrated to perform land cover change analysis using Random Forest (RF), while Cellular Automata Artificial Neural Networks (CA-ANNs) were employed for predictive modelling. The classification model achieved overall accuracies of 93% in 2010 and 94% in both 2015 and 2020, with weighted F1 scores of 80.0%, 75.8%, and 75.7%, respectively. Validation of the predictive model yielded a Kappa value of 0.70, with an overall accuracy rate of 80%, ensuring reliable spatial predictions of future land cover dynamics. Findings reveal a 12% expansion in peatland cover, equivalent to approximately 6570 ± 308.59 hectares, despite declines in specific peatland types. Concurrently, anthropogenic land uses have increased, evidenced by an 85% rise in rubber plantations (from 30,530 ± 110.96 hectares to 56,617 ± 220.90 hectares) and a 6% reduction in natural forest cover (5965 ± 353.72 hectares). Sparse vegetation, including smallholder farms, decreased by 35% from 45,064 ± 163.79 hectares to 29,424 ± 114.81 hectares. Projections for 2030 and 2040 indicate minimal changes based on current trends; however, they do not consider potential impacts from climate change, large-scale development projects, and demographic shifts, necessitating cautious interpretation. The results highlight areas of stability and vulnerability within the understudied GAP region, offering critical insights for developing targeted conservation strategies. Additionally, the methodological framework, which combines optical, radar, and topographical data with machine learning, provides a robust approach for accurate and detailed landscape-scale monitoring of tropical peatlands that is applicable to other regions facing similar environmental challenges.
... Carbon dioxide loaded AMP solutions are one of the perfect example, where the datasets are extensive but show large variances at some parametric values. Since the CO2-AMP-H2O data is non-linear in nature, various techniques like smoothening algorithms [4,5], spline estimation [5][6][7], partial least square methods [8][9][10][11][12][13], kernel vector methods [14,15] can be applied for statistical identification of data outliers. Nevertheless, artificial neural networks [16] offer a superior correlation of data pool with high complexity and multivariate nature. ...
Presence of a few measured outliers in a non-linear chemical dataset, like CO2-(2-amino-2-methyl-1-propanol)-H2O is a commonality and a hindrance in development of correlation and model development for carbon dioxide capture processes. Therefore, these outliers must be identified and treated accordingly. Most of the traditional statistical techniques are weak in such correlation and lose information at the extrema of the designated system. Hence, neural network based identification is a promising technique for data outliers found in the above system. The proposed approach flexibly transforms to the nonlinear data distribution and identifies the outliers, reported in open literature. The proposed method improves the shortcomings of previous statistical approaches and can be potentially extended to other nonlinear experimental datasets in chemical process systems.
... Trees (CART) is a decision tree learning technique that can be used for both classification and regression predictive modelling problems. The method involves splitting data into subsets based on the value of input features, leading to a tree-like model of decisions and their possible consequences [26]. The main goal of CART is to develop a model capable of predicting the value of a target variable by deriving straightforward decision rules from the features present in the data. ...
This study presents a framework for predicting hemoglobin (Hb) levels utilizing Bayesian optimization-assisted machine learning models, incorporating both time-domain and frequency-domain features derived from photoplethysmography (PPG) signals. Hemoglobin, a crucial protein for oxygen and carbon dioxide transport in the blood, has levels that indicate various health conditions, including anemia and diseases affecting red blood cell production. Traditional methods for measuring Hb levels are invasive, posing potential risks and discomfort. To address this, a dataset comprising PPG signals, along with demographic data (gender and age), was analyzed to predict Hb levels accurately. Our models employ support vector regression (SVR), artificial neural networks (ANNs), classification and regression trees (CART), and ensembles of trees (EoT) optimized through Bayesian optimization algorithm. The results demonstrated that incorporating age and gender as features significantly improved model performance, highlighting their importance in Hb level prediction. Among the tested models, ANN provided the best results, involving normalized raw signals, feature selection, and reduction methods. The model achieved a mean squared error (MSE) of 1.508, root mean squared error (RMSE) of 1.228, and R-squared (R²) of 0.226. This study's findings contribute to the growing body of research on non-invasive hemoglobin prediction, offering a potential tool for healthcare professionals and patients for convenient and risk-free Hb level monitoring.
... Decision trees [2]- [5] and and rule-based (decision rule) systems [6]- [10] are prevalent tools for classification, knowledge representation, and addressing various challenges in combinatorial optimization and fault diagnosis. Among the various models used in data analysis, decision trees and rulebased models stand out for their interpretability [11]. ...
This paper investigates classes of decision tables (DTs) with 0-1-decisions that are closed under the removal of attributes (columns) and changes to the assigned decisions to rows. For tables from any closed class (CC), the authors examine how the minimum complexity of deterministic decision trees (DDTs) depends on the minimum complexity of a strongly nondeterministic decision tree (SNDDT). Let this dependence be described with the function F
<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Ψ, A </sub>
(
n
). The paper establishes a condition under which the function
F
<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Ψ, A </sub>
(
n
) can be defined for all values. Assuming F
<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Ψ, A </sub>
(
n
) is defined everywhere, the paper proved that this function exhibits one of two behaviors: it is bounded above by a constant or it is at least
n
for infinitely many values of
n
. In particular, the function F
<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Ψ, A </sub>
(
n
) can grow as an arbitrary nondecreasing function φ(
n
) that satisfies φ(
n
) ≥
n
and φ(0) = 0. The paper also provided conditions under which the function F
<sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Ψ, A </sub>
(
n
) remains bounded from above by a polynomial in
n
.
Underwriting is one of the important stages in an insurance company. The insurance company uses different factors to classify the policyholders. In this study, we apply several machine learning models such as nearest neighbour and logistic regression to the Actuarial Challenge dataset used by Qazvini (2019) to classify liability insurance policies into two groups: 1 - policies with claims and 2 - policies without claims.
National Statistical Organisations every year spend time and money to collect information through surveys. Some of these surveys include follow-up studies, and usually, some participants due to factors such as death, immigration, change of employment, health, etc, do not participate in future surveys. In this study, we focus on the English Longitudinal Study of Ageing (ELSA) COVID-19 Substudy, which was carried out during the COVID-19 pandemic in two waves. In this substudy, some participants from wave 1 did not participate in wave 2. Our purpose is to predict non-responses using Machine Learning (ML) algorithms such as K-nearest neighbours (KNN), random forest (RF), AdaBoost, logistic regression, neural networks (NN), and support vector classifier (SVC). We find that RF outperforms other models in terms of balanced accuracy, KNN in terms of precision and test accuracy, and logistics regressions in terms of the area under the receiver operating characteristic curve (ROC), i.e. AUC.
A flood susceptibility assessment is crucial for identifying areas that are susceptible to flooding. This task usually uses models, but prior flood susceptibility assessment models focused on the frequency or duration of floods, not both. Integrating the frequency and duration of floods in susceptibility assessment could provide a more accurate picture of flood susceptibility. This study aimed to utilise and assess a novel integrated model that considers the frequency and duration of floods to categorise vulnerability/susceptibility zones. This study focuses on the multi-hazard zone between Cuddalore and Sirkazhi on the east coast of Tamil Nadu, India. Sentinel-1 A and RISAT-1 A Synthetic Aperture Radar (SAR) images were analysed using the Classification and Regression Tree (CART) classifier. Eight SAR images were used to study the persistence and temporal evolution of flooding over 49 days in 2015, along with multi-temporal datasets for 2015, 2018, and 2019. The classification of flood-susceptibility zones based on the frequency and duration of flooding yielded an accuracy of 0.87, whereas the integrated model scored 0.96 in all matrices. The hybrid integrated analysis provided a comprehensive understanding of the area’s flooding system, identifying the southern part of the study area as the most susceptible. The proposed model recommends a frequency-duration-based approach to demarcate flood susceptibility zones and potentially improve flood susceptibility assessments and management strategies.
Micro RNAs (miRNA) are a type of non-coding RNA involved in gene regulation and can be associated with diseases such as cancer, cardiovascular, and neurological diseases. As such, identifying the entire genome of miRNA can be of great relevance. Since experimental methods for novel precursor miRNA (pre-miRNA) detection are complex and expensive, computational detection using Machine Learning (ML) could be useful. Existing ML methods are often complex black boxes that do not create an interpretable structural description of pre-miRNA. In this paper, we propose a novel framework that makes use of generative modeling through Variational Auto-Encoders to uncover the generative factors of pre-miRNA. After training the VAE, the pre-miRNA description is developed using a decision tree on the lower dimensional latent space. Applying the framework to miRNA classification, we obtain a high reconstruction and classification performance while also developing an accurate miRNA description.
Currently, there is a frequent generation of high-dimensional data in various domains. This study aims to present a novel way of lowering the number of features in datasets with large dimensionality, and not only. The analysis suggests a two-stage approach for selecting the most relevant features. Firstly, it will use cosine similarity in a pre-processing step to identify the most significant features ranking them according to the most important ones. Next, a hybrid metaheuristic, a combination of binary Volleyball Premier League and Antlion optimizer will be employed to reselect the most significant features detected in the initial phase. Various extracted features are utilized on selected datasets of Parkinson Disease to compare their results with the scenario when the hybrid metaheuristic employs all the features. The findings demonstrated notable advantages in terms of decreasing the time required for execution, with improvements ranging from 40.37% to a maximum of 91.57%. Additionally, there was a reduction in the number of features by 9.28% to 73.85%, while impacting the accuracy by a maximum of 4.47% in approximately 80% of the datasets.
Background/Objectives: The COVID-19 pandemic reduced in-person pediatric visits in the United States by over 50%, while telehealth visits increased significantly. The national use of telehealth for children and the factors influencing their use have been rarely studied. This study aimed to investigate the prevalence of telehealth use during the COVID-19 pandemic and explore the potential factors linked to its use at the state level. Methods: A cross-sectional study of the National Survey of Children’s Health (2021–22) sponsored by the federal Maternal and Child Health Bureau was performed. We used the least absolute shrinkage and selection operator (LASSO) regression to predict telehealth use during the pandemic. A bar map showing the significant factors from the multivariable regression was created. Results: Of the 101,136 children, 15.25% reported using telehealth visits due to COVID-19, and 3.67% reported using telehealth visits due to other health reasons. The Northeast states showed the highest telehealth use due to COVID-19. In the Midwest and Southern states, children had a lower prevalence of telehealth visits due to other health reasons. The LASSO regressions demonstrated that telehealth visits were associated with age, insurance type, household income, usual source of pediatric preventive care, perceived child health, blood disorders, allergy, brain injury, seizure, ADHD, anxiety, depression, and special needs. Conclusions: This study demonstrated significant variability in the use of telehealth among states during the COVID-19 pandemic. Understanding who uses telehealth and why, as well as identifying access barriers, helps maximize telehealth potential and improve healthcare outcomes for all.
Spam detection is a critical cybersecurity and information management task with significant implications for security decision-making processes. Traditional machine learning algorithms such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Decision Trees (DT), and Support Vector Machines (SVM) have been employed to mitigate this challenge. However, these algorithms often suffer from the "black box" dilemma, a lack of transparency that hinders their applicability in security contexts where understanding the reasoning behind classifications is essential for effective risk assessment and mitigation strategies. To address this limitation, the current paper leverages Explainable Artificial Intelligence (XAI) principles to introduce a novel, more transparent approach to spam detection. This paper presents a novel approach to spam detection using a Random Forest (RF) Classifier model enhanced by a meticulously designed methodology. The methodology incorporates data balancing through Hybrid Random Sampling, feature selection using the Gini Index, and a two-layer model explainability via Model-agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP) techniques. The model achieved an impressive accuracy rate of 94.8% and high precision and recall scores, outperforming traditional methods such as LR, KNN, DT, and SVM across all key performance metrics. The results affirm the effectiveness of the proposed methodology, offering a robust and interpretable model for spam detection. This study is a significant advancement in the field, providing a comprehensive and reliable solution to the spam detection problem.
In this chapter, we introduce Metis, a framework designed to convert complex interactive multimedia streaming systems into human-readable control policies. Leveraging decision tree conversion methods, Metis addresses the drawbacks of current decision-making systems, such as their heavyweight nature, incomprehensible structure, and non-adjustable policies. By interpreting deep learning-based adaptive video streaming systems, Metis enables network operators to debug, deploy, and adjust these systems easily. Our approach not only provides interpretability but also reduces runtime overhead, maintaining performance degradation within 2% of the original deep neural networks. We demonstrate Metis’s effectiveness through various use cases in system design, debugging, and deployment.
Population growth and human actions are worsening the exhaustion of finite land and water resources, prompting worries about sustainability. This study examines the changing landscape dynamics of El-Beheira province in the West Nile Delta region of Egypt. The study aims to quantitatively identify the type, extent, and spatial trends of the major changes from the spatio-temporal analysis of the time series of different satellite images spanning the last four decades, and study the impact of those spatial dynamics on sustainable agricultural management. Remote sensing data and machine learning models such as Random Forest (RF), Gradient Tree Boosting (GTB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Multi-Layer Perceptron (MLP) were used to examine the changes in land use and land cover (LULC) from 1984 to 2021. The analysis involves multi-temporal satellite images from Landsat and Sentinel-2 satellites, as well as time series of spectral indices such as NDVI, NDWI, and NDBI. The classification results showed that the RF classifier outperformed the other ones and effectively distinguished between different LULC categories throughout the research area. The results show significant changes in terrain LULC categories, with increases in vegetation and urban areas and decreases in barren terrain. Change detection analysis reveals the temporal dynamics of LULC, emphasizing the effects of agricultural activities and urban growth. During the study period, the percentage of barren land decreased from 61.2% in 1984 to 37.8% in 2021, while vegetated areas increased from 35.8 to 56.4%. Urban areas are expanding rapidly due to population growth and infrastructure development, while aquatic ecosystems remain relatively stable. Most of the landscape remains unchanged over time, with transitions between barren land and vegetated areas accounting for a significant portion of the alterations. It is critical to highlight that urban expansion has encroached on the northern part of the historical soils of the Nile Delta, which are fertile and suitable for growing strategic crops such as rice and cotton. The study indicates that this area is at risk of soil salinization and ecological degradation. These findings can guide targeted interventions to mitigate soil degradation risks. In contrast, agricultural activities have expanded on barren lands in the desert region, which have relatively low levels of productivity. Despite the region’s notable agricultural development, a lack of rice and cotton varieties adapted to clay soils contributes to increased salinization and climate change impacts. The results provide valuable insights into changing land use patterns in El-Beheira, which are essential for informed decision-making and the development of sustainable land management strategies to preserve resources and improve environmental sustainability.
Este artigo propõe uma nova abordagem de particionamento de dados categóricos para aplicar a privacidade diferencial em Gradient Boosting Decision Trees. Nele estudamos aprimoramentos no tratamento de atributos categóricos e seleção aleatória de pontos de particionamento enquanto oferecemos garantias de privacidade diferencial. Nossa abordagem define uma nova função de ganho para esses atributos e determina os limites de sensibilidade dessa função. Além disso, realizamos uma análise empírica em 6 conjuntos de dados reais, mostrando que a abordagem proposta alcança taxas de erro menores ou iguais aos modelos de referência.
ResearchGate has not been able to resolve any references for this publication.