Book

Classification And Regression Trees

Authors:

Abstract

The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
... The top-down induction method is the most common approach for learning decision trees and various heuristics have been proposed in this context for finding oblique splits. Breiman et al. (1984) develop the first major algorithm, named CART-LC, to solve this problem. They propose a deterministic hill-climbing approach that cycles over the features and optimizes the corresponding coefficients of the oblique split individually until a local optimum is reached. ...
... These measure the inhomogeneity of the labels of observations in the two subsets obtained by the split and the splitting criterion becomes the weighted sum of those impurities. Another wellknown splitting criterion for classification is the twoing rule (Breiman et al. 1984). For regression tasks, one usually uses the weighted sum of mean squared errors or mean absolute errors from the mean or the median value of the labels of observations in the two respective halfspaces. ...
... We asses three different criteria which are out-of-sample accuracy, tree size measured in terms of the number of leaf nodes and the time necessary for inducing the trees. To evaluate these quality measures, we use 20 popular classification datasets from the UCI machine learning repository which are summarized in Table 1 and we compare our method to our own implementation of the univariate CART algorithm (Breiman et al. 1984) and to the popular oblique decision tree induction method OC1. All three methods use Gini impurity as the splitting criterion and the recursive partitioning is stopped either when all the labels of observations associated with a leaf node are equal or when the number of observations to be split is lower than four. ...
Article
Full-text available
We describe a new simulated annealing algorithm to compute near-optimal oblique splits in the context of decision tree induction. The algorithm can be interpreted as a walk on the cells of a hyperplane arrangement defined by the observations in the training set. The cells of this hyperplane arrangement correspond to subsets of oblique splits that divide the feature space in the same manner and the vertices of this arrangement reveal multiple neighboring solutions. We use a pivoting strategy to iterate over the vertices and to explore this neighborhood. Embedding this neighborhood search in a simulated annealing framework allows to escape local minima and increases the probability of finding global optimal solutions. To overcome the problems related to degeneracy, we rely on a lexicographic pivoting scheme. Our experimental results indicate that our approach is well-suited for inducing small and accurate decision trees and capable of outperforming existing univariate and oblique decision tree induction algorithms. Furthermore, oblique decision trees obtained with this method are competitive with other popular prediction models.
... Particularly, ensembles of classification or regression trees (CART) have been used in several applications [7]. A CART is a node structure that can provide a classification or regression output using a series of decisions based on thresholds of features' values [8]. Their main advantages are a low computational cost, low variability, and simple interpretability. ...
... The reason for this is that these 3 first nodes separate a larger quantity of observations in the training data, compared to subsequent nodes that will individually separate less observations, therefore, the features used in these 3 nodes will have a considerable effect in the tree structure. This is supported by analyzing the effects of thresholds used for separating observations within nodes, as stated by Breiman et al. [8] using Eq. 2. ...
Article
Full-text available
Tree ensemble algorithms, such as random forest (RF), are some of the most widely applied methods in machine learning. However, an important hyperparameter, the number of classification or regression trees within the ensemble must be specified in these algorithms. The number of trees within the ensemble can adversely affect bias or computational cost and should ideally be adapted for each task. For this reason, a novel tree ensemble is described, the feature-ranked self-growing forest (FSF), that allows the automatic growth of a tree ensemble based on the structural diversity of the first two levels of trees’ nodes. The algorithm’s performance was tested with 30 classification and 30 regression datasets and compared with RF. The computational complexity was also theoretically and experimentally analyzed. FSF had a significant higher performance for 57%, and an equivalent performance for 27% of classification datasets compared to RF. FSF had a higher performance for 70% and an equivalent performance for 7% of regression datasets compared to RF. Computational complexity of FSF was competitive compared to that of other tree ensembles, being mainly dependent on the number of observations within the dataset. Therefore, it can be implied that FSF is a suitable out-of-the-box approach with potential as a tool for feature ranking and dataset’s complexity analysis using the number of trees computed for a particular task. A MATLAB and Python implementation of the algorithm and a working example for classification and regression are provided for academic use.
... Finally, Section 6 provides the concluding remarks. Breiman, Friedman, Olshen, and Stone (1984) proposed Classification and Regression Tree (CART) in their book Classification and Regression Tree. They defined the regression model as Tree Structured Regression to differentiate it from other regression methods, where the training set is partitioned by a sequence of binary splits into terminal nodes. ...
... In order to improve the prediction accuracy, Breiman et al (1984) suggested that the CART models can be enhanced by replacing the average of the target variable values with a linear regression, which leads to M5 models (Quinlan, 1992). The rules used in CART models can be applied to M5 models if the standard deviation is used, instead of the variance as in the case of the CART models. ...
... Explainable models allow us to understand the relationship between input features and outputs (e.g. predicted leakage), IOP Publishing doi: 10.1088/1755-1315/1136/1/012040 2 furthering our understanding of the problem itself and what factors impact leakage the most. For water companies, this means additional factors can be considered, or better understood, when deciding on actions, company policies, and management. ...
... Decision trees [10] are simple models where predictions are made based on a series of if-then-else decisions, similar to a flow chart. Each decision, or node, is based on a threshold on a single feature. ...
Article
Full-text available
Understanding the various interrelated effects that result in leakage is vital to the effort to reduce it. This paper aims to understand, at the district metered area (DMA) level, the relationship between leakage and static characteristics of a DMA, i.e. without considering pressure or flow. The characteristics used include the number of pipes and connections, total DMA volume and network density, as well as pipe diameter, length, age, and material statistics. Leakage, especially background and unreported leakage, can be difficult to accurately quantify. Here, the Average Weekly Minimum Night Flow (AWM) over the last 5 years is used as a proxy for leakage. While this may include some legitimate demand, it is generally assumed that minimum night flow, strongly correlates with leakage. A data-driven case study on over 800 real DMAs from UK networks is conducted. Two regression models, a decision tree model and an elastic net linear regression model, are created to predict the AWM of unseen DMAs. Reasonable accuracy was achieved, considering pressure is not an included feature, and the models are investigated for the most prominent features related to leakage.
... Even though we mention the applications of these methods within the context of precision medicine, our review is not a summary on studies analyzing longitudinal data, but rather provides an overview of available methods and their implementations. We limit our attention to RF based on the classification and regression tree (CART, [21]) with a focus on prediction of categorical and quantitative outcomes. In section 2 we consider the univariate response longitudinal scenario. ...
... The nodes in the final layer of a tree are called leaves and are used for prediction of new observations. More detailed discussions on CART can be found in [21]. ...
Article
Full-text available
In longitudinal studies variables are measured repeatedly over time, leading to clustered and correlated observations. If the goal of the study is to develop prediction models, machine learning approaches such as the powerful random forest (RF) are often promising alternatives to standard statistical methods, especially in the context of high-dimensional data. In this paper, we review extensions of the standard RF method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate response longitudinal data and further categorize the repeated measurements according to whether the time effect is relevant. Even though most extensions are proposed for low-dimensional data, some can be applied to high-dimensional data. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.
... Then we conducted two supervised classifications using the classification and regression trees (CART) and random forest (RF) algorithms; these are two decision tree (DT) algorithms increasingly used for land cover classification because of their ability to identify and reduce meaningful variables from complex datasets (Phiri et al. 2020). To produce discrete classes, a CART algorithm builds a single decision tree using a defined array of predictor and response variables, which makes it particularly sensitive to outliers (Breiman et al. 1984). In contrast, an RF algorithm is an ensemble of decision trees based on random samples of data that yield several predictions that are then combined to define the classes of a classification (Bonaccorso 2018). ...
Article
Full-text available
Archaeology is increasingly employing remote sensing techniques such as airborne lidar (light detection and ranging), terrestrial laser scanning (TLS), and photogrammetry in tropical environments where dense vegetation hinders to a great extent the ability to understand the scope of ancient landscape modification. These technologies have enabled archaeologists to develop sophisticated analyses that overturn traditional misconceptions of tropical ecologies and the human groups that have inhabited them in the long term. This article presents new data on the Sierra Nevada de Santa Marta in northern Colombia that reveals the extent to which its ancient societies transformed this landscape, which is frequently thought of as pristine. By recursively integrating remote sensing and archaeology, this study contributes to interdisciplinary scholarship examining ancient land use and occupation in densely forested contexts.
... Mean Decrease in Impurity (MDI) is a method used in decision tree-based ML algorithms to evaluate the importance of each feature [5]. MDI calculates each feature's impurity (e.g., Gini impurity) and averages the impurity decrease across all the feature splits. ...
Preprint
Full-text available
Ransomware is a rapidly evolving type of malware designed to encrypt user files on a device, making them inaccessible in order to exact a ransom. Ransomware attacks resulted in billions of dollars in damages in recent years and are expected to cause hundreds of billions more in the next decade. With current state-of-the-art process-based detectors being heavily susceptible to evasion attacks, no comprehensive solution to this problem is available today. This paper presents Minerva, a new approach to ransomware detection. Unlike current methods focused on identifying ransomware based on process-level behavioral modeling, Minerva detects ransomware by building behavioral profiles of files based on all the operations they receive in a time window. Minerva addresses some of the critical challenges associated with process-based approaches, specifically their vulnerability to complex evasion attacks. Our evaluation of Minerva demonstrates its effectiveness in detecting ransomware attacks, including those that are able to bypass existing defenses. Our results show that Minerva identifies ransomware activity with an average accuracy of 99.45% and an average recall of 99.66%, with 99.97% of ransomware detected within 1 second.
... The objective of CART is usually to classify a data set into several groups by use of a rule that displays the groups in the form of a binary tree (Breiman et al., 2017), which is determined by a procedure known as recursive partitioning, in which each group is the node for the next partition. In this study, the CART method was used to classify the primary production indicators considered (NPP and CUE) in relation to the climate and topographic variables. ...
Thesis
Full-text available
Climate change is considered one of the main threats to biodiversity, ecosystems, socioeconomic development, human well-being, or even the future of humanity. In nature, it affects from individual species to ecosystems, going through the complex interactions among organisms and/or their habitats, compromising the state of ecosystems, their structure and function and the ecosystem services they provide. However, the extreme sensitivity to climate change of the Iberian Peninsula increases the risk for threatened species and ecosystems such as sweet chestnut and cork oak agroforestry systems and Cantabrian brown bears. Therefore, quantitative measures that represent a key ecosystem function and inform about ecosystem state are necessary. Primary production indicators are ecological indicators that allow to quantify the carbon assimilation through photosynthesis, thus representing one of the most important functions of the ecosystem. The general objective of this thesis was to analyse the spatial patterns of primary production, its changes and its drivers of change against climate change in the Iberian Peninsula to understand the state of our ecosystems, plant and animal dynamics or species adaptive strategies. For this purpose, different data sources were employed to characterise land use, bear faeces were used to position individuals and know their diet, and long-term remote sensing data provided primary production. Parametric and non-parametric fitting methods were used to model relationships with climate predictors, predict the risks to ecosystem and construct foraging models. Hotspot analysis was conducted to identify significant spatial clusters of high- and low-efficiency areas. In general, we found that human management positively affects the ecosystems productivity, while water availability is more important than temperature. Tree density plays a key role in the adaptation to climate variation, maintaining microclimatic conditions that make ecosystems less dependent on environmental variables. We observed that the state of the sweet chestnut is quite concerning while the state of cork oak reflects the ecological traits and the adaptive strategies used to survive drought seasons. Finally, regarding Cantabrian brown bears, primary production has been decisive to understand their nut foraging patterns and to predict spatial distribution related to nut consumption during the hyperphagia season, with our models highlighting areas of high importance or where recent expansion has occurred.
... All analyses were performed in the programming language R (version 4.1.0) using the packages "stats" (Wilkinson and Rogers, 1973), "randomForest" (Breiman, 2001), "rpart" (Breiman et al., 1984); and "e1071" (Rong-En et al., 2005); for the algorithms described above, respectively. Hyperparameters are described in Supplementary Table 1. ...
Article
Full-text available
Predicting sugarcane yield by quality allows stakeholders from research centers to industries to decide on the precise time and place to harvest a product on the field; hence, it can streamline workflow while leveling up the cost-effectiveness of full-scale production. °Brix and Purity can offer significant and reliable indicators of high-quality raw material for industrial processing for food and fuel. However, their analysis in a relevant laboratory can be costly, time-consuming, and not scalable. We, therefore, analyzed whether merging multispectral images and machine learning (ML) algorithms can develop a non-invasive, predictive framework to map canopy reflectance to °Brix and Purity. We acquired multispectral images data of a sugarcane-producing area via unmanned aerial vehicle (UAV) while determining °Brix and analytical Purity from juice in a routine laboratory. We then tested a suite of ML algorithms, namely multiple linear regression (MLR), random forest (RF), decision tree (DT), and support vector machine (SVM) for adequacy and complexity in predicting °Brix and Purity upon single spectral bands, vegetation indices (VIs), and growing degree days (GDD). We obtained evidence for biophysical functions accurately predicting °Brix and Purity. Those can bring at least 80% of adequacy to the modeling. Therefore, our study represents progress in assessing and monitoring sugarcane on an industrial scale. Our insights can offer stakeholders possibilities to develop prescriptive harvesting and resource-effective, high-performance manufacturing lines for by-products.
... These models consider sophisticated relationships amongst the explanatory variables and do not assume any assumption, Lee, Chiu, Chou, & Lu (2006) CA/TL, OI/TL as independent variables the CART performed better well followed by SVM. A classification tree is the most used method which supervises algorithms and divides the entire population into subgroups and attempts to find out the homogeneity amongst them to estimate the variables and it is useful in non-linear relationships (Breiman, Friedman, Stone, & Olshen, 1984). ...
Thesis
Full-text available
Firms default when they fail to pay either the instalment of the principal amount, the interest of the loan raised from the commercial banks or financial institutions; or when they fail to oblige the bond or debenture holders. Default is a worldwide problem that impacts holistically the financial credibility of the firms, its business operations, and ultimately the economic growth of the country in which the firm is incorporated. The default prediction process has drawn the considerable attention of the various regulators, accounting practitioners, bankers, scholars across the countries. The early warning signal of the defaults are imperative for the management, investors, creditors, and bankers so that they can take proactive or remedial measures to overcome the flaws of credit rating agencies, creating an internal credit risk system, deciding optimal capital structure. Further, the credit risk modeling is necessary for pricing the riskier bonds and also to sanction the loan to the firms. The present study has attempted to predict the default events of selected Indian corporate from selected 13 sectors. The total sample firms included in the study are 580 (320 Non-defaulted, 260 Defaulted) firms listed in the Indian stock exchange. The period of research commences from 1st April 2004 and ends at 31st March 2019. The study incorporated 5 default prediction methods namely Multiple Discriminant Analysis, Altman Original, Calibrated, Logistic Regression, and Structural Model. The study developed models for each selected sector using MDA and Logistic Regression. The firm-specific sample data is collected from the 13 Indian sectors namely Chemicals, Construction and Engineering, Electronics, Hotels, Infrastructure, Pharmaceuticals, Plastic & Fibre, Realty, Software, Steel, Sugar, Textile and Miscellaneous. Further, the study amalgamated the sample cases of the selected 13 sectors into one group called Complete Sample. The study developed 28 (14 MDA and 14 Logit) default prediction models. The classification accuracy of the developed MDA model has been compared with Altman Original and Calibrated model for in-sample data. The developed MDA and developed Logit model is validated on the out-of-sample data for each selected sector and Complete Sample. The Structural Model which is based upon the option pricing method was applied to predict the default and non-default cases of each selected firm from 13 sectors and Complete Sample. Further, the study compared the in-sample classification results of developed MDA model, developed Logit model & Structural Model function. Alongside, the comparison of developed MDA and developed Logit Model also conducted for out-of-sample data. Additionally, the study performed the advanced default projection (foreward testing) in which the potential default events have been predicted for the time horizon: from 8 years before to 1 year before or within the actual default occurrence of the selected defaulted firms from the selected sectors. The analysis deduced from the empirical results of the developed MDA models conveys that the model predicted defaulted cases as non-defaulted which generates a high value of Type II Errors that makes the model less effective. Amongst all developed MDA models, the results of the MDA model developed for Hotels sector depicted the considerable discriminatory and predictive power. The developed models have classified fairly the sample cases of the Construction and Engineering, Electronics, Hotels, Infrastructure, Pharmaceuticals, Plastic and Fibre and Software sectors. The MDA and Logit model have been developed using 21 and 23 independent variables respectively. The independent variables included in the study belong to accounting, market, economic and qualitative variables. The study found that in developed MDA model the following accounting variables performed significantly well namely NI/TA, WC/TA, EBIT/TA, TBD/TA, and RE/TA. The classification result of in-sample data demonstrated that the MDA model attained satisfactory predictive accuracies for Chemicals, Steel, Pharmaceuticals, Plastic & Fibre, Hotels and Electronics which range from 90% to 87% in conjunction with troublesome values of Type II Errors. The developed MDA models have been validated on the out-of-sample data of the selected sectors. The validation accuracy obtained by MDA model did not provide acceptable results except for Electronics and Sugar sectors that are 76% and 74% respectively. The developed Logit model provided better results than the developed MDA model with all respects such as robustness, effectiveness, classification accuracy and the significance of the independent variables. The results of the various tests conducted on the developed Logit model suggested that all the independent variables have a significant impact on the dependent variables, the developed Logit model is found highly competent for Software sector and Complete Sample. However, the developed Logit model could only explain maximum 47% and 45% variation in the dependent variables of Software and Electronics sector which is highest amongst selected sectors. Whereas, the independent variables of the Construction and Engineering, Electronics, Hotels and Pharmaceuticals sector did explain the variation in the dependent variables with 71%, 70%, 71%, and 55% accuracies respectively. Further, the results suggest that the model is specified & best fitting in the selected sample data. The overall classification accuracies for the in-sample data attained by the model is quite satisfactory for Chemicals, Construction and Engineering, Electronics, Pharmaceuticals, Plastic and Fibre, and Steel sectors, none of these sectors have achieved less than 90% accuracy. There was negligible rate of Type I Errors for all the sectors, However, the Type II Error was very high for few selected sectors namely Realty, Textile, Chemicals and Complete Sample that range from 83% to 62%. The validation results highlighted that the Logit model outperformed the MDA model for the out-of-sample data as well. This model attained higher accuracy levels for Pharmaceuticals, Plastic and Fibre, and Complete Sample that arrayed from 92% to 89%. The most predictive independent variables found in the 14 developed Logit models are WC/TA, NI/TA, RE/TA, LOG (TA/GNP), and Y. The Structural Model is applied to all selected sectors and Complete Sample. The classification results achieved by structure model are contrary to the results attained by MDA and Logit models. The Structural Model did achieve higher accuracies in the prediction of defaulted cases that range from 100% to 87% across the sectors and Complete Sample. However, the accuracies obtained while predicting the non-defaulted cases are not satisfactory rather it generated elevated values of Type I Error. The accuracy levels obtained in the prediction of the non-defaulted cases lies from 4% to 30% only; due to this, the overall accuracy of the Structural Model deteriorated. Therefore, the overall accuracy rate accomplished by the Structural Model arrayed from 18% to 35% only. The result of the Advanced Default Projection (foreward testing) study advocates the superiority of the Structural Model over the developed MDA and developed Logit model. The Structural Model adequately diagnosed the potential default events even 8 years or 5 years before the actual default occurrence that too with high accuracy. The highest accuracy levels achieved by the Structural Model are 91%, 89% and 83% for Realty, Hotels and Construction and Engineering sectors respectively. The higher accuracy levels accomplished by the MDA model are 60%, 50% and 40% for Hotels, Construction and Engineering, Sugar sectors and Complete Sample. The developed Logit model did not perform well in advanced default projection study. This developed Logit model derived only 13%, 10% and 9% accuracies for the Chemicals, Software and Infrastructure sectors for the prediction of potential default event. Rather, the developed Logit model attained very high prediction accuracy for the “Failed” category of time horizon.
... Combinatorial background is suppressed using a boosted decision tree (BDT) algorithm [37,38] implemented in the TMVA toolkit [39]. Simulated signal events are used as the signal training sample, while candidates with invariant mass in the upper B ± sideband between 5800-7000 MeV/c 2 form the background training sample. ...
Preprint
Full-text available
The first study of $C\!P$ violation in the decay mode $B^\pm\to[K^+K^-\pi^+\pi^-]_D h^\pm$, with $h=K,\pi$, is presented, exploiting a data sample of proton-proton collisions collected by the LHCb experiment that corresponds to an integrated luminosity of $9$ fb$^{-1}$. The analysis is performed in bins of phase space, which are optimised for sensitivity to local $C\!P$ asymmetries. $C\!P$-violating observables that are sensitive to the angle $\gamma$ of the Unitarity Triangle are determined. The analysis requires external information on charm-decay parameters, which are currently taken from an amplitude analysis of LHCb data, but can be updated in the future when direct measurements become available. Measurements are also performed of phase-space integrated observables for $B^\pm\to[K^+K^-\pi^+\pi^-]_D h^\pm$ and $B^\pm\to[\pi^+\pi^-\pi^+\pi^-]_D h^\pm$ decays.
... The last step in each path is a leaf, which determines the classification of the input, and allows the model to make predictions. Decision trees have been in use for many years [135] but are known to be unstable, and so are referred to as weak learners because they do not perform well as standalone classifiers. However, boosting combines numerous weak learners, resulting in a powerful algorithm. ...
Thesis
Composite Higgs models are a branch of Beyond the Standard Model theories which seek to address the hierarchy problem inherent to the current formulation of the Standard Model. In this thesis, we discuss and implement extensions to the scalar sector of the Standard Model, focusing on models of compositeness and their associated phenomenology. A key area of interest is then the appearance of compositeness at current and future colliders, which can be used to motivate the choice of the new era of experimental programs. To that end, this work details the construction of a number of effective models which are testable at colliders, and which are linked as extensions to the scalar sector. We include a comparison of twelve minimal composite Higgs modes featuring an underlying fermionic completion, and discuss ubiquitous features expected across all such models, including a light pseudo-scalar, which is shown to be reachable at future lepton colliders with the help of machine learning techniques. The study also reveals the consideration of bottom quarks in fermion-loops coupling the light scalar to SM states to be non-negligible. Phenomenology at future lepton colliders is extended through a study of the most minimal composite Higgs model of this nature, SU(4)/Sp(4), where we outline a potential search for the heavy pseudo-Nambu Goldstone boson ÷, considering both fermiophobic and fermiophilic couplings. While composite Higgs models seek primarily to address the separation of the electroweak and Planck energy scales, some such models may also produce dark matter candidates. We conclude this work with an investigation into heavy dark matter which couples to the top-sector of the Standard Model via a t-channel interaction, situating it within a composite Higgs model through its interaction with a heavy fermionic mediator and a top partner. Finally, the visibility of the model at colliders and astrophysical experiments is examined.
... Martínez-Sánchez et al. [35] use decision trees to model clients of a Mexican financial institution. Decision trees [36] are flowchart-like models where internal nodes split the feature space into mutually exclusive sub-regions. Final nodes, called leaves, label observations using a voting system. ...
Article
Full-text available
Money laundering is a profound global problem. Nonetheless, there is little scientific literature on statistical and machine learning methods for anti-money laundering. In this paper, we focus on anti-money laundering in banks and provide an introduction and review of the literature. We propose a unifying terminology with two central elements: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. On the other hand, suspicious behavior flagging is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is the need for more public data sets. This may potentially be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability, and fairness of the results.
... A decision trees is a ML algorithm which is used for both classification and regression learning tasks. 16 Based on a set of input features, decision trees generates its corresponding output targets. This is accomplished by a series of single decisions, each representing a node or branching of the tree. ...
... By the simple function approximation theorem, we know that if we allow the gradient boosting algorithm to construct an arbitrary number of trees, and have an arbitrary amount of data, E(Y |I 1,1 , I 1,2 ...I J,K ) → E(Y |X) pointwise. This means that unregularized gradient boosting trees are unbiased estimators of the true conditional distribution of Y |X, assuming that the number of trees is allowed to grow arbitrarily (Breiman et al., 2017). This then implies gradient boosted decision trees are universal approximators, because the distribution of Y |X was left arbitrary, meaning the trees could learn to approximate any distribution. ...
Preprint
Full-text available
Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.
... Determining the optimal number of terminal nodes or partitions in a decision tree is similarly challenging. Classification and regression trees (CART) (Breiman et al., 1984) recursively split the data until no additional split improves the homogeneity within the nodes. CART tends to overestimate the number of nodes and often requires ad-hoc pruning methods to prevent over-fitting (Denison et al., 1998). ...
... Random forests are responsible for building a large collection of de-correlated regression trees. Usually these have a good predictive performance. UC Business Analytics . 2018 . 10 Linear regression is a useful tool for predicting a quantitative response, and it is a widely used statistical learning method. UC Business Analytics . 2018 . 11 Breiman et al . 1984 . illustrated in Figure 1. This fi gure explains that all cases go through the regression tree and proceed to the left if the variable Fixed assets for the company is lower than a certain value (5.4e + 6, which is €5,400,000) or proceed to the right if these are higher. Next, the left branch is further partitioned by the variable Dates. T ...
Article
Administrative fines for GDPR infringements are growing rapidly in number, yet companies are presented with an opaque process on how these fines are issued by the data protection authorities (DPAs). In particular, one principle described within the guidelines issued by the European Data Protection Board (EDPB) requires a case-by-case assessment, which is potentially offsetting the automation of administrative fines in the future. This paper is challenging this principle through algorithmic arguments. The suggested approach has its benefits in terms of scalability. Yet, this approach may well receive funded critics due to potential clashes with other principles.
... In addition, for any subhalo in , a binary variable, missed , is defined to indicate whether or not it is missed by the simulation and thus created in the step of satellite-stage completion. (ii) We train a CART tree classifier (Breiman et al. 1984) that maps ...
Preprint
We present an algorithm to extend subhalo merger trees in a low-resolution dark-matter-only simulation by conditionally matching them to those in a high-resolution simulation. The algorithm is general and can be applied to simulation data with different resolutions using different target variables. We instantiate the algorithm by a case in which trees from ELUCID, a constrained simulation of $(500h^{-1}{\rm Mpc})^3$ volume of the local universe, are extended by matching trees from TNGDark, a simulation with much higher resolution. Our tests show that the extended trees are statistically equivalent to the high-resolution trees in the joint distribution of subhalo quantities and in important summary statistics relevant to modeling galaxy formation and evolution in halos. The extended trees preserve certain information of individual systems in the target simulation, including properties of resolved satellite subhalos, and shapes and orientations of their host halos. With the extension, subhalo merger trees in a cosmological scale simulation are extrapolated to a mass resolution comparable to that in a higher-resolution simulation carried out in a smaller volume, which can be used as the input for (sub)halo-based models of galaxy formation. The source code of the algorithm, and halo merger trees extended to a mass resolution of $\sim 2 \times 10^8 h^{-1}M_\odot$ in the entire ELUCID simulation, are available.
... In CinC 2017, Zabihi et al. [32] and Kropf et al. [33] used random forest to train the extracted features to obtain classifcation results because random forest is interpretable explain [34]. So, random forest was employed in this study. ...
Article
Full-text available
Detecting atrial fibrillation (AF) of short single-lead electrocardiogram (ECG) with low signal-to-noise ratio (SNR) is a key of the wearable heart monitoring system. This study proposed an AF detection method based on feature fusion to identify AF rhythm (A) from other three categories of ECG recordings, that is, normal rhythm (N), other rhythm (O), and noisy (∼) ECG recordings. So, the four categories, that is, N, A, O, and ∼ were identified from the database provided by PhysioNet/CinC Challenge 2017. The proposed method first unified the 9 to 60 seconds unbalanced ECG recordings into 30 s segments by copying, cutting, and symmetry. Then, 24 artificial features including waveform features, interval features, frequency-domain features, and nonlinear feature were extracted relying on prior knowledge. Meanwhile, a 13-layer one-dimensional convolutional neural network (1-D CNN) was constructed to yield 38 abstract features. Finally, 24 artificial features and 38 abstract features were fused to yield the feature matrix. Random forest was employed to classify the ECG recordings. In this study, the mean accuracy (Acc) of the four categories reached 0.857. The F1 of N, A, and O reached 0.837. The results exhibited the proposed method had relatively satisfactory performance for identifying AF from short single-lead ECG recordings with low SNR.
... For a fixed tree structure with regions A 1 , … , A L (j) , the optimal weights c ⋆ can easily be found and have components Plugging the weights c ⋆ into (4) and removing the term that does not depend on f j gives which can be used as a scoring function to measure the quality of the tree structure, a role similar to the impurity score in Breiman et al. (1984). ...
Article
Full-text available
This paper details the approach of the team Kohrrelation in the 2021 Extreme Value Analysis data challenge, dealing with the prediction of wildfire counts and sizes over the contiguous US. Our approach uses ideas from extreme-value theory in a machine learning context with theoretically justified loss functions for gradient boosting. We devise a spatial cross-validation scheme and show that in our setting it provides a better proxy for test set performance than naive cross-validation. The predictions are benchmarked against boosting approaches with different loss functions, and perform competitively in terms of the score criterion, finally placing second in the competition ranking.
... DT (Breiman et al., 2017) utilizes a flowchart-like structure where decisions are made through internal nodes, and the final or 'leaf' node represents a class labelprintable or not printable. The hyperparameters tuned were the maximum depth of the tree (between 1 and 30), the minimum number of samples required to split an internal node (between 2 and 20), the minimum number of samples required to be at a leaf node (between 1 and 20), and the number of features to consider when choosing the best split (between 1 and 12). ...
Article
Full-text available
Three-dimensional (3D) printing is drastically redefining medicine production, offering digital precision and personalized design opportunities. One emerging 3D printing technology is selective laser sintering (SLS), which is garnering attention for its high precision, and compatibility with a wide range of pharmaceutical materials, including low-solubility compounds. However, the full potential of SLS for medicines is yet to be realized, requiring expertise and considerable time-consuming and resource-intensive trial-and-error research. Machine learning (ML), a subset of artificial intelligence, is an in silico tool that is accomplishing remarkable breakthroughs in several sectors for its ability to make highly accurate predictions. Therefore, the present study harnessed ML to predict the printability of SLS formulations. Using a dataset of 170 formulations from 78 materials, ML models were developed from inputs that included the formulation composition and characterization data retrieved from Fourier-transformed infrared spectroscopy (FT-IR), X-ray powder diffraction (XRPD) and differential scanning calorimetry (DSC). Multiple ML models were explored, including supervised and unsupervised approaches. The results revealed that ML can achieve high accuracies, by using the formulation composition leading to a maximum F1 score of 81.9%. Using the FT-IR, XRPD and DSC data as inputs resulted in an F1 score of 84.2%, 81.3%, and 80.1%, respectively. A subsequent ML pipeline was built to combine the predictions from FT-IR, XRPD and DSC into one consensus model, where the F1 score was found to further increase to 88.9%. Therefore, it was determined for the first time that ML predictions of 3D printability benefit from multi-modal data, combining numeric, spectral, thermogram and diffraction data. The study lays the groundwork for leveraging existing characterization data for developing high-performing computational models to accelerate developments.
... A random forest is an ensemble of decision trees given by a bootstrap resampling of the training dataset. Bootstrap resampling (Breiman et al., 1999) is used to reduce the variance of a decision tree. The idea is to create several subsets of data from training dataset chosen randomly with replacement. ...
Article
Full-text available
Previous studies propose that there is a mantle upwelling that generated the Cenozoic basalts in Changbaishan. However, the dominant source and mechanism of the mantle upwelling remains highly debated. Here we apply machine learning algorithms of Random Forest and Deep Neural Network to train models using global island arc and ocean island basalts data. The trained models predict that Changbaishan basalts are highly influenced by slab-derived fluid. More importantly, the fluid effect decreases with no (87Sr/86Sr)0 and εNd(t) changes between 5 Ma and 1 Ma, then enhances with increasing εNd(t) and decreasing (87Sr/86Sr)0 after 1 Ma. We propose that a gap opened at about 5 Ma and the hot sub-slab oceanic asthenosphere rose through the gap after 1 Ma, generating the basalts enriched in fluid mobile elements and with the addition of depleted mantle component derived from the sub-slab oceanic asthenosphere.
... We also assess change in selection performance with low and high number of downsampled balanced datasets. We further consider the effect of choosing the optimally regularized model as compared to the more conventional practice of choosing a parsimonious model corresponding to tuning parameter (λ) value that satisfies the one-standard-error rule [48][49][50]. Following the terminology used in the glment R package, we denote the λ values corresponding the best and one-standard-error based models as λ min and λ 1se in Figs 7 and 8 and Table 7. ...
Article
Full-text available
We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.
... Tree and forest-based methods are tailored for direct HTE estimation without predicting potential outcomes. Su et al. propose a tree-based method inspired by the CART algorithm for classification [10,68]. They find binary covariate splits based on a t-test of differences between two potential children of a node. ...
Preprint
Full-text available
Estimating how a treatment affects different individuals, known as heterogeneous treatment effect estimation, is an important problem in empirical sciences. In the last few years, there has been a considerable interest in adapting machine learning algorithms to the problem of estimating heterogeneous effects from observational and experimental data. However, these algorithms often make strong assumptions about the observed features in the data and ignore the structure of the underlying causal model, which can lead to biased estimation. At the same time, the underlying causal mechanism is rarely known in real-world datasets, making it hard to take it into consideration. In this work, we provide a survey of state-of-the-art data-driven methods for heterogeneous treatment effect estimation using machine learning, broadly categorizing them as methods that focus on counterfactual prediction and methods that directly estimate the causal effect. We also provide an overview of a third category of methods which rely on structural causal models and learn the model structure from data. Our empirical evaluation under various underlying structural model mechanisms shows the advantages and deficiencies of existing estimators and of the metrics for measuring their performance.
... Con la técnica no se estiman parámetros de una ecuación o modelo, sino que se determinan algoritmos para una clasificación o agrupación, utilizando particiones sucesivas del conjunto de datos. Los árboles de regresión o clasificación fueron propuestos por Breiman et al. (1984). En un árbol de regresión, la variable respuesta es cuantitativa, mientras que si la variable fuera cualitativa o categórica se estaría frente a un árbol de clasificación. ...
Article
Full-text available
Se ha demostrado empíricamente que el nivel promedio de divulgación de información de las universidades latinoamericanas es bajo y muy heterogéneo (Abello 2018; Abello et al. 2018). Los atributos de los gobiernos corporativos universitarios pueden explicar el nivel de divulgación. Sin embargo, actualmente solo se comprende el sentido del efecto de estas variables sobre la divulgación. Para avanzar en la comprensión de los procesos de divulgación y la incidencia de los atributos de los gobiernos corporativos, se puede avanzar con una tipificación de universidades. Por medio de un enfoque cuantitativo, este trabajo tiene por objetivo tipificar a 219 universidades latinoamericanas según su nivel de divulgación de información considerando los atributos de sus gobiernos corporativos universitarios (GCU). Para lograr este objetivo, se introduce una técnica no paramétrica denominada árboles de regresión, utilizada en diversas disciplinas y que es parte de las técnicas de machine learning. Los resultados indican que hay cuatro nodos (o grupos) de universidades y cuyos niveles de divulgación son diferentes, y están estrechamente relacionados con los atributos de sus GCU. Estos hallazgos permiten proponer políticas públicas para incrementar la divulgación de información por parte de las universidades de manera más focalizada.
Article
Full-text available
On April 6, 2009, a strong earthquake (6.1 Mw) struck the city of L’Aquila, which was severely damaged as well as many neighboring towns. After this event, a digital model of the region affected by the earthquake was built and a large amount of data was collected and made available. This allowed us to obtain a very detailed dataset that accurately describes a typical historic city in central Italy. Building on this work, we propose a study that employs machine learning (ML) tools to predict damage to buildings after the 2009 earthquake. The used dataset, in its original form, contains 21 features, in addition to the target variable which is the level of damage. We are able to differentiate between light, moderate and heavy damage with an accuracy of 59%, by using the Random Forest (RF) algorithm. The level of accuracy remains almost stable using only the 12 features selected by the Boruta algorithm. In both cases, the RF tool showed an excellent ability to distinguish between moderate-heavy and light damage: around the 3% of the buildings classified as seriously damaged were labeled by the algorithm as minor damage.
Article
Full-text available
Sea ice type classification is of great significance for the exploration of waterways, fisheries, and offshore operations in the Arctic. However, to date, there is no multiple remote sensing method to detect sea ice type in the Arctic. This study develops a multiple sea ice type algorithm using the HaiYang-2B Scatterometer (HY-2B SCA). First, the parameters most applicable to classify sea ice type are selected through feature extraction, and a stacking model is established for the first time, which integrates decision tree and image segmentation algorithms. Finally, multiple sea ice types are classified in the Arctic, comprising Nilas, Young Ice, First Year Ice, Old Ice, and Fast Ice. Comparing the results with the Ocean and Sea Ice Satellite Application Facility (OSI-SAF) Sea Ice Type dataset (SIT) indicates that the sea ice type classified by HY-2B SCA (Stacking-HY2B) is similar to OSI-SAF SIT with regard to the changing trends in extent of sea ice. We use the Copernicus Marine Environment Monitoring Service (CMEMS) high-resolution sea ice type data and EM-Bird ice thickness data to validate the result, and accuracies of 87% and 88% are obtained, respectively. This indicates that the algorithm in this work is comparable with the performance of OSI-SAF dataset, while providing information of multiple sea ice types.
Article
Full-text available
Particle size, shape and morphology can be considered as the most significant functional parameters, their effects on increasing the performance of oral solid dosage formulation are indisputable. Supercritical Carbon dioxide fluid (SCCO2) technology is an effective approach to control the above-mentioned parameters in oral solid dosage formulation. In this study, drug solubility measuring is investigated based on artificial intelligence model using carbon dioxide as a common supercritical solvent, at different pressure and temperature, 120–400 bar, 308–338 K. The results indicate that pressure has a strong effect on drug solubility. In this investigation, Decision Tree (DT), Adaptive Boosted Decision Trees (ADA-DT), and Nu-SVR regression models are used for the first time as a novel model on the available data, which have two inputs, including pressure, X1 = P(bar) and temperature, X2 = T(K). Also, output is Y = solubility. With an R-squared score, DT, ADA-DT, and Nu-SVR showed results of 0.836, 0.921, and 0.813. Also, in terms of MAE, they showed error rates of 4.30E−06, 1.95E−06, and 3.45E−06. Another metric is RMSE, in which DT, ADA-DT, and Nu-SVR showed error rates of 4.96E−06, 2.34E−06, and 5.26E−06, respectively. Due to the analysis outputs, ADA-DT selected as the best and novel model and the find optimal outputs can be shown via vector: (x1 = 309, x2 = 317.39, Y1 = 7.03e−05).
Article
Liver cirrhosis disease is an important cause of death worldwide. Therefore, early diagnosis of the disease is very important. Machine learning algorithms are frequently used due to its high performance in the field of health, as in many areas. In this study, Multilayer Perceptron‐Artificial Neural Networks, Decision Trees, Random Forest, Naïve Bayes, Support Vector Machines, K‐Nearest Neighborhood, and Logistic Regression classification algorithms are used to classify the factors affecting liver cirrhosis. The performances of these algorithms are compared according to the accuracy rate, F measure, sensitivity, specificity and Kappa score on real data obtained from 2000 liver cirrhosis patients, and the factors affecting the disease are classified with the most appropriate algorithm. In addition, more than 50 articles covering both liver disease and classification methods are reviewed and the latest developments are presented in the study.
Chapter
Full-text available
The online traces that students leave on electronic learning platforms; the improved integration of educational, administrative and online data sources; and the increasing accessibility of hands-on software allow the domain of learning analytics to flourish. Learning analytics, as in interdisciplinary domain borrowing from statistics, computer sciences and education, exploits the increased accessibility of technology to foster an optimal learning environment that is both transparent and cost-effective. This chapter illustrates the potential of learning analytics to stimulate learning outcomes and to contribute to educational quality management. Moreover, it discusses the increasing emergence of large and accessible data sets in education and compares the cost-effectiveness of learning analytics to that of costly and unreliable retrospective studies and surveys. The chapter showcases the potential of methods that permit savvy users to make insightful predictions about student types, performance and the potential of reforms. The chapter concludes with recommendations, challenges to the implementation and growth of learning analytics.
Chapter
Overloading is one of the faults that occur very often in the operation of electrical machines. Therefore, a continuous monitoring and diagnosis for this is necessary in safety-critical applications. This paper presents a sound analysis system used for detecting and classifying induction motor and power transformer overload levels with a microphone. Three acoustic features and six classification models are evaluated. The obtained results show that this is a promising way to monitor electrical machines overload.KeywordsSound analysisElectrical machine overloadMachine learning
Chapter
The Support Vector Machine (SVM) is a widely used algorithm for batch classification with a run and memory efficient counterpart given by the Core Vector Machine (CVM). Both algorithms have nice theoretical guarantees, but are not able to handle data streams, which have to be processed instance by instance. We propose a novel approach to handle stream classification problems via an adaption of the CVM, which is also able to handle multiclass classification problems. Furthermore, we compare our Multiclass Core Vector Machine (MCCVM) approach against another existing Minimum Enclosing Ball (MEB)-based classification approach. Finally, we propose a real-world streaming dataset, which consists of changeover detection data and has only been analyzed in offline settings so far.
Article
Several studies were focused on the genetic ability to taste the bitter compound 6-n-propylthiouracil (PROP) to assess the inter-individual taste variability in humans, and its effect on food predilections, nutrition, and health. PROP taste sensitivity and that of other chemical molecules throughout the body are mediated by the bitter receptor TAS2R38, and their variability is significantly associated with TAS2R38 genetic variants. We recently automatically identified PROP phenotypes with high precision using Machine Learning (mL). Here we have used Supervised Learning (SL) algorithms to automatically identify TAS2R38 genotypes by using the biological features of eighty-four participants. The catBoost algorithm was the best-suited model for the automatic discrimination of the genotypes. It allowed us to automatically predict the identification of genotypes and precisely define the effectiveness and impact of each feature. The ratings of perceived intensity for PROP solutions (0.32 and 0.032 mM) and medium taster (MT) category were the most important features in training the model and understanding the difference between genotypes. Our findings suggest that SL may represent a trustworthy and objective tool for identifying TAS2R38 variants which, reducing the costs and times of molecular analysis, can find wide application in taste physiology and medicine studies.
Article
Machine learning models are brittle, and small changes in the training data can result in different predictions. We study the problem of proving that a prediction is robust to data poisoning , where an attacker can inject a number of malicious elements into the training set to influence the learned model. We target decision tree models, a popular and simple class of machine learning models that underlies many complex learning techniques. We present a sound verification technique based on abstract interpretation and implement it in a tool called Antidote. Antidote abstractly trains decision trees for an intractably large space of possible poisoned datasets. Due to the soundness of our abstraction, Antidote can produce proofs that, for a given input, the corresponding prediction would not have changed had the training set been tampered with or not. We demonstrate the effectiveness of Antidote on a number of popular datasets.
Article
Full-text available
Objetivo: Diseñar un modelo estadístico que determine los factores académicos sobre resultados de las Pruebas Saber Pro. Metodología: El estudio de técnicas de relaciones multivariables y de aprendizaje fueron empleadas para establecer un mecanismo de relación entre un conjunto de variables académicas y sociodemográficas y su influencia con el resultado de la prueba Saber Pro, a través de un diseño y selección de un modelo estadístico multivariable que determine en forma óptima los factores académicos que inciden en los resultados de las pruebas Saber Pro. Resultados: Se apreció que no existen diferencias significativas entre los modelos y la realidad reflejada en la muestra de la validación a excepción de la prueba PCME que el modelo de Random Forest no prueba hipótesis de validación. Se identificó que el modelo de regresión lineal multivariante no muestra diferencias significativas en ninguna de las pruebas, al contrario del modelo Random Forest si muestra diferencias para ciertos valores de α en ING y FPI además de rechazar hipótesis de igualdad para la prueba PCME. Conclusiones: Cualquiera de las técnicas utilizadas en el estudio puede ayudar a realizar un modelo predictivo que sea capaz de permitir a la institución generar estrategias para lograr crear políticas orientadas a mejorar el rendimiento de los estudiantes. Sin embargo, la técnica de regresión lineal multivariante de acuerdo a las pruebas de hipótesis es la mejor posicionada en este estudio.
Article
Full-text available
Co-amorphous systems (COAMS) have raised increasing interest in the pharmaceutical industry, since they combine the increased solubility and/or faster dissolution of amorphous forms with the stability of crystalline forms. However, the choice of the co-former is critical for the formation of a COAMS. While some models exist to predict the potential formation of COAMS, they often focus on a limited group of compounds. Here, four classes of combinations of an active pharmaceutical ingredient (API) with (1) another API, (2) an amino acid, (3) an organic acid, or (4) another substance were considered. A model using gradient boosting methods was developed to predict the successful formation of COAMS for all four classes. The model was tested on data not seen during training and predicted 15 out of 19 examples correctly. In addition, the model was used to screen for new COAMS in binary systems of two APIs for inhalation therapy, as diseases such as tuberculosis, asthma, and COPD usually require complex multidrug-therapy. Three of these new API-API combinations were selected for experimental testing and co-processed via milling. The experiments confirmed the predictions of the model in all three cases. This data-driven model will facilitate and expedite the screening phase for new binary COAMS.
Preprint
Full-text available
Background: The effect of the duodenal exclusion in glycemic regulation has yet to be defined. Individuals with type 2 Diabetes Mellitus (T2DM) operated for other reasons than obesity, represent an adequate model to analyze clinical outcomes of duodenal exclusion. Objective: To analyze the changes in glycemia and pharmacotherapy for T2DM in patients undergoing gastrectomy with Roux-in-Y derivation for gastric cancer. Methods: An observational study was conducted in 2018 on patients who were submitted to surgery from 2001 to 2016. Medical records of 129 patients’ cohort operated in two public hospitals were analyzed retrospectively before the surgery (T0) and one year after (T1). The research protocol was approved by the ethics committee. The final sample was mainly represented by women (50.5%) with a mean age of 65.5 years, and a mean body mass index of 26.5 kg/m2 SD 4.30. Results: One year later, mean glucose levels of the entire sample decreased (p=0.046), but 70% of patients with glycemia> 100 at T0, remained with the same value in T1. Glycated hemoglobin had no significant change (p=0.988). Regarding the pharmacotherapy for T2DM, 60.7% of the sample had no change. However, 6.7% had discontinuation of the medication with the improvement of T2DM. The multivariate model by classification and decision tree method (CART) found as predictors of change in DM2 medication, age (<62.5 years) and a body mass index (> 30.2 kg/m2) with a predictive value of 71.4%. Conclusion: There was no improvement of glycemia and pharmacotherapy in patients with T2DM who underwent gastrectomy with Roux-en-Y reconstruction, with a body mass index below 30 kg/m2
Article
Introduction: The creation of systems based on blockchain technology is a promising area of modern research and development. Blockchain is a reliable and secure way to store transaction data, providing integrity verification capabilities. Today, blockchain technology is widely spread all over the world, in many spheres of life. The article considers the essence and characteristics of post quantum electronic signature algorithms based on algebraic lattices and provides a comparative analysis of algorithms by conditional and unconditional criteria. Results. The advantages and disadvantages of existing cryptographic algorithms, the principles and specifics of the functioning of blockchain technology, as well as the possibility of using post quantum algorithms in the formation of block signatures are determined.
Article
The goal is to improve the quality of information interaction of objects by increasing the reliability of message delivery to recipients and bandwidth through the integration of networks using various signals and physical channels. A search for new technical solutions for the creation of integrated (hybrid) communication technologies aimed at expanding the possibilities for creating adaptive, self organizing, resistant to destabilizing factors communication systems with an increased quality of services provided in vast areas was carried out. The method of solving the tasks is based on the analysis of development trends and forecasting the requirements for integrated mobile communication systems. The novelty lies in the development and evaluation of options for building the structures of integrated communication systems, represented by diverse, complementary subsystems using various methods of channel separation, including in physical environments, as well as algorithms for their operation. Key findings. A set of technologies for the development of coastal integrated communication systems has been developed: the structure of the system based on a hybrid mesh topology; the basic form of the carrier signal with controlled parameters depending on the physical channel; the principle of control of orthogonal signals in adjacent channels of switching and relay nodes, as well as between network nodes; hydroacoustic communication method. Hydroacoustic communication for coastal areas is a potential for increasing the number of simultaneously used information channels per unit of the volume of interaction of vital objects and makes it possible to connect satellite, air, surface, underwater and bottom subsystems of telecommunications. The con ducted studies have shown the stability of the proposed methods in conditions of excess of interference over the signal.
Article
Introduction. The article deals with issues related to the manifestation of the primary fundamental properties of large systems, such as synergy and emergence, potentially inherent in large systems with the same or similar properties of components. These properties can be implemented at a certain time interval when measuring the values of the signal phases at a time. It is shown that the manifestation of the synergistic property in the system of simultaneously and independently operating generators makes it possible to reduce the error in estimating random fluctuations in the frequency of generators, and the manifestation of the emergent property eliminates a constant frequency shift from the nominal value. Based on the analysis of the properties of the obtained estimates, the conditions for obtaining unbiased effective estimates of the frequency of each of the generators are formulated. Result. The manifestation of the properties of synergy and emergence in a system of simultaneously and independently functioning generators is shown. It is noted that from a physical point of view, an increase in the accuracy of estimating the frequency of signals corresponds to multiple unequal indirect measurements. The simulated frequency deviations of each of the generators and the difference between the values of the frequency deviation and the corresponding values of the estimates obtained are shown. The results of mathematical modeling are presented, confirming the validity of the theoretical results obtained and the main patterns noted.
Article
Introduction. Smart lifecycle services, along with the adoption of other smart manufacturing strategies, show significant potential to increase the productivity and competitiveness of enterprises. There is a clear need for high quality process models and software representations of physical hardware that reflect the evolution of their physical counterparts in particular detail. Digital twin technology can provide fertile ground for the development of IoT based lifecycle applications. This paper proposes a software approach to the process of developing a digital twin. The purpose of this work is to study the methods of creating software frameworks in the context of digital twin technology, as well as to study the process of creating and integrating digital twins and developing our own tool. Existing methods for creating twins are considered and a self developed Python library is described. The work uses the methods of the category of research in action and the philosophy of Agile development. Methods. The choice of methodology in seeking answers to questions that arise when studying methods for improving digital twins mainly falls into the category of research in action. In this paper, research in action is used as a qualitative tool, well suited to situations where the researcher seeks to achieve two different goals, namely: to solve a contemporary problem faced by the organization; contribute to a pool of knowledge that can later be used by other people to solve problems in the same class of problems. Results. Created library supports two modes of data retrieval: using periodic database queries and TCP socket connection. The experiments carried out as part of the work allow us to conclude that the library can already be used as a tool for creating a digital twin and its integration.
ResearchGate has not been able to resolve any references for this publication.