Article

AttributeRank: An Algorithm for Attribute Ranking in Clinical Variable Selection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background Risk difference is a valuable measure of association in epidemiology and healthcare which has the potential to be used in medical and clinical variable selection. Objective In this study, an attribute ranking algorithm, called AttributeRank, was developed to facilitate variable selection from clinical data sets. Methods The algorithm computes the risk difference between a predictor and the response variable to determine the level of importance of a predictor. The performance of the algorithm was compared with some existing variable selection algorithms using five clinical data sets on neonatal birthweight, bacterial survival after treatment, myocardial infarction, breast cancer, and diabetes. Results The variable subsets selected by AttributeRank yielded the highest average classification accuracy across the data sets, compared to Fisher score, Pearson's correlation, variable importance function, and Chi‐Square. Conclusion AttributeRank proved to be more valuable in attribute ranking of clinical data sets compared to the existing algorithms and should be implemented in a user‐friendly application in future research.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This research developed and tested a filter algorithm that serves to reduce the feature space in healthcare datasets. The algorithm binarizes the dataset, and then separately evaluates the risk ratio of each predictor with the response, and outputs ratios that represent the association between a predictor and the class attribute. The value of the association translates to the importance rank of the corresponding predictor in determining the outcome. Using Random Forest and Logistic regression classification, the performance of the developed algorithm was compared against the regsubsets and varImp functions, which are unsupervised methods of variable selection. Equally, the proposed algorithm was compared with the supervised Fisher score and Pearson’s correlation feature selection methods. Different datasets were used for the experiment, and, in the majority of the cases, the predictors selected by the new algorithm outperformed those selected by the existing algorithms. The proposed filter algorithm is therefore a reliable alternative for variable ranking in data mining classification tasks with a dichotomous response.
Article
Full-text available
We propose two variable selection methods using signomial classification. We attempt to select, among a set of the input variables, the variables that lead to the best performance of the classifier. One method repeatedly removes variables based on backward selection, whereas the second method directly selects a set of variables by solving an optimization problem. The proposed methods conduct variable selection considering nonlinear interactions of variables and obtain a signomial classifier with the selected variables. Computational results show that the proposed methods more effectively selects desirable variables for predicting output and provide the classifiers with better or comparable test error rates, as compared with existing methods.
Article
Full-text available
Multivariable regression models are often used in transplantation research to identify or to confirm baseline variables which have an independent association, causally or only evidenced by statistical correlation, with transplantation outcome. Although sound theory is lacking, variable selection is a popular statistical method which seemingly reduces the complexity of such models. However, in fact, variable selection often complicates analysis as it invalidates common tools of statistical inference such as p-values and confidence intervals. This is a particular problem in transplantation research where sample sizes are often only small to moderate. Furthermore, variable selection requires computer-intensive stability investigations and a particularly cautious interpretation of results. We discuss how five common misconceptions often lead to inappropriate application of variable selection. We emphasize that variable selection and all problems related with it can often be avoided by the use of expert knowledge. This article is protected by copyright. All rights reserved.
Article
Full-text available
The selection of variables in regression problems has occupied the minds of many statisticians. Several Bayesian variable selection methods have been developed, and we concentrate on the following methods: Kuo & Mallick, Gibbs Variable Selection (GVS), Stochastic Search Variable Selection (SSVS), adaptive shrinkage with Jeffreys' prior or a Laplacian prior, and reversible jump MCMC. We review these methods, in the context of their different properties. We then implement the methods in BUGS, using both real and simulated data as examples, and investigate how the different methods perform in practice. Our results suggest that SSVS, reversible jump MCMC and adaptive shrinkage methods can all work well, but the choice of which method is better will depend on the priors that are used, and also on how they are implemented.
Article
Background The birthweight of a newborn is critical to their health, development, and well-being. Previous studies that used maternal characteristics to predict birthweight did not employ a harmonised scale to assess the risk of low birthweight (LBW). Objective The goal of this study was to develop a new instrument that uses items on a uniform scale to assess the risk of an LBW in a pregnant woman. Methods Item response theory was employed to evaluate a similar existing scale, and some weaknesses were identified. Results Based on the observed weaknesses of the existing scale, a new uniform scale was developed, which is a 3-point Likert scale consisting of seven items. Conclusion The scale, termed birthweight questionnaire, is a valuable tool for collecting data that could assist in assessing the risk of an LBW at every stage of pregnancy.
Article
This study developed an algorithm, TerrorClassify, which places terror organizations into hierarchical categories of casualties and consequences. A previous study proposed a method of categorizing terrorists into four classes based on the extent of havoc caused by individual terror groups. The classes include low-casualty, medium-consequence terror groups; medium-casualty, high-consequence terror groups; high-casualty, low-consequence terror groups; and higher-casualty, low-consequence terror groups. In this research, an algorithm was designed to show the procedures to be followed in placing terror groups into the four classes using the records in the global terrorism database. The algorithm can be implemented with any programming language.
Article
This research deployed the agglomerative hierarchical clustering to extract clusters from the coronavirus disease 2019 (COVID-19) data based on the morbidity and mortality of the novel virus across 206 countries, territories and areas. As of 2nd April, 2020, a total of 896,475 confirmed cases were reported across the world. Three clusters were extracted from the data on the bases of morbidity and mortality of COVID-19. These include: low-confirmed-cases, low-new-cases, low-deaths and low-new-deaths countries [Cluster 1]; medium-confirmed-cases, low-new-cases, medium-deaths, and medium-new-deaths countries [Cluster 2]; high-confirmed-cases, high-new-cases, high-deaths, and high-new-deaths countries [Cluster 3]. It is recommended that, to contain the pandemic, countries within a cluster should cooperate, share information and learn from mistakes or strategies (as the case may be) of the countries in other clusters. Among other benefits, this can prevent countries within the low-confirmed-cases cluster from progressing to the high-confirmed-cases cluster.
Chapter
A filter feature selection algorithm is developed and its performance tested. In the initial step, the algorithm dichotomizes the dataset then separately computes the association between each predictor and the class variable using relative odds (odds ratios). The value of the odds ratios becomes the importance ranking of the corresponding explanatory variable in determining the output. Logistic regression classification is deployed to test the performance of the new algorithm in comparison with three existing feature selection algorithms: the Fisher index, Pearson's correlation, and the varImp function. A number of experimental datasets are employed, and in most cases, the subsets selected by the new algorithm produced models with higher classification accuracy than the subsets suggested by the existing feature selection algorithms. Therefore, the proposed algorithm is a reliable alternative in filter feature selection for binary classification problems.
Book
A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods. The emphasis is on presenting practical problems and full analyses of real data sets.
Article
The focus of the present paper is to propose and discuss different procedures for performing variable selection in a multi-block regression context. In particular, the focus is on two multi-block regression methods: Multi-Block Partial Least Squares (MB-PLS) and Sequential and Orthogonalized Partial Least Squares (SO-PLS) regression. A small simulation study for regular PLS regression was conducted in order to select the most promising methods to investigate further in the multi-block context. The combinations of three variable selection methods with MB-PLS and SO-PLS are examined in detail. These methods are Variable Importance in Projection (VIP) Selectivity Ratio (SR) and forward selection. In this paper we focus on both prediction ability and interpretation. The different approaches are tested on three types of data: one sensory data set, one spectroscopic (Raman) data set and a number of simulated multi-block data sets.
Article
Plenty of feature selection methods are available in literature due to the availability of data with hundreds of variables leading to data with very high dimension. Feature selection methods provides us a way of reducing computation time, improving prediction performance, and a better understanding of the data in machine learning or pattern recognition applications. In this paper we provide an overview of some of the methods present in literature. The objective is to provide a generic introduction to variable elimination which can be applied to a wide array of machine learning problems. We focus on Filter, Wrapper and Embedded methods. We also apply some of the feature selection techniques on standard datasets to demonstrate the applicability of feature selection techniques.
Article
With the increasing ease of measuring multiple variables per object the importance of variable selection for data reduction and for improved interpretability is gaining importance. There are numerous suggested methods for variable selection in the literature of data analysis and statistics, and it is a challenge to stay updated on all the possibilities. We therefore present a review of available methods for variable selection within one of the many modeling approaches for high-throughput data, Partial Least Squares Regression. The aim of this paper is mainly to collect and shortly present the methods in such a way that the reader easily can get an understanding of the characteristics of the methods and to get a basis for selecting an appropriate method for own use. For each method we also give references to its use in the literature for further reading, and also to software availability.
Article
We introduce a novel wrapper Algorithm for Feature Selection, using Support Vector Machines with kernel functions. Our method is based on a sequential backward selection, using the number of errors in a validation subset as the measure to decide which feature to remove in each iteration. We compare our approach with other algorithms like a filter method or Recursive Feature Elimination SVM to demonstrate its effectiveness and efficiency.
Article
I evaluated the predictive ability of statistical models obtained by applying seven methods of variable selection to 12 ecological and environmental data sets. Cross-validation, involving repeated splits of each data set into training and validation subsets, was used to obtain honest estimates of predictive ability that could be fairly compared among methods. There was surprisingly little difference in predictive ability among five methods based on multiple linear regression. Stepwise methods performed similarly to exhaustive algorithms for subset selection, and the choice of criterion for comparing models (Akaike's information criterion, Schwarz's Bayesian information criterion or F statistics) had little effect on predictive ability. For most of the data sets, two methods based on regression trees yielded models with substantially lower predictive ability. I argue that there is no 'best' method of variable selection and that any of the regression-based approaches discussed here is capable of yielding useful predictive models.
Article
A frequently encountered challenge in high-dimensional regression is the detection of relevant variables. Variable selection suffers from instability and the power to detect relevant variables is typically low if predictor variables are highly correlated. When taking the multiplicity of the testing problem into account, the power diminishes even further. To gain power and insight, it can be advantageous to look for influence not at the level of individual variables but rather at the level of clusters of highly correlated variables. We propose a hierarchical approach. Variable importance is first tested at the coarsest level, corresponding to the global null hypothesis. The method then tries to attribute any effect to smaller subclusters or even individual variables. The smallest possible clusters, which still exhibit a significant influence on the response variable, are retained. It is shown that the proposed testing procedure controls the familywise error rate at a prespecified level, simultaneously over all resolution levels. The method has power comparable to the Bonferroni–Holm procedure on the level of individual variables and dramatically larger power for coarser resolution levels. The best resolution level is selected adaptively.
Introduction to Feature Selection Methods With an Example Analytics Vidhya ” accessed
  • S Kaushik
Available Datasets ” accessed
  • Rdatasets
Caret: Classification and Regression Training R Package Version 6.0-77 ” accessed
  • M Kuhn
  • J Wing
  • S Weston
Chi-Square Test for Feature Selection in Machine Learning ” accessed
  • S K Gajawada
DAAG: Data Analysis and Graphics Data and Functions R Package Version 1.22.1 ” accessed
  • J H W Maindonaldandj
  • Braun