Article

A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

IEEE Transactions on Knowledge and Data Engineering (Impact Factor: 1.82). 01/2011; 25(99):1 - 1. DOI: 10.1109/TKDE.2011.181
Source: IEEE Xplore

ABSTRACT A fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient maximum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms with respect to four types of well-known classifiers before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

Download full-text

Full-text

Available from: Qinbao Song, Nov 23, 2014
36 Followers
 · 
3,970 Views
  • Source
    • "If the class-relevance of a feature is lower than that of another and the correlation between them, it would be identified as a redundant features and thus to be removed. Recently, an extenuation of FCBF was proposed in order to identify redundant features more precisely [39]. All of the above mentioned methods take pairwise correlation as the redundancy index and identify features with high such index to be redundant, while ignoring 1) complementary correlation between features (which we will discuss detailed in section 3.2) and 2) correlation among more than two features, which still remain to be problems that impair the performance of feature selection. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Feature selection has attracted significant attention in data mining and machine learning in the past decades. Many existing feature selection methods eliminate redundancy by measuring pairwise inter-correlation of features, whereas the complementariness of features and higher inter-correlation among more than two features are ignored. In this study, a modification item concerning the complementariness of features is introduced in the evaluation criterion of features. Additionally, in order to identify the interference effect of already-selected False Positives (FPs), the redundancy-complementariness dispersion is also taken into account to adjust the measurement of pairwise inter-correlation of features. To illustrate the effectiveness of proposed method, classification experiments are applied with four frequently used classifiers on ten datasets. Classification results verify the superiority of proposed method compared with five representative feature selection methods.
  • Source
    • "Zheng and Padmanabhan [39] use DEA to construct an ensemble of classifiers in order to get better classification performance, which is a typical application of DEA to model combination problems. In addition, DEA itself is also applied to construct classifiers and clustering methods in prior work [40] [42] [31]. Recently, Zhang et al. [23] focus on the integration of DEA and feature selection, and indicate that DEA can be applied as a feature selection framework due to its nature of multi-index evaluation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a novel feature selection method is presented, which is based on Class-Separability (CS) strategy and Data Envelopment Analysis (DEA). To better capture the relationship between features and the class, class labels are separated into individual variables and relevance and redundancy are explicitly handled on each class label. Super-efficiency DEA is employed to evaluate and rank features via their conditional dependence scores on all class labels, and the feature with maximum super-efficiency score is then added in the conditioning set for conditional dependence estimation in the next iteration, in such a way as to iteratively select features and get the final selected features. Eventually, experiments are conducted to evaluate the effectiveness of proposed method comparing with four state-of-the-art methods from the viewpoint of classification accuracy. Empirical results verify the feasibility and the superiority of proposed feature selection method.
  • Source
    • "Zheng and Padmanabhan [40] use DEA to construct an ensemble of classifiers in order to get better classification performance, which is a typical application of DEA to model combination problems. In addition, DEA itself is also applied to construct classifiers and clustering methods in prior work [41] [43] [32]. Recently, Zhang et al. [24] focus on the integration of DEA and feature selection, and indicate that DEA can be applied as a feature selection framework due to its nature of multi-index evaluation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a novel feature selection method is presented, which is based on Class-Separability (CS) strategy and Data Envelopment Analysis (DEA). To better capture the relationship between features and the class, class labels are separated into individual variables and relevance and redundancy are explicitly handled on each class label. Super-efficiency DEA is employed to evaluate and rank features via their conditional dependence scores on all class labels, and the feature with maximum super-efficiency score is then added in the conditioning set for conditional dependence estimation in the next iteration, in such a way as to iteratively select features and get the final selected features. Eventually, experiments are conducted to evaluate the effectiveness of proposed method comparing with four state-of-the-art methods from the viewpoint of classification accuracy. Empirical results verify the feasibility and the superiority of proposed feature selection method.
    Neurocomputing 04/2014; 166. DOI:10.1016/j.neucom.2015.03.081 · 2.01 Impact Factor
Show more