A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

IEEE Transactions on Knowledge and Data Engineering (Impact Factor: 1.89). 01/2011; DOI: 10.1109/TKDE.2011.181
Source: IEEE Xplore

ABSTRACT A fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient maximum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms with respect to four types of well-known classifiers before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Feature dimensionality reduction is a critical task in various machine learning applications including prognostics and health management (PHM) applications. Linear transformations, most popularly principal component analysis (PCA) and linear discriminant analysis (LDA), are the most widely-used methods for feature dimensionality reduction. For classification problems, LDA, being a supervised linear transformation that aims at maximally retaining class discriminant information, is generally considered to be a better method than PCA, an unsupervised method. However, LDA suffers from the singularity or small sample size problem. Attempting to address this problem, in this paper we propose a cluster-based LDA (cLDA) for feature dimensionality reduction. It first partitions features in distinct clusters and then performs cluster-wise LDA transformation. We demonstrate the effectiveness of the proposed cLDA on reducing the number of features by using a real-world PHM application - partial discharge diagnosis.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Feature selection is applied to identify relevant and complementary features from a given high-dimensional feature set. In general, existing filter-based approaches operate on single (scalar) feature components and ignore the relationships among components of multidimensional features. As a result, generated feature subsets lack in interpretability and hardly provide insights into the underlying data. We propose an unsupervised, filter-based feature selection approach that preserves the natural assignment of feature components to semantically meaningful features. Experiments on different tasks in the audio domain show that the proposed approach outperforms well-established feature selection methods in terms of retrieval performance and runtime. Results achieved on different audio datasets for the same retrieval task indicate that the proposed method is more robust in selecting consistent feature sets across different datasets than compared approaches.
    SIAM International Conference on Data Mining (SDM); 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Feature interaction is an important issue in feature subset selection. However, most of the existing algorithms only focus on dealing with irrelevant and redundant features. In this paper, a propositional FOIL rule based algorithm FRFS, which not only retains relevant features and excludes irrelevant and redundant ones but also considers feature interaction, is proposed for selecting feature subset for high dimensional data. FRFS first merges the features appeared in the antecedents of all FOIL rules, achieving a candidate feature subset which excludes redundant features and reserves interactive ones. Then, it identifies and removes irrelevant features by evaluating features in the candidate feature subset with a new metric CoverRatio, and obtains the final feature subset. The efficiency and effectiveness of FRFS are extensively tested upon both synthetic and real world data sets, and it is compared with other six representative feature subset selection algorithms, including CFS, FCBF, Consistency, Relief-F, INTERACT, and the rule-based FSBAR, in terms of the number of selected features, runtime and the classification accuracies of the four well-known classifiers including Naive Bayes, C4.5, PART and IB1 before and after feature selection. The results on the five synthetic data sets show that FRFS can effectively identify irrelevant and redundant features while reserving interactive ones. The results on the 35 real world high dimensional data sets demonstrate that compared with other six feature selection algorithms, FRFS cannot only efficiently reduce the feature space, but also can significantly improve the performance of the four well-known classifiers.
    Pattern Recognition. 01/2013; 46(1):199–214.


Available from