A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

IEEE Transactions on Knowledge and Data Engineering (Impact Factor: 1.82). 01/2011; DOI: 10.1109/TKDE.2011.181
Source: IEEE Xplore

ABSTRACT A fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient maximum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms with respect to four types of well-known classifiers before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a novel feature selection method is presented, which is based on Class-Separability (CS) strategy and Data Envelopment Analysis (DEA). To better capture the relationship between features and the class, class labels are separated into individual variables and relevance and redundancy are explicitly handled on each class label. Super-efficiency DEA is employed to evaluate and rank features via their conditional dependence scores on all class labels, and the feature with maximum super-efficiency score is then added in the conditioning set for conditional dependence estimation in the next iteration, in such a way as to iteratively select features and get the final selected features. Eventually, experiments are conducted to evaluate the effectiveness of proposed method comparing with four state-of-the-art methods from the viewpoint of classification accuracy. Empirical results verify the feasibility and the superiority of proposed feature selection method.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Feature selection is applied to identify relevant and complementary features from a given high-dimensional feature set. In general, existing filter-based approaches operate on single (scalar) feature components and ignore the relationships among components of multidimensional features. As a result, generated feature subsets lack in interpretability and hardly provide insights into the underlying data. We propose an unsupervised, filter-based feature selection approach that preserves the natural assignment of feature components to semantically meaningful features. Experiments on different tasks in the audio domain show that the proposed approach outperforms well-established feature selection methods in terms of retrieval performance and runtime. Results achieved on different audio datasets for the same retrieval task indicate that the proposed method is more robust in selecting consistent feature sets across different datasets than compared approaches.
    SIAM International Conference on Data Mining (SDM); 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a pragmatic study on feature subset selection evaluators. In data mining, dimensionality reduction in data preprocessing plays a vital role for improving the performance of the machine learning algorithms. Many techniques have been proposed by researchers to achieve dimensionality reduction. Beside the contribution of feature subset selection in dimensionality reduction gives a significant improvement in accuracy, it reduces the false prediction ratio and reduces the time complexity for building the learning model in machine learning algorithm as the result of removing redundant and irrelevant attributes from the original dataset. This study analyzes the performance of these Cfs, Consistency and Filtered attribute subset evaluators in view of dimensionality reduction with the wide range of test datasets and learning algorithms namely probability-based Naive Bayes, tree-based C4.5(J48) and instance-based IB1.
    Advanced Communication Control and Computing Technologies (ICACCCT), 2012 IEEE International Conference on; 01/2012

Full-text (2 Sources)

Available from
Nov 23, 2014