Conference Paper

A Comparison between Two Feature Selection Algorithms

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An original implementation of the KS algorithm was developed and used by us in [2]. During the development of this original implementation, optimization opportunities were discovered, which gain efficiency primarily by avoiding redundant calculations and by exploiting the data parallelism inherent to the algorithm. ...
... Such a caching mechanism was proposed by Koller and Sahami themselves, but they did not implement and evaluate it [1]. This optimization has been implemented in [2] and is evaluated in the experiment presented here, as described in Section 5.4. ...
... This experiment uses the same experimental dataset used by the experiment presented in [2], where the KS algorithm was compared with Information Gain Thresholding in the quality of selected features, evaluated by training Naïve Bayes classifiers. ...
Article
This article describes and evaluates four optimizations for Koller and Sa-hami's Feature Selection algorithm, significantly reducing the time it requires to complete. The optimizations exploit the Information Theory concepts used by the algorithm, its inherent data parallelism and the fact that much of the calculations it performs are redundant. Each proposed optimization was carefully evaluated, showing significant efficiency gains. In particular, a decomposition of conditional mutual information is shown to reduce the time required to calculate its primary heuristic and can be potentially applied to other algorithms which calculate conditional mutual information.
... The authors experimentally evaluated the KS algorithm by using it to select features from datasets, then feeding these features into a classifier and evaluating its final accuracy. A separate evaluation compares KS to a simple selection by thresholding mutual information [13]. Supplementary efficiency optimizations for the KS algorithm are possible [14], targeting both heuristics of the algorithm. ...
Article
Full-text available
This article proposes the usage of the d-separation criterion in Markov Boundary Discovery algorithms, instead of or alongside the statistical tests of conditional independence these algorithms usually rely on. This is a methodological improvement applicable when designing, studying or improving such algorithms, but it is not applicable for productive use, because computing the d-separation criterion requires complete knowledge of a Bayesian network. Yet Bayesian networks can be made available to the algorithms when studied in controlled conditions. This approach has the effect of removing sources of suboptimal behavior, allowing the algorithms to perform at their theoretical best and providing insights about their properties. The article also discusses an extension of this approach, namely to use d-separation as a complement to the usual statistical tests performed on synthetic datasets in order to ascertain the overall accuracy of the tests chosen by the algorithms, for further insights into their behavior. To exemplify these two approaches, two Markov Boundary Discovery algorithms were used, namely the Incremental Association Markov Blanket algorithm and the Iterative Parent–Child-Based Search of Markov Blanket algorithm. Firstly, these algorithms were configured to use d-separation alone as their conditional independence test, computed on known Bayesian networks. Subsequently, the algorithms were configured to use the statistical G-test complemented by d-separation to evaluate their behavior on synthetic data.
Article
Full-text available
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection's properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Chapter
Half-title pageSeries pageTitle pageCopyright pageDedicationPrefaceAcknowledgementsContentsList of figuresHalf-title pageIndex
Article
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami's method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.
Article
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 5...
Toward optimal feature selection
  • D Koller
  • M Sahami
D. Koller and M. Sahami, "Toward optimal feature selection," in In 13th International Conference on Machine Learning, 1995, pp. 284-292.
Contributions to automatic knowledge extraction from unstructured data
  • I D Morariu
I. D. Morariu, "Contributions to automatic knowledge extraction from unstructured data," Ph.D. dissertation, "Lucian Blaga" University of Sibiu (supervisor: Prof. L. Vint, an), 2007.
RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection
  • D D Lewis
D. D. Lewis, "RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection," 2015.