Conference Paper

A Comparison between Two Feature Selection Algorithms

November 2017

Authors:

Camil Bancioiu

Lucian Blaga University of Sibiu

Lucian Vintan

Lucian Blaga University of Sibiu

Example of a Markov blanket within a simple Bayesian Network

…

Classifier accuracy for K = 1

…

Efficiency Optimizations for Koller and Sahami's Feature Selection Algorithm

Article

Apr 2019

This article describes and evaluates four optimizations for Koller and Sa-hami's Feature Selection algorithm, significantly reducing the time it requires to complete. The optimizations exploit the Information Theory concepts used by the algorithm, its inherent data parallelism and the fact that much of the calculations it performs are redundant. Each proposed optimization was carefully evaluated, showing significant efficiency gains. In particular, a decomposition of conditional mutual information is shown to reduce the time required to calculate its primary heuristic and can be potentially applied to other algorithms which calculate conditional mutual information.

Analyzing Markov Boundary Discovery Algorithms in Ideal Conditions Using the d-Separation Criterion

Article

Full-text available

Mar 2022

This article proposes the usage of the d-separation criterion in Markov Boundary Discovery algorithms, instead of or alongside the statistical tests of conditional independence these algorithms usually rely on. This is a methodological improvement applicable when designing, studying or improving such algorithms, but it is not applicable for productive use, because computing the d-separation criterion requires complete knowledge of a Bayesian network. Yet Bayesian networks can be made available to the algorithms when studied in controlled conditions. This approach has the effect of removing sources of suboptimal behavior, allowing the algorithms to perform at their theoretical best and providing insights about their properties. The article also discusses an extension of this approach, namely to use d-separation as a complement to the usual statistical tests performed on synthetic datasets in order to ascertain the overall accuracy of the tests chosen by the algorithms, for further insights into their behavior. To exemplify these two approaches, two Markov Boundary Discovery algorithms were used, namely the Incremental Association Markov Blanket algorithm and the Iterative Parent–Child-Based Search of Markov Blanket algorithm. Firstly, these algorithms were configured to use d-separation alone as their conditional independence test, computed on known Bayesian networks. Subsequently, the algorithms were configured to use the statistical G-test complemented by d-separation to evaluate their behavior on synthetic data.

RCV1: A New Benchmark Collection for Text Categorization Research.

Article

Full-text available

Apr 2004

Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection's properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as corrected versions of the category assignments and taxonomy structures, via online appendices.

Scikit-learn: Machine Learning in Python

Article

Full-text available

Jan 2012

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

Artificial Intelligence: A Modern Approach. Prentice Hall

Book

Jan 2009

Elements of Information Theory

Chapter

Oct 2001

Half-title pageSeries pageTitle pageCopyright pageDedicationPrefaceAcknowledgementsContentsList of figuresHalf-title pageIndex

Information gain and divergence-based feature selection for machine learning-based text categorization

Article

Jan 2006
INFORM PROCESS MANAG

Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami's method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.

Feature extraction: foundations and applications

Article

Jan 2006

A Comparative Study on Feature Selection in Text Categorization

Article

Apr 1998

This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 5...

Toward optimal feature selection

Jan 1995
284-292

D Koller
M Sahami

D. Koller and M. Sahami, "Toward optimal feature selection," in In 13th International Conference on Machine Learning, 1995, pp. 284-292.

Contributions to automatic knowledge extraction from unstructured data

Jan 2007

I D Morariu

I. D. Morariu, "Contributions to automatic knowledge extraction from unstructured data," Ph.D. dissertation, "Lucian Blaga" University of Sibiu (supervisor: Prof. L. Vint, an), 2007.

RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection

Jan 2015

D D Lewis

D. D. Lewis, "RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization Test Collection," 2015.

A Comparison between Two Feature Selection Algorithms

Figures

No full-text available

Recommended publications

Feature Selection using Parallel Cuckoo Algorithm with Naïve Bayes Classifier based on Two Different...

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection.

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;...

A Novel Hybrid PSBCO Algorithm for Feature Selection

Several Types of Group Variable Selection Methods and Block Coordinate Descent Algorithm

Variable selection by lasso type methods

Prediction of Electricity Tariff Recovery Risk based on Hybrid Feature Selection Algorithm

Reshaped Sequential Replacement algorithm: an efficient approach to variable selection

A New Measure of Feature Selection Algorithms' Stability

Catching Change-Points with Lasso

Malicious Code Detection Method Based on WIG-GA Feature Selection Algorithm

A feature selection algorithm with Fuzzy information