Fig 2 - uploaded by Damien François

Content may be subject to copyright.

# Boxplots for the distributions of d M IK(X4, Y ) (top) and of d M IK (X4, π(Y )) (bottom) as a function of K

Source publication

The selection of features that are relevant for a prediction or classification problem is an important problem in many domains involving high-dimensional data. Selecting features helps fighting the curse of dimensionality, improving the performances of prediction or classification methods, and interpreting the application. In a nonlinear context, t...

## Context in source publication

**Context 1**

... evaluating the MI between Y and a relevant feature (for example X 4 ), a t K value is obtained for each value of K, as shown on Figure 1. Those values summarize the differences between the empirical distributions of M I K (X 4 , Y ) and of M I K (X 4 , π(Y )) (an illustration of the behaviour of those distributions is given in Figure 2). The largest t K value corresponds to the smoothing parameter K that best separates the distributions in the relevant and permuted cases (in this example the optimal K is 10). ...

## Similar publications

Lower bounds on mutual information (MI) of long-haul optical fiber systems
for hard-decision and soft-decision decoding are studied. Ready-to-use
expressions to calculate the MI are presented. Extensive numerical simulations
are used to quantify how changes in the optical transmitter, receiver, and
channel affect the achievable transmission rates o...

Over binary input channels, uniform distribution is a universal prior, in the sense that it allows to maximize the worst case mutual information over all binary input channels, ensuring at least 94.2% of the capacity. In this paper, we address a similar question, but with respect to a universal generalized linear decoder. We look for the best colle...

This paper generalizes Wyner's definition of common information of a pair of random variables to that of $N$ random variables. We prove coding theorems that show the same operational meanings for the common information of two random variables generalize to that of $N$ random variables. As a byproduct of our proof, we show that the Gray-Wyner source...

In this paper, we show that there exists an arbitrary number of power
allocation schemes that achieve capacity in systems operating in parallel
channels comprised of single-input multiple-output (SIMO) Nakagami-m fading
subchannels when the number of degrees of freedom L (e.g., the number of
receive antennas) tends to infinity. Statistical waterfil...

We are interested in understanding the neural correlates of attentional processes using first principles. Here we apply a recently developed first principles approach that uses transmitted information in bits per joule to quantify the energy efficiency of information transmission for an inter-spike-interval (ISI) code that can be modulated by means...

## Citations

... In terms of accuracy, similar results are also obtained by using the computationally more efficient Gini impurity estimation [60]- [62]. While there are many variations of information gain-and Gini-based methods [61], [63]- [66], their major advantage is their ability to acknowledge nonlinear relationships between features and class labels [58], [67]. However, they favor features with a large number of distinct values, which can result in overfitting [57], [60], [62], [68]. ...

This article introduces a new iterative approach to explainable feature learning. During each iteration, new features are generated, first by applying arithmetic operations on the input set of features. These are then evaluated in terms of probability distribution agreements between values of samples belonging to different classes. Finally, a graph-based approach for feature selection is proposed, which allows for selecting high-quality and uncorrelated features to be used in feature generation during the next iteration. As shown by the results, the proposed method improved the accuracy of all tested classifiers, where the best accuracies were achieved using random forest. In addition, the method turned out to be insensitive to both of the input parameters, while superior performances in comparison to the state of the art were demonstrated on nine out of 15 test sets and achieving comparable results in the others. Finally, we demonstrate the explainability of the learned feature representation for knowledge discovery.

... When employing greedy forward methods, some concerns need to be specified, such as selecting appropriate parameters such as β and the stopping criterion of the greedy procedure [30]. The improper specification of parameters and stopping criteria may lead to a negligence of obtaining a better subset [31]. ...

... Before further analysis, we separate OF-Ratio into three levels: (0, 1] for 'small', (1,10] for 'medium', and (10, ∞) for 'large'. The number of classes is separated into 'binary' and 'multiclass', while the number of selected features is categorized into [5,25] ('low') and [30,50] ('high'). The interaction plots from ANOVA are given in Figure 7. ...

The filter feature selection algorithm is habitually used as an effective way to reduce the computational cost of data analysis by selecting and implementing only a subset of original features into the study. Mutual information (MI) is a popular measurement adopted to quantify the dependence among features. MI-based greedy forward methods (MIGFMs) have been widely applied to escape from computational complexity and exhaustion of high-dimensional data. However, most MIGFMs are parametric methods that necessitate proper preset parameters and stopping criteria. Improper parameters may lead to ignorance of better results. This paper proposes a novel nonparametric feature selection method based on mutual information and mixed-integer linear programming (MILP). By forming a mutual information network, we transform the feature selection problem into a maximum flow problem, which can be solved with the Gurobi solver in a reasonable time. The proposed method attempts to prevent negligence on obtaining a superior feature subset while keeping the computational cost in an affordable range. Analytical comparison of the proposed method with six feature selection methods reveals significantly better results compared to MIGFMs, considering classification accuracy.

... Mutual information between a feature variable and a label variable quantifies the efficiency of a feature variable at discriminating the label variable whereas mutual information of two feature variables is a measure of similarity/redundancy between them. Therefore it is a common tool for feature selection procedures in general [143,147,242], as well as for activity recognition applications [108,159]. Based on this notion, authors in [77,193] define the minimum redundancy maximum relevance criterion (mRMR), which is a measure that can be used to compare feature sets of same size both in terms of discriminant power and non-redundancy expressed as : ...

Tarkett is a global flooring company that developed a piezo-electric sensor encapsulated in the flooring and an embedded system meant to be equipped in nursing home patient rooms. The objective through this industrial project is to build reliable machine learning models able to work in real-time in the embedded system, based on piezo-electric signals, to provide useful information for medical staff to monitor their patients health. Considering different measurement technologies we describe how they affect the original physical signal, as well as different data gathering environments in which several dataset have been recorded. To be able to monitor elderly health state some important recurrent events like walk and some anomalies like falls need to be recognized from floor sensor signals.To this end, the way to process signals into adequate data representation, according to these detection purpose, is also a major challenge. We use a wide feature set based on time series from various signal representations such as Fourier transform, autocorrelation and spectrograms. Using predictive models based on random forests on different experimental datasets we show Tarkett system ability to achieve various monitoring tasks, as well as the relevance of each signal representation and associated features regarding these detection tasks. Nevertheless for these experimental studies to be deployed industrially in FIM Care real installations, machine learning models need to fulfill two crucial requirements. Firstly they have to be confronted with real environment data, meaning to be able to adapt to real installations variability and to activity signal differences between people. In this context we deal with the problem of adapting a predictive model initially trained on experimental data to real data with different empirical distribution. This particular situation in machine learning is known as transfer learning or domain adaptation. We address it by confronting simulated events data to real data on the fall detection task that presents the particularity of extreme class imbalance in real conditions. We investigate the drawbacks of this class imbalance on existing transfer learning methods on decision trees and propose some adaptations to handle this problem. Our contribution is a robust model-based transfer learning algorithm on random forests able to deal with class imbalance and that can also be used to interpret relations between two different domains. Secondly, most of the prediction tasks for elderly monitoring have to work in real time being embedded in an electronic device with limited computational capabilities. Taking into account this kind of constraints while designing a predictive model belongs to a branch of machine learning, known as cost sensitive or budget learning, that became an increasingly active research topic in the past years.We translate embedded system computational resource constraints into a budgeted prediction time framework compatible with decision tree based models and propose an efficient and scalable genetic algorithm considering both feature acquisition cost and evaluation cost allowing to pass from an experimental random forest model to a new simplified one that fits in embedded system resource limits. This algorithm takes advantage of the notion of equivalence between classifiers, meaning models sharing the same decision function but with different structures, to favor feature acquisition cost reduction by exploiting structural variety on decision trees.

... Similarly, Xiong et al. [23] used the Elastic Net algorithm to reduce input variables for calculating lower extremity joint moment. However, mutual information has no theoretically justified stopping criterion in the feature selection procedure and does not consider the interrelationship between variables [24], while the Elastic Net involves the linear regression method, which may not be able to optimize the non-linear system [25]. The BPSO is a typical nonlinear optimization algorithm, which can be used to solve this problem. ...

... Therefore, reducing the number of input variables through feature selection is an effective means to realize the portability of joint moment prediction, while improve the efficiency of prediction algorithm [32], [33]. Considering the nonlinearity of joint moment prediction model and the problems of feature selection methods such as mutual information [24],and Elastic Net [25], we used BPSO to get the optimal inputs subset for joint moment prediction. The BPSO is an effective method for solving discrete optimization problems based on population which was first introduced by Kennedy and Eberhart for discrete optimization problems in 1997 [26]. ...

Joint moment is an important parameter for a quantitative assessment of human motor function. However, most existing joint moment prediction methods lacking feature selection of optimal inputs subset, which reduced the prediction accuracy and output comprehensibility, increased the complexity of the input sensor structure, making the portable prediction equipment impossible to achieve. To address this problem, this paper develops a novel method based on the binary particle swarm optimization (BPSO) with the variance accounted for (VAF) as fitness function to reduce the number of input variables while improves the accuracy in joint moment prediction. The proposed method is tested on the experimental data collected from ten healthy subjects who are running on a treadmill with four different speeds of 2, 3, 4 and 5m/s. The BPSO is used to select optimal inputs subset from ten electromyography (EMG) data and six joints angles, and then the selected optimal inputs subset be used to train and predict the joint moments via artificial neural network (ANN). Prediction accuracy is evaluated by the variance accounted for (VAF) test between the predicted joint moment and multi-body dynamics moment. Results show that the proposed method can reduce the number of input variables of five joint moment from 16 to less than 11. Furthermore, the proposed method can better predict joint moment (mean VAF: 94.40±0.84%) in comparison with the state-of-the-art methods, i.e. Elastic Net (mean VAF: 93.38±0.96%) and mutual information (mean VAF: 86.27±1.41%). In conclusion, the proposed method reduces the number of input variables and improves the prediction accuracy that may allow the future development of a portable, non-invasive system for joint moment prediction. As such, it may facilitate real-time assessment of human motor function.

... In terms of food, for example, this is also supported by the role of women, currently women are born not only to take care of the household, but also become women who earn large or often called career women, so there is almost no time to prepare food for families, so it does not wonder if fast food is more chosen as one of the main alternatives for stomach fillers considering the taste of food served and the serving time is only minutes, so it doesn't drain much time. But unwittingly some types of fast food that are often consumed contain several types of hazardous ingredients [9], [10] that trigger the emergence of chronic diseases, such as: heart attack, insulin resistance, diabetes, and several other dangerous diseases [8], [11]. ...

Formalin and borax are additives that are widely used in food to be durable and not stale, the use of formalin and borax is not recommended for food because it can cause many diseases, especially cancer, to find out whether a food using formalin or borax can be examined using a system experts based on the characteristics of these foods in this case as a sample are meatballs which are one of the favorite types of food in Indonesia. The Certainty Factor method is one of the expert system methods that can be used to carry out checks based on facts on food and also the value of certainty given, testing obtained from using the certainty factor method can be used as a reference for consumers in choosing foods that are healthy and do not contain formalin or borax. © 2018 by the authors; licensee Modestum Ltd., UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution License

... Usually, the common way to stop the greedy forward feature selection is when estimated MI values started to decrease, this practice lacks justification. (François, et al., 2007;Verleysen, Rossi and François, 2009) introduce the use of resampling and permutation to provide a statistically justifiable stopping criteria. However, in both cases, the method was used with a randomly generated dataset. ...

... Even though randomly generated artificial dataset was used by the previous studies to validate the effectiveness of the algorithm, the use of real-world dataset is required to test its effectiveness. More so, despite the progress made in the previous studies to improve the greedy methods for feature selection, chances are that they yielded feature subsets that are far from optimal, optimal or almost optimal (Pudil, Novovicova and Kittler, 1994;Verleysen, Rossi and François, 2009; Sulaiman and Labadin, 2015b). ...

... So, the main contribution of this study is to establish a methodology for generating optimal feature subset for filter method by extending the previous work proposed by Verleysen, Rossi and François (2009 ...

Mutual Information (MI) is an information theory concept often used in the recent time as a criterion for feature selection methods. This is due to its ability to capture both linear and non-linear dependency relationships between two variables. In theory, mutual information is formulated based on probability density functions (pdfs) or entropies of the two variables. In most machine learning applications, mutual information estimation is formulated for classification problems (that is data with labeled output). This study investigates the use of mutual information estimation as a feature selection criterion for regression tasks and introduces enhancement in selecting optimal feature subset based on previous works. Specifically, while focusing on regression tasks, it builds on the previous work in which a scientifically sound stopping criteria for feature selection greedy algorithms was proposed. Four real-world regression datasets were used in this study, three of the datasets are public obtained from UCI machine learning repository and the remaining one is a private well log dataset. Two Machine learning models namely multiple regression and artificial neural networks (ANN) were used to test the performance of IFSMIR. The results obtained has proved the effectiveness of the proposed method.

... significantly increases the MI, when compared to the situation where irrelevant feature is added. The method described here is the one presented in [14] which is also based on resampling and random permutation operation. ...

... Many feature selection algorithms in the filter model rely on using certain metrics to rank or eliminate features. For instance, correlation [Yu and Liu, 2003, Cui et al., 2010, Woolrich et al., 2001, t-test [Zhou andWang, 2007, Tusher et al., 2001] and mutual information (MI) [Peng et al., 2005, Verleysen et al., 2009, Thomas M. Cover, 2006, Vinh et al., 2010, Chou et al., 2012 have been used to rank features or eliminate irrelevant features. The wrapper model requires a learning algorithm to assess the classification performance (e.g., prediction accuracy or cardinality) as evaluation criteria to select features [Chou et al., 2013, Kohavi and John, 1997, Reunanen, 2003]. ...

... In our previous study [Chou et al., 2013], a new feature selection based on MI criterion, called maximum informativeness (MaxI), was developed. MI is widely used as a criterion to rank the feature relevance [Verleysen et al., 2009, Thomas M. Cover, 2006, Vinh et al., 2010, Chou et al., 2012, starting from calculating the MI between each feature and the class label vector. MaxI prioritizes the voxels to be selected based on the informativeness of individual features to class labels, assessed by the value of MI (called importance index). ...

Feature selection plays an important role in the successful application of machine learning techniques to large real-world datasets. Avoiding model overfitting, especially when the number of features far exceeds the number of observations, requires selecting informative features and/or eliminating irrelevant ones. Searching for an optimal subset of features can be computationally expensive. Functional magnetic resonance imaging (fMRI) produces datasets with such characteristics creating challenges for applying machine learning techniques to classify cognitive states based on fMRI data. In this study, we present an embedded feature selection framework that integrates sparse optimization for regularization (or sparse regularization) and classification. This optimization approach attempts to maximize training accuracy while simultaneously enforcing sparsity by penalizing the objective function for the coefficients of the features. This process allows many coefficients to become zero, which effectively eliminates their corresponding features from the classification model. To demonstrate the utility of the approach, we apply our framework to three different real-world fMRI datasets. The results show that regularized classifiers yield better classification accuracy, especially when the number of initial features is large. The results further show that sparse regularization is key to achieving scientifically-relevant generalizability and functional localization of classifier features. The approach is thus highly suited for analysis of fMRI data.

... Unfortunately, there are exponentially many subsets and the computation becomes infeasible, already basic settings being NP hard [11]. This gives rise to greedy approaches such as forward search [6]. ...

... Our method is based on the MI, which is widely used in the field to select features in supervised as well as in unsupervised scenarios. Although the presented approach is unsupervised, it is possible, as shown in [6], to compute the pairwise MI with regard to the auxiliary information, thus turning it into a supervised technique. ...

... The relevant criterion can be based on the performance of a specific predictor (wrapper method), or on some general relevance measure of the variables for the prediction (filter method). Wrapper methods may have two drawbacks [2]: (a) they can be computationally very intensive; (b) their results may vary according to initial conditions or other chosen parameters. In the case of variable selection for regression, several studies have applied different regression algorithms attempting to minimize the cost of the search in the variable space [3,4]. ...

This paper presents a supervised variable selection method applied to regression problems. This method selects the variables applying a hierarchical clustering strategy based on information measures. The proposed technique can be applied to single-output regression datasets, and it is extendable to multi-output datasets. For single-output datasets, the method is compared against three other variable selection methods for regression on four datasets. In the multi-output case, it is compared against other state-of-the-art method and tested using two regression datasets. Two different figures of merit are used (for the single and multi-output cases) in order to analyze and compare the performance of the proposed method.