Content uploaded by Rob Brennan
Author content
All content in this area was uploaded by Rob Brennan on Jan 17, 2024
Content may be subject to copyright.
The Impact Of Data Valuation On Feature
Importance In Classification Models
Malick Ebiele*, Malika Bendechache, Marie Ward, Una Geary, Declan Byrne,
Donnacha Creagh, and Rob Brennan
Abstract This paper investigates the impact of data valuation metrics (variability
and coefficient of variation) on the feature importance in classification models.
Data valuation is an emerging topic in the fields of data science, accounting, data
quality, and information economics concerned with methods to calculate the value
of data. Feature importance or ranking is important in explaining how black-box
machine learning models make predictions as well as selecting the most predictive
features while training these models. Existing feature importance algorithms are
either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance
in Tree-based models). No previous investigation of the impact of data valuation
metrics on feature importance has been conducted. Five popular machine learning
models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression
(LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as
well as six widely implemented feature ranking techniques (Information Gain, Gini
importance, Frequency Importance, Cover Importance, Permutation Importance,
Malick Ebiele*
University College Dublin, Dublin, Ireland. e-mail: malick.ebiele@adaptcentre.ie
Malika Bendechache
University of Galway, Galway, Ireland. e-mail: malika.bendechache@
universityofgalway.ie
Marie Ward
Saint James’s Hospital, Dublin, Ireland. e-mail: MaWard@stjames.ie
Una Geary
Saint James’s Hospital, Dublin, Ireland. e-mail: UGeary@stjames.ie
Declan Byrne
Saint James’s Hospital, Dublin, Ireland. e-mail: DeByrne@stjames.ie
Donnacha Creagh
Saint James’s Hospital, Dublin, Ireland. e-mail: DCreagh@stjames.ie
Rob Brennan
University College Dublin, Dublin, Ireland. e-mail: rob.brennan@ucd.ie
1
2 M. Ebiele et al.
and SHAP values) to investigate the relationship between feature importance and
data valuation metrics for a clinical use case. XGB outperforms the other models
with a weighted F1-score of 79.72%. The findings suggest that features with
variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to
no value; therefore, these features can be filtered out during feature selection. This
result, if generalisable, could simplify feature selection and data preparation.
Keywords: Data Value, Machine Learning, Feature Importance, Feature Selection,
Explainable AI.
1 Introduction
Data valuation is an emerging topic in the fields of data science, accounting, data
quality and information economics concerned with methods to calculate the value
of data. Fleckenstein et al. [1] identified three categories of approaches to data
valuation: market-based valuation, economic models, and dimension-based models;
for more details, please refer to [1].
Data is a core resource for developing machine learning (ML) models. Hence,
ML researchers are starting to investigate the use of data valuation techniques [5].
Scholars aim to identify either the most influential subset of data for ML model
training or the most influential features for feature selection and model explanation
[5–7, 13]. Our approach is focused on identifying the most influential features in
a dataset. This is important for feature selection, model explainability, and has
potential applications in data protection by reducing the amount of data that needs
to be shared. In the case of feature selection, exhaustive search is a very complex
and time-consuming process [21]; for a set of n elements, there are 2n−1 possible
combinations of features. This can only be completed for a very small number of
features [21]. Therefore, many heuristic techniques have been proposed. Most are
guided search algorithms, such as forward or backward search, which means only
adding a feature to the feature set if that feature satisfies the selection criteria (for
example either minimising the error or maximising the accuracy). Although these
approaches are effective and widely used in practice, they do not always produce
the optimal feature set and require training the models. In this paper, we introduce
a new, training-free method for feature selection based on data valuation.
First, we must investigate how data valuation metrics are related to feature ranking
measures. To do that we chose the variability and coefficient of variation (CoV)
as data content valuation metrics as well as Five classification models: eXtreme
Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-
Layer Perceptron (MLP), and Naive Bayes (NB). On one hand, the reason for
choosing the variability is to investigate its impact on the models. As shown by
[6, 8, 9], Random Forest (RF) is biased toward features with high variability. For
the coefficient of variation, the higher the spread of data points is, the more difficult
The Impact Of Data Valuation On Feature Importance In Classification Models 3
it would be to cluster the data points. Therefore, we suspect that the spread of a
feature may be related to its ranking. On the other hand, XGB and RF have been
chosen because they are robust tree-based models, MLP because it is a (simple)
neural network, LR because it is the simplest classifier, and NB because it is a
probabilistic model. Finally, the six feature ranking techniques (Information Gain,
Gini importance, Frequency, Cover, Permutation Importance, and SHAP values)
have been used because there are the most popular feature ranking techniques.
Below are research questions:
• To what extent can variability and coefficient of variation predict the value of a
feature?
• To what extent are feature importance and data valuation metrics related?
• To what extent can variability and coefficient of variation be used for pretraining
feature selection?
To answer these research questions, we took a large, complex dataset for a real
world clinical classification problem and trained five ML models (eXtreme Gradient
Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer
Perceptron (MLP), and Naive Bayes (NB)) and calculated their feature importance
using six importance measures (SHAP {SHapley Additive exPlanation}values,
Permutation Importance, Information Gain, Gini Importance, Frequency, and Cover),
computed the variability and the coefficient of variation of each feature, and finally
compare the feature importance and its value. This paper has the following contributions:
• The introduction of variability (instead of the number of unique values which
is an absolute measure; variability is the relative number of unique values) and
coefficient of variation as data value measures
• The introduction of a new training-free feature selection approach
• A comprehensive study of the relationships between feature importance and data
valuation metrics for our clinical use case.
• Introduction of an extended ML process for investigating the relationship between
Data Valuation Metrics and Feature Importance.
The remainder of this paper is divided as follows: the next section defines the
background, followed by a description of the related work, section 3 describes our
use case, section 4 proposes a new data value informed ML process, the experiments
are described in section 5 which is followed by the experimental results section, then
the discussion section, and finally the conclusion.
2 Background
This section defines the key terms as well as the motivation of the choices made.
4 M. Ebiele et al.
2.1 Definitions
Data valuation is a set of techniques to assign value (e.g. economic or social value)
to datasets or data assets [27]. It aims to identify which data is important for a given
context with respect to predefined criteria. For example in training machine learning
(ML) models, an important datum or feature will be the one that improves the
model’s performance by reducing the error or improving its accuracy. In business,
the most important data could either be the most used on a daily basis for business
operations or the ones that most business units depend upon or the data which most
increase business revenue. This demonstrates that the same data can have multiple,
variable values in different contexts.
Several techniques have been used to determine the value of a specific data or data
set; from qualitative metrics to quantitative and even ML-based ones [1,4,19]. There
is a wide range of dimensions of data value and metrics in the literature [24]. Some
dimensions are usage, utility, cost, uniqueness, recency, and recurrency. In this
paper, we have defined two new metrics: Variability and Coefficient of variation.
Variability is defined as the relative number of unique values. For example, given
two dataset D1and D2with 100 and 1000 entries, respectively, if F1from D1and
F2from D1are two features with the same number of unique values equal to 20,
then they will have 0.2 and 0.02 variability. These two features have the same
absolute number of unique values but drastically different relative number of unique
values (variability). The variability allows the comparison of the number of unique
values of features from different datasets. As for coefficient of variation [28], it is
defined as the relative spread of data points; it is a statistical measure applied to data
valuation. Below are the formulas used to calculate the variability and the coefficient
of variation:
• The Variability of the feature Fj:
RU (Fj) = |Fj|
N, (1)
where Nis the number of entries of the dataset containing the feature Fj; RU
stands for Relative Uniqueness (Uniqueness means number of unique values).
• The coefficient of variation (CoV) of the feature Fj:
CoV(Fj) = 1
µs∑j(Fi j −µ)2
dof , (2)
where dof (which means Degrees Of Freedom) is equal to N−1 (for a sample)
or N(for the population) and Fi j is the ith entry of the jth feature.
The Impact Of Data Valuation On Feature Importance In Classification Models 5
2.2 Motivations
The models can be evaluated in terms of predictive power by using the average
Precision, average Recall, and average F1-score metrics. The evaluation is necessary
to make sure that each of them represents the feature space in the best possible
way because the more performing a model is the better it captures the data and the
better the features are used. Three averaging techniques will be used: micro average
(calculate metrics globally by counting the total true positives, false negatives and
false positives1), macro average (calculate metrics for each label, and find their
unweighted mean. This does not take label imbalance into account1), and weighted
average (calculate metrics for each label, and find their average weighted by using
the number of true instances for each label. This alters macro average to account
for label imbalance1). Therefore, The weighted average F1-score will be used as the
main performance metric because F1-score is the trade-off between precision and
recall and, as mentioned above, weighted average takes into account the imbalance
of the data. In the process described below, the variability and coefficient of variation
should be computed on the training data (because computing these metrics before
splitting the data will mislead the comparison performed as the data valuation
metrics and the feature importance will be computed on different sets). Also, these
metrics need to be calculated on the original dataset before applying any scaling
techniques. The main reason is that the scaling technique keeps the distribution but
alters the statistics like the Standard deviation and the Mean (which are used to
compute the coefficient of variation) and sometimes even the variability (because
some scaling techniques are performed row-wise not feature-wise).
The selection of the Gini importance (based on the Gini index which is used
to compute the mean impurity index in Random Forests), Information Gain (the
average or total gain across all splits the feature is used in1), Frequency Importance
(the number of times a feature is used to split the data across all trees2), Cover
Importance (the average or total relative number of training instances seen across
all splits the feature is used in2). Permutation Importance (measures the impact of
any given feature by shuffling that feature and then measuring the performance of
the model. The change in performance, either positive or negative, is the importance
of the feature. If the performance improves, then the feature is ranked low. But if the
performance is worsened, then the feature is ranked high; the rank being the value of
the change in performance), and SHAP values importance (works quite similarly to
the Permutation Importance but instead of shuffling, it is based on cooperative game
theory to measure the change in performance). The first four ranking techniques will
be computed while training the models and therefore are model-specific (mostly
tree-based models) while the latter two will only be estimated after the models have
1https://scikit-learn.org/stable/modules/generated/sklearn.
metrics.f1_score.html
1https://xgboost.readthedocs.io/en/stable/python/python_api.html#
xgboost.Booster.get_score
6 M. Ebiele et al.
been trained. Another reason is to check if these different feature ranking algorithms
agree on the ranking of the features.
3 Related Work
Feature importance has been widely studied for the purpose of explaining black-
box machine and deep learning algorithms. These studies aim to detect the features
or variables that contribute most to predictive model outputs. Significant feature
ranking algorithms include: SHAP (SHapley Additive exPlanation) values, Permutation
Importance, Information Gain (Entropy, Log-loss), Gini Importance, Frequency
Importance, and Cover Importance. The first two are model-independent and the
latter four are tree-based techniques. SHAP values are based on cooperative game
theory and were first introduced by Kuhn and Tucker [26]. They are widely used
in machine learning for their robustness and effectiveness in capturing outliers
and corruptions [5]. For that last reason, SHAP values have received increasing
attention and focus in the data valuation community in recent years, especially to
detect the high-value datum in training predictive models [5]. Previous studies have
been focused either on detecting which features are the most influential for machine
learning models output [12, 16–18, 20–22] or choosing the best datum or subset of
data which improve these models performance during the training process [5,13].
There are also been some studies of the bias or shortcomings of the feature selection
and model explanation techniques with some attempts to solve them [6, 8, 9]. In
this paper, we investigate the relationship between the feature importance and its
value measured by its variability and coefficient of variation. If the influence of
the feature value on its ranking can be accurately estimated, then training-free or
pre-training feature selection can be performed; because all the existing approaches
perform feature selection during or after the training of the models. To the best
of our knowledge, no such work has been done before. Strobl et al. [8] showed
that tree-based Gini importance and permutation importance in Random Forests are
not reliable when the features have different scales and numbers of unique values
(variability). Strobl [8] and Loecher [6] proposed different approaches to solve this
problem. In [6], Loecher showed that even SHAP values have shortcomings and
strongly depend on the number of unique values or variability of a feature.
Our work differs from all these previous contributions as (i) we are not trying to
identify the shortcomings of feature importance measures or attempting to solve
them but rather to investigate how they are related to variability and coefficient
of variation, (ii) they only focus on Random forest while we used different types
of algorithms (Probabilistic {Naive Bayes}, Tree-based {Random Forest, eXtreme
Gradient Boosting}, Linear modelling {Logistic Regression}, and Neural networks
{Multi-Layer Perceptron}), and (iii) the data is scaled before training the models
(therefore we believe that the problem related to the scale difference is solved).
The Impact Of Data Valuation On Feature Importance In Classification Models 7
4 Use Case
To test our proposed approach on the five selected ML algorithms, we consider a
real dataset and classification problem from a major hospital in Ireland.
The data used in this paper is extracted from several sources in the hospital data
management system. It consists daily data on patient journeys within the hospital
(e.g. beds and wards used, admission and discharge dates) as well as other data
(e.g. area of residence, gender, discharge code). In total there are 152 features and
938,296 patient records. The classification aim is to predict patients’ outcomes at
any given time throughout their journey within the hospital. Given the complexity
of the data set, exploring the whole feature space with traditional ML techniques is
extremely time and resource-intensive. This, along with the variety of feature types
and time series behaviours included, makes it a good candidate to investigate the
research questions defined in the Introduction.
4.1 Ethics and Data Sharing
This research was covered as part of DCU research ethics committee application
DCUREC/2021/118 and the data was transferred under a data processing agreement
between the hospital and universities. The dataset was anonymised before it was
shared and so contains no personal data. The research described here was conducted
as part of a broader programme investigating patient infection risk and improving
hospital safety systems. This large clinical classification problem highlighted issues
in model training time and traditional feature selection techniques that led to this
work to optimise the process and more rapidly protect patients.
5 Extended ML Process for Investigating the Impact of Data
Valuation Metrics on Feature Importance
Fig. 1 displays the extended ML process used to investigate the impact of data value
metrics on feature importance; it extends the process presented by [29].
The three added steps are described below. Please, refer to [29] for the description
of the standard steps.
Data valuation metrics: Calculation of the data valuation metrics on each feature.
In this use case, the variability and coefficient of variation are computed on the
training data using the formulas defined in the Background section above.
Feature Importance extraction or calculation: The main objective here is to
investigate how much each feature influences the prediction of the best model (the
8 M. Ebiele et al.
Fig. 1: Extended ML Process for Investigating the Impact of Data Valuation Metrics
on Feature Importance
selected model after the model analysis); either by extracting the ranking of the
feature from the models (tree-based models) or calculating the ranking of features
(Permutation importance, SHAP values).
Investigation of the relationship between feature importance and data valuation
metrics: This step is about visualizing, calculating, and exploring every possible
relationship between the feature importance and the data valuation metrics. Visualization
should always be the first task to be undertaken in this step because it may guide
the remaining exploration and also lead to insights that are straightforward to get
visually by difficult mathematically. Next, we can restart the model life-cycle from
the training and tuning step to incorporate the insights got from the investigation of
the relationship between feature importance and data valuation metrics.
6 Experiments
This section discusses the preprocessing steps, the training dataset details, and the
model training setups. After all these steps, the feature importance is calculated and
plotted against the variability and coefficient of variation.
The Impact Of Data Valuation On Feature Importance In Classification Models 9
6.1 Dataset
The model tuning and the SHAP values computation are very time-consuming.
Therefore, we decided to perform a stratified random sampling of the dataset.
Stratified random sampling is chosen over simple random sampling because we
wanted the training data to reflect the distribution of the number of entries per class
of the original dataset. The Table 1 below shows the number of entries per class
before and after the stratified random sampling.
Table 1: The number of entries per class before and after a stratified random
sampling
Class ID Number of entries before
sampling (proportion in %)
Number of entries after
sampling (proportion in %)
0 75051 (81.72%) 3753 (81.72%)
1 5596 (6.09%) 280 (6.09%)
2 3484 (3.79%) 174 (3.78%)
3 2661 ( 2.89%) 133 (2.89%)
4 2139 ( 2.32%) 107 (2.33%)
5 2011 (2.18%) 101 (2.19%)
6 531 (0.57%) 26 (0.56%)
7 363 (0.39%) 18 (0.39%)
Total 91,836 4,592
6.2 Data Preprocessing
The original dataset contains 938,296 rows and 152 columns of inpatient discharges[
from 1st Jan 2018 until 28th Feb 2022] from a large acute hospital. The first
processing task was to drop the empty columns (columns where all the values
are missing), single-value columns, and duplicated rows from the data. Therefore
21 empty columns, 1 single-value column, and 57 duplicated entries have been
dropped. The resulting dataset contains 938,239 rows and 130 columns; 117 contain
categorical data, 7 of them contain numerical data and 6 date-time data. The second
step was grouping the data per patient id, episode id, and discharge description to
get a unique discharge description per entry. If multiple values are encountered,
for some columns, a set is returned. For example, if a patient used two or more
beds or wards during their episode a set of the beds or wards used is returned.
Each of these sets will then be encoded afterwards using the ordinal encoder.
The resulting dataset has 91,836 entries and 130 columns. Two new columns
10 M. Ebiele et al.
(episode length in days = discharge date - admission date, screen length in days =
Last Date Of CPE Screen - First Date Of CPE Screen) was calculated and added
to the dataset. Third step, for 5 of the datetime columns, the following time features
are added to the dataset: year, quarter, month, week, day, hour, minute, dayofweek,
isweekend, and isholiday. The remaining datetime column is used to compute the
number of overnight stays of a patient during his or her stay. The datetime columns
are then deleted afterwards. For each of the datetime features added, -1 is imputed
when the original date is missing except the year feature which is imputed with the
minimum year minus one. After this step, the number of columns increased from
130 to 175 (175 = 130+10x5 - 5). Fourth and last step, The 117 categorical columns
have been encoded using the ordinal encoder and imputed with zero for all missing
values. The ordinal encoder was chosen over the one hot encoder because of the
high variability of some of the columns (Fig. shows the number of unique values
per column). One hot encoding would have increased the data dimension drastically
which may have invoked the ”curse of dimensionality”, especially for tree-based
models. The sets returned by the grouping are encoded randomly (without taking
into account their size).
6.3 Model Training
The models were trained using 5-fold Randomized Grid-Search Cross Validation
(RGS-CV) with weighted F1-score scoring on an Ubuntu 18.04 LTS machine with
16GB RAM and 2 cores. The RGS-CV is run 5 times; once for each scaling
technique: SimpleNormalizer, Normalizer, RobustScaler, MinMaxScaler, and StandardScaler.
The experiments were run using the Scikit-learn package; all the scalers have been
used with the default parameters. SimpleNormalizer is a self-implementation and
divides each feature in the dataset by its maximum value. The data transformation
and parameters with the highest weighted average F1-score are returned at the end
of these runs. The model is retrained using the returned information and then tested
on the test data to report the model’s performance.
6.4 Model Performance Report
Once the training of each model is completed, the parameters as well as the scaling
technique of the best model in terms of F1-score are returned. Then, the model is
retrained using those parameters and the test data is scaled using the best scaler.
Table 2 below shows the model performance on the test data. It can be seen that
XGB outperforms the other models on 8 out of 9 performance metrics; and most
importantly on the main metric which is the weighted average F1-score.
The Impact Of Data Valuation On Feature Importance In Classification Models 11
Table 2: Model performance report in percentage (%) and training and SHAP values
time.
Models PRECISION∗RECALL∗F1-SCORE∗TIME∗∗
(days-hh:mm:ss)
Micro Macro Weighted Micro Macro Weighted Micro Macro Weighted Training SHAP values
XGB 84.23 47.1 80.26 84.23 24.27 84.23 84.23 27.74 79.72 0-07:37:00 1-07:00:31
RF 77.70 29.51 77.99 77.70 29.03 77.70 77.70 27.29 77.35 0-01:52:01 0-07:36:21
MLP 81.62 29.05 74.89 81.62 19.37 81.62 81.62 20.39 76.52 0-00:40:41 0-06:39:14
NB 81.27 21.58 72.11 81.27 13.3 81.27 81.27 12.95 74.26 0-00:00:15 0-06:58:52
LR 82.14 22.30 71.57 82.14 16.72 82.14 82.14 17.67 75.78 0-00:33:37 0-06:17:10
∗The highest the better.
∗∗The lowest the better.
7 Experimental Results
Fig. 2 below displays the relationship between the variability (on the x-axis) and the
feature ranking techniques (on the y-axis).
On the top (first row) is plotted the variability against the Permutation Importance
and SHAP values. It can be seen that the most influential features have a variability
close to zero and as the variability increases the features’ ranking decreases until 0.4.
Above 0.4, the features have very little influence (zero or close to zero Permutation
Importance and SHAP values) for most models.
On the middle row is displayed the variability against Gain Importance and Cover
Importance. It can be seen that most influential features are located to the left
and as the variability increases the features’ ranking decreases until 0.2 where the
features’ ranking becomes constant around 1.75 for Gain Importance and 80 for
Cover Importance.
On the bottom row is plotted variability against the Frequency Importance and Gini
Importance. The most influential features have a variability close to zero and as the
variability increases some features’ ranking increase logarithmically for Frequency
Importance and linearly for Gini Importance. The other features’ ranking are located
between the logarithmic or linear line and the vertical line at variability equal to
zero.
We can retain that there is no linear relationship between the variability and the
feature ranking. However, it can be observed that features with variability greater
than or equal to 0.4 have very little to no impact or importance across 4 out of the 6
ranking techniques. The exception for Frequency Importance makes sense because
the higher the variability the more difficult it is to group the entries per value. As
for Gini Importance, it might be due to the shortcomings of Random Forest towards
feature with high variability [6,8].
Fig. 3 below shows the relationship between the coefficient of variation (on the x-
axis) and the feature ranking techniques (on the y-axis).
12 M. Ebiele et al.
Fig. 2: Variability versus feature importance
On first row is plotted the coefficient of variation against the Permutation Importance
and SHAP values. It can be seen that the most influential features have a coefficient
of variation close to zero and as the coefficient of variation increases the Permutation
The Impact Of Data Valuation On Feature Importance In Classification Models 13
Importance and SHAP values decrease until 11.852. Between 11.85 and 31.33,
Permutation Importance fluctuates around the line y=0 within a small range. Above
31.3, the features have zero Permutation Importance for all models. As to SHAP
values, above 11.85 the features have little influence and their SHAP values fluctuate
around the line y=0 within a range smaller than the ones with a CoV less than 11.85.
On the middle row is displayed the coefficient of variation against Gain Importance
and Cover Importance. It can be seen that all influential features are located to the
left of the vertical line at 23.44. Above 23.4, the features have very zero influence.
On the bottom row is plotted coefficient of variation against the Frequency Importance
and Gini Importance. The most influential features have a coefficient of variation
less than 11.85. Above 11.85, the features have no influence.
We can retain that there is no linear relationship between the coefficient of variation
and the feature importance ranking. But, as depicted on the graphs below, features
with a coefficient of variation greater than or equal to 23.4 (on average) have little to
no impact or importance. Also, features with negative CoV have little SHAP values
and zero importance for all other feature ranking techniques.
We found that, overall, as the variability and the coefficient of variation of a feature
increase, that feature predictive power decreases. The decay of the predictive power
of a feature is noticeable for a variability from 0.4 and above and a coefficient of
variation from 23.4 (on average) and above.
These experiments also show the similarity of the behaviour of Permutation Importance
and SHAP values, Gain Importance and Cover Importance, and Frequency Importance
and Gini Importance relative to the features’ variability and coefficient of variation.
8 Discussion
The results presented in this paper suggest that features with a variability greater
than or equal to 0.4 have very little to no importance for this use case; contrary to
the results presented by [6, 8, 9]. Here, the features with a large number of unique
values have been ranked low. This might be the result of intensive work to reduce
model bias towards features with high variability since these shortcomings have
been exposed by [8]. However, RF, XGB and MLP still assigned a pretty high
permutation importance or SHAP values importance to such features.
The comparison between the feature importance and the coefficient of variation
(CoV) showed that features with a coefficient of variation greater than or equal
to 23.4 have zero Gain, Frequency, Cover, and Gini importance. The same features
211.85 =min CoV +0.23 ×(max CoV - min CoV)
331.3=min CoV +0.55 ×(max CoV - min CoV)
423.4=min CoV +0.42 ×(max CoV - min CoV)
14 M. Ebiele et al.
Fig. 3: Coefficient of variation versus feature importance
have zero and little permutation importance for a coefficient of variation greater than
31.3 and between 23.4 and 31.3, respectively. In terms of SHAP values importance,
The Impact Of Data Valuation On Feature Importance In Classification Models 15
they are assigned little importance by most models. But, only NB assigned zero
importance consistently to those with a coefficient of variation greater than 31.3.
In general, features with variability greater than or equal to 0.4 or a coefficient of
variation greater than or equal to 23.4 (on average) have a lower (average) deviation
from the line y=0 (zero importance line). The coefficient of variation seems to be a
better predictor of the importance of the features than the variability. The problem
with the coefficient of variation is that it is a real number and does not have a fixed
range. Moreover, it can be misleading to measure the spread of data points when
the mean is close to zero. For example, given three normally distributed random
variables V1 ∼N(1, 1), V2 ∼N(0.001,1), and V3 ∼N(0,1). V1, V2, and V3
have the same standard deviation but dramatically different coefficients of variation.
In fact, V1, V2 and V2 have 1, 1000, and +∞(infinity) coefficient of variation,
respectively. Therefore, the need to define a more robust relative spread measure is
necessary.
In spite of all the shortcomings listed above, the coefficient of variation is found to
be a better predictor of the (analytical) value of a feature. In this use case, all the
top-ranked features have a coefficient of variation within
[min CoV,min CoV +0.42 ×(max CoV - min CoV)].
This interval can be shrunk to [min CoV,min CoV +0.23 ×(max CoV - min CoV)]
for Frequency and Gini Importance.
In general, lower values of variability (less than 0.4) and coefficient of variation
(less than 42% of the CoV’s range) have more impact on the feature importance
than higher values.
9 Conclusion
In this paper we investigated the relationship between the feature importance and
two newly introduced data valuation metrics for the content value dimension which
are variability and coefficient of variation. We found that there is no linear relationship
between these two data valuation metrics and feature importance. However, we
found that the most influential features have a variability less than 0.4 and a
coefficient of variation less than 23.4 (on average). This suggests, in our use case,
that features with variability great than 0.4 or a coefficient of variation greater
than 23.4 (on average) are not relevant. This does not predict the actual value of
a feature but instead helps to classify the features into non-influential and possible
impactful feature sets. This finding, if generalised, will allow us to perform training-
free feature selection. It also has potential applications in data protection or privacy
where non-influential features could be identified prior to sharing and thus never
shared, thereby increasing safety.
16 M. Ebiele et al.
In future work, we firstly aim to generalise the findings in this paper by applying
the proposed process on different datasets (preferably from different domains) to
check the reproducibility of the insights presented in this paper, and secondly, to
design a more robust relative spread measure aiming to solve the shortcomings of
the coefficient of variation mentioned in the Discussion Section. We also aim to
systematise the setting of the thresholds of the variability and the coefficient of
variation as they have been chosen manually in this use case.
Acknowledgements This research has received funding from the ADAPT Centre for Digital
Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106 P2),
co-funded by the European Regional Development Fund. For the purpose of Open Access, the
author has applied a CC BY public copyright licence to any Author Accepted Manuscript version
arising from this submission.
References
1. Fleckenstein, M., Obaidi, A. & Tryfona, N. A Review of Data Valuation Approaches and
Building and Scoring a Data Valuation Model. Harvard Data Science Review.5(2023,1),
https://hdsr.mitpress.mit.edu/pub/1qxkrnig/release/1
2. Noshad, M., Choi, J., Sun, Y., Hero, A. & Dinov, I. A data value metric for
quantifying information content and utility. Journal Of Big Data.8, 82 (2021,6),
https://doi.org/10.1186/s40537-021-00446-6
3. Tang, S., Ghorbani, A., Yamashita, R., Rehman, S., Dunnmon, J., Zou, J. & Rubin, D.
Data Valuation for Medical Imaging Using Shapley Value: Application on A Large-scale
Chest X-ray Dataset. Scientific Reports.11, 8366 (2021,4), http://arxiv.org/abs/2010.08006,
arXiv:2010.08006 [cs, eess]
4. Yoon, J., Arik, S. & Pfister, T. Data Valuation using Reinforcement Learning. (arXiv,2019,9),
http://arxiv.org/abs/1909.11671, arXiv:1909.11671 [cs, stat]
5. Ghorbani, A. & Zou, J. Data Shapley: Equitable Valuation of Data for Machine Learning.
(arXiv,2019,6), http://arxiv.org/abs/1904.02868, arXiv:1904.02868 [cs, stat]
6. Loecher, M. Unbiased variable importance for random forests. Communications In
Statistics - Theory And Methods.51, 1413-1425 (2022,3), http://arxiv.org/abs/2003.02106,
arXiv:2003.02106 [cs, stat]
7. Lundberg, S. & Lee, S. A Unified Approach to Interpreting Model Predictions.
(arXiv,2017,11), http://arxiv.org/abs/1705.07874, arXiv:1705.07874 [cs, stat]
8. Strobl, C., Boulesteix, A., Zeileis, A. & Hothorn, T. Bias in random forest variable
importance measures: Illustrations, sources and a solution. BMC Bioinformatics.8, 25
(2007,1), https://doi.org/10.1186/1471-2105-8-25
9. Loecher, M. Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles.
Machine Learning And Knowledge Extraction. pp. 114-129 (2022)
10. Baudeu, R., Wright, M. & Loecher, M. Are SHAP Values Biased Towards High-Entropy
Features?. Machine Learning And Principles And Practice Of Knowledge Discovery In
Databases. pp. 418-433 (2023)
11. Antwarg, L., Miller, R., Shapira, B. & Rokach, L. Explaining
anomalies detected by autoencoders using Shapley Additive Explanations.
Expert Systems With Applications.186 pp. 115736 (2021,12),
https://www.sciencedirect.com/science/article/pii/S0957417421011155
12. Maasland, T., Pereira, J., Bastos, D., Goffau, M., Nieuwdorp, M., Zwinderman, A. & Levin, E.
Interpretable Models via Pairwise Permutations Algorithm. Machine Learning And Principles
And Practice Of Knowledge Discovery In Databases. pp. 15-25 (2021)
The Impact Of Data Valuation On Feature Importance In Classification Models 17
13. Jia, R., Dao, D., Wang, B., Hubis, F., Hynes, N., G¨
urel, N., Li, B., Zhang, C., Song, D. &
Spanos, C. Towards Efficient Data Valuation Based on the Shapley Value. Proceedings Of
The Twenty-Second International Conference On Artificial Intelligence And Statistics. pp.
1167-1176 (2019,4), https://proceedings.mlr.press/v89/jia19a.html, ISSN: 2640-3498
14. Kumar, S., Lakshminarayanan, A., Chang, K., Guretno, F., Mien, I., Kalpathy-Cramer,
J., Krishnaswamy, P. & Singh, P. Towards More Efficient Data Valuation in Healthcare
Federated Learning Using Ensembling. Distributed, Collaborative, And Federated Learning,
And Affordable AI And Healthcare For Resource Diverse Global Health. pp. 119-129 (2022)
15. Gul, F. Bargaining Foundations of Shapley Value. Econometrica.57, 81-95 (1989),
https://www.jstor.org/stable/1912573, Publisher: [Wiley, Econometric Society]
16. Datta, A., Sen, S. & Zick, Y. Algorithmic Transparency via Quantitative Input Influence:
Theory and Experiments with Learning Systems. 2016 IEEE Symposium On Security And
Privacy (SP). pp. 598-617 (2016,5), ISSN: 2375-1207
17. Cohen, S., Dror, G. & Ruppin, E. Feature Selection via Coalitional Game Theory. Neural
Computation.19, 1939-1961 (2007,7), Conference Name: Neural Computation
18. Campbell, T., Roder, H., Georgantas III, R. & Roder, J. Exact Shapley values for local and
model-true explanations of decision tree ensembles. Machine Learning With Applications.9
pp. 100345 (2022,9), https://www.sciencedirect.com/science/article/pii/S2666827022000500
19. Wu, Z., Shu, Y. & Low, B. DAVINZ: Data Valuation using Deep Neural Networks at
Initialization. Proceedings Of The 39th International Conference On Machine Learning. pp.
24150-24176 (2022,6), https://proceedings.mlr.press/v162/wu22j.html, ISSN: 2640-3498
20. Altmann, A., Tolos¸i, L., Sander, O. & Lengauer, T. Permutation importance: a
corrected feature importance measure. Bioinformatics.26, 1340-1347 (2010,5),
https://doi.org/10.1093/bioinformatics/btq134
21. Shardlow, M. An Analysis of Feature Selection Techniques. (2011),
https://www.semanticscholar.org/paper/An-Analysis-of-Feature-Selection-Techniques-
Shardlow/8973a724545bbc2a5cc52bc28f7ffcb5d4aa8dc8
22. Strumbelj, E. & Kononenko, I. An Efficient Explanation of Individual Classifications using
Game Theory. The Journal Of Machine Learning Research.11 pp. 1-18 (2010,3)
23. Brennan, R., Attard, J., Petkov, P., Nagle, T. & Helfert, M. Exploring data value assessment:
a survey method and investigation of the perceived relative importance of data value
dimensions. (SciTePress,2019,5), https://cora.ucc.ie/handle/10468/8166, Accepted: 2019-07-
16T09:18:42Z
24. Brennan, R. & Attard, J. Management of Data Value Chains, a Value Monitoring
Capability Maturity Model. (2018), http://www.tara.tcd.ie/handle/2262/82277, Accepted:
2018-01-25T15:30:03Z Journal Abbreviation: 20th International Conference on Enterprise
Information Systems (ICEIS)
25. Hapke, H. & Nelson, C. Introduction. Building Machine Learning
Pipelines: Automating Model Life Cycles With TensorFlow.. (2020,7),
https://www.oreilly.com/library/view/building-machine-learning/9781492053187/
26. Shapley, L. 17. A Value for n-Person Games. Contributions To
The Theory Of Games (AM-28), Volume II. pp. 307-318 (1953,12),
https://www.degruyter.com/document/doi/10.1515/9781400881970-018/html
27. Shobeiri, S. & Aajami, M. Shapley value in convolutional neural networks (CNNs): A
Comparative Study. American Journal Of Science & Engineering.2, 9-14 (2021,12)
28. Brown, C. Coefficient of Variation. Applied Multivariate Statistics In Geohydrology And
Related Sciences. pp. 155-157 (1998), https://doi.org/10.1007/978-3-642-80328-4 13
29. Hapke, H. & Nelson, Cl. (Year). Introduction. Building Machine Learning Pipelines:
Automating Model Life Cycles with TensorFlow. O’Reilly Media, Inc. (2020, 7). ISBN:
9781492053194