Content uploaded by Rob Brennan

Author content

All content in this area was uploaded by Rob Brennan on Jan 17, 2024

Content may be subject to copyright.

The Impact Of Data Valuation On Feature

Importance In Classiﬁcation Models

Malick Ebiele*, Malika Bendechache, Marie Ward, Una Geary, Declan Byrne,

Donnacha Creagh, and Rob Brennan

Abstract This paper investigates the impact of data valuation metrics (variability

and coefﬁcient of variation) on the feature importance in classiﬁcation models.

Data valuation is an emerging topic in the ﬁelds of data science, accounting, data

quality, and information economics concerned with methods to calculate the value

of data. Feature importance or ranking is important in explaining how black-box

machine learning models make predictions as well as selecting the most predictive

features while training these models. Existing feature importance algorithms are

either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance

in Tree-based models). No previous investigation of the impact of data valuation

metrics on feature importance has been conducted. Five popular machine learning

models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression

(LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as

well as six widely implemented feature ranking techniques (Information Gain, Gini

importance, Frequency Importance, Cover Importance, Permutation Importance,

Malick Ebiele*

University College Dublin, Dublin, Ireland. e-mail: malick.ebiele@adaptcentre.ie

Malika Bendechache

University of Galway, Galway, Ireland. e-mail: malika.bendechache@

universityofgalway.ie

Marie Ward

Saint James’s Hospital, Dublin, Ireland. e-mail: MaWard@stjames.ie

Una Geary

Saint James’s Hospital, Dublin, Ireland. e-mail: UGeary@stjames.ie

Declan Byrne

Saint James’s Hospital, Dublin, Ireland. e-mail: DeByrne@stjames.ie

Donnacha Creagh

Saint James’s Hospital, Dublin, Ireland. e-mail: DCreagh@stjames.ie

Rob Brennan

University College Dublin, Dublin, Ireland. e-mail: rob.brennan@ucd.ie

1

2 M. Ebiele et al.

and SHAP values) to investigate the relationship between feature importance and

data valuation metrics for a clinical use case. XGB outperforms the other models

with a weighted F1-score of 79.72%. The ﬁndings suggest that features with

variability greater than 0.4 or a coefﬁcient of variation greater than 23.4 have little to

no value; therefore, these features can be ﬁltered out during feature selection. This

result, if generalisable, could simplify feature selection and data preparation.

Keywords: Data Value, Machine Learning, Feature Importance, Feature Selection,

Explainable AI.

1 Introduction

Data valuation is an emerging topic in the ﬁelds of data science, accounting, data

quality and information economics concerned with methods to calculate the value

of data. Fleckenstein et al. [1] identiﬁed three categories of approaches to data

valuation: market-based valuation, economic models, and dimension-based models;

for more details, please refer to [1].

Data is a core resource for developing machine learning (ML) models. Hence,

ML researchers are starting to investigate the use of data valuation techniques [5].

Scholars aim to identify either the most inﬂuential subset of data for ML model

training or the most inﬂuential features for feature selection and model explanation

[5–7, 13]. Our approach is focused on identifying the most inﬂuential features in

a dataset. This is important for feature selection, model explainability, and has

potential applications in data protection by reducing the amount of data that needs

to be shared. In the case of feature selection, exhaustive search is a very complex

and time-consuming process [21]; for a set of n elements, there are 2n−1 possible

combinations of features. This can only be completed for a very small number of

features [21]. Therefore, many heuristic techniques have been proposed. Most are

guided search algorithms, such as forward or backward search, which means only

adding a feature to the feature set if that feature satisﬁes the selection criteria (for

example either minimising the error or maximising the accuracy). Although these

approaches are effective and widely used in practice, they do not always produce

the optimal feature set and require training the models. In this paper, we introduce

a new, training-free method for feature selection based on data valuation.

First, we must investigate how data valuation metrics are related to feature ranking

measures. To do that we chose the variability and coefﬁcient of variation (CoV)

as data content valuation metrics as well as Five classiﬁcation models: eXtreme

Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-

Layer Perceptron (MLP), and Naive Bayes (NB). On one hand, the reason for

choosing the variability is to investigate its impact on the models. As shown by

[6, 8, 9], Random Forest (RF) is biased toward features with high variability. For

the coefﬁcient of variation, the higher the spread of data points is, the more difﬁcult

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 3

it would be to cluster the data points. Therefore, we suspect that the spread of a

feature may be related to its ranking. On the other hand, XGB and RF have been

chosen because they are robust tree-based models, MLP because it is a (simple)

neural network, LR because it is the simplest classiﬁer, and NB because it is a

probabilistic model. Finally, the six feature ranking techniques (Information Gain,

Gini importance, Frequency, Cover, Permutation Importance, and SHAP values)

have been used because there are the most popular feature ranking techniques.

Below are research questions:

• To what extent can variability and coefﬁcient of variation predict the value of a

feature?

• To what extent are feature importance and data valuation metrics related?

• To what extent can variability and coefﬁcient of variation be used for pretraining

feature selection?

To answer these research questions, we took a large, complex dataset for a real

world clinical classiﬁcation problem and trained ﬁve ML models (eXtreme Gradient

Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer

Perceptron (MLP), and Naive Bayes (NB)) and calculated their feature importance

using six importance measures (SHAP {SHapley Additive exPlanation}values,

Permutation Importance, Information Gain, Gini Importance, Frequency, and Cover),

computed the variability and the coefﬁcient of variation of each feature, and ﬁnally

compare the feature importance and its value. This paper has the following contributions:

• The introduction of variability (instead of the number of unique values which

is an absolute measure; variability is the relative number of unique values) and

coefﬁcient of variation as data value measures

• The introduction of a new training-free feature selection approach

• A comprehensive study of the relationships between feature importance and data

valuation metrics for our clinical use case.

• Introduction of an extended ML process for investigating the relationship between

Data Valuation Metrics and Feature Importance.

The remainder of this paper is divided as follows: the next section deﬁnes the

background, followed by a description of the related work, section 3 describes our

use case, section 4 proposes a new data value informed ML process, the experiments

are described in section 5 which is followed by the experimental results section, then

the discussion section, and ﬁnally the conclusion.

2 Background

This section deﬁnes the key terms as well as the motivation of the choices made.

4 M. Ebiele et al.

2.1 Deﬁnitions

Data valuation is a set of techniques to assign value (e.g. economic or social value)

to datasets or data assets [27]. It aims to identify which data is important for a given

context with respect to predeﬁned criteria. For example in training machine learning

(ML) models, an important datum or feature will be the one that improves the

model’s performance by reducing the error or improving its accuracy. In business,

the most important data could either be the most used on a daily basis for business

operations or the ones that most business units depend upon or the data which most

increase business revenue. This demonstrates that the same data can have multiple,

variable values in different contexts.

Several techniques have been used to determine the value of a speciﬁc data or data

set; from qualitative metrics to quantitative and even ML-based ones [1,4,19]. There

is a wide range of dimensions of data value and metrics in the literature [24]. Some

dimensions are usage, utility, cost, uniqueness, recency, and recurrency. In this

paper, we have deﬁned two new metrics: Variability and Coefﬁcient of variation.

Variability is deﬁned as the relative number of unique values. For example, given

two dataset D1and D2with 100 and 1000 entries, respectively, if F1from D1and

F2from D1are two features with the same number of unique values equal to 20,

then they will have 0.2 and 0.02 variability. These two features have the same

absolute number of unique values but drastically different relative number of unique

values (variability). The variability allows the comparison of the number of unique

values of features from different datasets. As for coefﬁcient of variation [28], it is

deﬁned as the relative spread of data points; it is a statistical measure applied to data

valuation. Below are the formulas used to calculate the variability and the coefﬁcient

of variation:

• The Variability of the feature Fj:

RU (Fj) = |Fj|

N, (1)

where Nis the number of entries of the dataset containing the feature Fj; RU

stands for Relative Uniqueness (Uniqueness means number of unique values).

• The coefﬁcient of variation (CoV) of the feature Fj:

CoV(Fj) = 1

µs∑j(Fi j −µ)2

dof , (2)

where dof (which means Degrees Of Freedom) is equal to N−1 (for a sample)

or N(for the population) and Fi j is the ith entry of the jth feature.

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 5

2.2 Motivations

The models can be evaluated in terms of predictive power by using the average

Precision, average Recall, and average F1-score metrics. The evaluation is necessary

to make sure that each of them represents the feature space in the best possible

way because the more performing a model is the better it captures the data and the

better the features are used. Three averaging techniques will be used: micro average

(calculate metrics globally by counting the total true positives, false negatives and

false positives1), macro average (calculate metrics for each label, and ﬁnd their

unweighted mean. This does not take label imbalance into account1), and weighted

average (calculate metrics for each label, and ﬁnd their average weighted by using

the number of true instances for each label. This alters macro average to account

for label imbalance1). Therefore, The weighted average F1-score will be used as the

main performance metric because F1-score is the trade-off between precision and

recall and, as mentioned above, weighted average takes into account the imbalance

of the data. In the process described below, the variability and coefﬁcient of variation

should be computed on the training data (because computing these metrics before

splitting the data will mislead the comparison performed as the data valuation

metrics and the feature importance will be computed on different sets). Also, these

metrics need to be calculated on the original dataset before applying any scaling

techniques. The main reason is that the scaling technique keeps the distribution but

alters the statistics like the Standard deviation and the Mean (which are used to

compute the coefﬁcient of variation) and sometimes even the variability (because

some scaling techniques are performed row-wise not feature-wise).

The selection of the Gini importance (based on the Gini index which is used

to compute the mean impurity index in Random Forests), Information Gain (the

average or total gain across all splits the feature is used in1), Frequency Importance

(the number of times a feature is used to split the data across all trees2), Cover

Importance (the average or total relative number of training instances seen across

all splits the feature is used in2). Permutation Importance (measures the impact of

any given feature by shufﬂing that feature and then measuring the performance of

the model. The change in performance, either positive or negative, is the importance

of the feature. If the performance improves, then the feature is ranked low. But if the

performance is worsened, then the feature is ranked high; the rank being the value of

the change in performance), and SHAP values importance (works quite similarly to

the Permutation Importance but instead of shufﬂing, it is based on cooperative game

theory to measure the change in performance). The ﬁrst four ranking techniques will

be computed while training the models and therefore are model-speciﬁc (mostly

tree-based models) while the latter two will only be estimated after the models have

1https://scikit-learn.org/stable/modules/generated/sklearn.

metrics.f1_score.html

1https://xgboost.readthedocs.io/en/stable/python/python_api.html#

xgboost.Booster.get_score

6 M. Ebiele et al.

been trained. Another reason is to check if these different feature ranking algorithms

agree on the ranking of the features.

3 Related Work

Feature importance has been widely studied for the purpose of explaining black-

box machine and deep learning algorithms. These studies aim to detect the features

or variables that contribute most to predictive model outputs. Signiﬁcant feature

ranking algorithms include: SHAP (SHapley Additive exPlanation) values, Permutation

Importance, Information Gain (Entropy, Log-loss), Gini Importance, Frequency

Importance, and Cover Importance. The ﬁrst two are model-independent and the

latter four are tree-based techniques. SHAP values are based on cooperative game

theory and were ﬁrst introduced by Kuhn and Tucker [26]. They are widely used

in machine learning for their robustness and effectiveness in capturing outliers

and corruptions [5]. For that last reason, SHAP values have received increasing

attention and focus in the data valuation community in recent years, especially to

detect the high-value datum in training predictive models [5]. Previous studies have

been focused either on detecting which features are the most inﬂuential for machine

learning models output [12, 16–18, 20–22] or choosing the best datum or subset of

data which improve these models performance during the training process [5,13].

There are also been some studies of the bias or shortcomings of the feature selection

and model explanation techniques with some attempts to solve them [6, 8, 9]. In

this paper, we investigate the relationship between the feature importance and its

value measured by its variability and coefﬁcient of variation. If the inﬂuence of

the feature value on its ranking can be accurately estimated, then training-free or

pre-training feature selection can be performed; because all the existing approaches

perform feature selection during or after the training of the models. To the best

of our knowledge, no such work has been done before. Strobl et al. [8] showed

that tree-based Gini importance and permutation importance in Random Forests are

not reliable when the features have different scales and numbers of unique values

(variability). Strobl [8] and Loecher [6] proposed different approaches to solve this

problem. In [6], Loecher showed that even SHAP values have shortcomings and

strongly depend on the number of unique values or variability of a feature.

Our work differs from all these previous contributions as (i) we are not trying to

identify the shortcomings of feature importance measures or attempting to solve

them but rather to investigate how they are related to variability and coefﬁcient

of variation, (ii) they only focus on Random forest while we used different types

of algorithms (Probabilistic {Naive Bayes}, Tree-based {Random Forest, eXtreme

Gradient Boosting}, Linear modelling {Logistic Regression}, and Neural networks

{Multi-Layer Perceptron}), and (iii) the data is scaled before training the models

(therefore we believe that the problem related to the scale difference is solved).

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 7

4 Use Case

To test our proposed approach on the ﬁve selected ML algorithms, we consider a

real dataset and classiﬁcation problem from a major hospital in Ireland.

The data used in this paper is extracted from several sources in the hospital data

management system. It consists daily data on patient journeys within the hospital

(e.g. beds and wards used, admission and discharge dates) as well as other data

(e.g. area of residence, gender, discharge code). In total there are 152 features and

938,296 patient records. The classiﬁcation aim is to predict patients’ outcomes at

any given time throughout their journey within the hospital. Given the complexity

of the data set, exploring the whole feature space with traditional ML techniques is

extremely time and resource-intensive. This, along with the variety of feature types

and time series behaviours included, makes it a good candidate to investigate the

research questions deﬁned in the Introduction.

4.1 Ethics and Data Sharing

This research was covered as part of DCU research ethics committee application

DCUREC/2021/118 and the data was transferred under a data processing agreement

between the hospital and universities. The dataset was anonymised before it was

shared and so contains no personal data. The research described here was conducted

as part of a broader programme investigating patient infection risk and improving

hospital safety systems. This large clinical classiﬁcation problem highlighted issues

in model training time and traditional feature selection techniques that led to this

work to optimise the process and more rapidly protect patients.

5 Extended ML Process for Investigating the Impact of Data

Valuation Metrics on Feature Importance

Fig. 1 displays the extended ML process used to investigate the impact of data value

metrics on feature importance; it extends the process presented by [29].

The three added steps are described below. Please, refer to [29] for the description

of the standard steps.

Data valuation metrics: Calculation of the data valuation metrics on each feature.

In this use case, the variability and coefﬁcient of variation are computed on the

training data using the formulas deﬁned in the Background section above.

Feature Importance extraction or calculation: The main objective here is to

investigate how much each feature inﬂuences the prediction of the best model (the

8 M. Ebiele et al.

Fig. 1: Extended ML Process for Investigating the Impact of Data Valuation Metrics

on Feature Importance

selected model after the model analysis); either by extracting the ranking of the

feature from the models (tree-based models) or calculating the ranking of features

(Permutation importance, SHAP values).

Investigation of the relationship between feature importance and data valuation

metrics: This step is about visualizing, calculating, and exploring every possible

relationship between the feature importance and the data valuation metrics. Visualization

should always be the ﬁrst task to be undertaken in this step because it may guide

the remaining exploration and also lead to insights that are straightforward to get

visually by difﬁcult mathematically. Next, we can restart the model life-cycle from

the training and tuning step to incorporate the insights got from the investigation of

the relationship between feature importance and data valuation metrics.

6 Experiments

This section discusses the preprocessing steps, the training dataset details, and the

model training setups. After all these steps, the feature importance is calculated and

plotted against the variability and coefﬁcient of variation.

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 9

6.1 Dataset

The model tuning and the SHAP values computation are very time-consuming.

Therefore, we decided to perform a stratiﬁed random sampling of the dataset.

Stratiﬁed random sampling is chosen over simple random sampling because we

wanted the training data to reﬂect the distribution of the number of entries per class

of the original dataset. The Table 1 below shows the number of entries per class

before and after the stratiﬁed random sampling.

Table 1: The number of entries per class before and after a stratiﬁed random

sampling

Class ID Number of entries before

sampling (proportion in %)

Number of entries after

sampling (proportion in %)

0 75051 (81.72%) 3753 (81.72%)

1 5596 (6.09%) 280 (6.09%)

2 3484 (3.79%) 174 (3.78%)

3 2661 ( 2.89%) 133 (2.89%)

4 2139 ( 2.32%) 107 (2.33%)

5 2011 (2.18%) 101 (2.19%)

6 531 (0.57%) 26 (0.56%)

7 363 (0.39%) 18 (0.39%)

Total 91,836 4,592

6.2 Data Preprocessing

The original dataset contains 938,296 rows and 152 columns of inpatient discharges[

from 1st Jan 2018 until 28th Feb 2022] from a large acute hospital. The ﬁrst

processing task was to drop the empty columns (columns where all the values

are missing), single-value columns, and duplicated rows from the data. Therefore

21 empty columns, 1 single-value column, and 57 duplicated entries have been

dropped. The resulting dataset contains 938,239 rows and 130 columns; 117 contain

categorical data, 7 of them contain numerical data and 6 date-time data. The second

step was grouping the data per patient id, episode id, and discharge description to

get a unique discharge description per entry. If multiple values are encountered,

for some columns, a set is returned. For example, if a patient used two or more

beds or wards during their episode a set of the beds or wards used is returned.

Each of these sets will then be encoded afterwards using the ordinal encoder.

The resulting dataset has 91,836 entries and 130 columns. Two new columns

10 M. Ebiele et al.

(episode length in days = discharge date - admission date, screen length in days =

Last Date Of CPE Screen - First Date Of CPE Screen) was calculated and added

to the dataset. Third step, for 5 of the datetime columns, the following time features

are added to the dataset: year, quarter, month, week, day, hour, minute, dayofweek,

isweekend, and isholiday. The remaining datetime column is used to compute the

number of overnight stays of a patient during his or her stay. The datetime columns

are then deleted afterwards. For each of the datetime features added, -1 is imputed

when the original date is missing except the year feature which is imputed with the

minimum year minus one. After this step, the number of columns increased from

130 to 175 (175 = 130+10x5 - 5). Fourth and last step, The 117 categorical columns

have been encoded using the ordinal encoder and imputed with zero for all missing

values. The ordinal encoder was chosen over the one hot encoder because of the

high variability of some of the columns (Fig. shows the number of unique values

per column). One hot encoding would have increased the data dimension drastically

which may have invoked the ”curse of dimensionality”, especially for tree-based

models. The sets returned by the grouping are encoded randomly (without taking

into account their size).

6.3 Model Training

The models were trained using 5-fold Randomized Grid-Search Cross Validation

(RGS-CV) with weighted F1-score scoring on an Ubuntu 18.04 LTS machine with

16GB RAM and 2 cores. The RGS-CV is run 5 times; once for each scaling

technique: SimpleNormalizer, Normalizer, RobustScaler, MinMaxScaler, and StandardScaler.

The experiments were run using the Scikit-learn package; all the scalers have been

used with the default parameters. SimpleNormalizer is a self-implementation and

divides each feature in the dataset by its maximum value. The data transformation

and parameters with the highest weighted average F1-score are returned at the end

of these runs. The model is retrained using the returned information and then tested

on the test data to report the model’s performance.

6.4 Model Performance Report

Once the training of each model is completed, the parameters as well as the scaling

technique of the best model in terms of F1-score are returned. Then, the model is

retrained using those parameters and the test data is scaled using the best scaler.

Table 2 below shows the model performance on the test data. It can be seen that

XGB outperforms the other models on 8 out of 9 performance metrics; and most

importantly on the main metric which is the weighted average F1-score.

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 11

Table 2: Model performance report in percentage (%) and training and SHAP values

time.

Models PRECISION∗RECALL∗F1-SCORE∗TIME∗∗

(days-hh:mm:ss)

Micro Macro Weighted Micro Macro Weighted Micro Macro Weighted Training SHAP values

XGB 84.23 47.1 80.26 84.23 24.27 84.23 84.23 27.74 79.72 0-07:37:00 1-07:00:31

RF 77.70 29.51 77.99 77.70 29.03 77.70 77.70 27.29 77.35 0-01:52:01 0-07:36:21

MLP 81.62 29.05 74.89 81.62 19.37 81.62 81.62 20.39 76.52 0-00:40:41 0-06:39:14

NB 81.27 21.58 72.11 81.27 13.3 81.27 81.27 12.95 74.26 0-00:00:15 0-06:58:52

LR 82.14 22.30 71.57 82.14 16.72 82.14 82.14 17.67 75.78 0-00:33:37 0-06:17:10

∗The highest the better.

∗∗The lowest the better.

7 Experimental Results

Fig. 2 below displays the relationship between the variability (on the x-axis) and the

feature ranking techniques (on the y-axis).

On the top (ﬁrst row) is plotted the variability against the Permutation Importance

and SHAP values. It can be seen that the most inﬂuential features have a variability

close to zero and as the variability increases the features’ ranking decreases until 0.4.

Above 0.4, the features have very little inﬂuence (zero or close to zero Permutation

Importance and SHAP values) for most models.

On the middle row is displayed the variability against Gain Importance and Cover

Importance. It can be seen that most inﬂuential features are located to the left

and as the variability increases the features’ ranking decreases until 0.2 where the

features’ ranking becomes constant around 1.75 for Gain Importance and 80 for

Cover Importance.

On the bottom row is plotted variability against the Frequency Importance and Gini

Importance. The most inﬂuential features have a variability close to zero and as the

variability increases some features’ ranking increase logarithmically for Frequency

Importance and linearly for Gini Importance. The other features’ ranking are located

between the logarithmic or linear line and the vertical line at variability equal to

zero.

We can retain that there is no linear relationship between the variability and the

feature ranking. However, it can be observed that features with variability greater

than or equal to 0.4 have very little to no impact or importance across 4 out of the 6

ranking techniques. The exception for Frequency Importance makes sense because

the higher the variability the more difﬁcult it is to group the entries per value. As

for Gini Importance, it might be due to the shortcomings of Random Forest towards

feature with high variability [6,8].

Fig. 3 below shows the relationship between the coefﬁcient of variation (on the x-

axis) and the feature ranking techniques (on the y-axis).

12 M. Ebiele et al.

Fig. 2: Variability versus feature importance

On ﬁrst row is plotted the coefﬁcient of variation against the Permutation Importance

and SHAP values. It can be seen that the most inﬂuential features have a coefﬁcient

of variation close to zero and as the coefﬁcient of variation increases the Permutation

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 13

Importance and SHAP values decrease until 11.852. Between 11.85 and 31.33,

Permutation Importance ﬂuctuates around the line y=0 within a small range. Above

31.3, the features have zero Permutation Importance for all models. As to SHAP

values, above 11.85 the features have little inﬂuence and their SHAP values ﬂuctuate

around the line y=0 within a range smaller than the ones with a CoV less than 11.85.

On the middle row is displayed the coefﬁcient of variation against Gain Importance

and Cover Importance. It can be seen that all inﬂuential features are located to the

left of the vertical line at 23.44. Above 23.4, the features have very zero inﬂuence.

On the bottom row is plotted coefﬁcient of variation against the Frequency Importance

and Gini Importance. The most inﬂuential features have a coefﬁcient of variation

less than 11.85. Above 11.85, the features have no inﬂuence.

We can retain that there is no linear relationship between the coefﬁcient of variation

and the feature importance ranking. But, as depicted on the graphs below, features

with a coefﬁcient of variation greater than or equal to 23.4 (on average) have little to

no impact or importance. Also, features with negative CoV have little SHAP values

and zero importance for all other feature ranking techniques.

We found that, overall, as the variability and the coefﬁcient of variation of a feature

increase, that feature predictive power decreases. The decay of the predictive power

of a feature is noticeable for a variability from 0.4 and above and a coefﬁcient of

variation from 23.4 (on average) and above.

These experiments also show the similarity of the behaviour of Permutation Importance

and SHAP values, Gain Importance and Cover Importance, and Frequency Importance

and Gini Importance relative to the features’ variability and coefﬁcient of variation.

8 Discussion

The results presented in this paper suggest that features with a variability greater

than or equal to 0.4 have very little to no importance for this use case; contrary to

the results presented by [6, 8, 9]. Here, the features with a large number of unique

values have been ranked low. This might be the result of intensive work to reduce

model bias towards features with high variability since these shortcomings have

been exposed by [8]. However, RF, XGB and MLP still assigned a pretty high

permutation importance or SHAP values importance to such features.

The comparison between the feature importance and the coefﬁcient of variation

(CoV) showed that features with a coefﬁcient of variation greater than or equal

to 23.4 have zero Gain, Frequency, Cover, and Gini importance. The same features

211.85 =min CoV +0.23 ×(max CoV - min CoV)

331.3=min CoV +0.55 ×(max CoV - min CoV)

423.4=min CoV +0.42 ×(max CoV - min CoV)

14 M. Ebiele et al.

Fig. 3: Coefﬁcient of variation versus feature importance

have zero and little permutation importance for a coefﬁcient of variation greater than

31.3 and between 23.4 and 31.3, respectively. In terms of SHAP values importance,

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 15

they are assigned little importance by most models. But, only NB assigned zero

importance consistently to those with a coefﬁcient of variation greater than 31.3.

In general, features with variability greater than or equal to 0.4 or a coefﬁcient of

variation greater than or equal to 23.4 (on average) have a lower (average) deviation

from the line y=0 (zero importance line). The coefﬁcient of variation seems to be a

better predictor of the importance of the features than the variability. The problem

with the coefﬁcient of variation is that it is a real number and does not have a ﬁxed

range. Moreover, it can be misleading to measure the spread of data points when

the mean is close to zero. For example, given three normally distributed random

variables V1 ∼N(1, 1), V2 ∼N(0.001,1), and V3 ∼N(0,1). V1, V2, and V3

have the same standard deviation but dramatically different coefﬁcients of variation.

In fact, V1, V2 and V2 have 1, 1000, and +∞(inﬁnity) coefﬁcient of variation,

respectively. Therefore, the need to deﬁne a more robust relative spread measure is

necessary.

In spite of all the shortcomings listed above, the coefﬁcient of variation is found to

be a better predictor of the (analytical) value of a feature. In this use case, all the

top-ranked features have a coefﬁcient of variation within

[min CoV,min CoV +0.42 ×(max CoV - min CoV)].

This interval can be shrunk to [min CoV,min CoV +0.23 ×(max CoV - min CoV)]

for Frequency and Gini Importance.

In general, lower values of variability (less than 0.4) and coefﬁcient of variation

(less than 42% of the CoV’s range) have more impact on the feature importance

than higher values.

9 Conclusion

In this paper we investigated the relationship between the feature importance and

two newly introduced data valuation metrics for the content value dimension which

are variability and coefﬁcient of variation. We found that there is no linear relationship

between these two data valuation metrics and feature importance. However, we

found that the most inﬂuential features have a variability less than 0.4 and a

coefﬁcient of variation less than 23.4 (on average). This suggests, in our use case,

that features with variability great than 0.4 or a coefﬁcient of variation greater

than 23.4 (on average) are not relevant. This does not predict the actual value of

a feature but instead helps to classify the features into non-inﬂuential and possible

impactful feature sets. This ﬁnding, if generalised, will allow us to perform training-

free feature selection. It also has potential applications in data protection or privacy

where non-inﬂuential features could be identiﬁed prior to sharing and thus never

shared, thereby increasing safety.

16 M. Ebiele et al.

In future work, we ﬁrstly aim to generalise the ﬁndings in this paper by applying

the proposed process on different datasets (preferably from different domains) to

check the reproducibility of the insights presented in this paper, and secondly, to

design a more robust relative spread measure aiming to solve the shortcomings of

the coefﬁcient of variation mentioned in the Discussion Section. We also aim to

systematise the setting of the thresholds of the variability and the coefﬁcient of

variation as they have been chosen manually in this use case.

Acknowledgements This research has received funding from the ADAPT Centre for Digital

Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106 P2),

co-funded by the European Regional Development Fund. For the purpose of Open Access, the

author has applied a CC BY public copyright licence to any Author Accepted Manuscript version

arising from this submission.

References

1. Fleckenstein, M., Obaidi, A. & Tryfona, N. A Review of Data Valuation Approaches and

Building and Scoring a Data Valuation Model. Harvard Data Science Review.5(2023,1),

https://hdsr.mitpress.mit.edu/pub/1qxkrnig/release/1

2. Noshad, M., Choi, J., Sun, Y., Hero, A. & Dinov, I. A data value metric for

quantifying information content and utility. Journal Of Big Data.8, 82 (2021,6),

https://doi.org/10.1186/s40537-021-00446-6

3. Tang, S., Ghorbani, A., Yamashita, R., Rehman, S., Dunnmon, J., Zou, J. & Rubin, D.

Data Valuation for Medical Imaging Using Shapley Value: Application on A Large-scale

Chest X-ray Dataset. Scientiﬁc Reports.11, 8366 (2021,4), http://arxiv.org/abs/2010.08006,

arXiv:2010.08006 [cs, eess]

4. Yoon, J., Arik, S. & Pﬁster, T. Data Valuation using Reinforcement Learning. (arXiv,2019,9),

http://arxiv.org/abs/1909.11671, arXiv:1909.11671 [cs, stat]

5. Ghorbani, A. & Zou, J. Data Shapley: Equitable Valuation of Data for Machine Learning.

(arXiv,2019,6), http://arxiv.org/abs/1904.02868, arXiv:1904.02868 [cs, stat]

6. Loecher, M. Unbiased variable importance for random forests. Communications In

Statistics - Theory And Methods.51, 1413-1425 (2022,3), http://arxiv.org/abs/2003.02106,

arXiv:2003.02106 [cs, stat]

7. Lundberg, S. & Lee, S. A Uniﬁed Approach to Interpreting Model Predictions.

(arXiv,2017,11), http://arxiv.org/abs/1705.07874, arXiv:1705.07874 [cs, stat]

8. Strobl, C., Boulesteix, A., Zeileis, A. & Hothorn, T. Bias in random forest variable

importance measures: Illustrations, sources and a solution. BMC Bioinformatics.8, 25

(2007,1), https://doi.org/10.1186/1471-2105-8-25

9. Loecher, M. Debiasing MDI Feature Importance and SHAP Values in Tree Ensembles.

Machine Learning And Knowledge Extraction. pp. 114-129 (2022)

10. Baudeu, R., Wright, M. & Loecher, M. Are SHAP Values Biased Towards High-Entropy

Features?. Machine Learning And Principles And Practice Of Knowledge Discovery In

Databases. pp. 418-433 (2023)

11. Antwarg, L., Miller, R., Shapira, B. & Rokach, L. Explaining

anomalies detected by autoencoders using Shapley Additive Explanations.

Expert Systems With Applications.186 pp. 115736 (2021,12),

https://www.sciencedirect.com/science/article/pii/S0957417421011155

12. Maasland, T., Pereira, J., Bastos, D., Goffau, M., Nieuwdorp, M., Zwinderman, A. & Levin, E.

Interpretable Models via Pairwise Permutations Algorithm. Machine Learning And Principles

And Practice Of Knowledge Discovery In Databases. pp. 15-25 (2021)

The Impact Of Data Valuation On Feature Importance In Classiﬁcation Models 17

13. Jia, R., Dao, D., Wang, B., Hubis, F., Hynes, N., G¨

urel, N., Li, B., Zhang, C., Song, D. &

Spanos, C. Towards Efﬁcient Data Valuation Based on the Shapley Value. Proceedings Of

The Twenty-Second International Conference On Artiﬁcial Intelligence And Statistics. pp.

1167-1176 (2019,4), https://proceedings.mlr.press/v89/jia19a.html, ISSN: 2640-3498

14. Kumar, S., Lakshminarayanan, A., Chang, K., Guretno, F., Mien, I., Kalpathy-Cramer,

J., Krishnaswamy, P. & Singh, P. Towards More Efﬁcient Data Valuation in Healthcare

Federated Learning Using Ensembling. Distributed, Collaborative, And Federated Learning,

And Affordable AI And Healthcare For Resource Diverse Global Health. pp. 119-129 (2022)

15. Gul, F. Bargaining Foundations of Shapley Value. Econometrica.57, 81-95 (1989),

https://www.jstor.org/stable/1912573, Publisher: [Wiley, Econometric Society]

16. Datta, A., Sen, S. & Zick, Y. Algorithmic Transparency via Quantitative Input Inﬂuence:

Theory and Experiments with Learning Systems. 2016 IEEE Symposium On Security And

Privacy (SP). pp. 598-617 (2016,5), ISSN: 2375-1207

17. Cohen, S., Dror, G. & Ruppin, E. Feature Selection via Coalitional Game Theory. Neural

Computation.19, 1939-1961 (2007,7), Conference Name: Neural Computation

18. Campbell, T., Roder, H., Georgantas III, R. & Roder, J. Exact Shapley values for local and

model-true explanations of decision tree ensembles. Machine Learning With Applications.9

pp. 100345 (2022,9), https://www.sciencedirect.com/science/article/pii/S2666827022000500

19. Wu, Z., Shu, Y. & Low, B. DAVINZ: Data Valuation using Deep Neural Networks at

Initialization. Proceedings Of The 39th International Conference On Machine Learning. pp.

24150-24176 (2022,6), https://proceedings.mlr.press/v162/wu22j.html, ISSN: 2640-3498

20. Altmann, A., Tolos¸i, L., Sander, O. & Lengauer, T. Permutation importance: a

corrected feature importance measure. Bioinformatics.26, 1340-1347 (2010,5),

https://doi.org/10.1093/bioinformatics/btq134

21. Shardlow, M. An Analysis of Feature Selection Techniques. (2011),

https://www.semanticscholar.org/paper/An-Analysis-of-Feature-Selection-Techniques-

Shardlow/8973a724545bbc2a5cc52bc28f7ffcb5d4aa8dc8

22. Strumbelj, E. & Kononenko, I. An Efﬁcient Explanation of Individual Classiﬁcations using

Game Theory. The Journal Of Machine Learning Research.11 pp. 1-18 (2010,3)

23. Brennan, R., Attard, J., Petkov, P., Nagle, T. & Helfert, M. Exploring data value assessment:

a survey method and investigation of the perceived relative importance of data value

dimensions. (SciTePress,2019,5), https://cora.ucc.ie/handle/10468/8166, Accepted: 2019-07-

16T09:18:42Z

24. Brennan, R. & Attard, J. Management of Data Value Chains, a Value Monitoring

Capability Maturity Model. (2018), http://www.tara.tcd.ie/handle/2262/82277, Accepted:

2018-01-25T15:30:03Z Journal Abbreviation: 20th International Conference on Enterprise

Information Systems (ICEIS)

25. Hapke, H. & Nelson, C. Introduction. Building Machine Learning

Pipelines: Automating Model Life Cycles With TensorFlow.. (2020,7),

https://www.oreilly.com/library/view/building-machine-learning/9781492053187/

26. Shapley, L. 17. A Value for n-Person Games. Contributions To

The Theory Of Games (AM-28), Volume II. pp. 307-318 (1953,12),

https://www.degruyter.com/document/doi/10.1515/9781400881970-018/html

27. Shobeiri, S. & Aajami, M. Shapley value in convolutional neural networks (CNNs): A

Comparative Study. American Journal Of Science & Engineering.2, 9-14 (2021,12)

28. Brown, C. Coefﬁcient of Variation. Applied Multivariate Statistics In Geohydrology And

Related Sciences. pp. 155-157 (1998), https://doi.org/10.1007/978-3-642-80328-4 13

29. Hapke, H. & Nelson, Cl. (Year). Introduction. Building Machine Learning Pipelines:

Automating Model Life Cycles with TensorFlow. O’Reilly Media, Inc. (2020, 7). ISBN:

9781492053194