Using Imputation Techniques to Help Learn Accurate Classifiers
ABSTRACT It is difficult to learn good classifiers when training data is missing attribute values. Conventional techniques for dealing with such omissions, such as mean imputation, generally do not significantly improve the performance of the resulting classifier. We proposed imputation-helped classifiers, which use accurate imputation techniques, such as Bayesian multiple imputation (BMI), predictive mean matching (PMM), and Expectation Maximization (EM), as preprocessors for conventional machine learning algorithms. Our empirical results show that EM-helped and BMI-helped classifiers work effectively when the data is "missing completely at random", generally improving predictive performance over most of the original machine learned classifiers we investigated.
-
Citations (0)
- Cited In (1)
-
Conference Proceeding: Using Classifier-Based Nominal Imputation to Improve Machine Learning.
Advances in Knowledge Discovery and Data Mining - 15th Pacific-Asia Conference, PAKDD 2011, Shenzhen, China, May 24-27, 2011, Proceedings, Part I; 01/2011
Page 1
Using Imputation Techniques to Help Learn Accurate Classifiers
Xiaoyuan Su
Computer Science & Engineering
Florida Atlantic University
Boca Raton, FL 33431, USA
xsu@fau.edu taghi@cse.fau.edu
Taghi M. Khoshgoftaar
Computer Science & Engineering
Florida Atlantic University
Boca Raton, FL 33431, USA
Russell Greiner
Computing Science
University of Alberta
Edmonton, AB, Canada
greiner@cs.ualberta.ca
Abstract
It is difficult to learn good classifiers when training
data is missing attribute values. Conventional techniques
for dealing with such omissions, such as mean
imputation, generally do not significantly improve the
performance of the resulting classifier. We proposed
imputation-helped classifiers, which use accurate
imputation techniques, such as Bayesian multiple
imputation (BMI), predictive mean matching (PMM), and
Expectation Maximization (EM), as preprocessors for
conventional machine learning algorithms. Our empirical
results show that EM-helped and BMI-helped classifiers
work effectively when the data is “missing completely at
random”, generally improving predictive performance
over most of the original machine learned classifiers we
investigated.
1. Introduction
Missing data is unfortunately common in data analysis;
this typically leads to difficulties in estimation and
inference. To deal with missing data, many analytic tools
either ignore the missing values or fill in the missing
values with a value estimated by some process -- this
process is called imputation.
As the predictive performance of many learning
algorithms deteriorates on incomplete training data, some
try using imputation techniques to suggest values for
missing data before training a classifier. However, many
such machine learning systems use a simple imputer.
Some use mean imputation (MEI), which replaces the
missing value with the mean value of the attribute over all
instances or over all instances of the same class, or with
the most frequently observed value of the attribute (e.g.,
this is used by the WEKA [20] implementation of logistic
regression [4] and random forest [3]). Others fill in the
missing value with a global constant, such as the value
“unknown” (e.g., WEKA’s
optimization [14] and one rule classifier [7]). Yet others
sequential minimal
simply ignore the missing values (e.g., WEKA’s naïve
Bayes [9] and decision table [11]). However, empirical
results show that these simple imputation methods
typically do not significantly improve the machine
learners’ performance.
We are therefore motivated to propose imputation-
helped classifiers, which use better imputation techniques
for filling in the missing values before training the
classifiers. To our knowledge, no one has empirically
compared various state-of-the-art imputation techniques,
such as Expectation Maximization (EM) [5] (Section 2.4),
predictive mean matching (PMM) [12] (Section 2.5), and
Bayesian multiple imputation (BMI) [16] (Section 2.6), to
see which one(s) best help standard machine learning
algorithms to deal with incomplete data. This work
explores this task.
We considered datasets from the UCI machine
learning repository [2] whose values were artificially
deleted using the MCAR (missing completely at random)
mechanism (defined below), using different pre-
determined missing ratios. After using imputation
techniques to impute missing values, we learn classifiers
on this complete data using the following 10 machine
learning algorithms: decision tree (C4.5), decision table
(dTable), k nearest neighbor (kNN), logistic regression
(LR), naïve Bayes (NB), neural networks (NN), one rule
(OneR), decision list (PART), support vector machine
(SVM), and random forest (RF). We then evaluate the
predictive performance improvement of these various
imputation-helped classifiers over the original classifiers.
Section 2 presents the framework of our work and
reviews relevant literature; then Section 3 provides
experimental design and results.
2. Framework
The imputation-helped classifiers work in the
following steps (see Figure 1): impute values for the
incomplete training data to generate imputed (complete)
training data; learn a classifier on the imputed dataset;
Page 2
then have the classifier produce classification for the test
dataset.
Figure 1. Framework of imputation-helped classifiers
2.1 Patterns of Missing Data
Little and Rubin [13] consider three types of missing
data patterns. If the missingness does not depend on the
observed data, then it is regarded as Missing Completely
at Random (MCAR). This includes the situation where a
value is missing with an iid (independent and identical
distribution) probability value (e.g., 0.2) that is
independent of the actual value of this feature, or of any
other feature (think of random corruption while the data is
being transmitted). Simple imputation techniques, such as
mean imputation and linear regression imputation, are
often used to deal with such MCAR missing data.
This work will focus on MCAR missingness. To help
explain this, we contrast it with two other types of
missingness. Missing data that depends on the observed
data, but not on the unobserved data is considered
Missing at Random (MAR) [17]. For example, if a
person’s gender is recorded as male, then the attribute
“pregnancy” is typically left blank. Full maximum
likelihood and Bayesian imputation techniques can be
used to deal with missing data that are MAR as well as
MCAR [10].
Both MCAR and MAR missing data patterns are
considered ignorable patterns, as valid inferences and
estimations can be made even if the missing data pattern
is not explicitly considered in the analysis.
If the “missingness” is not MAR, then we say the data
are Not Missing at Random (NMAR, or non-ignorable).
Here, patterns of the unobserved variables can determine
their own missingness. For example, the “years of
education” field of a job application may be more likely
to be omitted if it is “< 1”. Incorrectly assuming
ignorability can result in biased estimates [17]. When the
missing data are non-ignorable, a model of missing value
mechanism must be incorporated into the estimation
process to produce meaningful estimates [19]. This is
generally specific to the problem itself and is difficult in
practice, especially as the missing data mechanism is
rarely known with certainty. The good news is that
multiple imputation [16], which works well for both
MAR and MCAR data, can often reduce the bias of the
estimates when data are NMAR.
2.2 How Machine Learners Handle Missing Data
When we use a machine learning algorithm, we
usually do not know the missing patterns of the data
beforehand. We therefore want a learner that works well
for any pattern of missingness. In this work, we consider
10 well-known machine learning algorithms from WEKA
[20], each of which has its own strategy to deal with
missing data. (Note that none of these approaches
attempts to determine the type of missingness.)
2.2.1 Ignore Missing Values. Decision table (dTable)
classifier includes two components: a schema that is a set
of features, and a body consisting of labeled instances
from the space defined by the features in the schema [11].
Given an unlabelled instance, the dTable classifier
searches for exact matches in the table only using the
features in the schema. If it finds no matching instances, it
then returns the majority class; otherwise, it returns the
majority class of all matching instances. DTable ignores
missing values during the learning and classification
processes.
Naïve Bayes (NB) is an extremely simple Bayesian
network that assumes attribute values are conditionally
independent given the class, and typically assumes that
numeric attributes obey a Gaussian distribution [9]. NB
learns by estimating the conditional distributions of each
attribute given the class; here it simply ignores attribute
values that are missing.
Decision tree (C4.5) is a member of the ID3 family of
algorithms that grows decision trees from the root
downward, greedily selecting the next attribute for each
new decision branch added to the tree. In the learning
stage, C4.5 algorithm just ignores missing values in gain
and entropy calculation, and in the classification stage, it
assigns a probability to each of the possible values of the
missing value based on the number of training instances
going down that branch, divided by the total number of
training instances with observed values at the node [15].
Decision list (PART) is a decision list classifier based
on partial decision trees. It combines C4.5 and RIPPER to
avoid their respective problems to produce accurate rule
sets. PART deals with missing values using the same
strategy as C4.5 [6].
incomplete
training datatraining data
imputed
datadata
classifierclassifier
an imputeran imputer
a machine
learnerlearner
classi-
ficationfication
a test
instance instance
incomplete
imputed
a machine
classi-
a test
2.2.2 Use Mean or Median of Observed Values. The
logistic regression (LR) implementation in WEKA uses a
multinomial logistic regression model with a ridge
estimator, and uses a ReplaceMissingValuesFilter to
replace the missing values with the mean (for numeric
attributes) or the most frequent value (for nominal
attributes) [4].
Random Forest (RF) grows many classification trees.
To classify a new instance, it first asks each tree in the
forest for its classification, and then returns the
Page 3
classification having the most votes. An RF learner
replaces the missing values with the median value for
numeric attributes or the most frequent value for nominal
attributes [3]. RF fills in the missing data in test set using
filled-in values from the training set
2.2.3 Missing Values Replaced by a Default Value.
One rule (OneR) is a rule-based classifier that infers one
rule that predicts the class, based on the most informative
single attribute. The attributes are assumed to be discrete,
otherwise they must be discretized. Missing values are
treated as the new value, “missing” [7].
A kernel-based Support vector machine (SVM)
produces nonlinear boundaries by constructing a linear
boundary in a large, transformed version of the feature
space. WEKA’s SVM uses the sequential minimal
optimization (SMO) algorithm [14] for training a support
vector classifier using polynomial (which we used) or
RBF kernels. This implementation globally replaces all
missing values by a default value, e.g., “unknown”.
2.2.4 Missing Values Used in Distance Measure. A k-
nearest-neighbor (kNN) classifier finds the k neighbors
that are closest to the unknown instance, and returns the
average value of the real-valued labels of the neighbors.
The closeness of the neighbors is defined in terms of
Euclidean distance for continuous attributes and
Hamming distance for discrete ones. KNN handles
missing values by means of a minor change in the
distance measure: when the two instances each miss the
values of the same attribute, the distance on that attribute
is zero, but when only one has a missing value, a maximal
distance is assigned [20].
2.2.5 Other Methods to Deal with Missing Values. A
Neural network (NN) is composed of interconnected
input/output units, where each connection has an
associated weight, learned using the backpropagation
algorithm. Many neural network models have been
modified to handle missing data. Following [8], we
replace each missing value by an interval that includes all
the possible values on that attribute (eg., a unit interval
[0,1]), and also replace each observed value by a
degenerate interval (e.g, 0.7 transformed to [0.7,0.7]),
before applying the backpropagation algorithm. Learning
and making classification from incomplete data are
therefore turned to classification of the (complete)
interval vectors.
2.2.6 Summary. Each of the above machine learning
algorithms uses a simple approach to deal with missing
data in its learning process, regardless of the missing
patterns. They may perform poorly on incomplete data,
especially those missing many values. This paper
explores the use of more advanced imputation techniques,
such as BMI, EM, or PMM (introduced below). As some
of our imputers only deal with numerical or ordinal
datasets (but not nominal), we only investigate these
kinds of data.
2.3 Linear Regression Imputation
Linear regression (LinR) imputation attempts to predict
the missing value based on the observed values of other
variables. In general, given a one-dimensional vector of
inputs X=(X1, X2, …, Xp), linear regression predicts the
dependent value Y based on the linear regression model
∑
=
j
1
where ε is a residual and coefficients β0 and β β=(β1, β2,
…, βp)T are trained on the existing values to minimize the
sum of squared residuals. Here Y is the missing feature
value to be imputed, and each Xj is the value of an
observed feature of the same instance.
We round LinR imputed values to the nearest integers
for integer attributes. We also find the observed value
range [min, max] for each attribute, and replace imputed
values <min with min, and those >max with max for
missing values. This procedure is also applied for other
imputers described below.
++=
p
jj
XY
0
εββ
(1)
2.4 Expectation Maximization Imputation
Expectation-maximization (EM) is a well-known
algorithm that seeks maximum likelihood estimates of
parameters in probabilistic models in the presence of
latent variables [5]. EM imputation requires specifying a
joint probability distribution for the feature value to be
imputed and the other feature values. EM iterates between
performing an expectation E step, which computes an
expected value of the complete data likelihood, given the
observed data and the current parameters; and a
maximization M step, which calculates values of the
parameters that maximize the expected likelihood over
the data, including those estimated in the E step. The
parameters found on the M step are then used to begin
another E step, and the process is repeated until it
converges to a stationary point.
Our implementation of EM algorithm assumes the data
is drawn from a multivariate
(parameterized by the mean and the covariance matrix)
and uses ridge regression [18]. EM first produces an
initial guess of these parameters. In each subsequent
iteration, EM updates its estimates of the mean and the
covariance matrix in three steps [18]: (1) for each
instance with missing values, initialize the distribution
parameters by estimating the mean and the covariance
matrix; (2) the missing values in a sample are replaced
with their conditional expectation values given the
observed values using the estimated mean and covariance
Gaussian data
Page 4
matrix; (3) the mean and the covariance matrix are re-
estimated, using the sample mean of the completed
dataset and the covariance matrix as the sum of the
sample covariance matrix of the completed dataset and
the contributions from the conditional covariance matrix
of the imputation errors. EM iterates these steps until the
imputed values and the estimates of the mean and
covariance stop changing [13].
2.5 Predictive Mean Matching Imputation
Predictive mean matching (PMM) is a state-of-the-art
imputation technique [12] that imputes missing values
Ymiss,i of incomplete instance (recipient) Yi, based on the
observed part of that instance Yobs,i to find the nearest
instance (donor) using a distance function (see Equation 3
below) that is computed as the expected values of the
missing variables conditioned on the observed covariates,
instead of directly on the values of the covariates. Our
version of PMM works as follows:
1) Use the EM algorithm [5] to estimate the parameters
θ of a multivariate Gaussian distribution over the attribute
values using all the available data.
2) Compute the conditional expected value for the
missing part Ymiss,i of instance Yi conditioned on the
observed part Yobs,i based on the estimated parameters θ.
∧
(2)
),|(
,,
θµ
i obsimiss
i
YYE
=
3) Match each recipient Yi to another instance
(possible donor) Yj=argminj d(i,j) that has the nearest
predictive mean with respect to the Mahalanobis
distance, defined through the residual covariance matrix
from the regression of the missing items on the observed
ones.
S
miss
)()(), (id
1
|
,,
ji
YY
T
ji
iobsi
j
∧∧
−
∧∧
−−=µµµµ
(3)
where the S is the empirical covariance matrix [21].
4) Impute each missing value in the recipient with the
corresponding values from its closest donor.
2.6 Bayesian Multiple Imputation
Standard single imputation produces a single filled-in
dataset, where each missing value is replaced with a
single value, perhaps drawn from the posterior predictive
distribution. While this approach is simple and can be
applied to virtually every data set, single imputation does
not account for the uncertainty about the predictions of
the imputed values, which can lead to statistically invalid
inferences [16]. By contrast, multiple imputation (MI)
produces many different filled-in datasets. For example,
consider filling in the (3,3) position in the upper-left 4x3
table in Figure 2. Here, we could produce m=4 different
completed datasets, shown as the next 4 tables in that
figure; note the proposed values for this (3, 3) entry are
{2,3,4,3}. BMI uses the average of these four values
(here, 3) as the final prediction – see the bottom right 4x3
table in the figure. In many situations, MI approaches
have proven to be highly effective even for small values
of m – say 3 to 12 [16].
BMI follows a Bayesian framework: it specifies a
parametric model for the complete data, with a prior
distribution over the unknown model parameters θ, then
simulates m independent draws from the conditional
distribution of the missing data given the observed data
by Bayes’ Theorem. We apply Markov chain Monte
Carlo (MCMC) to perform BMI [16].
2 4 3 2 4 3 2 4 3
? 4 3
?
1 4 3 1 4 3
1 ? ? 1 4 2 1 3 3
1 3 2 1 3 2 1 3 2
2 4 3 2 4 3 2 4 3
3
1
1
4
5
3
3
4
2
3
1
1
4
4
3
3
3
2
=>
2
1
1
4
4
3
3
3
2
Figure 2. An example of BMI with m=4. Each value in
the shaded cells is an estimated value from an
imputation. BMI produces different filled-in datasets
(after the “?” sign) and takes the average of the m
predictions as the final imputed dataset (after the
“=>” sign).
While BMI assumes a multivariate normal distribution
when generating the imputations for missing values, it is
robust to non-normally distributed data [17]. BMI imputes
data as follows [16]:
Let P(Ycom|θ) model the complete data, based on the
parameter θ (which here is the mean and covariance
matrix that parameterize a normal distribution here). If
Y= (Yobs, Ymiss) follows a parametric model P(Y|θ) where θ
has the prior distribution P(θ), then the posterior
predictive distribution for Ymiss is
∫
Equation 4 suggests that BMI can be drawn by iterating
the following process for j=1, …, m:
(1) generate missing
P(Ymiss|Yobs,θ(j));
(2) draw parameters fromθ(j+1) from P(θ|Yobs, Ymiss
Repeat these two steps to generate the Markov chain
{Ymiss
Ymiss(j+1) depends on θ(j), and θ(j) depends on Ymiss(j).
This entire process is repeated until the distribution
P(Ymiss, θ |Yobs) is stabilized [17].
After producing m sets of filled-in values, we take the
average as the final imputed set of values,
∑
=
i
m
1
where
is the i-th imputed value.
i
Q
=θ
d
θ
(
θ
,YPYYPYYP
obsobsmissobs miss
)|)|()|(
(4)
values Ymiss
(j+1) from
(j+1)).
(1), θ(1), Ymiss
(2), θ(2),…, Ymiss
(j), θ(j),…}; note that
∧
=
m
i QQ
1
(5)
∧
Page 5
3. Experimental Design and Results
We used eight datasets with numeric or ordinal
attributes from the UCI machine learning repository [2]
(see Table 1). We ran all 10 classifiers on each dataset; in
general, we used the default parameters given by WEKA.
For each, we used 50 trees for the random forest classifier
(RF50), and used k=5 as the number of neighbors for the
kNN classifier; we found these settings produced optimal
performance in our preliminary experiments. When
implementing our proposed imputation-helped classifiers,
we use an iteration number of 5 for LinR and PMM, and
BMI and EM iterate until they converge. When
preprocessing the incomplete data with the imputers, we
impute the training and test sets together. Notice none of
the imputers use the class labels of the test sets.
We used the standard training and test splits for each
of these datasets: 2/3 of the instances for training and 1/3
for testing, except the dataset “letter”, which was a 3/4 to
1/4 split. We take the average classification accuracy of 5
runs. In each run, with MCAR missing pattern, we
generate three incomplete datasets from each dataset by
randomly deleting 10%, 30%, and 50% of the observed
values.
Table 1. Description of the UCI datasets we worked on
datasets # Train
australian 460
# Test # attri # class
230 14 2
breast 466 233 10 2
diabetes
heart
pima
shuttle-small
vehicle
letter
We computed the average classification accuracy over
the 8 datasets for the original classifiers, and MEI-helped,
PMM-helped, LinR-helped, EM-helped, and BMI-helped
classifiers: 10% missing appears in Table 2 and Figure 3;
30% missing in Table 3 and Figure 4; and 50% missing in
Table 4 and Figure 5. We did not count the results on the
dataset “letter” for the classifier kNN as it produces
exceptionally low accuracy (<10%) from the original
classifier when the dataset has missing ratios of 30% and
50%. Our results collectively show that EM-helped
classifiers perform the best: they achieve the highest
average classification accuracy -- significantly higher
than the original classifiers (those without using imputers)
with 1-sided t-test p-value p<0.00037. BMI-helped
classifiers are significantly better than original classifiers
with p<0.0029. MEI-helped classifiers (resp., LinR-
512
180
512
3866
564
15000
256
90
256
1934
282
5000
8 2
2
2
7
4
13
8
9
18
16 26
helped, and PMM-helped ones) however, do not
significantly outperform original classifiers, with p-values
of p<0.18 (resp. p<0.40, and p<0.12).
Table 2. Average classification accuracy over eight
datasets with MCAR missing ratio 10%
original
classifier helped helped
MEI - LinR- EM-
helped
PMM-
helped
BMI-
helped
OneR
NB
dTable
C4.5
PART
SVM
LR
NN
RF50
kNN
68.31 67.98 69.18 69.59 68.78
69.85
75.01
74.60 74.02 74.62 74.11 74.74
76.29 76.29 78.36
79.16
76.85 78.38
81.77
80.01 80.26 81.01 79.51 81.28
82.51
80.17 80.98 81.50 79.73 80.91
80.68 80.75 81.66
82.50
80.81 82.20
80.23 80.14 82.24
83.03
80.82 82.62
79.00 80.35 79.72 82.19 80.21
82.21
84.88
83.97 83.90 84.69 83.71 84.69
73.75 81.07 82.13 83.16 81.49
83.38
ave
78.24 78.53 79.24
80.15
78.60 80.03
Table 3. Average classification accuracy over eight
datasets with MCAR missing ratio 30%
original
classifier helped helped
MEI -LinR- EM-
helped
PMM-
helped
BMI-
helped
OneR
NB
dTable
C4.5
PART
SVM
LR
NN
RF50
kNN
64.19 63.68 65.50
69.31
65.65 68.66
73.34
71.93 70.98 72.95 70.48 72.31
71.42 72.19 70.50
75.40
71.33 74.30
76.59 75.22 71.97
77.37
72.04 76.05
77.65
74.66 72.20 77.10 72.44 76.87
75.06 75.00 73.79
78.63
73.95 77.92
75.58 75.33 72.70
77.85
73.93 76.91
72.40 74.65 73.70
78.17
73.35 78.00
80.48 79.68 77.29
80.78
76.89 80.19
62.19 77.43 76.91
80.71
77.21 80.66
ave
72.89 73.98 72.55
76.83
72.71 76.19
Table 4. Average classification accuracy over eight
datasets with MCAR missing ratio 50%
original
classifier helped helped
MEI - LinR- EM-
helped
PMM-
helped
BMI-
helped
OneR
NB
dTable
C4.5
PART
SVM
LR
NN
RF50
kNN
61.48 60.86 64.19
67.60
63.48 66.54
71.29
68.80 66.88 70.48 67.38 69.94
67.34 67.91 66.91
72.10
66.71 70.39
71.05 69.60 66.41
73.82
65.87 71.85
72.48 70.40 67.31
72.65
65.79 71.76
70.84 71.19 69.71
73.72
67.92 72.90
71.63 71.73 68.88
73.65
67.70 71.48
68.13 70.40 67.76
73.86
66.55 72.96
76.33 73.82 70.70
76.52
69.76 75.34
58.14 72.23 72.80
77.42
71.74 77.38
ave
68.87 69.69 68.15
73.18
67.29 72.05