ArticlePDF Available

Abstract and Figures

Intelligent data analysis techniques are useful for better exploring real-world data sets. However, the real-world data sets always are accompanied by missing data that is one major factor affecting data quality. At the same time, good intelligent data exploration requires quality data. Fortunately, Missing Data Imputation Techniques (MDITs) can be used to improve data quality. However, no one method MDIT can be used in all conditions, each method has its own context. In this paper, we introduce the MDITs to the KDD and machine learning communities by presenting the basic idea and highlighting the advantages and limitations of each method.
Content may be subject to copyright.
Int. J. Business Intelligence and Data Mining, Vol. x, No. x, 200x 261
Copyright © 2007 Inderscience Enterprises Ltd.
Missing Data Imputation Techniques
Qinbao Song*
Department of Computer Science and Technology,
Xi’an Jiaotong University, 28 Xian-Ning West Road,
Xi’an, Shaanxi 710049, China
*Corresponding author
Martin Shepperd
School of IS, Computing and Maths,
Brunel University, Uxbridge UB8 3PH, UK
Abstract: Intelligent data analysis techniques are useful for better exploring
real-world data sets. However, the real-world data sets always are accompanied
by missing data that is one major factor affecting data quality. At the same
time, good intelligent data exploration requires quality data. Fortunately,
Missing Data Imputation Techniques (MDITs) can be used to improve data
quality. However, no one method MDIT can be used in all conditions, each
method has its own context. In this paper, we introduce the MDITs to the
KDD and machine learning communities by presenting the basic idea and
highlighting the advantages and limitations of each method.
Keywords: data quality; KDD; data mining; machine learning; data cleaning;
missing data; data imputation; missingness mechanism; missing data pattern;
single imputation; multiple imputation.
Reference to this paper should be made as follows: Song, Q. and Shepperd, M.
(2007) ‘Missing Data Imputation Techniques’, Int. J. Business Intelligence and
Data Mining, Vol. x, No. x, pp.xxxxxx.
Biographical notes: Qinbao Song received a PhD in Computer Science from
the Xi’an Jiaotong University, China, in 2001. He is a Professor of Software
Technology in the Department of Computer Science and Technology at
Xi’an Jiaotong University, Xi’an, China. He has published more than
60 referred papers in the area of Data Mining, Machine Learning, and Software
Engineering. He is a board member of the Open Software Engineering Journal.
His research interests include intelligent computing, machine learning for
software engineering, and trustworthy software.
Martin Shepperd received a PhD in Computer Science from the Open
University in 1991. He is a Professor of Software Technology at Brunel
University, London, UK, and director of the Brunel Software Engineering
Research Centre (B-SERC). He has published more than 90 refereed papers and
three books in the area of Empirical Software Engineering, Machine Learning
and Statistics. He is Editor-in-Chief of the Journal of Information and Software
Technology and was Associate Editor of IEEE Transactions on Software
Engineering (2000–2004).
262 Q. Song and M. Shepperd
1 Introduction
As more and more data are gathered each day and every day, we are almost swallowed by
the huge data mountains. Knowledge Discovery from Databases (KDD) is a powerful
technique that can help us understand the huge amount of data and extract useful
information from it; it is becoming more and more important, and has been used in
a wide range of areas (Chen and Liu, 2005). But in practice, the presence of missing data
is a general and challenging problem for most data analysis techniques, including KDD
and machine learning. As all data analysis techniques induce knowledge strictly from
data, the quality of the knowledge extracted is largely determined by the quality of the
underlying data. This makes any one data analysis technique a possible garbage in
garbage out system. Therefore, data quality has been a major concern in the data
analysis field, and the quality of data is a precondition for obtaining quality knowledge.
Noisy data, inconsistent data, and missing data are three roadblocks to gaining quality
data (Han and Kambr, 2000). In this paper, however, we just focus on Missing Data
Techniques (MDTs).
Missing data are the absence of data items for a subject, they hide some information
that may be important. Missing data pervasively exist in most real-world data sets, they
result in a loss of information and statistical power (Kim and Curry, 1977)
make common data analysis methods inappropriate or difficult to apply
(Rubin, 1987)
can introduce bias into estimates derived from a statistical model (Becker and
Walstad, 1990l; Rubin, 1987).
Obviously, missing data pose a challenge to most data analysis techniques, including
KDD and machine learning, because a substantial proportion of the data may be missing
and predictions must be made for cases with missing inputs. At the same time, among
KDD and machine learning methods, many of them can only work with complete data
sets (e.g., nearly half of the 23 classification algorithms studied in the Statlog project
(Michie et al., 1994) require complete data), and some of them handle missing data in a
rather naive way that can introduce bias into the knowledge induced. Therefore, missing
data are a major concern in KDD and machine learning communities. To obtain quality
knowledge with intelligent data analysis methods, we must first carefully deal with
missing data.
Unfortunately, the KDD and machine learning communities paid relatively less
attention to the missing data problems, hence only a few papers have been published
(Sarle, 1998; Lakshminarayan et al., 1999; Aggarwal and Parthasarathy, 2001; Kalousis
and Hilario, 2000). In contrast, the missing data problems have been extensively studied
by the statistics community. This makes it possible for the KDD and machine learning
communities to borrow some MDTs from the statistics community to address the missing
data problems in their own fields. Therefore, in this paper, we introduce the MDTs,
specially the MDITs, from the perspective of statistics. We mainly present the technical
aspect of the methods, so researchers in the intelligent data analysis field can decide
which method can be used for his or her specific tasks. At the same time, we also
highlight the advantages and limitations of each method, and provide some comments
from the perspective of intelligent data analysis.
Missing Data Imputation Techniques 263
The rest of the paper is organised as follows. We first introduce the basic concepts
of missingness mechanisms and missing data patterns in Section 2. Then, we provide
readers with the big picture of the MDTs in Section 3. After that, we focus on reviewing
the main single imputation and MI methods in Sections 4 and 5, respectively. Finally, we
present our recommendations for the intelligent data analysis community in Section 6.
2 Missingness mechanisms and missing data patterns
Both missingness mechanisms and missing data patterns have great impact on research
results, both are critical issues a researcher must address before choosing an appropriate
method to deal with missing data.
2.1 Missingness mechanisms
Missingness mechanisms are assumptions about the nature and types of missing data.
Little and Rubin (2002) define three unique types of missing data mechanisms: Missing
Completely at Random (MCAR), Missing at Random (MAR), and Non-Ignorable (NI).
In general, the missingness mechanism is concerned with whether the missingness is
related to the study variables or not. This is extremely significant as it determines how
difficult it may be to handle missing values and at the same time how risky it is to ignore
Viewing response as a random process (Rubin, 1976), the missingness mechanisms
can be introduced as follows. Suppose Z is a data matrix that includes observed and
missing data, let Zobs be the set of observed values of Z, let Zmis be the set of missing
values of Z and let R be the missing data indicator matrix, i be the ith case and j the jth
1 if is missing
0 if is observed.
MCAR indicates that the missingness is unrelated to the values of any variables, whether
missing or observed, thus
(|) ()forall.pR Z pR Z=
This is best illustrated by an example. Suppose there is a data set consisting of two
variables and 16 cases, see Table 1 for details. The first two columns of the table show
the complete data for variables V1 and V2. The other columns show the values of V2 that
remain after imposing three missingness mechanisms. In the third column, the missing
values appeared in the cases whose values of variable V1 are A, B, and C, which are the
all possible values of variable V1, and the missing values of variable V2 distributed from
small to large. This means that the values were missing randomly and have no relation
with both variable V1 and itself, thus it is MCAR.
264 Q. Song and M. Shepperd
Table 1 Complete data set and the simulation of three missingness mechanisms
V1 Complete MCAR MAR NI
A 85 85 85 ?
A 94 ? 94 ?
A 111 111 111 111
A 130 130 130 130
B 80 80 ? ?
B 97 97 ? ?
B 117 117 ? 117
B 125 ? ? 125
C 88 ? 88 ?
C 91 91 91 ?
C 123 123 123 123
C 132 ? 132 132
MCAR is an extreme condition and from an analyst’s point of view, it is an ideal
condition. Generally, you can test whether MCAR conditions can be met by showing that
there is no difference between the distribution of the observed data of the observed cases
and missing cases, this is Little’s (1988) and Little et al. (1995) multivariate test, which is
implemented in SYSTAT and the SPSS Missing Values Analysis module. Unfortunately,
this is hard when there are few cases as there can be a problem with Type I errors.
Non-Ignorable (NI) is at the opposite end of the spectrum. It means that the
missingness is non-random, it is related to the missing values, and it is not predictable
from any one variable in the data set. That is
( | ) ( ) for all , ( | ) depends on .pR Z pR Z pR Z Z
NI is the worst case since, as the name implies, the problem cannot be avoided by
a deletion technique, nor are imputation techniques in general effective unless the analyst
has some model of the cause of missingness. For example, in the fifth column of Table 1,
all the values that are smaller than 100 were missing, and the corresponding values of the
V1 can be A, B, and C. This means that the values that were missing only depend on the
variable V2 itself, thus it is NI. This also can be illustrated by another example. Suppose
software engineers are less likely to report high defect rates than low rates, perhaps for
reasons of politics. Merely to ignore the incomplete values leads to a biased sample and
an over optimistic view of defects. On the other hand, imputation techniques do not work
well either since they attempt to exploit known values and as we have already observed
this is a biased sample. Unless one has some understanding of the process and can
construct explanatory models, there is little that can effectively be done with NI
MAR lies between these two extremes. It requires that the cause of the missing data is
unrelated to the missing values, but may be related to the observed values of other
variables, that is:
obs mis
(|) (| )forall .pR Z pR Z Z=
Missing Data Imputation Techniques 265
For example, in the fourth column of Table 1, all the missing values appeared in the cases
whose values of variable V1 are B, and the missing values distributed from small to large.
This means the missing values depend only on the variable V1, and have no relation with
itself, thus it is MAR.
Most missing data methods assume MAR. Whether the MAR condition holds can be
examined by a simple t-test of mean differences between the groups with complete data
and that with missing data (Kim and Curry, 1977; Tabachnick and Fidell, 2001). MAR is
less restrictive than MCAR because MCAR is a special case of MAR. MAR and MCAR
are both said to be ignorable missing data mechanisms, which is coined in Rubin (1976)
and is fully explicated in the context of MI in Rubin (1987).
In practice, it is usually difficult to meet the MCAR assumption. MAR is an
assumption that is more often, but not always, tenable.
2.2 Missing data patterns
The missing data indicator matrix R reveals the missing data pattern. By rearranging the
cases and the variables of a data set, we can get the missing data patterns. Generally,
there are two types of missing data patterns, they are univariate pattern and multivariate
In the univariate missingness pattern, only one variable contains missing values.
Table 2 is an example. In Table 2, only variable x3 contains three missing values.
In the multivariate missingness pattern, more than one variable contain missing data.
We can refine this pattern into two types: monotone pattern and arbitrary pattern.
Table 2 Univariate missing pattern
1 x
2 x
3 x
4 x
5 x
C1 * * * * * *
C2 * * * * * *
C3 * * ? * * *
C4 * * ? * * *
C5 * * ? * * *
C6 * * * * * *
In monotone pattern, variables can be arranged so that for a set of variables x1, x2, , xn,
if xi is missing, then so are xi+1, , xn. Table 3 is an example.
Table 3 Monotone missing pattern
1 x
2 x
3 x
4 x
5 x
C1 * * * * * *
C2 * * * * * ?
C3 * * * * ? ?
C4 * * * ? ? ?
C5 * * ? ? ? ?
C6 * ? ? ? ? ?
266 Q. Song and M. Shepperd
In arbitrary pattern, missing data can occur anywhere and no special structure appears
regardless of how you arrange variables. Table 4 is an example.
Table 4 Arbitrary missing pattern
1 x
2 x
3 x
4 x
5 x
C1 * * ? * * *
C2 * * * * * ?
C3 * * * * * ?
C4 ? ? * * * *
C5 * * * * * *
C6 ? * * * * *
The SPSS Missing Value Analysis module has the function of assessing missing data
patterns. The type of missing data pattern may affect the selection of missing data
methods, because some missing data methods are sensitive to the missing data patterns.
Therefore, we will discuss this issue when introducing a specific missing data imputation
method if applicable.
3 The classification of Missing Data Techniques (MDTs)
In this section, we first show readers a big picture of MDTs by presenting the
corresponding overall taxonomy. Then, we draw readers’ attention to a specific part of it:
the taxonomy of the MDITs, which are the focus of the paper.
3.1 Missing Data Techniques (MDTs) taxonomy
There is a great deal of research activities on MDTs, and a wide range of missing data
methods are being proposed. We divide these MDTs into three categories:
missing data ignoring techniques
missing data toleration techniques
Missing Data Imputation Techniques (MDITs).
The missing data ignoring techniques simply delete the cases that contain missing data.
Because of its simplicity, it is widely used, but it does not lead to the most efficient
utilisation of the data and should be used only in situations where the amount of missing
data values is very small. This technique has two forms: Listwise Deletion (LD) and
Pairwise Deletion (PD).
Missing Data Imputation Techniques 267
Listwise Deletion (LD) is also referred to as case deletion, casewise deletion and
complete case analysis. This method omits the cases containing missing data,
it only makes use of those cases that do not contain any missing data. It is easy,
fast, commonly accepted, and does not invent data. It is the default of most statistical
packages. The drawback is that its application leads to large loss of observations,
which may result in too small data sets if the fraction of missing values is high
(Myrtveit et al., 2001), and this method incurs a bias in the data unless data are
missing under MCAR.
Pairwise Deletion (PD) is also referred to as available case method. This method
considers each variable separately. For each variable, all recorded values in each
observation are considered (Strike et al., 2001) and missing data are ignored.
This means that different calculations will utilise different cases and will have
different sample sizes, this effect is undesirable. But it still provides better estimates
than LD (Kim and Curry, 1977). The advantage is that the sample size for each
individual analysis is generally higher than with complete case analysis, but the
results are unbiased only if the data are missing under MCAR. It is necessary when
the overall sample size is small or the number of cases with missing data is large.
The missing data toleration techniques are the internal missing data treatment strategies,
they work directly with data sets containing missing values. If the objective is not to
predict the missing values, these types of techniques are a better choice. This is because
any missing data predication method will incur bias, and when biased data are used to
make predictions, the result is doubtful.
CART (Breiman et al., 1984) addresses the missing data problem in the context of
decision tree classifier. If some cases contain missing values, CART uses the best
surrogate spilt to assign these cases to branches of a spilt on a variable where these cases’
values were missing. C4.5 (Quinlan, 1988) is an alternative method of CART. C4.5 uses
a probabilistic approach to handle missing data; missing values can be present in any
variable except the class variable. This method calculates the expected information gain
by assuming that the missing value is distributed according to the observed values in the
subset of the data at that node of the tree. It seems to be one of the best methods of using
simple methods to treat missing values (Grzymala-Busse and Hu, 2000).
However, in some contexts, for example, some data analysis methods only work with
a complete data set, we must fill in the missing values or delete the cases containing
missing values first, and then use the resulting data set to perform subsequent data
analysis. In this case, toleration techniques cannot be used. And in cases where the data
set contains large amounts of missing data, or the mechanism causing the missing data is
non-random, imputation techniques might perform better than ignoring techniques
(Haitovsky, 1968).
The MDITs refer to any strategy that fills in missing values of a data set so that
standard data analysis methods can then be applied to analyse the completed data set.
These techniques not only retain data in incomplete cases, but also impute values of
correlated variables (Little and Rubin, 1989).
268 Q. Song and M. Shepperd
3.2 Missing Data Imputation Techniques (MDITs) taxonomy
In practice, missing data imputation is one of the most common techniques for handling
missing data (Kalton, 1981; Sedransk, 1985). It is also a rapidly evolving field with many
methods. All these methods can be classified into
ignorable missing data imputation methods, which consist of single imputation
methods and MI methods
NI missing data imputation methods, which consist of likelihood-based methods and
non-likelihood-based methods.
The likelihood-based methods (Hogan and Laird, 1997; Little, 1993, 1995;
Tsiatis et al., 1994; Wu and Bailey, 1988) require a full parametric specification
of the joint distribution of complete data and missingness mechanism to be specified
(Scharfstein et al., 1999), it is typical to specify a parametric model for NI and
incorporate it into the complete data log-likelihood. Depending on how the joint
distribution of the primary data Z and missingness indicators R is specified, Little and
Rubin (1987) presented three general classes of likelihood-based NI missing data
Selection Models (SMs) assume different parameters for the primary data model
p(Z) and for p(R|Z).
Pattern Mixture Models (PMMs) assume different parameters for p(Z|R) and
for p(R).
Shared Parameter Models (SPMs) instead incorporate common parameters into
models for p(Z) and p(R).
Selection Models (SMs) (Heckman, 1976; Amemiya, 1984; Diggle and Kenward,
1994; Little, 1995; Verbeke and Molenberghs, 2000) require provision
of the distribution of complete data and specification of the manner in which the
probability of missingness depends on the data (Schafer and Graham, 2002).
Pattern Mixture Models (PMMs) (Little, 1993, 1994; Glynn et al., 1986; Hedeker
and Gibbons, 1997; Little and Schenker, 1995; Little, 1995; Little and Wang, 1996;
Verbeke and Molenberghs, 2000) assume the joint distribution of Z and R is a
mixture of the missing patterns; this means that the parameters for the model have
to be calculated separately for each pattern. PMMs require specification of a
distribution of probability for the missing data patterns and a model for the data
within each pattern. The typical feature is that the distribution of the missingness
only depends on the covariates and not on Z.
Shared Parameter Models (SPMs) (Wu and Bailey, 1989; Schluchter, 1992;
DeGruttola and Tu, 1994; Dunson and Perreault, 2001; Roy, 2003) will be used
when the missingness may be associated with the true underlying response for
a subject under clustered and longitudinal data settings. Several shared parameter
approaches have been proposed for analysing longitudinal data when the time to
censoring depends on a subject’s underlying rate of change.
The non-likelihood-based methods (Robins, 1997; Robins et al., 1995; Rotnitzky and
Robins, 1997; Rotnitzky et al., 1998; Paik, 1997) require the joint distribution of the
Missing Data Imputation Techniques 269
complete data to follow a non-parametric or semi-parametric model and the missingness
mechanism to follow a parametric model (Scharfstein et al., 1999).
Sensitivity analysis is a rational approach to NI non-response, and has been used
by Little (1994), Little and Wang (1996), Rubin (1977) and Scharfstein et al. (1999).
But Scharfstein et al. (1999) argue that sensitivity analysis has some problems. The first
is in practice, the users more favour simplicity and conciseness in the presentation of
results. The second lies in the fact that a sensitivity analysis usually needs to be confined
to a relatively small number of parameters. The third is that many different forms of
sensitivity analysis can be expected and may yield contradictory conclusions.
When compared with the likelihood-based methods, the non-likelihood-based
methods are preferable. The reason is that in most real-world problems, it is not possible
to fully specify the non-response model as a function of reported values. However, the NI
missing data imputation methods require explicit specification of a distribution for the
missingness in addition to the model for the complete data, they are more complex and
not suitable for KDD purpose. Thus, in this paper, we further focus on single and MI
A single imputation method fills in a value for each missing value; currently, it is
more common than MI, which replaces each missing value with several plausible values
and accurately reflects sampling variability about the actual values. In this paper,
we focus attention mainly on single imputation and MI, which have the potential of
serving KDD and machine learning.
4 Single imputation techniques
Single imputation refers to the substitution of a single value for each missing value.
It is flexible and fills in missing values with imputed data and thus complete data analysis
methods can be applied. Further, it handles missing data only once, which implies
a consequent consistency of answers from identical analyses (Rubin, 1988). This means
that single imputation methods can be applied to KDD and machine learning. Although
single imputation actually makes contributions to the missing data problem to some
extent, unfortunately, it still does not reflect the uncertainty in missing data estimates,
and the sample size is overestimated with the variance and standard errors being
underestimated, confidence intervals for estimated parameters are too narrow, and Type I
error rates are too high (Little and Rubin, 2002).
There are different types of single imputation techniques, we focus on both the main
basic imputation techniques, such as mean imputation, regression imputation, and
hot-deck imputation, and advanced imputation techniques, such as Expectation
Maximisation (EM) approach and Raw Maximum Likelihood (RML) methods.
The Approximate Bayesian Bootstrap (ABB) and Data Augmentation (DA) methods that
are closely related to MI will be introduced in Section 4.1.
4.1 Mean imputation
Mean imputation, also called unconditional mean imputation, is a widely used imputation
270 Q. Song and M. Shepperd
Mean imputation assumes that the mean of a variable is the best estimate for any
case that has missing information on this variable. Thus, it imputes each missing value
with the mean of known values for the same variable if data are missing under MCAR
(Wilks, 1932). Suppose x3 in Table 3 is a continuous variable, the missing values of both
C5(x3) and C6(x3) are the mean of the four observed values of variable x3. That is
If x3 in Table 3 is a categorical variable, the missing values of both C5(x3) and C6(x3) are
the mode of the four observed values of variable x3. That is
(|() )
max ,
Cx Cx
where υ is the someone known value of variable x3.
A variation of mean imputation is to stratify a data set into sub-groups based on
auxiliary variables and then impute the sub-group mean for missing data within the
sub-group (Kalton and Kasprszyk, 1986; Song et al., 2005; Song and Shepperd, 2007).
This certainly represents an improvement on the overall mean imputation, although it
may not be perfect. Cohen (1996) proposed another way to improve mean imputation.
Instead of imputing the mean for all the missing values, one can impute half of the
missing values with
obs obs
and the other half with
obs obs
where nobs is the number of observed values, obs
is the mean of observed values, and
obs obs
This adjustment will retain the first and second moments as observed.
Advantages and limitations
Mean imputation can be valid especially when data are missing at MCAR. It is fast,
simple, ease to implement, and no cases are excluded. But even if the missingness
mechanism is MCAR, this method still leads to underestimation of the population
variance (the bias is proportional to (nobs – 1)/(nobs + nmis – 1) (Little, 1992)), and thus a
small standard error and a possibility of Type I error.
Missing Data Imputation Techniques 271
4.2 Regression imputation
Regression imputation, also called conditional mean imputation, replaces each missing
value with a predicted value based on a regression model if data are missing under MAR.
A general regression imputation is a two-phase procedure: first, a regression model is
built using all the available complete observations, and then missing values are estimated
based on the built regression model.
Little (1992) divided the regression imputation into two categories: independent
variables based and both independent and dependent variables based methods.
The former just uses information in the reported independent variables in a case to impute
the missing independent variables, and Little recommends using Weighted Least Squres
(WLS) regression (Little, 1992). But Gourieroux and Montfort (1981) and Conniffe
(1983) noted that the estimation error in regression coefficient introduces a correlation
between the incomplete cases and further affects the best choice of weight and
consistency of standard estimation error, they proposed using Generalised Least Squares
(GLS) to address this problem. GLS uses both dependent variable and reported
independent variables to impute missing values if the partial correlation of dependent
variable and a missing independent variable given the reported independent variable is
high. However, Little (1992) noted that whether the missing independent variables are
imputed using only reported independent variables or using both reported independent
variables and dependent variable, estimated standard errors of the regression coefficients
from Ordinary Least Squares (OLS) or WLS on the filled-in data will tend to be too
small. The reason is that imputation error is not taken into account.
Advantages and limitations
Regression imputation maintains the sample size by preserving the cases with missing
data, it is better than MI. But as imputed data is always predicted from the regression
model that needs to be specified, correlations and covariances are inflated. It may require
large samples to produce stable estimates (Donner, 1982), and it still underestimates
4.3 Hot-deck imputation
Hot-deck imputation is a procedure where the imputed values come from other cases in
the same data set.
Traditional hot-deck imputation stratifies a data set into classes according to some
auxiliary variables, keeps complete cases within classes on an active file, and selects the
most similar one to fill in missing data (Ford, 1983).
There are two most similar case selection techniques, they are random (stochastic)
and deterministic methods. The random method is the simplest hot-deck imputation,
which randomly chooses observed values on the same variable from donor cases
according to the determined auxiliary variables on which donors and donees will
match (Rao and Shao, 1992; Reilly, 1993). If a matched class does not have any
observed value, combine the class with other classes and perform imputation based
272 Q. Song and M. Shepperd
on the combined imputation classes. Obviously, the Similar Response Pattern Imputation
(SRPI) (Joreskog and Sorbom, 1993) that identifies the most similar case without
missing values and copies the value(s) of this case to fill in the blanks in the cases with
missing data, the k-NN (Fix and Hodges, 1952) (k Nearest Neighbours) imputation
(Batista and Monard, 2003; Song et al., 2005) that searches the k nearest neighbours of
the missing value and replaces the missing value by the average of the corresponding
variable values of the k nearest neighbours, and multivariate matching that matches
donors and donees on several predetermined auxiliary variables, all belong to the
deterministic hot-deck imputation.
Advantages and limitations
Hot-deck preserves the population distribution by substituting different observed values
for each missing value and maintains the proper measurement level of variables
(categorical variables remain categorical and continuous variables remain continuous);
it is better than mean imputation and regression imputation. But
“Hot-deck imputations do not explicitly take into account the likely possibility
that probability for respondent and non respondent sub populations are
different” (Chiu and Sedransk, 1986),
they distort correlations and covariances. In addition, the donor data set must be large
enough, so we can find out the donor cases.
Cold-deck imputation is similar to hot-deck imputation but it copies in a value from
a similar case in the historical data set rather than the current data set, it is useful for
variables that are static.
4.4 Expectation Maximisation (EM)
The EM (Dempster et al., 1977; Ghahramami and Jordan, 1995) method to missing
data handling was first proposed by Dempster et al. (1997). It is a general method of
finding maximum likelihood estimate of parameters of an underlying distribution from an
incomplete data set.
The basic idea of the EM is first to predict the missing values based on assumed
values for the parameters, then use these predictions to update the parameters, and
repeat until the sequence of parameters converges to maximum likelihood estimates,
as Figure 1 shows.
Suppose Zobs and Zmis, respectively, are the observed and missing parts of data set Z,
that is Z = {Zobs, Zmis}, and the data set Z can be described by p(Z|Θ), which is a
probability or density function that is governed by the set of parameters Θ (e.g., p might
be a set of Gaussians and Θ could be then means and covariances). With this density
function, we define a likelihood function L(Θ|Z) = p(Z|Θ). The L(Θ|Z) is function of
parameters Θ for given Z, whereas the p(Z|Θ) is a function of Z for given Θ. This method
starts with an initial estimate parameter Θ(0), and at each iteration t + 1, the following two
steps must be performed:
The first step of EM is Expectation step (E-step), its purpose is to predict ()
Zobs and Θ(t) by finding the expected value of the complete data log-likelihood log
Missing Data Imputation Techniques 273
p(Zobs, Zmis|Θ) with respect to the unknown data Zmis, given the observed data Zobs and the
current parameter estimates. That is
() ()
obs mis obs
(| ) [log( , |)| , ],
where Θ(t) are the current parameter estimates that were used to evaluate the expectation
and Θ are the new parameters that ultimately will be optimised to maximise the
likelihood Q.
Note that Zobs and Θ(t) are constant, Θ is a normal variable that we wish to adjust, and
Zmis is a random variable governed by the distribution f(Zmis|Zobs, Θ(t)). So the expected
likelihood can be rewritten as
() ()
obs mis mis obs
(| ) log( , |)( | , )d ,
ΘΘ = Θ Θ
where f(Zmis|Zobs, Θ(t)) is the marginal distribution of the missing data and it depends on
both the observed data and the current parameters, and
is the space of values Zmis can
take on.
The second step of EM is Maximum step (M-step); its purpose is to maximise the
expectation computed in the first step. Specifically, this step estimates Θ(t+1) from Zobs and
mis .
That is
(1) ()
arg max ( | ).
Actually, we do not really need to maximise Q(Θ|Θ(t)) with respect to Θ to increase
monotonically the log-likelihood at each iteration. All that is needed is to find Θ(t+1) such
that Q(Θ(t+1)|Θ(t)) Q(Θ(t)|Θ(t)). This is called Generalised EM (GEM).
These two steps are repeated until parameter estimates Θ(1), Θ(2), , Θ(t+1) converge
to a local maximum of the likelihood function. For the problem of rate-of-convergence,
see Dempster et al. (1977), Wu’s (1983), Xu and Jordan’s (1996) and Jordan and
Xu’s (1996) papers for details.
Figure 1 EM procedure
274 Q. Song and M. Shepperd
Advantages and limitations
The EM method has well-known statistical properties; it generally outperforms popular
ad hoc methods of incomplete data handling such as listwise and PD and mean
imputation. The reason is that it assumes that incomplete cases have data missing at
MAR rather than MCAR. It is guaranteed to converge to a local maximum of the
likelihood function, the rate of convergence depends on fractions of missing information,
high missingness produces slow convergence. But EM adds no uncertainty component to
the estimated data and the estimation variability is ignored, which leads to
underestimation of standard errors and confidence intervals. At the same time, EM must
be implemented specially for each type of model. These limitations led to the
modifications and extensions of EM (McLaachlan and Krishnan, 1997; Chan and
Ledilter, 1995; Meng, 1994; Rubin, 1991; Weil and Tanner, 1990; Celeux and Diebolt,
1985) and the propositions of RML and MI methods.
4.5 Raw Maximum Likelihood (RML)
RML, which is also known as Full Information Maximum Likelihood (FIML), is an
efficient model-based method that uses all available cases in a data set to construct the
best possible first and second order moment estimates with a multivariate normal
distribution data set, which is under the MAR assumption.
In this method, the parameters of the given model are estimated from all the available
data, and the missing values are estimated based on the estimated parameters. Hartley and
Hocking (1971) did the original work of RML. In their method, first they divide the
incomplete data set into several classes. Each class corresponds to one missing data
pattern. Then they calculate the likelihood for each class and sum these likelihoods.
After that, they perform the parameter estimation based on the summed likelihood.
FIML is very similar to this method, but calculates the likelihood for each case with
observed data. The following section provides a formal introduction of this method.
Let n be the number of cases of data set Z that includes observed data Zobs and
missing data Zmis. As in EM, the data set Z can be described by p(Z|Θ), which is a
probability or density function that is governed by the set of parameters Θ. We assume
that the cases of data set Z are independent and identically distributed (i.i.d.) with
distribution p. Therefore, the resulting density for the cases is
(|) ( | ).
pZ pZ
Θ= Θ
Thinking of Z = (Zobs, Zmis) and let nobs be the number of complete cases, we rewrite the
resulting density as
obs mis
mis obs obs
(|) ( , |)
pZ pZ Z
pZ Z pZ
Θ= Θ
Missing Data Imputation Techniques 275
The log-likelihood of Θ based on Zobs can then be written as
obs obs,
(| )log ( |)
(| ).
Θ= Θ
The FIML method first calculates the likelihood L(Θ|Zobs,i) for the observed portion of
each case i’s (1 i nobs) data. Then, it accumulates likelihoods of all cases to form the
likelihood for all the non-missing data and performs parameter estimation using
maximum likelihood based on the summed likelihood. After that, the FIML estimates the
missing values based on the estimated parameters.
Advantages and limitations
FIML method produces unbiased parameter estimates and standard errors under MAR
and MCAR (Arbuckle, 1996; Wothke, 2000). It is robust to data sets that do not comply
completely with the multivariate normal distribution requirement (Boomsma, 1982), and
can offer superior performance to LD and PD methods even under the NI missingness
mechanism (Wothke, 2000). But the FIML requires relatively large data sets and has
limitations in small samples (Little, 1992). Furthermore, the likelihood equations need to
be specifically worked out for a given distribution and estimation problem, and maximum
likelihood can be sensitive to the choice of starting values. One potential problem is that
the covariance matrix may be indefinite, which can lead to significant parameter
estimation difficulties, although these problems are often modest (Wothke, 2000).
4.6 Sequential imputation
The sequential imputation method (Kong et al., 1993, 1994; Irwin et al., 1994), which
imputes the missing values sequentially, was introduced by Kong et al. (1994) in the
context of Bayesian missing data problems, and has been applied to a variety of
problems. Liu (1996) discusses the use of this algorithm to provide approximations to the
quantities of interest in the context of his model for binary data. MacEachern et al. (1999)
propose improved performance of the original version by removing the locations entirely
through integration.
In the sequential imputation method, at each stage, the missing data can be sampled from
its posterior, and the predictive probability can be evaluated. Specifically, the complete
cases are processed first, and the other cases are processed in the order of increasing
missingness so that the missing values are imputed conditioned on as many of Zobs as
possible. The formal introduction of this method is as follows.
Let Θ be parameters of interest and n be the number of cases of complete data set Z
that includes observed data Zobs and missing data Zmis. Suppose the complete data
posterior distribution p(Θ|Z) is simple. The main goal is to find the posterior distribution
obs mis obs
(| ) (|)( | )d .
276 Q. Song and M. Shepperd
By drawing m independent copies of Zmis’s from the conditional distribution p(Zmis|Zobs),
we can approximate the posterior distribution p(Θ|Zobs) by
obs mis
1(| , ()),
where Zmis(i) is the ith imputation for the missing part Zmis. Usually, this is an iterative
approximation procedure. But the method of Kong et al. (1994) avoids iterations by
sequentially imputing the Zmis,t that is the variable, which contains missing values, in the
tth observation and using importance sampling weights.
This method starts by drawing
Z from p(Zmis,1|Zobs,1) and computing w1 = p(Zobs,1),
and for each remaining case t (2 t n), the following two steps must be performed
The first step draws
Z from the conditional distribution
 
mis,1 mis,2 mis, 1
mis, obs ,1 obs, 2 obs, 1 obs,
(|,,,,, , ,).
pZ Z Z Z Z Z Z Z
The second step calculates the predictive probabilities
 
mis,1 mis, 2 mi s, 1
obs, obs ,1 obs ,2 obs, 1
(|,,, ,, , ),
pZ Z Z Z Z Z Z
and the importance sampling weight wt = wt–1p(Zobs).
After all the cases have been computed, let w = wn we have
 
mis,1 mis,2 mis, 1
obs,1 obs, obs ,1 obs, 2 obs, 1
()( | , , , ,, , ).
wpZ pZ Z Z Z Z Z Z
By independently repeating the procedure m times to draw m sets of imputations
()( {1,2, , })
ii m and corresponding weights w(i), we can obtain the estimated
posterior distribution p(Θ|Zobs),
() mis
obs obs
() 1
(| ) (| , ()).
pZ wpZZi
Θ= Θ
Advantages and limitations
The sequential imputation method can directly estimate the model likelihood and cheaply
perform sensitivity analysis and influence analysis. Sequential imputation, which is
essentially also a blocking scheme, handles multiple loci very well, but only with zero or
very few loops. The limitation is that it requires that p(Z1, Z2, , Zt–1) and p(Θ|Z) are all
4.7 General Iterative Principal (GIP) Component Analysis (PCA)
Principal Component Analysis (PCA) is one of the most used multivariate techniques,
Dear (1959) suggested its use for imputation. Dear’s Principal Component (DPC)
analysis imputation does not require any distributional assumptions, but it works poorly
for data sets with less complete cases because it uses the casewise deletion method to
calculate the correlation matrix. The General Iterative Principal (GIP) component method
was proposed to avoid the problem and make DPC a general-purpose method.
Missing Data Imputation Techniques 277
Suppose Z = {zij} is a n × p data matrix that includes observed and missing data, and
X = {xij} is the normalised Z, and let R = {rij} be the missing data indicator matrix
(see Section 2 for details). The GIP component method consists of an initialisation step
and an iteration procedure.
The initialisation step calculates the correlation matrix S that can be obtained by two
methods. The first method uses all available data to compute S, and modify it with
Huseby et al. (1980) algorithm if it is non-positive definite. Another method applies
mean imputation to fill in all missing values and uses nm – 1 instead of n – 1 as the
denominator in the variance–covariance calculations to obtain S, where m is the number
of cases with missing values.
The iteration procedure involves three steps that must be iteratively performed until
successive imputed values do not change materially.
The first step calculates the largest eigenvalue
1 of S and its associated eigenvector
1 = (
12, ,
The second step performs the following actions for all cases with missing values.
Let the first principal component for the ith case be
i j ij ij
so that the points on the first principal component line that are closest to the ith case
replace the missing variables
if 1,
ˆif 0.
ij ij
ji ij
The third step converts ˆ
back to ˆ
and recalculates S from the imputed data matrix ˆ.
Advantages and limitations
General iterative PCA imputation is a general-purpose imputation method; it does not
require any distributional assumptions, but its iterative procedure limits the performance.
4.8 Singular Value Decomposition (SVD)
The Singular Value Decomposition (SVD) method offers an interesting and stable
method for the imputation of missing values. It is easy to compute and can be used in a
simple way to impute missing data (Krzanowski, 1988).
Let Z = {zij} be a n × p data matrix that includes observed and missing data. The SVD
method starts with filling in all the missing values with any initial imputed values such as
the mean. Then, for each missing value zij, the following iterative procedure must be
performed until stable imputed values are achieved.
The first step excludes the ith case from Z and calculates the SVD of the remaining
(n – 1) × p data matrix, denoted by i
= with st st
{}, {}υUu V== and
diag{ , , , },
Dddd= where andUV are orthonormal matrices.
278 Q. Song and M. Shepperd
The second step excludes the jth variable from Z and calculates the SVD of the
remaining n × (p – 1) data matrix, denoted by
with st st
{}, {}υUuV==
diag{ , , , }.
Dddd= 
The third step imputes zij with
1/2 1/ 2
ˆ()( )
ij ik k jk k
and updates zij’s initial imputed value with ˆ.
Advantages and limitations
In practice, this algorithm converges quite rapidly, typically five or six iterations.
Run-time can be linear in the number of non-zero elements in the data, and the result
is very sensitive to the choice of regression method and order of imputations.
The SVD-based method shows sharp deterioration in performance when a non-optimal
fraction of missing values is used (Troyanskaya et al., 2001).
4.9 General comments on single imputation from KDD perspective
Single imputation is a set of methods, some of them explicitly require large samples
to produce stable estimates or to obtain proper donor cases, such as the regression
imputation, the hot-deck imputation, and the FIML imputation. This characteristic makes
them capable of being used to process the large data sets that are used for KDD or
machine learning.
At the same time, regression imputation, hot-deck imputation, sequential imputation,
general iterative principal component analysis, and SVD do not assume particular
missingness mechanisms, thus they should be more easily applied to the preprocessing
task of the KDD and machine learning. In contrast, some of the single imputation
techniques require data to be missing under a specific missingness mechanism. However,
in practice it is usually hard to precisely test the missingness mechanisms. Therefore,
before applying a specific imputation method to KDD or machine learning, empirical
research on the safest default missingness mechanism assumptions for the method should
be done. Song et al. (2005) have done this work on k-NN and class mean imputation
methods, they found that MAR is the safest default missingness mechanism assumption
for both imputation methods in the context of software project effort prediction.
5 Multiple Imputation (MI)
MI (Rubin, 1977, 1978, 2004) is a valid method for handling missing data under the
MAR and multivariate normality assumptions, it retains all the major advantages of
single imputation and rectifies its major disadvantages (Rubin, 1988). By imputing
missing data m times, it introduces statistical uncertainty into the model and uses the
uncertainty to emulate the sample variability of a complete data set. MI is also a
general-purpose method, highly efficient even for small sample sizes (Graham and
Schafer, 1999). But each application of MI produces slightly different imputed results,
so the results cannot be replicated. The situation becomes worse when the amount of
Missing Data Imputation Techniques 279
missing data increases. MI is also time intensive in imputing 5–10 data sets, and different
types of imputation models need different result combination methods, which limit the
model selection.
5.1 General procedure
MI was first proposed by Rubin (1997, 1978, 2004) as a method for handling missing
data in surveys and later elaborated in his book (Rubin, 1987). It seems to be one of the
most promising methods for general-purpose handling of missing data in multivariate
analysis, this is because MI has several desirable features:
The fact that MI introduces appropriate random error into the imputation process
makes it possible to get approximately unbiased estimates for all parameters.
No deterministic imputation method can do this in general settings (Allison, 2000).
MI can be used with any kind of data and any kind of analysis without specialised
software. Repeated imputation allows one to get good estimates of the standard
errors (Allison, 2000).
MI can be highly efficient even for small values of m. In many applications, just 3–5
imputations are sufficient to obtain excellent results (Schafer and Olsen, 1998).
Figure 2 illustrates the MI procedure, which consists of three steps:
Figure 2 Multiple imputation procedure
280 Q. Song and M. Shepperd
Imputation. The most challenging step, which imputes the missing values m > 1
times. Imputed values are drawn from the distribution of incomplete data set, and
each drawn value will be different for the missing item. The results of this step are
m complete data sets. There are a variety of imputation models that can be used.
The choice of imputation models depends on the assumptions regarding the
missingness mechanisms and patterns, as well as the data distribution. See next
subsection for details.
Analysis. The repeated analysis step on the imputed data. This step analyses each of
the m complete data sets by using a standard complete data method, which should be
compatible with the imputation model used by the imputation step (Schafer, 1997).
The results of this step are m analysis results.
Combination. This step integrates the m analysis results into a final result for the
inference. It consists of computing the mean over the m repeated analysis, the
corresponding variance and confidence interval or p value. Simple rules exist for
combining the m analysis results.
Among these three steps, the Combination is the simplest one, it is just used to calculate
the single point estimate from the m-point estimates of the Analysis step according to
Rubin’s rule (1987), which we will introduce in Subsection 4.3. Unlike the relative
independent Combination step, the Imputation and Analysis steps depend on each other to
some extent. In other words, the analysis to be performed on the imputed data sets should
be compatible with the imputation model (Schafer, 1997), which is used to generate the
imputed values must be correct in some sense. Rubin (1987, 1996) has described this
5.2 Imputation model
The imputation model is used to generate the MIs. The MI method assumes that the
imputation model is the same as analysis model. In practice, however, these two models
may not be identical. But, at least the imputation model should be compatible with the
intended analysis.
The imputation model involves two aspects: variables and imputation method.
For variables, in general, the imputation model should contain all the variables that are
used by the analysis model. However, there is no reason for not including other variables
that are not used for analysis in the imputation model. It is also valuable to include
additional variables that are highly correlated to the variables with missing data and
the variables that are highly predictive of the missingness even if they are not used by the
analysis model.
For the imputation method, to properly reflect sampling variability when creating
repeated imputations under a model, and as a result that leads to valid inferences,
MI requires the imputation method to be proper. Imputation methods that incorporate
appropriate variability among the repetitions within a model are called proper, which is
defined precisely in Rubin (1987). In some complex situations, however, it is hard to find
a proper imputation (Fay, 1991, 1993; Rao, 1996). Fay (1992, 1993) explained that the
MI variance is a biased estimator for domains that are not part of the imputer’s model.
For more discussion on the relationship between the imputation model and analysis
model, see Meng’s (1994), Rubin’s (1996) and Schafer’s (1997) papers for details.
Missing Data Imputation Techniques 281
Generally, the choice of imputation methods depends on both missingness
mechanisms and missingness patterns. For example, for a monotone missing data pattern,
a parametric regression method is appropriate. For an arbitrary missing data pattern,
a Markov chain Monte Carlo (MCMC) method can be used. Meanwhile, imputation
methods must be proper, that is, they must be compatible with analysis methods
(Schafer, 1997). Unfortunately, not all the imputation methods are proper. But some
imputation methods, such as the parametric regression imputation method, Bayesian
Bootstrap (BB), ABB method, propensity score method (Rosenbaum and Rubin, 1983),
and MCMC methods, etc., are known to be proper and are introduced as follows.
The BB was first introduced by Rubin (1981) as a variation of bootstrap
(Efron, 1979).
Suppose there is a univariate data set X = {x1, x2, , xn}, where the first a < n values
are reported and the remaining na values are missing. BB imputation first draws
a 1 uniform random numbers between 0 and 1, and let their ordered values be
{r1, r2, , ra–1}, and let r0 = 0 and ra = 1. Then draws each of the na missing values
from x1, x2, , xa with probabilities (r1r0), (r2r1), , (rara–1) as the corresponding
imputed value.
The Approximate Bayesian Bootstrap (ABB) was proposed by Rubin and Schenker
(1986) as a way of generating Mis; it is an approximation of the BB (Rubin, 1981).
This method supposes that the sample can be regarded as independently and identically
distributed, and the missingness mechanism is ignorable.
Suppose there is a univariate data set X = {x1, x2, , xn}, where the first a < n values
are reported and the remaining na values are missing. We can view ABB as a two-step
bootstrap, it is described as follows:
with equal probabilities of selection, generate the donor set Xa by sampling a values
with replacement from the reported values {x1, x2, , xa}.
again with equal probabilities of selection, create a set of imputed data ˆna
by sampling na values with replacement from the donor set Xa.
In the case when the missingness of X is related to covariates Y = {y1, y2, , ym}, if the
covariates are discrete and a is large enough, first the data set X is partitioned into cells
corresponding to unique patterns of y1, y2, , ym and then ABB are carried out within
each cell. If the covariates are continuous or m is large, the data set X also is partitioned
into cells defined by coarse grouping of the estimated response propensities, which is
modelled by logistic regression on the covariates Y. See Lavori et al. (1995) paper for
The ABB method introduces additional variation by drawing imputations from
a resample of the observed data set instead of drawing directly from the observed data set
itself, this makes it proper for MI. However, Allison (2000) and Schafer and Olsen
(1998) have noted that using the ABB method for the calculation of MIs may result in
misleading findings, particularly for regression analysis. Kim (2002) investigated the
ABB’s finite sample properties of the variance estimator, the results show that the bias is
not negligible for a moderate sample. He also proposed a modification of the method for
reducing the bias of the variance estimator.
Both ABB and BB’s distributions have the same means and correlations, but the
variances for the ABB method are (1 + 1/a) times the variances for the BB method.
282 Q. Song and M. Shepperd
Propensity score method (Rosenbaum and Rubin, 1983)
The propensity score is the estimated probability that a data item is missing. In the
propensity score method, a missing value is filled in by sampling from the cases that have
a similar propensity score. Specifically, for each variable with missing values, the first
step of the propensity score method is to estimate a propensity score for each case to
indicate the probability of its being missing, the most common method to do this is
logistic regression. The cases are then classified into several clusters based on these
propensity scores. After that, an ABB imputation (Rubin and Schenker, 1986) is applied
to each cluster (Lavori et al., 1995).
This method is effective for inferences about the distributions of individual imputed
variables, but it is not appropriate for analyses involving relationship among variables
(Yuan, 2000; Allison, 2000).
The MCMC (Gilks et al., 1996) originated in physics as a tool for exploring equilibrium
distributions of interacting molecules. However, in statistics, it was used to create a large
number of random draws of parameters from Bayesian posterior distributions under
complicated parametric models for parameter simulation. Recently, it has been used to
create a small number of independent draws of the missing data from a predictive
distribution via Markov chain, these draws are then used for MI inference.
A Markov chain is a sequence of random variables in which the distribution of each
element depends only on the previous one. For Markov chain {Xt : t = 1, 2, }, all the
variables must meet p(Xt|X0, X1, , Xt–1) = p(Xt|Xt–1). The Markov chain is fully specified
by the starting distribution p(X0) and the transition rule p(Xt|Xt–1).
Data Augmentation (DA) (Tanner and Wong, 1987; Tanner, 1991, 1993), which was
adapted by Schafer (1997) for the purpose of generating Bayesian proper imputations,
generates a Markov chain by starting with a random imputation of missing values. It is an
MCMC method, which consists of the following two steps:
The first step is the imputation I-step, which draws imputations for the missing values
from the predicted distribution of the data using current parameter estimates. That is,
this step draws (1)
Z+ from p(Zmis|Zobs, Θ(t)), where t denotes tth iteration.
The first iteration of the I-step requires a prior as the initial parameter estimates,
which can be generated by using the EM (Dempster et al., 1977) algorithm. After the first
iteration, the parameter estimates are drawn from P-step, the second step.
The second step is the posterior P-step, which actually is a parameter estimation step.
This step draws new parameter estimates from Bayesian posterior distribution based on
both the observed and imputed data. That is, this step draws Θ(t+1) from (1)
obs mis
(| , ),
where t denotes tth iteration. In general, non-informative priors about the parameters are
used for large samples and informative priors may be used for smaller samples or sparse
These two steps are iterated until the result of a Markov chain (1) (1)
(2)(2) (1)(1)
mis mis
(, ),,( , )
ΘΘ eventually stabilises or converges in distribution to
p(Zmis, Θ|Zobs). The distribution of the parameters stabilises to a posterior distribution that
averages over the missing data, whereas the distribution of the missing data stabilises to a
predictive distribution (Schafer and Olsen, 1998).
Just as EM, the rate of convergence of DA also depends on fractions of missing
information. The larger the fractions of missing information, the slower the convergence.
Missing Data Imputation Techniques 283
It is not like EM, where the parameter estimates no longer change from one iteration to
the next when it converges. When DA converges, the distribution of the parameters no
longer changes from one iteration to the next, although the random parameter values
themselves do continue to change. For this reason, assessing the convergence of DA is
much more complicated than EM (Schafer and Olsen, 1998).
There are two methods when using DA to create m MIs for missing values. The first
is using one Markov chain to draw MIs. Suppose DA converges at tth iteration,
it is necessary to have a single run with length of mt and store the completed data sets
from iteration t, 2t, , mt. The second is running m independent Markov chains, clearly
it is preferable.
MCMC can be used with both arbitrary and monotone patterns of missing data,
it assumes the multivariate normal distribution data under MAR missingness mechanism
5.3 The combination of analysis results
Rubin (1987) proposed the method, referred to as Rubin’s rule, to combine the m results
of Analysis step to obtain a single set of results. Rubin’s rule is described as follows.
Suppose m 2 times imputations and analyses have been performed, and one has
calculated and saved all the m estimates and standard errors. For the ith (i = 1, 2, , m)
analysis, let ˆi
Θ be the estimate of a particular parameter Θ of interest and let ˆ
υi be its
estimated variance. The MI estimate ˆ
Θ of Θ is
Θ= Θ
The corresponding estimated variance ˆ
υ, which has two components that take into
account variability within each data set and across data sets, is the sum of the two
components, which are the within-imputation variance sw and the between-imputation
variance sb, with an additional correction factor to account for the simulation error in ˆ,Θ
where sw is the within-imputation variance,
is simply the average of the estimated variances.
The between-imputation variance sb,
is the sample variance of the estimates themselves.
Note if there were no missing data, then ˆ(1,2,,)
iimΘ= would be identical,
sb would be 0 and ˆ
υ would simply be equal to sw. sb/sw indicates how much information
is missing.
284 Q. Song and M. Shepperd
Final estimates are then presented with confidence limits. Confidence intervals are
calculated as
where tdf is a t distribution with degrees of freedom df are calculated as
(1)1 .
(1 )
df m ms
=− +
The efficiency of a parameter estimate based on m imputations is
is the rate of missing information (Rubin, 1987), it is defined as
2/( 3),
(1 ) b
is the relative increase in variance due to missingness.
Table 5 relates the missing data rate
with the number of imputations m for
efficiency of recovery of the true parameter. It is clear from Table 5 that for low missing
data rates (30% or less), no more than m = 3 imputations are needed to be at least 91%
efficient. For higher missing data rates, m = 5 or m = 10 imputations are needed.
This table can be used to determine how many times imputations are needed for the given
missing data rate.
Table 5 Efficiency of MI for
missing data and m imputations
m 10% 20% 30% 40% 50% 60% 70% 80% 90%
3 97 94 91 88 86 83 81 79 77
5 98 96 94 95 91 89 88 86 85
10 99 98 97 96 95 94 93 93 92
20 100 99 99 98 98 97 97 96 96
5.4 General comments on MI from KDD perspective
Although nothing in the theory of MI requires data to be missing under MAR, almost all
MI analyses have assumed that the missing data are MAR. At the same time, a few NI
applications have been published (Glynn et al., 1993; Verbeke and Molenberghs, 2000)
and new methods for generating MIs under NI will certainly arrive in the future
(Schafer and Graham, 2002). This gives the MI imputation techniques wider application
Missing Data Imputation Techniques 285
than the single imputation techniques. The fact that MI does rely on large-sample
approximations also means that it has the ability of serving KDD and machine learning.
However, MI generates more than one data set, this raises another problem when we
use it for KDD or machine learning: how to integrate the multiple knowledge bases
induced from the different m data sets? For this question we have two solutions:
if the knowledge is presented in the form of a scalar parameter, such as the effort
of a software project, just apply Rubin’s rule to it; otherwise
apply Rubin’s rule to the m imputed complete data sets, so we can get only
one data set.
Then just induce knowledge with KDD or machine learning methods from it.
6 Concluding remarks
Data quality is a major concern in the intelligent data analysis field. MDITs have been
used to improve data quality in the statistics field for many years. Of course, these
techniques also can be applied to the preprocessing task of the intelligent data analysis.
When MDITs are used for cleaning data sets for KDD or machine learning purposes,
the following procedure is recommended.
Choosing variables. The selection of variables that need to be imputed depends on
both the specific analysis task and the imputation technique. Not all missing data
need to be imputed. As Rubin (1986) and Grischiles (1986) have noted, if the
missing item is unrelated to the dependent variable, one may proceed with the
analysis by ignoring missing data in which case we may be satisfied with point
estimates, which may or may not be efficient. If missing data contained in variables
are tightly related to the analysis task, these variables must undoubtedly be included
in the imputation list. Otherwise, if variables can help to impute missing data
contained in other variables or can help to do analysis task well, they also should be
Assessing missing data percentage. An imputation method may work properly with a
low but not high missing data rate. Furthermore, some methods can only work well
with some specific missing data percentages. Although the performance of a specific
imputation method varies with different missing data rates, it is generally accepted
that data sets with more than 40% missing data are not useful for detailed analysis
(Strike et al., 2001). Thus, it is not recommended to impute data sets with more than
40% missing data using these imputation methods.
Testing missing data pattern and missingness mechanism. Both missing data patterns
and missingness mechanisms have great impact on the selection of imputation
methods to deal with missing data. Some imputation methods can only work well
under MCAR, but there still are some imputation methods that do not have the
missingness mechanism assumptions. The latter is most suitable for intelligent data
analysis. At the same time, different imputation methods can have different missing
data pattern requirements. For example, the propensity score method and the
parametric regression method are appropriate for a monotone missing data
pattern, etc.
286 Q. Song and M. Shepperd
Deciding on imputation method. Selecting proper missing data methods to best treat
missing data depends not only on the type of the variables, the missing data rate, the
missing data pattern, and the missingness mechanism, but also on the context of the
specific analysis. Although there are many techniques to deal with missing data,
no one is absolutely better than the others. Different situations require different
solutions. Missing data imputation must be developed within the context of the
specific analysis, different analysts are concerned with different contexts, no single
set of imputation can satisfy all interests. For example, some methods work well for
speech recognition, while others are appropriate for image processing, and so on.
This work is supported by the National Natural Science Foundation of China under grant
60673124 and the Hi-Tech Research Development Program of China under grant
2006AA01Z1. The authors thank the anonymous reviews and the Editor for their
insightful and helpful comments.
Aggarwal, C.C. and Parthasarathy, S. (2001) ‘Mining massively incomplete data sets by conceptual
reconstruction’, Proceedings of the Seventh ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, San Francisco, California, USA, pp.227–232.
Allison, P. (2000) ‘Multiple imputation for missing data: a cautionary tale’, Sociological Methods
and Research, Vol. 28, No. 3, pp.301–309.
Amemiya, T. (1984) ‘Tobit models: a survey’, Journal of Econometrics, Vol. 24, pp.3–61.
Arbuckle, J.L. (1996) ‘Full information likelihood estimation in the presence of incomplete data’,
in Marcoulides, G.A. and Schumaker, R.E. (Eds.): Advanced Structural Equation Modeling,
Lawrence Erlbaum, Mahwah, NJ, pp.243–277.
Batista, G.E.A.P.A. and Monard, M.C. (2003) ‘An analysis of four missing data treatment methods
for supervised learning’, Applied Artificial Intelligence, Vol. 17, Nos. 5–6, pp.519–533.
Becker, W.E. and Walstad, W.B. (1990) ‘Data loss from pretest to posttest as a sample selection
problem’, The Review of Economics and Statistics, Vol. 72, No. 1, pp.184–188.
Boomsma, A. (1982) On Robustness of LISREL (Maximum Likelihood Estimation) Against Small
Samples Sizes and Non-Normality, Scoiometric Research Foundation, Amsterdam.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression
Trees, Wadsworth International Group, Belmont, CA.
Celeux, G. and Diebolt, J. (1985) ‘The SEM algorithm: a probabilistic teacher algorithm derived
from the EM algorithm for a mixture problem’, Comput. Stat. Q., Vol. 2, pp.73–82.
Chan, S. and Ledilter, J. (1995) ‘Monte Carlo estimation for time series models involving counts’,
Journal of the American Statistical Association, Vol. 90, pp.242–252.
Chen, S.Y. and Liu, X. (2005) ‘Data mining from 1994 to 2004: an application-orientated review’,
International Journal of Business Intelligence and Data Mining, Vol. 1, No. 1, pp.4–21.
Chiu, H.Y. and Sedransk, J. (1986) ‘A Bayesian procedure for imputing missing values in sample
values in sample surveys’, Journal of the American Statistical Association, Vol. 81, No. 395,
Cohen, M.P. (1996) ‘A new approach to imputation’, American Statistical Association Proceedings
of the Section on Survey Research Methods, pp.293–298.
Missing Data Imputation Techniques 287
Conniffe, D. (1983) ‘Comments on the weighted regression approach to missing values’, Economic
and Social Review, Vol. 14, pp.259–272.
Dear, R.E. (1959) A Principle-Component Missing Data Method for Multiple Regression Models,
Technical Report Report SP-86, System Development Corporation, Santa Monica, CA.
DeGruttola, V. and Tu, X.M. (1994) ‘Modeling the progression of CD4- lymphocyte count and its
relationship to survival time’, Biometrics, Vol. 50, pp.1003–1014.
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) ‘Maximum likelihood from incomplete data
via EM algorithm’, Journal of the Royal Statistical Society Series B, Vol. 39, No. 1, pp.1–38.
Diggle, P. and Kenward, M.G. (1994) ‘Informative drop-out in longitudinal data analysis’, Applied
Statistics, Vol. 43, pp.49–93.
Donner, A. (1982) ‘The relative effectiveness of procedures commonly used in multiple regression
analysis for dealing with missing data’, The American Statistician, Vol. 36, pp.378–381.
Dunson, D.B. and Perreault, S.D. (2001) ‘Factor analytic models of clustered multivariate data with
informative censoring’, Biometrics, Vol. 57, pp.302–308.
Efron, B. (1979) ‘Bootstrap methods: another look at the jackknife’, Ann. Statist., Vol. 7, pp.1–26.
Fay, R.E. (1991) ‘A design-based perspective on missing data variance’, Proceedings of Seventh
Annual Research Conference, Bureau of the Census, Washington DC, pp.429–440.
Fay, R.E. (1992) ‘When are inferences from multiple imputation valid?’, Proceedings of the
Section on Survey Research Methods, American Statistical Association, pp.227–232.
Fay, R.E. (1993) ‘Valid inference from imputed survey data’, Proceedings of the Section on Survey
Research Methods, American Statistical Association, pp.41–48.
Fix, E. and Hodges, J.L. (1952) Discriminatory Analysis: Nonparametric Discrimination: Small
Sample Performance, Technical Report Project 21-49-004, Report Number 11, USAF School
of Aviation Medicine, Randolf Field, Texas.
Ford, B. (1983) ‘An overview of hot-deck procedures’, in Madow, W.G., Olkin, I. and Rubin, D.B.
(Eds.): Incomplete Data in Sample Survey, Academic Press, Vol. II, pp.185–207.
Ghahramami, Z. and Jordan, M. (1995) Learning from Incomplete Data, Technical Report AI Lab
Memo No. 1509, CBCL Paper No. 108, MIT AI Lab, August.
Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (Eds.) (1996) Markov Chain Monte Carlo in
Practice, Chapman & Hall, London.
Glynn, R., Laird, N.M. and Rubin, D.B. (1986) ‘Selection modeling versus mixture modeling with
nonignorable nonresponse’, in Wainer, H. (Ed.): Drawing Inferences from Self-Selected
Samples, Springer-Verlag, New York, pp.119–146.
Glynn, R.J., Laird, N.M. and Rubin, D.B. (1993) ‘Multiple imputation in mixture models for
nonignorable nonresponse with follow-ups’, Journal of the American Statistical Association,
Vol. 88, pp.984–993.
Gourieroux, C. and Montfort, A. (1981) ‘On the problem of missing data in linear models’, Review
of Economic Studies, Vol. XLVIII, pp.579–586.
Graham, J.W. and Schafer, J.L. (1999) ‘On the performance of multiple imputation for multivariate
data with small sample size’, Statistical Strategies for Small-Sample Research, pp.1–29.
Grischilles, Z. (1986) ‘Economic data issues’, in Grischilles, Z. and Intilligator M. (Eds.):
Handbook of Econometrics, Amsterdam, Vol. 3, pp.1465–1514.
Grzymala-Busse, J.W. and Hu, M. (2000) ‘A comparison of several approaches to missing attribute
values in data mining’, RSCTC’2000, pp.340–347.
Haitovsky, Y. (1968) ‘Missing data in regression analysis’, Journal of the Royal Statistical Society,
Vol. B30, pp.67–81.
Han, J. and Kambr, M. (2000) Data Mining: Concepts and Techniques, Morgan Kaufmann
Publisher, San Fransisco, USA.
Hartley, H. and Hocking, R. (1971) ‘The analysis of incomplete data’, Biometrics, Vol. 27,
288 Q. Song and M. Shepperd
Heckman, J. (1976) ‘The common structure of statistical models of truncation, sample selection and
limited dependent variables, and a simple estimator for such models’, Annals of Economic and
Social Measurement, Vol. 5, pp.475–492.
Hedeker, D. and Gibbons, R.D. (1997) ‘Application of random-effects pattern-mixture models for
missing data in longitudinal studies’, Psychological Methods, Vol. 2, No. 1, pp.64–78.
Hogan, J.W. and Laird, N.M. (1997) ‘Mixture models for the joint distribution of repeated
measures and event times’, Statistics in Medicine, Vol. 16, pp.239–257.
Huseby, J.R., Schwertman, N.C. and Allen, D.M. (1980) ‘Computation of the mean vector and
dispersion matrix for incomplete multivariate data’, Communication in Statistics, Vol. B3,
Irwin, M., Cox, N. and Kong, A. (1994) ‘Sequential imputation for multilocus linkage analysis’,
Proceedings of the National Academy of Sciences of the USA, pp.11684–11688.
Jordan, M. and Xu, L. (1996) ‘Convergence results for the em approach to mixtures of experts
architectures’, Neural Networks, Vol. 8, pp.409–1431.
Joreskog, K.G. and Sorbom, D. (1993) LISREL 8 User’s Reference Guide, Scientific Software Int’l
Inc., Chicago.
Kalousis, A. and Hilario, M. (2000) ‘Supervised knowledge discovery from incomplete data’,
Proceedings of the 2nd International Conference on Data Mining, WIT Press, Cambridge,
Kalton, G. (1981) Compensating for Missing Data, ISR Research Report Series, Survey Research
Center, University of Michigan, Ann Arbor.
Kalton, G. and Kasprszyk, D. (1986) ‘The treatment of missing survey data’, Survey Methodology,
Vol. 12, No. 1, pp.1–16.
Kim, J.K. (2002) ‘A note on approximate Bayesian bootstrap imputation’, Biometrika, Vol. 89,
No. 2, pp.470–477.
Kim, J-O. and Curry, J. (1977) ‘The treatment of missing data in multivariate analysis’,
Sociological Methods and Research, Vol. 6, No. 2, pp.215–240.
Kong, A., Cox, N., Frigge, M. and Irwin, M. (1993) ‘Sequential imputation and multipoint linkage
analysis’, Genetic Epidemiology, Vol. 10, pp.483–488.
Kong, A., Liu, J.S. and Wong, W.H. (1994) ‘Sequential imputations and Bayesian missing data
problems’, Journal of the American Statistical Association, Vol. 89, pp.278–288.
Krzanowski, W.J. (1988) ‘Missing value imputation in multivariate data using the singular value
decomposition of a matrix’, Biometrical Letters, Vol. 25, pp.31–39.
Lakshminarayan, K., Harp, S.A. and Samad, T. (1999) ‘Imputation of missing data in industrial
databases’, Applied Intelligence, Vol. 11, pp.259–275.
Lavori, P.W., Dawson, R. and Shera, D. (1995) ‘A multiple imputation strategy for clinical trials
with truncation of patient data’, Statistics in Medicine, Vol. 14, pp.1913–1925.
Little, R.C., Roderick, J.A. and Schenker, N. (1995) ‘Missing data’, Arminger, G., Clogg, C. and
Sobel, M. (Eds.): Handbook of Statistical Modeling for the Social and Behavioral Sciences,
Plenum, New York, pp.39–75.
Little, R.J.A. (1988) ‘A test of missing completely at random for multivariate data with missing
values’, Journal of the American Statistical Association, Vol. 83, No. 404, pp.1198–1202.
Little, R.J.A. (1988) ‘Missing-data adjustments in large surveys’, Journal of Business and
Economic Statistics, Vol. 6, No. 3, pp.287–296.
Little, R.J.A. (1992) ‘Regression with missing X’s: a review’, Journal of the American Statistical
Association, Vol. 87, pp.1227–1238.
Little, R.J.A. (1993) ‘Pattern-mixture models for multivariate incomplete data’, Journal of the
American Statistical Association, Vol. 88, pp.125–134.
Little, R.J.A. (1994) ‘A class of pattern mixture models for normal incomplete data’, Biometrika,
Vol. 81, pp.471–483.
Missing Data Imputation Techniques 289
Little, R.J.A. (1995) ‘Modeling the drop-out mechanism in repeated- measures studies’, Journal of
the American Statistical Association, Vol. 90, pp.1112–1121.
Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data, John-Wiley,
New York.
Little, R.J.A. and Rubin, D.B. (1989) ‘Analysis of social science data with missing values’,
Sociological Methods and Research, Vol. 18, pp.292–326.
Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data, John Wiley & Sons,
New York.
Little, R.J.A. and Schenker, N. (1995) ‘Missing data’, in Arminger, G., Clogg, C. and Sobel, M.
(Eds.): Handbook of Statistical Modeling for the Social and Behavioral Sciences, Plenum,
New York.
Little, R.J.A. and Wang, Y. (1996) ‘Pattern-mixing models for multivariate incomplete data with
covariates’, Biometrics, Vol. 52, pp.98–111.
Liu, J.S. (1996) ‘Nonparametric hierarchical Bayes via sequential imputations’, The Annals of
Statistics, Vol. 24, No. 3, pp.911–930.
MacEachern, S.N., Clyde, M. and Liu, J.S. (1999) ‘Sequential importance sampling for
nonparametric Bayes models: the next generation’, The Canadian Journal of Statistics/La
Revue Canadienne de Statistique, Technical Report, Department of Statistics, University of
Stanford, Vol. 27, No. 2, pp.251–267.
McLaachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Extensions, Wiley, New York.
Meng, X.L. (1994) ‘Multiple imputation with uncongenial sources of input (with discussion)’,
Statistical Science, Vol. 9, pp.538–574.
Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine Learning, Neural and Statistical
Classification, Ellis Horwood, New Jersey, USA.
Myrtveit, I., Stensrud, E. and Olsson, U.H. (2001) ‘Analyzing data sets with missing data:
An empirical evaluation of imputation methods and likelihood-based methods’, IEEE
Transactions on Software Engineering, Vol. 27, No. 11, pp.999–1013.
Paik, M.C. (1997) ‘The generalized estimating equation approach when data are not missing
completely at random’, Journal of the American Statistical Association, Vol. 92, No. 440,
Quinlan, J.R. (1988) C4.5 Programs for Machine Learning, Morgan Kaufmann, CA.
Rao, J.N.K. (1996) ‘On variance estimation with imputation survey data’, Journal of the American
Statistical Association, Vol. 91, pp.499–506.
Rao, J.N.K. and Shao, J. (1992) ‘Jackknife variance estimation with survey data under hot deck
imputation’, Biometrika, Vol. 79, pp.811–822.
Reilly, M. (1993) ‘Data analysis with hot deck multiple imputation’, The Statistician, Vol. 42,
Robins, J.M. (1997) ‘Non-response models for the analysis of non-monotone non-ignorable
missing data’, Statistics in Medicine, Vol. 16, pp.21–38.
Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1995) ‘Analysis of semiparametric regression models
for repeated outcomes in the presence of missing data’, Journal of the American Statistical
Association, Vol. 90, pp.106–121.
Rosenbaum, P.R. and Rubin, D.B. (1983) ‘The central role of the propensity score in observational
studies for causal effects’, Biometrika, Vol. 70, pp.41–55.
Rotnitzky, A. and Robins, J.M. (1997) ‘Analysis of semiparametric regression models with
non-ignorable non-response’, Statistics in Medicine, Vol. 16, pp.81–102.
Rotnitzky, A., Robins, J.M. and Scharfstein, D.O. (1998) ‘Semiparametric regression for repeated
outcomes with non-ignorable non-response’, Journal of the American Statistical Association,
Vol. 93, pp.1321–1339.
Roy, J. (2003) ‘Modeling longitudinal data with nonignorable dropouts using a latent dropout class
model’, Biometrics, Vol. 59, No. 4, pp.829–836.
290 Q. Song and M. Shepperd
Rubin, D.B. (1976) ‘Inference and missing data’, Biometrika, Vol. 63, pp.581–592.
Rubin, D.B. (1977) ‘Formalizing subjective notion about the effect of nonrespondents in sample
surveys’, Journal of the American Statistical Association, Vol. 72, pp.538–543.
Rubin, D.B. (1978) ‘Multiple imputations in sample surveys – a phenomenological Bayesian
approach to nonresponse’, Proceedings of the Survey Research Methods Section of the
American Statistical Association, pp.20–34.
Rubin, D.B. (1981) ‘The Bayesian bootstrap’, The Annals of Statistics, Vol. 9, No. 1, pp.130–134.
Rubin, D.B. (1986) ‘Basic ideas of multiple imputation on non-response’, Survey Methodology,
Vol. 12, No. 1, pp.37–47.
Rubin, D.B. (1987) Multiple Imputation for Nonresponses in Surveys, John Wiley & Sons,
New York.
Rubin, D.B. (1988) ‘An overview of multiple imputation’, Proceedings of the Survey Research
Section, American Statistical Association, pp.79–84.
Rubin, D.B. (1991) ‘EM and beyond’, Psychometrica, Vol. 56, pp.241–254.
Rubin, D.B. (1996) ‘Multiple imputation after 18+ years (with discussion)’, Journal of the
American Statistical Association, Vol. 91, pp.473–489.
Rubin, D.B. (2004) The Design of a General and Flexible System for Handling Nonresponse in
Sample Surveys, Report prepared for the US Social Security Administration, The American
Statistician, Vol. 58, No. 4, pp.298–302.
Rubin, D.B. and Schenker, N. (1986) ‘Multiple imputation for interval estimation from simple
random samples with ignorable nonresponse’, Journal of the American Statistical Association,
Vol. 81, No. 394, pp.366–374.
Sarle, W.S. (1998) ‘Prediction with missing inputs’, Proceedings of the Fourth Joint Conference on
Information Sciences, Vol. 2, pp.399–402.
Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data, Chapman & Hall, London.
Schafer, J.L. and Graham, J.W. (2002) ‘Missing data: our view of the state of the art’,
Psychological Methods, Vol. 7, No. 2, pp.147–177.
Schafer, J.L. and Olsen, M.K. (1998) ‘Multiple imputation for multivariate missing-data problems:
a data analyst’s perspective’, Multivariate Behavioral Research, Vol. 33, No. 4, pp.545–571.
Scharfstein, D.O., Rotnitzky, A. and Robins, J.M. (1999) ‘Adjusting for non-ignorable drop-out
using semiparametric non-response models (with discussion)’, Journal of the American
Statistical Association, Vol. 94, pp.1096–1146.
Schluchter, M.D. (1992) ‘Methods for the analysis of informatively censored longitudinal data’,
Statistics in Medicine, Vol. 11, pp.1861–1870.
Sedransk, J. (1985) ‘The objective and practice of imputation’, Proceedings of the First Annual
Res. Conf., pp.445–452.
Song, Q. and Shepperd, M. (2007) ‘A new method for imputing small software project data sets’,
Journal of Systems and Software, Vol. 80, No. 1, January, pp.51–62.
Song, Q., Shepperd, M. and Cartwright, M. (2005) ‘A short note on safest default missingness
mechanism assumptions’, Empirical Software Engineering: An International Journal, Vol. 10,
No. 2, pp.235–243.
Strike, K., El Emam, K. and Madhavji, N. (2001) ‘Software cost estimation with incomplete data’,
IEEE Transactions on Software Engineering, Vol. 27, No. 10, pp.890–908.
Tabachnick, B.G. and Fidell, L.S. (2001) Using Multivariate Statistics, 4th ed., Allyn & Bacon,
Needham Heights, MA.
Tanner, M.A. (1991) Tools for Statistical Inference: Observed Data and Data Augmentation
Methods, Springer-Verlag, New York, USA.
Tanner, M.A. (1993) Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, 2nd ed., Springer-Verlag, New York.
Missing Data Imputation Techniques 291
Tanner, M.A. and Wong, W.H. (1987) ‘The calculation of posterior distributions by data
augmentation (with discussion)’, Journal of the American Statistical Association, Vol. 82,
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and
Altman1, R.B. (2001) ‘Missing value estimation methods for DNA microarrays’,
Bioinformatics, Vol. 17, pp.520–525.
Tsiatis, A.A., DeGruttola, V. and Wulfsohn, M.S. (1994) ‘Modeling the relationship of survival and
longitudinal data measured with error’, Journal of the American Statistical Association,
Vol. 90, pp.27–37.
Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data,
Springer-Verlag, New York.
Weil, G.C.G. and Tanner, M.A. (1990) ‘A Monte Carlo implementation of the EM algorithm and
the poor man’s dara augmentation algorithms’, Journal of the American Statistical
Association, Vol. 85, pp.699–704.
Wilks, S.S. (1932) ‘Moments and distributions of estimates of population parameters from
fragmentary samples’, Annals of Mathematics Statistics, Vol. 3, pp.163–195.
Wothke, W. (2000) ‘Longitudinal and multi-group modeling with missing data’, in Little, T.D.,
Schnabel, K.U. and Baumert, J. (Eds.): Modeling Longitudinal and Multiple Group Data:
Practical Issues, Applied Approaches and Specific Examples, Lawrence Erlbaum Associates,
Mahwah, NJ, pp.219–240.
Wu, C.F.J. (1983) ‘On the convergence properties of the em algorithm’, The Annals of Statistics,
Vol. 11, No. 1, pp.95–103.
Wu, M.C. and Bailey, K.R. (1988) ‘Analyzing changes in the presence of informative right
censoring caused by death and withdrawal’, Statistics in Medicine, Vol. 7, pp.337–346.
Wu, M.C. and Bailey, K.R. (1989) ‘Estimation and comparison of changes in the presence of
informative right censoring: conditional linear model’, Biometrics, Vol. 45, pp.939–955.
Xu, L. and Jordan, M.I. (1996) ‘On convergence properties of the em algorithm for Gaussian
mixtures’, Neural Computation, Vol. 8, pp.129–151.
Yuan, Y.C. (2000) ‘Multiple imputation for missing data: concepts and new development’,
Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference,
SAS Institute, Cary, NC.
... However, for smaller data sets, or when removing data is impractical (each observation needs to have a prediction), it is common to impute the missing data as we might lose valuable information while deleting sets of observations. For imputation, a variety of different strategies exist that try to replace the missing data with estimates (Song & Shepperd, 2007). While imputation strategies can affect the performance of Machine Learning algorithms, their impact on the fairness of the resulting predictions is unclear. ...
... Missing data is often a problem in many research studies across various disciplines (Myrtveit et al., 2001;Sinharay et al., 2001;Song & Shepperd, 2007;Hardy et al., 2009;Cheema, 2014). According to Soley-Bori (2013), there are multiple methods to handle missing data depending on the type of missing data. ...
Full-text available
Research on Fairness and Bias Mitigation in Machine Learning often uses a set of reference datasets for the design and evaluation of novel approaches or definitions. While these datasets are well structured and useful for the comparison of various approaches, they do not reflect that datasets commonly used in real-world applications can have missing values. When such missing values are encountered, the use of imputation strategies is commonplace. However, as imputation strategies potentially alter the distribution of data they can also affect the performance, and potentially the fairness, of the resulting predictions, a topic not yet well understood in the fairness literature. In this article, we investigate the impact of different imputation strategies on classical performance and fairness in classification settings. We find that the selected imputation strategy, along with other factors including the type of classification algorithm, can significantly affect performance and fairness outcomes. The results of our experiments indicate that the choice of imputation strategy is an important factor when considering fairness in Machine Learning. We also provide some insights and guidance for researchers to help navigate imputation approaches for fairness.
... To lessen bias and escalate precision, missing values are imputed instead of eliminating observations with missing data. Mean imputation is implemented since it is quick, not computationally challenging and it sustains sample size (Song and Shepperd 2007). In this technique, the average of non-missing values for each driver with missing value(s) is calculated. ...
Full-text available
In this study, we design stepwise ordinary least squares regression models using various amalgamations of firm features, loan characteristics and macroeconomic variables to forecast workout recovery rates for defaulted bank loans for private non-financial corporates under downturn conditions in Zimbabwe. Our principal aim is to identify and interpret the determinants of recovery rates for private firm defaulted bank loans. For suitability and efficacy purposes, we adopt a unique real-life data set of defaulted bank loans for private non-financial firms pooled from a major anonymous Zimbabwean commercial bank. Our empirical results show that the firm size, the collateral value, the exposure at default, the earnings before interest and tax/total assets ratio, the length of the workout process, the total debt/total assets ratio, the ratio of (current assets–current liabilities)/total assets, the inflation rate, the interest rate and the real gross domestic product growth rate are the significant determinants of RRs for Zimbabwean private non-financial firm bank loans. We reveal that accounting information is useful in examining recovery rates for defaulted bank loans for private corporations under distressed financial and economic conditions. Moreover, we discover that the prediction results of recovery rate models are augmented by fusing firm features and loan characteristics with macroeconomic factors.
... However, the superior performance of the RRLR method was not universal, and it was more suitable for low missing ratios (e.g. less than 30%). This is because the strategy of RRLR is to build regression models to predict and impute the missing features according to other complete samples in an iterative loop [50]. Although this strategy allows RRLR to utilize as many observations as possible during interpolation, regression typically requires many samples with non-missing data to produce stable results [33]. ...
Full-text available
Background Biological age (BA) has been recognized as a more accurate indicator of aging than chronological age (CA). However, the current limitations include: insufficient attention to the incompleteness of medical data for constructing BA; Lack of machine learning-based BA (ML-BA) on the Chinese population; Neglect of the influence of model overfitting degree on the stability of the association results. Methods and results Based on the medical examination data of the Chinese population (45–90 years), we first evaluated the most suitable missing interpolation method, then constructed 14 ML-BAs based on biomarkers, and finally explored the associations between ML-BAs and health statuses (healthy risk indicators and disease). We found that round-robin linear regression interpolation performed best, while AutoEncoder showed the highest interpolation stability. We further illustrated the potential overfitting problem in ML-BAs, which affected the stability of ML-Bas’ associations with health statuses. We then proposed a composite ML-BA based on the Stacking method with a simple meta-model (STK-BA), which overcame the overfitting problem, and associated more strongly with CA (r = 0.66, P < 0.001), healthy risk indicators, disease counts, and six types of disease. Conclusion We provided an improved aging measurement method for middle-aged and elderly groups in China, which can more stably capture aging characteristics other than CA, supporting the emerging application potential of machine learning in aging research.
... These are handling missing values, removal of inconsistent values, and removal of duplicate values. Missing value can be removed by dropping that particular instance and replacing the value by mean or the value whose probability of occurrence is high (Song and Shepperd 2007). Health data may contain an inconsistent value. ...
The main aim of this chapter is to control home and healthcare appliances using two types of automation which are EEG based brain–computer interfaces and command-based using Telegram Bot. In the brain–computer interface, data are captured using EEG, and the bandpass filter is used for filtering data in the range between 12 to 100 Hz, artifact removal is done using independent component analysis, feature extraction and selection are done by Fast Fourier theorem, and then translation by command recognition. After optimizing all steps command send to the microcontroller, where the circuit is designed using ESP8266 Node MCU and Relay. Another process is to control home automation using Telegram Bot, this process is for physically fit people, they will use the Telegram Bot to control home automation at low cost. The objective of this chapter is to control home applications using EEG and BCI that could help to support old and paralyzed people to be independent in their daily life. So, this system fulfilled the expectations of home automation in two different ways that can hugely impact society in day-to-day life.KeywordsElectroencephalogramBrain–computer interfaceMachine learningInternet-of-ThingsThingSpeakTalkBackTelegram BotHome automation
... These are handling missing values, removal of inconsistent values, and removal of duplicate values. Missing value can be removed by dropping that particular instance and replacing the value by mean or the value whose probability of occurrence is high (Song and Shepperd 2007). Health data may contain an inconsistent value. ...
Internet of Things Based Smart Healthcare (Intelligent and Secure Solutions Applying Machine Learning Techniques)
... Various techniques of handling missing data are already being implemented. In our research we have used mean imputation technique [13] to achieve a proper prepared data with less complexity. ...
Conference Paper
Full-text available
Myocardial Infarction occurs due to thedestruction of heart tissue resulting from the obstruction of theblood supply to the heart muscle. It refers to a heart attack whichis a major heart disease throughout the world. Machine learningtechniques can be engaged as a decision support system forpredicting myocardial infarction from a group of importantpredictive features that may categorize the severe-risk patientsand can provide guidance to minimize the severity. In thisresearch, we have collected myocardial infarction patient’s datato assess the classification performance of two different ensemblebased machine learning methods Bagging and Boosting with fivedifferent base classifiers such as Support Vector Machine, K-Nearest Neighbor, Naïve Bayes, Decision Tree, and RandomForest for predicting myocardial infarction in an earlier stage. Itshould be understood that finding important attributes can helpto increase performance. Experimental result showed that theBagging with Random Forest ensemble method outperformedother methods by achieving higher accuracy of 96.50%.
We evaluate visualization concepts to represent missing values in parallel coordinates. We focus on the trade‐off between the ability to perceive missing values and the concept's impact on common tasks. For this purpose, we identified three missing value representation concepts: removing line segments where values are missing, adding a separate, horizontal axis onto which missing values are projected, and using imputed values as a replacement for missing values. For the missing values axis and imputed values concepts, we additionally add downplay and highlight variations. We performed a crowd‐sourced, quantitative user study with 732 participants comparing the concepts and their variations using five real‐world datasets. Based on our findings, we provide suggestions regarding which visual encoding to employ depending on the task at focus.
With the increasing population and diverse environment, health care data prediction plays an important role. The goal of the work is to build a system that predicts a disease accurately based on basic knowledge. With the advancement of ICT-based Healthcare systems, Wearable smart devices (WSD) play an important role in monitoring body vitals and helps in taking proper decisions regarding one’s health. The role of smart and developed systems as WSD helps tremendously to keep check all vital attributes of a patient and thus controlling their conditions accordingly. WSDs have significant potential to detect any abnormal occurrence of events in a patient such as a sudden drop of blood pressure, decrease oxygen saturation, monitoring blood glucose level, decreasing pulse rate which can help in proper decision making in time thus saving lives. In the last few years, monitoring body vitals has become an intensively researched area and several studies have been done for analyzing these signs so that any further casualty can be detected early. So a real-time health monitoring system with WSDs helps in cost-effective, accurate, and easy to apply secured data analysis. Today WSDs are available in many forms like smartwatches, smart jewelry, smart clothing, different fitness tracker, smart glasses are available to help monitor one’s health vitals by themselves. This work presents a comprehensive review of this emerging field of research.KeywordsWearable smart devicesHealthcare systemBody vitals monitoringMachine learning approachesPrediction
Full-text available
The main focus of this thesis aims to introduce a convenient conception of a smart city framework which helps to analyze the state of 65 cities in Europe according to their smartness. This framework persists in the development of an objective and a subjective Smart City Composite Indicator (SCCI) which offer a detailed view on those 65 cities. It is holistic and differentiates at the same time between several aspects to enable close observations of the cities’ performances. The approach here is unique in the sense that it distinguishes between objective elements that surround the cities and subjective perceptions of the citizens. Other researchers mix them and rely mostly on the former. The separate contemplations yield interesting insights because they are benchmarks that allow to examine if inhabitants perceive cities as smart which are smart according to objective criteria. Furthermore, this thesis attempts to identify smart city drivers with the means of an econometric analysis. The idea is to establish the same econometric models for both SCCIs and therefore to investigate if the same variables explain them. Moreover, the econometric models can set a rough agenda for the long–term development of the cities.
Two algorithms for producing multiple imputations for missing data are evaluated with simulated data. Software using a propensity score classifier with the approximate Bayesian bootstrap produces badly biased estimates of regression coefficients when data on predictor variables are missing at random or missing completely at random. On the other hand, a regression-based method employing the data augmentation algorithm produces estimates with little or no bias.
Relates to those methods of dealing with missing values of explanatory variables in regression analysis that first 'complete' the data by inserting estimates derived from regressions of explanatory variables on each other and then employ some form of weighted regression. It is argued that the choices of weights in the published methods are not optimal and that improvements are possible. This is verified for a simple case and the difficulties to extending the methodology to general cases are discussed.-Author
A common concern when faced with multivariate data with missing values is whether the missing data are missing completely at random (MCAR); that is, whether missingness depends on the variables in the data set. One way of assessing this is to compare the means of recorded values of each variable between groups defined by whether other variables in the data set are missing or not. Although informative, this procedure yields potentially many correlated statistics for testing MCAR, resulting in multiple-comparison problems. This article proposes a single global test statistic for MCAR that uses all of the available data. The asymptotic null distribution is given, and the small-sample null distribution is derived for multivariate normal data with a monotone pattern of missing data. The test reduces to a standard t test when the data are bivariate with missing data confined to a single variable. A limited simulation study of empirical sizes for the test applied to normal and nonnormal data suggests that the test is conservative for small samples.
It is sometimes suspected that nonresponse to a sample survey is related to the primary outcome variable. This is the case, for example, in studies of income or of alcohol consumption behaviors. If nonresponse to a survey is related to the level of the outcome variable, then the sample mean of this outcome variable based on the respondents will generally be a biased estimate of the population mean. If this outcome variable has a linear regression on certain predictor variables in the population, then ordinary least squares estimates of the regression coefficients based on the responding units will generally be biased unless nonresponse is a stochastic function of these predictor variables. The purpose of this paper is to discuss the performance of two alternative approaches, the selection model approach and the mixture model approach, for obtaining estimates of means and regression estimates when nonresponse depends on the outcome variable. Both approaches extend readily to the situation when values of the outcome variable are available for a subsample of the nonrespondents, called “follow-ups.” The availability of follow-ups are a feature of the example we use to illustrate comparisons.