Content uploaded by Martin John Shepperd

Author content

All content in this area was uploaded by Martin John Shepperd on May 11, 2016

Content may be subject to copyright.

Int. J. Business Intelligence and Data Mining, Vol. x, No. x, 200x 261

Copyright © 2007 Inderscience Enterprises Ltd.

Missing Data Imputation Techniques

Qinbao Song*

Department of Computer Science and Technology,

Xi’an Jiaotong University, 28 Xian-Ning West Road,

Xi’an, Shaanxi 710049, China

E-mail: qbsong@mail.xjtu.edu.cn

*Corresponding author

Martin Shepperd

School of IS, Computing and Maths,

Brunel University, Uxbridge UB8 3PH, UK

E-mail: martin.shepperd@brunel.ac.uk

Abstract: Intelligent data analysis techniques are useful for better exploring

real-world data sets. However, the real-world data sets always are accompanied

by missing data that is one major factor affecting data quality. At the same

time, good intelligent data exploration requires quality data. Fortunately,

Missing Data Imputation Techniques (MDITs) can be used to improve data

quality. However, no one method MDIT can be used in all conditions, each

method has its own context. In this paper, we introduce the MDITs to the

KDD and machine learning communities by presenting the basic idea and

highlighting the advantages and limitations of each method.

Keywords: data quality; KDD; data mining; machine learning; data cleaning;

missing data; data imputation; missingness mechanism; missing data pattern;

single imputation; multiple imputation.

Reference to this paper should be made as follows: Song, Q. and Shepperd, M.

(2007) ‘Missing Data Imputation Techniques’, Int. J. Business Intelligence and

Data Mining, Vol. x, No. x, pp.xxx–xxx.

Biographical notes: Qinbao Song received a PhD in Computer Science from

the Xi’an Jiaotong University, China, in 2001. He is a Professor of Software

Technology in the Department of Computer Science and Technology at

Xi’an Jiaotong University, Xi’an, China. He has published more than

60 referred papers in the area of Data Mining, Machine Learning, and Software

Engineering. He is a board member of the Open Software Engineering Journal.

His research interests include intelligent computing, machine learning for

software engineering, and trustworthy software.

Martin Shepperd received a PhD in Computer Science from the Open

University in 1991. He is a Professor of Software Technology at Brunel

University, London, UK, and director of the Brunel Software Engineering

Research Centre (B-SERC). He has published more than 90 refereed papers and

three books in the area of Empirical Software Engineering, Machine Learning

and Statistics. He is Editor-in-Chief of the Journal of Information and Software

Technology and was Associate Editor of IEEE Transactions on Software

Engineering (2000–2004).

262 Q. Song and M. Shepperd

1 Introduction

As more and more data are gathered each day and every day, we are almost swallowed by

the huge data mountains. Knowledge Discovery from Databases (KDD) is a powerful

technique that can help us understand the huge amount of data and extract useful

information from it; it is becoming more and more important, and has been used in

a wide range of areas (Chen and Liu, 2005). But in practice, the presence of missing data

is a general and challenging problem for most data analysis techniques, including KDD

and machine learning. As all data analysis techniques induce knowledge strictly from

data, the quality of the knowledge extracted is largely determined by the quality of the

underlying data. This makes any one data analysis technique a possible garbage in

garbage out system. Therefore, data quality has been a major concern in the data

analysis field, and the quality of data is a precondition for obtaining quality knowledge.

Noisy data, inconsistent data, and missing data are three roadblocks to gaining quality

data (Han and Kambr, 2000). In this paper, however, we just focus on Missing Data

Techniques (MDTs).

Missing data are the absence of data items for a subject, they hide some information

that may be important. Missing data pervasively exist in most real-world data sets, they

• result in a loss of information and statistical power (Kim and Curry, 1977)

• make common data analysis methods inappropriate or difficult to apply

(Rubin, 1987)

• can introduce bias into estimates derived from a statistical model (Becker and

Walstad, 1990l; Rubin, 1987).

Obviously, missing data pose a challenge to most data analysis techniques, including

KDD and machine learning, because a substantial proportion of the data may be missing

and predictions must be made for cases with missing inputs. At the same time, among

KDD and machine learning methods, many of them can only work with complete data

sets (e.g., nearly half of the 23 classification algorithms studied in the Statlog project

(Michie et al., 1994) require complete data), and some of them handle missing data in a

rather naive way that can introduce bias into the knowledge induced. Therefore, missing

data are a major concern in KDD and machine learning communities. To obtain quality

knowledge with intelligent data analysis methods, we must first carefully deal with

missing data.

Unfortunately, the KDD and machine learning communities paid relatively less

attention to the missing data problems, hence only a few papers have been published

(Sarle, 1998; Lakshminarayan et al., 1999; Aggarwal and Parthasarathy, 2001; Kalousis

and Hilario, 2000). In contrast, the missing data problems have been extensively studied

by the statistics community. This makes it possible for the KDD and machine learning

communities to borrow some MDTs from the statistics community to address the missing

data problems in their own fields. Therefore, in this paper, we introduce the MDTs,

specially the MDITs, from the perspective of statistics. We mainly present the technical

aspect of the methods, so researchers in the intelligent data analysis field can decide

which method can be used for his or her specific tasks. At the same time, we also

highlight the advantages and limitations of each method, and provide some comments

from the perspective of intelligent data analysis.

Missing Data Imputation Techniques 263

The rest of the paper is organised as follows. We first introduce the basic concepts

of missingness mechanisms and missing data patterns in Section 2. Then, we provide

readers with the big picture of the MDTs in Section 3. After that, we focus on reviewing

the main single imputation and MI methods in Sections 4 and 5, respectively. Finally, we

present our recommendations for the intelligent data analysis community in Section 6.

2 Missingness mechanisms and missing data patterns

Both missingness mechanisms and missing data patterns have great impact on research

results, both are critical issues a researcher must address before choosing an appropriate

method to deal with missing data.

2.1 Missingness mechanisms

Missingness mechanisms are assumptions about the nature and types of missing data.

Little and Rubin (2002) define three unique types of missing data mechanisms: Missing

Completely at Random (MCAR), Missing at Random (MAR), and Non-Ignorable (NI).

In general, the missingness mechanism is concerned with whether the missingness is

related to the study variables or not. This is extremely significant as it determines how

difficult it may be to handle missing values and at the same time how risky it is to ignore

them.

Viewing response as a random process (Rubin, 1976), the missingness mechanisms

can be introduced as follows. Suppose Z is a data matrix that includes observed and

missing data, let Zobs be the set of observed values of Z, let Zmis be the set of missing

values of Z and let R be the missing data indicator matrix, i be the ith case and j the jth

feature:

1 if is missing

0 if is observed.

ij

ij

ij

Z

RZ

=

MCAR indicates that the missingness is unrelated to the values of any variables, whether

missing or observed, thus

(|) ()forall.pR Z pR Z=

This is best illustrated by an example. Suppose there is a data set consisting of two

variables and 16 cases, see Table 1 for details. The first two columns of the table show

the complete data for variables V1 and V2. The other columns show the values of V2 that

remain after imposing three missingness mechanisms. In the third column, the missing

values appeared in the cases whose values of variable V1 are A, B, and C, which are the

all possible values of variable V1, and the missing values of variable V2 distributed from

small to large. This means that the values were missing randomly and have no relation

with both variable V1 and itself, thus it is MCAR.

264 Q. Song and M. Shepperd

Table 1 Complete data set and the simulation of three missingness mechanisms

V2

V1 Complete MCAR MAR NI

A 85 85 85 ?

A 94 ? 94 ?

A 111 111 111 111

A 130 130 130 130

B 80 80 ? ?

B 97 97 ? ?

B 117 117 ? 117

B 125 ? ? 125

C 88 ? 88 ?

C 91 91 91 ?

C 123 123 123 123

C 132 ? 132 132

MCAR is an extreme condition and from an analyst’s point of view, it is an ideal

condition. Generally, you can test whether MCAR conditions can be met by showing that

there is no difference between the distribution of the observed data of the observed cases

and missing cases, this is Little’s (1988) and Little et al. (1995) multivariate test, which is

implemented in SYSTAT and the SPSS Missing Values Analysis module. Unfortunately,

this is hard when there are few cases as there can be a problem with Type I errors.

Non-Ignorable (NI) is at the opposite end of the spectrum. It means that the

missingness is non-random, it is related to the missing values, and it is not predictable

from any one variable in the data set. That is

mis

( | ) ( ) for all , ( | ) depends on .pR Z pR Z pR Z Z≠

NI is the worst case since, as the name implies, the problem cannot be avoided by

a deletion technique, nor are imputation techniques in general effective unless the analyst

has some model of the cause of missingness. For example, in the fifth column of Table 1,

all the values that are smaller than 100 were missing, and the corresponding values of the

V1 can be A, B, and C. This means that the values that were missing only depend on the

variable V2 itself, thus it is NI. This also can be illustrated by another example. Suppose

software engineers are less likely to report high defect rates than low rates, perhaps for

reasons of politics. Merely to ignore the incomplete values leads to a biased sample and

an over optimistic view of defects. On the other hand, imputation techniques do not work

well either since they attempt to exploit known values and as we have already observed

this is a biased sample. Unless one has some understanding of the process and can

construct explanatory models, there is little that can effectively be done with NI

missingness.

MAR lies between these two extremes. It requires that the cause of the missing data is

unrelated to the missing values, but may be related to the observed values of other

variables, that is:

obs mis

(|) (| )forall .pR Z pR Z Z=

Missing Data Imputation Techniques 265

For example, in the fourth column of Table 1, all the missing values appeared in the cases

whose values of variable V1 are B, and the missing values distributed from small to large.

This means the missing values depend only on the variable V1, and have no relation with

itself, thus it is MAR.

Most missing data methods assume MAR. Whether the MAR condition holds can be

examined by a simple t-test of mean differences between the groups with complete data

and that with missing data (Kim and Curry, 1977; Tabachnick and Fidell, 2001). MAR is

less restrictive than MCAR because MCAR is a special case of MAR. MAR and MCAR

are both said to be ignorable missing data mechanisms, which is coined in Rubin (1976)

and is fully explicated in the context of MI in Rubin (1987).

In practice, it is usually difficult to meet the MCAR assumption. MAR is an

assumption that is more often, but not always, tenable.

2.2 Missing data patterns

The missing data indicator matrix R reveals the missing data pattern. By rearranging the

cases and the variables of a data set, we can get the missing data patterns. Generally,

there are two types of missing data patterns, they are univariate pattern and multivariate

pattern.

In the univariate missingness pattern, only one variable contains missing values.

Table 2 is an example. In Table 2, only variable x3 contains three missing values.

In the multivariate missingness pattern, more than one variable contain missing data.

We can refine this pattern into two types: monotone pattern and arbitrary pattern.

Table 2 Univariate missing pattern

x

1 x

2 x

3 x

4 x

5 x

6

C1 * * * * * *

C2 * * * * * *

C3 * * ? * * *

C4 * * ? * * *

C5 * * ? * * *

C6 * * * * * *

In monotone pattern, variables can be arranged so that for a set of variables x1, x2, …, xn,

if xi is missing, then so are xi+1, …, xn. Table 3 is an example.

Table 3 Monotone missing pattern

x

1 x

2 x

3 x

4 x

5 x

6

C1 * * * * * *

C2 * * * * * ?

C3 * * * * ? ?

C4 * * * ? ? ?

C5 * * ? ? ? ?

C6 * ? ? ? ? ?

266 Q. Song and M. Shepperd

In arbitrary pattern, missing data can occur anywhere and no special structure appears

regardless of how you arrange variables. Table 4 is an example.

Table 4 Arbitrary missing pattern

x

1 x

2 x

3 x

4 x

5 x

6

C1 * * ? * * *

C2 * * * * * ?

C3 * * * * * ?

C4 ? ? * * * *

C5 * * * * * *

C6 ? * * * * *

The SPSS Missing Value Analysis module has the function of assessing missing data

patterns. The type of missing data pattern may affect the selection of missing data

methods, because some missing data methods are sensitive to the missing data patterns.

Therefore, we will discuss this issue when introducing a specific missing data imputation

method if applicable.

3 The classification of Missing Data Techniques (MDTs)

In this section, we first show readers a big picture of MDTs by presenting the

corresponding overall taxonomy. Then, we draw readers’ attention to a specific part of it:

the taxonomy of the MDITs, which are the focus of the paper.

3.1 Missing Data Techniques (MDTs) taxonomy

There is a great deal of research activities on MDTs, and a wide range of missing data

methods are being proposed. We divide these MDTs into three categories:

• missing data ignoring techniques

• missing data toleration techniques

• Missing Data Imputation Techniques (MDITs).

The missing data ignoring techniques simply delete the cases that contain missing data.

Because of its simplicity, it is widely used, but it does not lead to the most efficient

utilisation of the data and should be used only in situations where the amount of missing

data values is very small. This technique has two forms: Listwise Deletion (LD) and

Pairwise Deletion (PD).

Missing Data Imputation Techniques 267

• Listwise Deletion (LD) is also referred to as case deletion, casewise deletion and

complete case analysis. This method omits the cases containing missing data,

it only makes use of those cases that do not contain any missing data. It is easy,

fast, commonly accepted, and does not invent data. It is the default of most statistical

packages. The drawback is that its application leads to large loss of observations,

which may result in too small data sets if the fraction of missing values is high

(Myrtveit et al., 2001), and this method incurs a bias in the data unless data are

missing under MCAR.

• Pairwise Deletion (PD) is also referred to as available case method. This method

considers each variable separately. For each variable, all recorded values in each

observation are considered (Strike et al., 2001) and missing data are ignored.

This means that different calculations will utilise different cases and will have

different sample sizes, this effect is undesirable. But it still provides better estimates

than LD (Kim and Curry, 1977). The advantage is that the sample size for each

individual analysis is generally higher than with complete case analysis, but the

results are unbiased only if the data are missing under MCAR. It is necessary when

the overall sample size is small or the number of cases with missing data is large.

The missing data toleration techniques are the internal missing data treatment strategies,

they work directly with data sets containing missing values. If the objective is not to

predict the missing values, these types of techniques are a better choice. This is because

any missing data predication method will incur bias, and when biased data are used to

make predictions, the result is doubtful.

CART (Breiman et al., 1984) addresses the missing data problem in the context of

decision tree classifier. If some cases contain missing values, CART uses the best

surrogate spilt to assign these cases to branches of a spilt on a variable where these cases’

values were missing. C4.5 (Quinlan, 1988) is an alternative method of CART. C4.5 uses

a probabilistic approach to handle missing data; missing values can be present in any

variable except the class variable. This method calculates the expected information gain

by assuming that the missing value is distributed according to the observed values in the

subset of the data at that node of the tree. It seems to be one of the best methods of using

simple methods to treat missing values (Grzymala-Busse and Hu, 2000).

However, in some contexts, for example, some data analysis methods only work with

a complete data set, we must fill in the missing values or delete the cases containing

missing values first, and then use the resulting data set to perform subsequent data

analysis. In this case, toleration techniques cannot be used. And in cases where the data

set contains large amounts of missing data, or the mechanism causing the missing data is

non-random, imputation techniques might perform better than ignoring techniques

(Haitovsky, 1968).

The MDITs refer to any strategy that fills in missing values of a data set so that

standard data analysis methods can then be applied to analyse the completed data set.

These techniques not only retain data in incomplete cases, but also impute values of

correlated variables (Little and Rubin, 1989).

268 Q. Song and M. Shepperd

3.2 Missing Data Imputation Techniques (MDITs) taxonomy

In practice, missing data imputation is one of the most common techniques for handling

missing data (Kalton, 1981; Sedransk, 1985). It is also a rapidly evolving field with many

methods. All these methods can be classified into

• ignorable missing data imputation methods, which consist of single imputation

methods and MI methods

• NI missing data imputation methods, which consist of likelihood-based methods and

non-likelihood-based methods.

The likelihood-based methods (Hogan and Laird, 1997; Little, 1993, 1995;

Tsiatis et al., 1994; Wu and Bailey, 1988) require a full parametric specification

of the joint distribution of complete data and missingness mechanism to be specified

(Scharfstein et al., 1999), it is typical to specify a parametric model for NI and

incorporate it into the complete data log-likelihood. Depending on how the joint

distribution of the primary data Z and missingness indicators R is specified, Little and

Rubin (1987) presented three general classes of likelihood-based NI missing data

methods:

• Selection Models (SMs) assume different parameters for the primary data model

p(Z) and for p(R|Z).

• Pattern Mixture Models (PMMs) assume different parameters for p(Z|R) and

for p(R).

• Shared Parameter Models (SPMs) instead incorporate common parameters into

models for p(Z) and p(R).

• Selection Models (SMs) (Heckman, 1976; Amemiya, 1984; Diggle and Kenward,

1994; Little, 1995; Verbeke and Molenberghs, 2000) require provision

of the distribution of complete data and specification of the manner in which the

probability of missingness depends on the data (Schafer and Graham, 2002).

• Pattern Mixture Models (PMMs) (Little, 1993, 1994; Glynn et al., 1986; Hedeker

and Gibbons, 1997; Little and Schenker, 1995; Little, 1995; Little and Wang, 1996;

Verbeke and Molenberghs, 2000) assume the joint distribution of Z and R is a

mixture of the missing patterns; this means that the parameters for the model have

to be calculated separately for each pattern. PMMs require specification of a

distribution of probability for the missing data patterns and a model for the data

within each pattern. The typical feature is that the distribution of the missingness

only depends on the covariates and not on Z.

• Shared Parameter Models (SPMs) (Wu and Bailey, 1989; Schluchter, 1992;

DeGruttola and Tu, 1994; Dunson and Perreault, 2001; Roy, 2003) will be used

when the missingness may be associated with the true underlying response for

a subject under clustered and longitudinal data settings. Several shared parameter

approaches have been proposed for analysing longitudinal data when the time to

censoring depends on a subject’s underlying rate of change.

The non-likelihood-based methods (Robins, 1997; Robins et al., 1995; Rotnitzky and

Robins, 1997; Rotnitzky et al., 1998; Paik, 1997) require the joint distribution of the

Missing Data Imputation Techniques 269

complete data to follow a non-parametric or semi-parametric model and the missingness

mechanism to follow a parametric model (Scharfstein et al., 1999).

Sensitivity analysis is a rational approach to NI non-response, and has been used

by Little (1994), Little and Wang (1996), Rubin (1977) and Scharfstein et al. (1999).

But Scharfstein et al. (1999) argue that sensitivity analysis has some problems. The first

is in practice, the users more favour simplicity and conciseness in the presentation of

results. The second lies in the fact that a sensitivity analysis usually needs to be confined

to a relatively small number of parameters. The third is that many different forms of

sensitivity analysis can be expected and may yield contradictory conclusions.

When compared with the likelihood-based methods, the non-likelihood-based

methods are preferable. The reason is that in most real-world problems, it is not possible

to fully specify the non-response model as a function of reported values. However, the NI

missing data imputation methods require explicit specification of a distribution for the

missingness in addition to the model for the complete data, they are more complex and

not suitable for KDD purpose. Thus, in this paper, we further focus on single and MI

methods.

A single imputation method fills in a value for each missing value; currently, it is

more common than MI, which replaces each missing value with several plausible values

and accurately reflects sampling variability about the actual values. In this paper,

we focus attention mainly on single imputation and MI, which have the potential of

serving KDD and machine learning.

4 Single imputation techniques

Single imputation refers to the substitution of a single value for each missing value.

It is flexible and fills in missing values with imputed data and thus complete data analysis

methods can be applied. Further, it handles missing data only once, which implies

a consequent consistency of answers from identical analyses (Rubin, 1988). This means

that single imputation methods can be applied to KDD and machine learning. Although

single imputation actually makes contributions to the missing data problem to some

extent, unfortunately, it still does not reflect the uncertainty in missing data estimates,

and the sample size is overestimated with the variance and standard errors being

underestimated, confidence intervals for estimated parameters are too narrow, and Type I

error rates are too high (Little and Rubin, 2002).

There are different types of single imputation techniques, we focus on both the main

basic imputation techniques, such as mean imputation, regression imputation, and

hot-deck imputation, and advanced imputation techniques, such as Expectation

Maximisation (EM) approach and Raw Maximum Likelihood (RML) methods.

The Approximate Bayesian Bootstrap (ABB) and Data Augmentation (DA) methods that

are closely related to MI will be introduced in Section 4.1.

4.1 Mean imputation

Mean imputation, also called unconditional mean imputation, is a widely used imputation

method.

270 Q. Song and M. Shepperd

Method

Mean imputation assumes that the mean of a variable is the best estimate for any

case that has missing information on this variable. Thus, it imputes each missing value

with the mean of known values for the same variable if data are missing under MCAR

(Wilks, 1932). Suppose x3 in Table 3 is a continuous variable, the missing values of both

C5(x3) and C6(x3) are the mean of the four observed values of variable x3. That is

4

33

1

1().

4i

i

x

Cx

=

=∑

If x3 in Table 3 is a categorical variable, the missing values of both C5(x3) and C6(x3) are

the mode of the four observed values of variable x3. That is

33

314

(|() )

max ,

4

ii

i

Cx Cx

x≤≤

≡

=

υ

where υ is the someone known value of variable x3.

A variation of mean imputation is to stratify a data set into sub-groups based on

auxiliary variables and then impute the sub-group mean for missing data within the

sub-group (Kalton and Kasprszyk, 1986; Song et al., 2005; Song and Shepperd, 2007).

This certainly represents an improvement on the overall mean imputation, although it

may not be perfect. Cohen (1996) proposed another way to improve mean imputation.

Instead of imputing the mean for all the missing values, one can impute half of the

missing values with

obs

obs obs

obs

1

1

nn

Z

D

n

+−

+−

and the other half with

obs

obs obs

obs

1,

1

nn

ZD

n

+−

−−

where nobs is the number of observed values, obs

Z

is the mean of observed values, and

obs

22

obs obs

1

obs

1().

n

i

i

DZZ

n=

=−

∑

This adjustment will retain the first and second moments as observed.

Advantages and limitations

Mean imputation can be valid especially when data are missing at MCAR. It is fast,

simple, ease to implement, and no cases are excluded. But even if the missingness

mechanism is MCAR, this method still leads to underestimation of the population

variance (the bias is proportional to (nobs – 1)/(nobs + nmis – 1) (Little, 1992)), and thus a

small standard error and a possibility of Type I error.

Missing Data Imputation Techniques 271

4.2 Regression imputation

Regression imputation, also called conditional mean imputation, replaces each missing

value with a predicted value based on a regression model if data are missing under MAR.

Method

A general regression imputation is a two-phase procedure: first, a regression model is

built using all the available complete observations, and then missing values are estimated

based on the built regression model.

Little (1992) divided the regression imputation into two categories: independent

variables based and both independent and dependent variables based methods.

The former just uses information in the reported independent variables in a case to impute

the missing independent variables, and Little recommends using Weighted Least Squres

(WLS) regression (Little, 1992). But Gourieroux and Montfort (1981) and Conniffe

(1983) noted that the estimation error in regression coefficient introduces a correlation

between the incomplete cases and further affects the best choice of weight and

consistency of standard estimation error, they proposed using Generalised Least Squares

(GLS) to address this problem. GLS uses both dependent variable and reported

independent variables to impute missing values if the partial correlation of dependent

variable and a missing independent variable given the reported independent variable is

high. However, Little (1992) noted that whether the missing independent variables are

imputed using only reported independent variables or using both reported independent

variables and dependent variable, estimated standard errors of the regression coefficients

from Ordinary Least Squares (OLS) or WLS on the filled-in data will tend to be too

small. The reason is that imputation error is not taken into account.

Advantages and limitations

Regression imputation maintains the sample size by preserving the cases with missing

data, it is better than MI. But as imputed data is always predicted from the regression

model that needs to be specified, correlations and covariances are inflated. It may require

large samples to produce stable estimates (Donner, 1982), and it still underestimates

variance.

4.3 Hot-deck imputation

Hot-deck imputation is a procedure where the imputed values come from other cases in

the same data set.

Method

Traditional hot-deck imputation stratifies a data set into classes according to some

auxiliary variables, keeps complete cases within classes on an active file, and selects the

most similar one to fill in missing data (Ford, 1983).

There are two most similar case selection techniques, they are random (stochastic)

and deterministic methods. The random method is the simplest hot-deck imputation,

which randomly chooses observed values on the same variable from donor cases

according to the determined auxiliary variables on which donors and donees will

match (Rao and Shao, 1992; Reilly, 1993). If a matched class does not have any

observed value, combine the class with other classes and perform imputation based

272 Q. Song and M. Shepperd

on the combined imputation classes. Obviously, the Similar Response Pattern Imputation

(SRPI) (Joreskog and Sorbom, 1993) that identifies the most similar case without

missing values and copies the value(s) of this case to fill in the blanks in the cases with

missing data, the k-NN (Fix and Hodges, 1952) (k Nearest Neighbours) imputation

(Batista and Monard, 2003; Song et al., 2005) that searches the k nearest neighbours of

the missing value and replaces the missing value by the average of the corresponding

variable values of the k nearest neighbours, and multivariate matching that matches

donors and donees on several predetermined auxiliary variables, all belong to the

deterministic hot-deck imputation.

Advantages and limitations

Hot-deck preserves the population distribution by substituting different observed values

for each missing value and maintains the proper measurement level of variables

(categorical variables remain categorical and continuous variables remain continuous);

it is better than mean imputation and regression imputation. But

“Hot-deck imputations do not explicitly take into account the likely possibility

that probability for respondent and non respondent sub populations are

different” (Chiu and Sedransk, 1986),

they distort correlations and covariances. In addition, the donor data set must be large

enough, so we can find out the donor cases.

Cold-deck imputation is similar to hot-deck imputation but it copies in a value from

a similar case in the historical data set rather than the current data set, it is useful for

variables that are static.

4.4 Expectation Maximisation (EM)

The EM (Dempster et al., 1977; Ghahramami and Jordan, 1995) method to missing

data handling was first proposed by Dempster et al. (1997). It is a general method of

finding maximum likelihood estimate of parameters of an underlying distribution from an

incomplete data set.

Method

The basic idea of the EM is first to predict the missing values based on assumed

values for the parameters, then use these predictions to update the parameters, and

repeat until the sequence of parameters converges to maximum likelihood estimates,

as Figure 1 shows.

Suppose Zobs and Zmis, respectively, are the observed and missing parts of data set Z,

that is Z = {Zobs, Zmis}, and the data set Z can be described by p(Z|Θ), which is a

probability or density function that is governed by the set of parameters Θ (e.g., p might

be a set of Gaussians and Θ could be then means and covariances). With this density

function, we define a likelihood function L(Θ|Z) = p(Z|Θ). The L(Θ|Z) is function of

parameters Θ for given Z, whereas the p(Z|Θ) is a function of Z for given Θ. This method

starts with an initial estimate parameter Θ(0), and at each iteration t + 1, the following two

steps must be performed:

The first step of EM is Expectation step (E-step), its purpose is to predict ()

mis

t

Z

from

Zobs and Θ(t) by finding the expected value of the complete data log-likelihood log

Missing Data Imputation Techniques 273

p(Zobs, Zmis|Θ) with respect to the unknown data Zmis, given the observed data Zobs and the

current parameter estimates. That is

() ()

obs mis obs

(| ) [log( , |)| , ],

tt

QEpZZZΘΘ = Θ Θ

where Θ(t) are the current parameter estimates that were used to evaluate the expectation

and Θ are the new parameters that ultimately will be optimised to maximise the

likelihood Q.

Note that Zobs and Θ(t) are constant, Θ is a normal variable that we wish to adjust, and

Zmis is a random variable governed by the distribution f(Zmis|Zobs, Θ(t)). So the expected

likelihood can be rewritten as

mis

mis

() ()

obs mis mis obs

(| ) log( , |)( | , )d ,

tt

Z

Z

QpZZfZZ

γ

∈

ΘΘ = Θ Θ

∫

where f(Zmis|Zobs, Θ(t)) is the marginal distribution of the missing data and it depends on

both the observed data and the current parameters, and

γ

is the space of values Zmis can

take on.

The second step of EM is Maximum step (M-step); its purpose is to maximise the

expectation computed in the first step. Specifically, this step estimates Θ(t+1) from Zobs and

()

mis .

t

Z

That is

(1) ()

arg max ( | ).

tt

Q

+

Θ

Θ= ΘΘ

Actually, we do not really need to maximise Q(Θ|Θ(t)) with respect to Θ to increase

monotonically the log-likelihood at each iteration. All that is needed is to find Θ(t+1) such

that Q(Θ(t+1)|Θ(t)) ≥ Q(Θ(t)|Θ(t)). This is called Generalised EM (GEM).

These two steps are repeated until parameter estimates Θ(1), Θ(2), …, Θ(t+1) converge

to a local maximum of the likelihood function. For the problem of rate-of-convergence,

see Dempster et al. (1977), Wu’s (1983), Xu and Jordan’s (1996) and Jordan and

Xu’s (1996) papers for details.

Figure 1 EM procedure

274 Q. Song and M. Shepperd

Advantages and limitations

The EM method has well-known statistical properties; it generally outperforms popular

ad hoc methods of incomplete data handling such as listwise and PD and mean

imputation. The reason is that it assumes that incomplete cases have data missing at

MAR rather than MCAR. It is guaranteed to converge to a local maximum of the

likelihood function, the rate of convergence depends on fractions of missing information,

high missingness produces slow convergence. But EM adds no uncertainty component to

the estimated data and the estimation variability is ignored, which leads to

underestimation of standard errors and confidence intervals. At the same time, EM must

be implemented specially for each type of model. These limitations led to the

modifications and extensions of EM (McLaachlan and Krishnan, 1997; Chan and

Ledilter, 1995; Meng, 1994; Rubin, 1991; Weil and Tanner, 1990; Celeux and Diebolt,

1985) and the propositions of RML and MI methods.

4.5 Raw Maximum Likelihood (RML)

RML, which is also known as Full Information Maximum Likelihood (FIML), is an

efficient model-based method that uses all available cases in a data set to construct the

best possible first and second order moment estimates with a multivariate normal

distribution data set, which is under the MAR assumption.

Method

In this method, the parameters of the given model are estimated from all the available

data, and the missing values are estimated based on the estimated parameters. Hartley and

Hocking (1971) did the original work of RML. In their method, first they divide the

incomplete data set into several classes. Each class corresponds to one missing data

pattern. Then they calculate the likelihood for each class and sum these likelihoods.

After that, they perform the parameter estimation based on the summed likelihood.

FIML is very similar to this method, but calculates the likelihood for each case with

observed data. The following section provides a formal introduction of this method.

Let n be the number of cases of data set Z that includes observed data Zobs and

missing data Zmis. As in EM, the data set Z can be described by p(Z|Θ), which is a

probability or density function that is governed by the set of parameters Θ. We assume

that the cases of data set Z are independent and identically distributed (i.i.d.) with

distribution p. Therefore, the resulting density for the cases is

1

(|) ( | ).

n

i

i

pZ pZ

=

Θ= Θ

∏

Thinking of Z = (Zobs, Zmis) and let nobs be the number of complete cases, we rewrite the

resulting density as

obs mis

mis obs obs

(|) ( , |)

(|,)(|).

pZ pZ Z

pZ Z pZ

Θ= Θ

=ΘΘ

Missing Data Imputation Techniques 275

The log-likelihood of Θ based on Zobs can then be written as

obs

obs

obs obs,

1

obs,

1

(| )log ( |)

(| ).

n

i

i

n

i

i

ZpZ

Z

=

=

Θ= Θ

=Θ

∏

∑

L

L

The FIML method first calculates the likelihood L(Θ|Zobs,i) for the observed portion of

each case i’s (1 ≤ i ≤ nobs) data. Then, it accumulates likelihoods of all cases to form the

likelihood for all the non-missing data and performs parameter estimation using

maximum likelihood based on the summed likelihood. After that, the FIML estimates the

missing values based on the estimated parameters.

Advantages and limitations

FIML method produces unbiased parameter estimates and standard errors under MAR

and MCAR (Arbuckle, 1996; Wothke, 2000). It is robust to data sets that do not comply

completely with the multivariate normal distribution requirement (Boomsma, 1982), and

can offer superior performance to LD and PD methods even under the NI missingness

mechanism (Wothke, 2000). But the FIML requires relatively large data sets and has

limitations in small samples (Little, 1992). Furthermore, the likelihood equations need to

be specifically worked out for a given distribution and estimation problem, and maximum

likelihood can be sensitive to the choice of starting values. One potential problem is that

the covariance matrix may be indefinite, which can lead to significant parameter

estimation difficulties, although these problems are often modest (Wothke, 2000).

4.6 Sequential imputation

The sequential imputation method (Kong et al., 1993, 1994; Irwin et al., 1994), which

imputes the missing values sequentially, was introduced by Kong et al. (1994) in the

context of Bayesian missing data problems, and has been applied to a variety of

problems. Liu (1996) discusses the use of this algorithm to provide approximations to the

quantities of interest in the context of his model for binary data. MacEachern et al. (1999)

propose improved performance of the original version by removing the locations entirely

through integration.

Method

In the sequential imputation method, at each stage, the missing data can be sampled from

its posterior, and the predictive probability can be evaluated. Specifically, the complete

cases are processed first, and the other cases are processed in the order of increasing

missingness so that the missing values are imputed conditioned on as many of Zobs as

possible. The formal introduction of this method is as follows.

Let Θ be parameters of interest and n be the number of cases of complete data set Z

that includes observed data Zobs and missing data Zmis. Suppose the complete data

posterior distribution p(Θ|Z) is simple. The main goal is to find the posterior distribution

p(Θ|Zobs),

mis

obs mis obs

(| ) (|)( | )d .

Z

pZ pZpZZΘ=Θ

∫

276 Q. Song and M. Shepperd

By drawing m independent copies of Zmis’s from the conditional distribution p(Zmis|Zobs),

we can approximate the posterior distribution p(Θ|Zobs) by

obs mis

1

1(| , ()),

m

i

pZZi

m=

Θ

∑

where Zmis(i) is the ith imputation for the missing part Zmis. Usually, this is an iterative

approximation procedure. But the method of Kong et al. (1994) avoids iterations by

sequentially imputing the Zmis,t that is the variable, which contains missing values, in the

tth observation and using importance sampling weights.

This method starts by drawing

mis,1

Z from p(Zmis,1|Zobs,1) and computing w1 = p(Zobs,1),

and for each remaining case t (2 ≤ t ≤ n), the following two steps must be performed

sequentially:

The first step draws

mis,t

Z from the conditional distribution

mis,1 mis,2 mis, 1

mis, obs ,1 obs, 2 obs, 1 obs,

(|,,,,, , ,).

t

ttt

pZ Z Z Z Z Z Z Z

−

−

…

The second step calculates the predictive probabilities

mis,1 mis, 2 mi s, 1

obs, obs ,1 obs ,2 obs, 1

(|,,, ,, , ),

t

tt

pZ Z Z Z Z Z Z −

−

…

and the importance sampling weight wt = wt–1p(Zobs).

After all the cases have been computed, let w = wn we have

mis,1 mis,2 mis, 1

obs,1 obs, obs ,1 obs, 2 obs, 1

2

()( | , , , ,, , ).

n

t

tt

t

wpZ pZ Z Z Z Z Z Z −

−

=

=∏…

By independently repeating the procedure m times to draw m sets of imputations

()( {1,2, , })

m

Z

ii m∈… and corresponding weights w(i), we can obtain the estimated

posterior distribution p(Θ|Zobs),

() mis

obs obs

() 1

1

1

(| ) (| , ()).

m

i

mii

i

pZ wpZZi

w=

=

Θ= Θ

∑

∑

Advantages and limitations

The sequential imputation method can directly estimate the model likelihood and cheaply

perform sensitivity analysis and influence analysis. Sequential imputation, which is

essentially also a blocking scheme, handles multiple loci very well, but only with zero or

very few loops. The limitation is that it requires that p(Z1, Z2, …, Zt–1) and p(Θ|Z) are all

simple.

4.7 General Iterative Principal (GIP) Component Analysis (PCA)

Principal Component Analysis (PCA) is one of the most used multivariate techniques,

Dear (1959) suggested its use for imputation. Dear’s Principal Component (DPC)

analysis imputation does not require any distributional assumptions, but it works poorly

for data sets with less complete cases because it uses the casewise deletion method to

calculate the correlation matrix. The General Iterative Principal (GIP) component method

was proposed to avoid the problem and make DPC a general-purpose method.

Missing Data Imputation Techniques 277

Method

Suppose Z = {zij} is a n × p data matrix that includes observed and missing data, and

X = {xij} is the normalised Z, and let R = {rij} be the missing data indicator matrix

(see Section 2 for details). The GIP component method consists of an initialisation step

and an iteration procedure.

The initialisation step calculates the correlation matrix S that can be obtained by two

methods. The first method uses all available data to compute S, and modify it with

Huseby et al. (1980) algorithm if it is non-positive definite. Another method applies

mean imputation to fill in all missing values and uses n – m – 1 instead of n – 1 as the

denominator in the variance–covariance calculations to obtain S, where m is the number

of cases with missing values.

The iteration procedure involves three steps that must be iteratively performed until

successive imputed values do not change materially.

The first step calculates the largest eigenvalue

λ

1 of S and its associated eigenvector

η

1 = (

η

11,

η

12, …,

η

1p).

The second step performs the following actions for all cases with missing values.

Let the first principal component for the ith case be

1

1

,

p

i j ij ij

j

x

r

γη

=

=∑

so that the points on the first principal component line that are closest to the ith case

replace the missing variables

1

if 1,

ˆif 0.

ij ij

ij

ji ij

xr

xr

ηγ

≡

=≡

The third step converts ˆ

X

back to ˆ

Z

and recalculates S from the imputed data matrix ˆ.

Z

Advantages and limitations

General iterative PCA imputation is a general-purpose imputation method; it does not

require any distributional assumptions, but its iterative procedure limits the performance.

4.8 Singular Value Decomposition (SVD)

The Singular Value Decomposition (SVD) method offers an interesting and stable

method for the imputation of missing values. It is easy to compute and can be used in a

simple way to impute missing data (Krzanowski, 1988).

Method

Let Z = {zij} be a n × p data matrix that includes observed and missing data. The SVD

method starts with filling in all the missing values with any initial imputed values such as

the mean. Then, for each missing value zij, the following iterative procedure must be

performed until stable imputed values are achieved.

The first step excludes the ith case from Z and calculates the SVD of the remaining

(n – 1) × p data matrix, denoted by i

Z

UDV

−′

= with st st

{}, {}υUu V== and

12

diag{ , , , },

p

Dddd=… where andUV are orthonormal matrices.

278 Q. Song and M. Shepperd

The second step excludes the jth variable from Z and calculates the SVD of the

remaining n × (p – 1) data matrix, denoted by

j

Z

UDV

−′

=

with st st

{}, {}υUuV==

and

12

diag{ , , , }.

p

Dddd=

…

The third step imputes zij with

1

1/2 1/ 2

1

ˆ()( )

p

ij ik k jk k

k

zudud

−

=

=∑

and updates zij’s initial imputed value with ˆ.

ij

Z

Advantages and limitations

In practice, this algorithm converges quite rapidly, typically five or six iterations.

Run-time can be linear in the number of non-zero elements in the data, and the result

is very sensitive to the choice of regression method and order of imputations.

The SVD-based method shows sharp deterioration in performance when a non-optimal

fraction of missing values is used (Troyanskaya et al., 2001).

4.9 General comments on single imputation from KDD perspective

Single imputation is a set of methods, some of them explicitly require large samples

to produce stable estimates or to obtain proper donor cases, such as the regression

imputation, the hot-deck imputation, and the FIML imputation. This characteristic makes

them capable of being used to process the large data sets that are used for KDD or

machine learning.

At the same time, regression imputation, hot-deck imputation, sequential imputation,

general iterative principal component analysis, and SVD do not assume particular

missingness mechanisms, thus they should be more easily applied to the preprocessing

task of the KDD and machine learning. In contrast, some of the single imputation

techniques require data to be missing under a specific missingness mechanism. However,

in practice it is usually hard to precisely test the missingness mechanisms. Therefore,

before applying a specific imputation method to KDD or machine learning, empirical

research on the safest default missingness mechanism assumptions for the method should

be done. Song et al. (2005) have done this work on k-NN and class mean imputation

methods, they found that MAR is the safest default missingness mechanism assumption

for both imputation methods in the context of software project effort prediction.

5 Multiple Imputation (MI)

MI (Rubin, 1977, 1978, 2004) is a valid method for handling missing data under the

MAR and multivariate normality assumptions, it retains all the major advantages of

single imputation and rectifies its major disadvantages (Rubin, 1988). By imputing

missing data m times, it introduces statistical uncertainty into the model and uses the

uncertainty to emulate the sample variability of a complete data set. MI is also a

general-purpose method, highly efficient even for small sample sizes (Graham and

Schafer, 1999). But each application of MI produces slightly different imputed results,

so the results cannot be replicated. The situation becomes worse when the amount of

Missing Data Imputation Techniques 279

missing data increases. MI is also time intensive in imputing 5–10 data sets, and different

types of imputation models need different result combination methods, which limit the

model selection.

5.1 General procedure

MI was first proposed by Rubin (1997, 1978, 2004) as a method for handling missing

data in surveys and later elaborated in his book (Rubin, 1987). It seems to be one of the

most promising methods for general-purpose handling of missing data in multivariate

analysis, this is because MI has several desirable features:

• The fact that MI introduces appropriate random error into the imputation process

makes it possible to get approximately unbiased estimates for all parameters.

No deterministic imputation method can do this in general settings (Allison, 2000).

• MI can be used with any kind of data and any kind of analysis without specialised

software. Repeated imputation allows one to get good estimates of the standard

errors (Allison, 2000).

• MI can be highly efficient even for small values of m. In many applications, just 3–5

imputations are sufficient to obtain excellent results (Schafer and Olsen, 1998).

Figure 2 illustrates the MI procedure, which consists of three steps:

• imputation

• analysis

• combination.

Figure 2 Multiple imputation procedure

280 Q. Song and M. Shepperd

• Imputation. The most challenging step, which imputes the missing values m > 1

times. Imputed values are drawn from the distribution of incomplete data set, and

each drawn value will be different for the missing item. The results of this step are

m complete data sets. There are a variety of imputation models that can be used.

The choice of imputation models depends on the assumptions regarding the

missingness mechanisms and patterns, as well as the data distribution. See next

subsection for details.

• Analysis. The repeated analysis step on the imputed data. This step analyses each of

the m complete data sets by using a standard complete data method, which should be

compatible with the imputation model used by the imputation step (Schafer, 1997).

The results of this step are m analysis results.

• Combination. This step integrates the m analysis results into a final result for the

inference. It consists of computing the mean over the m repeated analysis, the

corresponding variance and confidence interval or p value. Simple rules exist for

combining the m analysis results.

Among these three steps, the Combination is the simplest one, it is just used to calculate

the single point estimate from the m-point estimates of the Analysis step according to

Rubin’s rule (1987), which we will introduce in Subsection 4.3. Unlike the relative

independent Combination step, the Imputation and Analysis steps depend on each other to

some extent. In other words, the analysis to be performed on the imputed data sets should

be compatible with the imputation model (Schafer, 1997), which is used to generate the

imputed values must be correct in some sense. Rubin (1987, 1996) has described this

requirement.

5.2 Imputation model

The imputation model is used to generate the MIs. The MI method assumes that the

imputation model is the same as analysis model. In practice, however, these two models

may not be identical. But, at least the imputation model should be compatible with the

intended analysis.

The imputation model involves two aspects: variables and imputation method.

For variables, in general, the imputation model should contain all the variables that are

used by the analysis model. However, there is no reason for not including other variables

that are not used for analysis in the imputation model. It is also valuable to include

additional variables that are highly correlated to the variables with missing data and

the variables that are highly predictive of the missingness even if they are not used by the

analysis model.

For the imputation method, to properly reflect sampling variability when creating

repeated imputations under a model, and as a result that leads to valid inferences,

MI requires the imputation method to be proper. Imputation methods that incorporate

appropriate variability among the repetitions within a model are called proper, which is

defined precisely in Rubin (1987). In some complex situations, however, it is hard to find

a proper imputation (Fay, 1991, 1993; Rao, 1996). Fay (1992, 1993) explained that the

MI variance is a biased estimator for domains that are not part of the imputer’s model.

For more discussion on the relationship between the imputation model and analysis

model, see Meng’s (1994), Rubin’s (1996) and Schafer’s (1997) papers for details.

Missing Data Imputation Techniques 281

Generally, the choice of imputation methods depends on both missingness

mechanisms and missingness patterns. For example, for a monotone missing data pattern,

a parametric regression method is appropriate. For an arbitrary missing data pattern,

a Markov chain Monte Carlo (MCMC) method can be used. Meanwhile, imputation

methods must be proper, that is, they must be compatible with analysis methods

(Schafer, 1997). Unfortunately, not all the imputation methods are proper. But some

imputation methods, such as the parametric regression imputation method, Bayesian

Bootstrap (BB), ABB method, propensity score method (Rosenbaum and Rubin, 1983),

and MCMC methods, etc., are known to be proper and are introduced as follows.

The BB was first introduced by Rubin (1981) as a variation of bootstrap

(Efron, 1979).

Suppose there is a univariate data set X = {x1, x2, …, xn}, where the first a < n values

are reported and the remaining n – a values are missing. BB imputation first draws

a – 1 uniform random numbers between 0 and 1, and let their ordered values be

{r1, r2, …, ra–1}, and let r0 = 0 and ra = 1. Then draws each of the n – a missing values

from x1, x2, …, xa with probabilities (r1 – r0), (r2 – r1), …, (ra – ra–1) as the corresponding

imputed value.

The Approximate Bayesian Bootstrap (ABB) was proposed by Rubin and Schenker

(1986) as a way of generating Mis; it is an approximation of the BB (Rubin, 1981).

This method supposes that the sample can be regarded as independently and identically

distributed, and the missingness mechanism is ignorable.

Suppose there is a univariate data set X = {x1, x2, …, xn}, where the first a < n values

are reported and the remaining n – a values are missing. We can view ABB as a two-step

bootstrap, it is described as follows:

• with equal probabilities of selection, generate the donor set Xa by sampling a values

with replacement from the reported values {x1, x2, …, xa}.

• again with equal probabilities of selection, create a set of imputed data ˆna

X

−

by sampling n – a values with replacement from the donor set Xa.

In the case when the missingness of X is related to covariates Y = {y1, y2, …, ym}, if the

covariates are discrete and a is large enough, first the data set X is partitioned into cells

corresponding to unique patterns of y1, y2, …, ym and then ABB are carried out within

each cell. If the covariates are continuous or m is large, the data set X also is partitioned

into cells defined by coarse grouping of the estimated response propensities, which is

modelled by logistic regression on the covariates Y. See Lavori et al. (1995) paper for

details.

The ABB method introduces additional variation by drawing imputations from

a resample of the observed data set instead of drawing directly from the observed data set

itself, this makes it proper for MI. However, Allison (2000) and Schafer and Olsen

(1998) have noted that using the ABB method for the calculation of MIs may result in

misleading findings, particularly for regression analysis. Kim (2002) investigated the

ABB’s finite sample properties of the variance estimator, the results show that the bias is

not negligible for a moderate sample. He also proposed a modification of the method for

reducing the bias of the variance estimator.

Both ABB and BB’s distributions have the same means and correlations, but the

variances for the ABB method are (1 + 1/a) times the variances for the BB method.

282 Q. Song and M. Shepperd

Propensity score method (Rosenbaum and Rubin, 1983)

The propensity score is the estimated probability that a data item is missing. In the

propensity score method, a missing value is filled in by sampling from the cases that have

a similar propensity score. Specifically, for each variable with missing values, the first

step of the propensity score method is to estimate a propensity score for each case to

indicate the probability of its being missing, the most common method to do this is

logistic regression. The cases are then classified into several clusters based on these

propensity scores. After that, an ABB imputation (Rubin and Schenker, 1986) is applied

to each cluster (Lavori et al., 1995).

This method is effective for inferences about the distributions of individual imputed

variables, but it is not appropriate for analyses involving relationship among variables

(Yuan, 2000; Allison, 2000).

The MCMC (Gilks et al., 1996) originated in physics as a tool for exploring equilibrium

distributions of interacting molecules. However, in statistics, it was used to create a large

number of random draws of parameters from Bayesian posterior distributions under

complicated parametric models for parameter simulation. Recently, it has been used to

create a small number of independent draws of the missing data from a predictive

distribution via Markov chain, these draws are then used for MI inference.

A Markov chain is a sequence of random variables in which the distribution of each

element depends only on the previous one. For Markov chain {Xt : t = 1, 2, …}, all the

variables must meet p(Xt|X0, X1, …, Xt–1) = p(Xt|Xt–1). The Markov chain is fully specified

by the starting distribution p(X0) and the transition rule p(Xt|Xt–1).

Data Augmentation (DA) (Tanner and Wong, 1987; Tanner, 1991, 1993), which was

adapted by Schafer (1997) for the purpose of generating Bayesian proper imputations,

generates a Markov chain by starting with a random imputation of missing values. It is an

MCMC method, which consists of the following two steps:

The first step is the imputation I-step, which draws imputations for the missing values

from the predicted distribution of the data using current parameter estimates. That is,

this step draws (1)

mis

t

Z+ from p(Zmis|Zobs, Θ(t)), where t denotes tth iteration.

The first iteration of the I-step requires a prior as the initial parameter estimates,

which can be generated by using the EM (Dempster et al., 1977) algorithm. After the first

iteration, the parameter estimates are drawn from P-step, the second step.

The second step is the posterior P-step, which actually is a parameter estimation step.

This step draws new parameter estimates from Bayesian posterior distribution based on

both the observed and imputed data. That is, this step draws Θ(t+1) from (1)

obs mis

(| , ),

t

pZZ

+

Θ

where t denotes tth iteration. In general, non-informative priors about the parameters are

used for large samples and informative priors may be used for smaller samples or sparse

samples.

These two steps are iterated until the result of a Markov chain (1) (1)

mis

(,),ZΘ

(2)(2) (1)(1)

mis mis

(, ),,( , )

tt

ZZ

++

ΘΘ… eventually stabilises or converges in distribution to

p(Zmis, Θ|Zobs). The distribution of the parameters stabilises to a posterior distribution that

averages over the missing data, whereas the distribution of the missing data stabilises to a

predictive distribution (Schafer and Olsen, 1998).

Just as EM, the rate of convergence of DA also depends on fractions of missing

information. The larger the fractions of missing information, the slower the convergence.

Missing Data Imputation Techniques 283

It is not like EM, where the parameter estimates no longer change from one iteration to

the next when it converges. When DA converges, the distribution of the parameters no

longer changes from one iteration to the next, although the random parameter values

themselves do continue to change. For this reason, assessing the convergence of DA is

much more complicated than EM (Schafer and Olsen, 1998).

There are two methods when using DA to create m MIs for missing values. The first

is using one Markov chain to draw MIs. Suppose DA converges at tth iteration,

it is necessary to have a single run with length of mt and store the completed data sets

from iteration t, 2t, …, mt. The second is running m independent Markov chains, clearly

it is preferable.

MCMC can be used with both arbitrary and monotone patterns of missing data,

it assumes the multivariate normal distribution data under MAR missingness mechanism

assumption.

5.3 The combination of analysis results

Rubin (1987) proposed the method, referred to as Rubin’s rule, to combine the m results

of Analysis step to obtain a single set of results. Rubin’s rule is described as follows.

Suppose m ≥ 2 times imputations and analyses have been performed, and one has

calculated and saved all the m estimates and standard errors. For the ith (i = 1, 2, …, m)

analysis, let ˆi

Θ be the estimate of a particular parameter Θ of interest and let ˆ

υi be its

estimated variance. The MI estimate ˆ

Θ of Θ is

1

1

ˆˆ

.

m

i

i

m=

Θ= Θ

∑

The corresponding estimated variance ˆ

υ, which has two components that take into

account variability within each data set and across data sets, is the sum of the two

components, which are the within-imputation variance sw and the between-imputation

variance sb, with an additional correction factor to account for the simulation error in ˆ,Θ

1

ˆ1,

υwb

s

s

m

=++

where sw is the within-imputation variance,

1

1ˆ

υ

m

wi

i

sm=

=∑

is simply the average of the estimated variances.

The between-imputation variance sb,

2

b

1

1ˆˆ

()

1

m

i

i

sm=

=Θ−Θ

−∑

is the sample variance of the estimates themselves.

Note if there were no missing data, then ˆ(1,2,,)

iimΘ= … would be identical,

sb would be 0 and ˆ

υ would simply be equal to sw. sb/sw indicates how much information

is missing.

284 Q. Song and M. Shepperd

Final estimates are then presented with confidence limits. Confidence intervals are

calculated as

ˆˆ,

υ

df

tΘ±

where tdf is a t distribution with degrees of freedom df are calculated as

2

1

(1)1 .

(1 )

w

b

s

df m ms

−

=− +

+

The efficiency of a parameter estimate based on m imputations is

1

1,

m

γ

−

+

where

γ

is the rate of missing information (Rubin, 1987), it is defined as

2/( 3),

1

rdf

r

γ

++

=+

where

1

(1 ) b

w

mS

rs

−

+

=

is the relative increase in variance due to missingness.

Table 5 relates the missing data rate

γ

with the number of imputations m for

efficiency of recovery of the true parameter. It is clear from Table 5 that for low missing

data rates (30% or less), no more than m = 3 imputations are needed to be at least 91%

efficient. For higher missing data rates, m = 5 or m = 10 imputations are needed.

This table can be used to determine how many times imputations are needed for the given

missing data rate.

Table 5 Efficiency of MI for

γ

missing data and m imputations

γ

m 10% 20% 30% 40% 50% 60% 70% 80% 90%

3 97 94 91 88 86 83 81 79 77

5 98 96 94 95 91 89 88 86 85

10 99 98 97 96 95 94 93 93 92

20 100 99 99 98 98 97 97 96 96

5.4 General comments on MI from KDD perspective

Although nothing in the theory of MI requires data to be missing under MAR, almost all

MI analyses have assumed that the missing data are MAR. At the same time, a few NI

applications have been published (Glynn et al., 1993; Verbeke and Molenberghs, 2000)

and new methods for generating MIs under NI will certainly arrive in the future

(Schafer and Graham, 2002). This gives the MI imputation techniques wider application

Missing Data Imputation Techniques 285

than the single imputation techniques. The fact that MI does rely on large-sample

approximations also means that it has the ability of serving KDD and machine learning.

However, MI generates more than one data set, this raises another problem when we

use it for KDD or machine learning: how to integrate the multiple knowledge bases

induced from the different m data sets? For this question we have two solutions:

• if the knowledge is presented in the form of a scalar parameter, such as the effort

of a software project, just apply Rubin’s rule to it; otherwise

• apply Rubin’s rule to the m imputed complete data sets, so we can get only

one data set.

Then just induce knowledge with KDD or machine learning methods from it.

6 Concluding remarks

Data quality is a major concern in the intelligent data analysis field. MDITs have been

used to improve data quality in the statistics field for many years. Of course, these

techniques also can be applied to the preprocessing task of the intelligent data analysis.

When MDITs are used for cleaning data sets for KDD or machine learning purposes,

the following procedure is recommended.

• Choosing variables. The selection of variables that need to be imputed depends on

both the specific analysis task and the imputation technique. Not all missing data

need to be imputed. As Rubin (1986) and Grischiles (1986) have noted, if the

missing item is unrelated to the dependent variable, one may proceed with the

analysis by ignoring missing data in which case we may be satisfied with point

estimates, which may or may not be efficient. If missing data contained in variables

are tightly related to the analysis task, these variables must undoubtedly be included

in the imputation list. Otherwise, if variables can help to impute missing data

contained in other variables or can help to do analysis task well, they also should be

imputed.

• Assessing missing data percentage. An imputation method may work properly with a

low but not high missing data rate. Furthermore, some methods can only work well

with some specific missing data percentages. Although the performance of a specific

imputation method varies with different missing data rates, it is generally accepted

that data sets with more than 40% missing data are not useful for detailed analysis

(Strike et al., 2001). Thus, it is not recommended to impute data sets with more than

40% missing data using these imputation methods.

• Testing missing data pattern and missingness mechanism. Both missing data patterns

and missingness mechanisms have great impact on the selection of imputation

methods to deal with missing data. Some imputation methods can only work well

under MCAR, but there still are some imputation methods that do not have the

missingness mechanism assumptions. The latter is most suitable for intelligent data

analysis. At the same time, different imputation methods can have different missing

data pattern requirements. For example, the propensity score method and the

parametric regression method are appropriate for a monotone missing data

pattern, etc.

286 Q. Song and M. Shepperd

• Deciding on imputation method. Selecting proper missing data methods to best treat

missing data depends not only on the type of the variables, the missing data rate, the

missing data pattern, and the missingness mechanism, but also on the context of the

specific analysis. Although there are many techniques to deal with missing data,

no one is absolutely better than the others. Different situations require different

solutions. Missing data imputation must be developed within the context of the

specific analysis, different analysts are concerned with different contexts, no single

set of imputation can satisfy all interests. For example, some methods work well for

speech recognition, while others are appropriate for image processing, and so on.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under grant

60673124 and the Hi-Tech Research Development Program of China under grant

2006AA01Z1. The authors thank the anonymous reviews and the Editor for their

insightful and helpful comments.

References

Aggarwal, C.C. and Parthasarathy, S. (2001) ‘Mining massively incomplete data sets by conceptual

reconstruction’, Proceedings of the Seventh ACM SIGKDD Conference on Knowledge

Discovery and Data Mining, San Francisco, California, USA, pp.227–232.

Allison, P. (2000) ‘Multiple imputation for missing data: a cautionary tale’, Sociological Methods

and Research, Vol. 28, No. 3, pp.301–309.

Amemiya, T. (1984) ‘Tobit models: a survey’, Journal of Econometrics, Vol. 24, pp.3–61.

Arbuckle, J.L. (1996) ‘Full information likelihood estimation in the presence of incomplete data’,

in Marcoulides, G.A. and Schumaker, R.E. (Eds.): Advanced Structural Equation Modeling,

Lawrence Erlbaum, Mahwah, NJ, pp.243–277.

Batista, G.E.A.P.A. and Monard, M.C. (2003) ‘An analysis of four missing data treatment methods

for supervised learning’, Applied Artificial Intelligence, Vol. 17, Nos. 5–6, pp.519–533.

Becker, W.E. and Walstad, W.B. (1990) ‘Data loss from pretest to posttest as a sample selection

problem’, The Review of Economics and Statistics, Vol. 72, No. 1, pp.184–188.

Boomsma, A. (1982) On Robustness of LISREL (Maximum Likelihood Estimation) Against Small

Samples Sizes and Non-Normality, Scoiometric Research Foundation, Amsterdam.

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) Classification and Regression

Trees, Wadsworth International Group, Belmont, CA.

Celeux, G. and Diebolt, J. (1985) ‘The SEM algorithm: a probabilistic teacher algorithm derived

from the EM algorithm for a mixture problem’, Comput. Stat. Q., Vol. 2, pp.73–82.

Chan, S. and Ledilter, J. (1995) ‘Monte Carlo estimation for time series models involving counts’,

Journal of the American Statistical Association, Vol. 90, pp.242–252.

Chen, S.Y. and Liu, X. (2005) ‘Data mining from 1994 to 2004: an application-orientated review’,

International Journal of Business Intelligence and Data Mining, Vol. 1, No. 1, pp.4–21.

Chiu, H.Y. and Sedransk, J. (1986) ‘A Bayesian procedure for imputing missing values in sample

values in sample surveys’, Journal of the American Statistical Association, Vol. 81, No. 395,

pp.667–676.

Cohen, M.P. (1996) ‘A new approach to imputation’, American Statistical Association Proceedings

of the Section on Survey Research Methods, pp.293–298.

Missing Data Imputation Techniques 287

Conniffe, D. (1983) ‘Comments on the weighted regression approach to missing values’, Economic

and Social Review, Vol. 14, pp.259–272.

Dear, R.E. (1959) A Principle-Component Missing Data Method for Multiple Regression Models,

Technical Report Report SP-86, System Development Corporation, Santa Monica, CA.

DeGruttola, V. and Tu, X.M. (1994) ‘Modeling the progression of CD4- lymphocyte count and its

relationship to survival time’, Biometrics, Vol. 50, pp.1003–1014.

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) ‘Maximum likelihood from incomplete data

via EM algorithm’, Journal of the Royal Statistical Society Series B, Vol. 39, No. 1, pp.1–38.

Diggle, P. and Kenward, M.G. (1994) ‘Informative drop-out in longitudinal data analysis’, Applied

Statistics, Vol. 43, pp.49–93.

Donner, A. (1982) ‘The relative effectiveness of procedures commonly used in multiple regression

analysis for dealing with missing data’, The American Statistician, Vol. 36, pp.378–381.

Dunson, D.B. and Perreault, S.D. (2001) ‘Factor analytic models of clustered multivariate data with

informative censoring’, Biometrics, Vol. 57, pp.302–308.

Efron, B. (1979) ‘Bootstrap methods: another look at the jackknife’, Ann. Statist., Vol. 7, pp.1–26.

Fay, R.E. (1991) ‘A design-based perspective on missing data variance’, Proceedings of Seventh

Annual Research Conference, Bureau of the Census, Washington DC, pp.429–440.

Fay, R.E. (1992) ‘When are inferences from multiple imputation valid?’, Proceedings of the

Section on Survey Research Methods, American Statistical Association, pp.227–232.

Fay, R.E. (1993) ‘Valid inference from imputed survey data’, Proceedings of the Section on Survey

Research Methods, American Statistical Association, pp.41–48.

Fix, E. and Hodges, J.L. (1952) Discriminatory Analysis: Nonparametric Discrimination: Small

Sample Performance, Technical Report Project 21-49-004, Report Number 11, USAF School

of Aviation Medicine, Randolf Field, Texas.

Ford, B. (1983) ‘An overview of hot-deck procedures’, in Madow, W.G., Olkin, I. and Rubin, D.B.

(Eds.): Incomplete Data in Sample Survey, Academic Press, Vol. II, pp.185–207.

Ghahramami, Z. and Jordan, M. (1995) Learning from Incomplete Data, Technical Report AI Lab

Memo No. 1509, CBCL Paper No. 108, MIT AI Lab, August.

Gilks, W.R., Richardson, S. and Spiegelhalter, D.J. (Eds.) (1996) Markov Chain Monte Carlo in

Practice, Chapman & Hall, London.

Glynn, R., Laird, N.M. and Rubin, D.B. (1986) ‘Selection modeling versus mixture modeling with

nonignorable nonresponse’, in Wainer, H. (Ed.): Drawing Inferences from Self-Selected

Samples, Springer-Verlag, New York, pp.119–146.

Glynn, R.J., Laird, N.M. and Rubin, D.B. (1993) ‘Multiple imputation in mixture models for

nonignorable nonresponse with follow-ups’, Journal of the American Statistical Association,

Vol. 88, pp.984–993.

Gourieroux, C. and Montfort, A. (1981) ‘On the problem of missing data in linear models’, Review

of Economic Studies, Vol. XLVIII, pp.579–586.

Graham, J.W. and Schafer, J.L. (1999) ‘On the performance of multiple imputation for multivariate

data with small sample size’, Statistical Strategies for Small-Sample Research, pp.1–29.

Grischilles, Z. (1986) ‘Economic data issues’, in Grischilles, Z. and Intilligator M. (Eds.):

Handbook of Econometrics, Amsterdam, Vol. 3, pp.1465–1514.

Grzymala-Busse, J.W. and Hu, M. (2000) ‘A comparison of several approaches to missing attribute

values in data mining’, RSCTC’2000, pp.340–347.

Haitovsky, Y. (1968) ‘Missing data in regression analysis’, Journal of the Royal Statistical Society,

Vol. B30, pp.67–81.

Han, J. and Kambr, M. (2000) Data Mining: Concepts and Techniques, Morgan Kaufmann

Publisher, San Fransisco, USA.

Hartley, H. and Hocking, R. (1971) ‘The analysis of incomplete data’, Biometrics, Vol. 27,

pp.783–808.

288 Q. Song and M. Shepperd

Heckman, J. (1976) ‘The common structure of statistical models of truncation, sample selection and

limited dependent variables, and a simple estimator for such models’, Annals of Economic and

Social Measurement, Vol. 5, pp.475–492.

Hedeker, D. and Gibbons, R.D. (1997) ‘Application of random-effects pattern-mixture models for

missing data in longitudinal studies’, Psychological Methods, Vol. 2, No. 1, pp.64–78.

Hogan, J.W. and Laird, N.M. (1997) ‘Mixture models for the joint distribution of repeated

measures and event times’, Statistics in Medicine, Vol. 16, pp.239–257.

Huseby, J.R., Schwertman, N.C. and Allen, D.M. (1980) ‘Computation of the mean vector and

dispersion matrix for incomplete multivariate data’, Communication in Statistics, Vol. B3,

pp.301–309.

Irwin, M., Cox, N. and Kong, A. (1994) ‘Sequential imputation for multilocus linkage analysis’,

Proceedings of the National Academy of Sciences of the USA, pp.11684–11688.

Jordan, M. and Xu, L. (1996) ‘Convergence results for the em approach to mixtures of experts

architectures’, Neural Networks, Vol. 8, pp.409–1431.

Joreskog, K.G. and Sorbom, D. (1993) LISREL 8 User’s Reference Guide, Scientific Software Int’l

Inc., Chicago.

Kalousis, A. and Hilario, M. (2000) ‘Supervised knowledge discovery from incomplete data’,

Proceedings of the 2nd International Conference on Data Mining, WIT Press, Cambridge,

UK.

Kalton, G. (1981) Compensating for Missing Data, ISR Research Report Series, Survey Research

Center, University of Michigan, Ann Arbor.

Kalton, G. and Kasprszyk, D. (1986) ‘The treatment of missing survey data’, Survey Methodology,

Vol. 12, No. 1, pp.1–16.

Kim, J.K. (2002) ‘A note on approximate Bayesian bootstrap imputation’, Biometrika, Vol. 89,

No. 2, pp.470–477.

Kim, J-O. and Curry, J. (1977) ‘The treatment of missing data in multivariate analysis’,

Sociological Methods and Research, Vol. 6, No. 2, pp.215–240.

Kong, A., Cox, N., Frigge, M. and Irwin, M. (1993) ‘Sequential imputation and multipoint linkage

analysis’, Genetic Epidemiology, Vol. 10, pp.483–488.

Kong, A., Liu, J.S. and Wong, W.H. (1994) ‘Sequential imputations and Bayesian missing data

problems’, Journal of the American Statistical Association, Vol. 89, pp.278–288.

Krzanowski, W.J. (1988) ‘Missing value imputation in multivariate data using the singular value

decomposition of a matrix’, Biometrical Letters, Vol. 25, pp.31–39.

Lakshminarayan, K., Harp, S.A. and Samad, T. (1999) ‘Imputation of missing data in industrial

databases’, Applied Intelligence, Vol. 11, pp.259–275.

Lavori, P.W., Dawson, R. and Shera, D. (1995) ‘A multiple imputation strategy for clinical trials

with truncation of patient data’, Statistics in Medicine, Vol. 14, pp.1913–1925.

Little, R.C., Roderick, J.A. and Schenker, N. (1995) ‘Missing data’, Arminger, G., Clogg, C. and

Sobel, M. (Eds.): Handbook of Statistical Modeling for the Social and Behavioral Sciences,

Plenum, New York, pp.39–75.

Little, R.J.A. (1988) ‘A test of missing completely at random for multivariate data with missing

values’, Journal of the American Statistical Association, Vol. 83, No. 404, pp.1198–1202.

Little, R.J.A. (1988) ‘Missing-data adjustments in large surveys’, Journal of Business and

Economic Statistics, Vol. 6, No. 3, pp.287–296.

Little, R.J.A. (1992) ‘Regression with missing X’s: a review’, Journal of the American Statistical

Association, Vol. 87, pp.1227–1238.

Little, R.J.A. (1993) ‘Pattern-mixture models for multivariate incomplete data’, Journal of the

American Statistical Association, Vol. 88, pp.125–134.

Little, R.J.A. (1994) ‘A class of pattern mixture models for normal incomplete data’, Biometrika,

Vol. 81, pp.471–483.

Missing Data Imputation Techniques 289

Little, R.J.A. (1995) ‘Modeling the drop-out mechanism in repeated- measures studies’, Journal of

the American Statistical Association, Vol. 90, pp.1112–1121.

Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data, John-Wiley,

New York.

Little, R.J.A. and Rubin, D.B. (1989) ‘Analysis of social science data with missing values’,

Sociological Methods and Research, Vol. 18, pp.292–326.

Little, R.J.A. and Rubin, D.B. (2002) Statistical Analysis with Missing Data, John Wiley & Sons,

New York.

Little, R.J.A. and Schenker, N. (1995) ‘Missing data’, in Arminger, G., Clogg, C. and Sobel, M.

(Eds.): Handbook of Statistical Modeling for the Social and Behavioral Sciences, Plenum,

New York.

Little, R.J.A. and Wang, Y. (1996) ‘Pattern-mixing models for multivariate incomplete data with

covariates’, Biometrics, Vol. 52, pp.98–111.

Liu, J.S. (1996) ‘Nonparametric hierarchical Bayes via sequential imputations’, The Annals of

Statistics, Vol. 24, No. 3, pp.911–930.

MacEachern, S.N., Clyde, M. and Liu, J.S. (1999) ‘Sequential importance sampling for

nonparametric Bayes models: the next generation’, The Canadian Journal of Statistics/La

Revue Canadienne de Statistique, Technical Report, Department of Statistics, University of

Stanford, Vol. 27, No. 2, pp.251–267.

McLaachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Extensions, Wiley, New York.

Meng, X.L. (1994) ‘Multiple imputation with uncongenial sources of input (with discussion)’,

Statistical Science, Vol. 9, pp.538–574.

Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994) Machine Learning, Neural and Statistical

Classification, Ellis Horwood, New Jersey, USA.

Myrtveit, I., Stensrud, E. and Olsson, U.H. (2001) ‘Analyzing data sets with missing data:

An empirical evaluation of imputation methods and likelihood-based methods’, IEEE

Transactions on Software Engineering, Vol. 27, No. 11, pp.999–1013.

Paik, M.C. (1997) ‘The generalized estimating equation approach when data are not missing

completely at random’, Journal of the American Statistical Association, Vol. 92, No. 440,

pp.1320–1329.

Quinlan, J.R. (1988) C4.5 Programs for Machine Learning, Morgan Kaufmann, CA.

Rao, J.N.K. (1996) ‘On variance estimation with imputation survey data’, Journal of the American

Statistical Association, Vol. 91, pp.499–506.

Rao, J.N.K. and Shao, J. (1992) ‘Jackknife variance estimation with survey data under hot deck

imputation’, Biometrika, Vol. 79, pp.811–822.

Reilly, M. (1993) ‘Data analysis with hot deck multiple imputation’, The Statistician, Vol. 42,

pp.307–313.

Robins, J.M. (1997) ‘Non-response models for the analysis of non-monotone non-ignorable

missing data’, Statistics in Medicine, Vol. 16, pp.21–38.

Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1995) ‘Analysis of semiparametric regression models

for repeated outcomes in the presence of missing data’, Journal of the American Statistical

Association, Vol. 90, pp.106–121.

Rosenbaum, P.R. and Rubin, D.B. (1983) ‘The central role of the propensity score in observational

studies for causal effects’, Biometrika, Vol. 70, pp.41–55.

Rotnitzky, A. and Robins, J.M. (1997) ‘Analysis of semiparametric regression models with

non-ignorable non-response’, Statistics in Medicine, Vol. 16, pp.81–102.

Rotnitzky, A., Robins, J.M. and Scharfstein, D.O. (1998) ‘Semiparametric regression for repeated

outcomes with non-ignorable non-response’, Journal of the American Statistical Association,

Vol. 93, pp.1321–1339.

Roy, J. (2003) ‘Modeling longitudinal data with nonignorable dropouts using a latent dropout class

model’, Biometrics, Vol. 59, No. 4, pp.829–836.

290 Q. Song and M. Shepperd

Rubin, D.B. (1976) ‘Inference and missing data’, Biometrika, Vol. 63, pp.581–592.

Rubin, D.B. (1977) ‘Formalizing subjective notion about the effect of nonrespondents in sample

surveys’, Journal of the American Statistical Association, Vol. 72, pp.538–543.

Rubin, D.B. (1978) ‘Multiple imputations in sample surveys – a phenomenological Bayesian

approach to nonresponse’, Proceedings of the Survey Research Methods Section of the

American Statistical Association, pp.20–34.

Rubin, D.B. (1981) ‘The Bayesian bootstrap’, The Annals of Statistics, Vol. 9, No. 1, pp.130–134.

Rubin, D.B. (1986) ‘Basic ideas of multiple imputation on non-response’, Survey Methodology,

Vol. 12, No. 1, pp.37–47.

Rubin, D.B. (1987) Multiple Imputation for Nonresponses in Surveys, John Wiley & Sons,

New York.

Rubin, D.B. (1988) ‘An overview of multiple imputation’, Proceedings of the Survey Research

Section, American Statistical Association, pp.79–84.

Rubin, D.B. (1991) ‘EM and beyond’, Psychometrica, Vol. 56, pp.241–254.

Rubin, D.B. (1996) ‘Multiple imputation after 18+ years (with discussion)’, Journal of the

American Statistical Association, Vol. 91, pp.473–489.

Rubin, D.B. (2004) The Design of a General and Flexible System for Handling Nonresponse in

Sample Surveys, Report prepared for the US Social Security Administration, The American

Statistician, Vol. 58, No. 4, pp.298–302.

Rubin, D.B. and Schenker, N. (1986) ‘Multiple imputation for interval estimation from simple

random samples with ignorable nonresponse’, Journal of the American Statistical Association,

Vol. 81, No. 394, pp.366–374.

Sarle, W.S. (1998) ‘Prediction with missing inputs’, Proceedings of the Fourth Joint Conference on

Information Sciences, Vol. 2, pp.399–402.

Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data, Chapman & Hall, London.

Schafer, J.L. and Graham, J.W. (2002) ‘Missing data: our view of the state of the art’,

Psychological Methods, Vol. 7, No. 2, pp.147–177.

Schafer, J.L. and Olsen, M.K. (1998) ‘Multiple imputation for multivariate missing-data problems:

a data analyst’s perspective’, Multivariate Behavioral Research, Vol. 33, No. 4, pp.545–571.

Scharfstein, D.O., Rotnitzky, A. and Robins, J.M. (1999) ‘Adjusting for non-ignorable drop-out

using semiparametric non-response models (with discussion)’, Journal of the American

Statistical Association, Vol. 94, pp.1096–1146.

Schluchter, M.D. (1992) ‘Methods for the analysis of informatively censored longitudinal data’,

Statistics in Medicine, Vol. 11, pp.1861–1870.

Sedransk, J. (1985) ‘The objective and practice of imputation’, Proceedings of the First Annual

Res. Conf., pp.445–452.

Song, Q. and Shepperd, M. (2007) ‘A new method for imputing small software project data sets’,

Journal of Systems and Software, Vol. 80, No. 1, January, pp.51–62.

Song, Q., Shepperd, M. and Cartwright, M. (2005) ‘A short note on safest default missingness

mechanism assumptions’, Empirical Software Engineering: An International Journal, Vol. 10,

No. 2, pp.235–243.

Strike, K., El Emam, K. and Madhavji, N. (2001) ‘Software cost estimation with incomplete data’,

IEEE Transactions on Software Engineering, Vol. 27, No. 10, pp.890–908.

Tabachnick, B.G. and Fidell, L.S. (2001) Using Multivariate Statistics, 4th ed., Allyn & Bacon,

Needham Heights, MA.

Tanner, M.A. (1991) Tools for Statistical Inference: Observed Data and Data Augmentation

Methods, Springer-Verlag, New York, USA.

Tanner, M.A. (1993) Tools for Statistical Inference: Methods for the Exploration of Posterior

Distributions and Likelihood Functions, 2nd ed., Springer-Verlag, New York.

Missing Data Imputation Techniques 291

Tanner, M.A. and Wong, W.H. (1987) ‘The calculation of posterior distributions by data

augmentation (with discussion)’, Journal of the American Statistical Association, Vol. 82,

pp.528–550.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and

Altman1, R.B. (2001) ‘Missing value estimation methods for DNA microarrays’,

Bioinformatics, Vol. 17, pp.520–525.

Tsiatis, A.A., DeGruttola, V. and Wulfsohn, M.S. (1994) ‘Modeling the relationship of survival and

longitudinal data measured with error’, Journal of the American Statistical Association,

Vol. 90, pp.27–37.

Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data,

Springer-Verlag, New York.

Weil, G.C.G. and Tanner, M.A. (1990) ‘A Monte Carlo implementation of the EM algorithm and

the poor man’s dara augmentation algorithms’, Journal of the American Statistical

Association, Vol. 85, pp.699–704.

Wilks, S.S. (1932) ‘Moments and distributions of estimates of population parameters from

fragmentary samples’, Annals of Mathematics Statistics, Vol. 3, pp.163–195.

Wothke, W. (2000) ‘Longitudinal and multi-group modeling with missing data’, in Little, T.D.,

Schnabel, K.U. and Baumert, J. (Eds.): Modeling Longitudinal and Multiple Group Data:

Practical Issues, Applied Approaches and Specific Examples, Lawrence Erlbaum Associates,

Mahwah, NJ, pp.219–240.

Wu, C.F.J. (1983) ‘On the convergence properties of the em algorithm’, The Annals of Statistics,

Vol. 11, No. 1, pp.95–103.

Wu, M.C. and Bailey, K.R. (1988) ‘Analyzing changes in the presence of informative right

censoring caused by death and withdrawal’, Statistics in Medicine, Vol. 7, pp.337–346.

Wu, M.C. and Bailey, K.R. (1989) ‘Estimation and comparison of changes in the presence of

informative right censoring: conditional linear model’, Biometrics, Vol. 45, pp.939–955.

Xu, L. and Jordan, M.I. (1996) ‘On convergence properties of the em algorithm for Gaussian

mixtures’, Neural Computation, Vol. 8, pp.129–151.

Yuan, Y.C. (2000) ‘Multiple imputation for missing data: concepts and new development’,

Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference,

SAS Institute, Cary, NC.