Page 1

RESEARCH ARTICLEOpen Access

Multiple imputation of missing covariates with

non-linear effects and interactions: an evaluation

of statistical methods

Shaun R Seaman1*, Jonathan W Bartlett2and Ian R White1

Abstract

Background: Multiple imputation is often used for missing data. When a model contains as covariates more than

one function of a variable, it is not obvious how best to impute missing values in these covariates. Consider a

regression with outcome Y and covariates X and X2. In ‘passive imputation’ a value X* is imputed for X and then X2

is imputed as (X*)2. A recent proposal is to treat X2as ‘just another variable’ (JAV) and impute X and X2under

multivariate normality.

Methods: We use simulation to investigate the performance of three methods that can easily be implemented in

standard software: 1) linear regression of X on Y to impute X then passive imputation of X2; 2) the same regression

but with predictive mean matching (PMM); and 3) JAV. We also investigate the performance of analogous methods

when the analysis involves an interaction, and study the theoretical properties of JAV. The application of the

methods when complete or incomplete confounders are also present is illustrated using data from the EPIC Study.

Results: JAV gives consistent estimation when the analysis is linear regression with a quadratic or interaction term

and X is missing completely at random. When X is missing at random, JAV may be biased, but this bias is generally

less than for passive imputation and PMM. Coverage for JAV was usually good when bias was small. However, in

some scenarios with a more pronounced quadratic effect, bias was large and coverage poor. When the analysis

was logistic regression, JAV’s performance was sometimes very poor. PMM generally improved on passive

imputation, in terms of bias and coverage, but did not eliminate the bias.

Conclusions: Given the current state of available software, JAV is the best of a set of imperfect imputation

methods for linear regression with a quadratic or interaction effect, but should not be used for logistic regression.

Background

In most medical and epidemiological studies some of

the data that should have been collected are missing.

This presents problems for the analysis of such data.

One approach is to restrict the analysis to complete

cases, i.e. those subjects for whom none of the variables

in the analysis model are missing. Data are said to be

missing completely at random (MCAR), missing at ran-

dom (MAR) or missing not at random (MNAR) [1].

MCAR means that that the probability of the pattern of

missing data being as it is depends on neither the

observed nor the missing data. MAR is the weaker

condition that the probability does not depend on the

missing data given the observed data. MNAR means

that it depends also on the missing data. When data are

MCAR, the complete cases constitute a representative

subsample of the sample, and so the complete-case ana-

lysis is valid. However, when data are MAR, using only

complete cases can yield biased parameter estimators.

Furthermore, even when data are MCAR, this approach

is inefficient, as it ignores information from incomplete

cases.

A method for handling missing data that gives valid

inference under MAR and which is more efficient than

just using complete cases is multiple imputation (MI)

[1]. Here a Bayesian model with non-informative prior

is specified for the joint distribution of the variables in

the analysis model, as well as possibly other (‘auxiliary’)

* Correspondence: shaun.seaman@mrc-bsu.cam.ac.uk

1MRC Biostatistics Unit, Institute of Public Health, Forvie Site, Robinson Way,

Cambridge CB2 0SR, UK

Full list of author information is available at the end of the article

Seaman et al. BMC Medical Research Methodology 2012, 12:46

http://www.biomedcentral.com/1471-2288/12/46

© 2012 Seaman et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons

Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in

any medium, provided the original work is properly cited.

Page 2

variables. This model is fitted to the observed data

assuming that they are MAR. A single imputed dataset

is now created by sampling the parameters of the impu-

tation model from their posterior distribution, in order

to account for the uncertainty in this model, and then

randomly generating (’imputing’) values for the missing

data using these sampled parameter values in the speci-

fied model. This procedure is repeated multiple times,

so generating multiple imputed datasets, and then the

analysis model is fitted to each of these in turn. Finally,

the complete-data parameter and variance estimates

from each imputed dataset are combined according to

simple formulae called Rubin’s Rules. Note that when

some of the variables are fully observed, it is unneces-

sary to model their distribution and the imputation

model can be a model for the conditional distribution of

the remaining variables given these.

This article is concerned with the use of MI when the

analysis model includes as covariates more than one

function of the same variable and this variable can be

missing. Such situations arise when the analysis model

includes both linear and higher-order terms of the same

variable or when the model includes an interaction term.

This is the case, for example, when non-linear associa-

tions are explored using fractional polynomials or splines

[2]. In such situations, the imputation is complicated by

the functional relationship between the covariates in the

analysis model. In this article we focus on two particular

simple settings: where the analysis model is 1) linear

regression of an outcome Y on covariates X and X2, and

2) linear regression of Y on covariates X, Z and XZ.

These are the two settings considered by Von Hippel

(2009) [3], who also investigated methods for imputing

variables in the presence of higher-order or interaction

effects. Unless stated otherwise, we suppose that Y and Z

are fully observed and X can be missing. We investigate

three methods of MI that can be easily implemented in

standard software.

In ‘passive imputation’, an imputation model is speci-

fied for the distribution of X given Y (or X given Y and

Z). Missing values of X are imputed from this model

and the corresponding values of the function(s) (X2or

XZ) of X calculated. Von Hippel (2009) [3] called this

method ‘impute then transform’. In principle, there is

nothing wrong with this method. However, in practice,

the existence of the higher-order or interaction effects

makes commonly used imputation models misspecified.

The conditional distribution of X given Y (and Z)

depends on the distribution of X (and Z) and the condi-

tional distribution of Y given X (and Z). In the case of a

linear regression analysis model, if X (and Z) are

(jointly) normally distributed and the true coefficient of

the higher-order or interaction term in the analysis

model is zero, the conditional distribution of X given Y

(and Z) is given by the linear regression of X on Y (and

Z). If the coefficient is not zero, this is no longer so.

Nevertheless, such a linear regression model would

commonly be used in practice as an imputation model

for X.

It is possible that passive imputation might be

improved by using predictive mean matching (PMM)

[4]. In this approach, rather than using the imputation

model to generate missing X values directly, it is used to

match subjects who have missing X with subjects with

observed X. Each incomplete case’s missing X is then

imputed as the matching subject’s value of X. The moti-

vation for PMM is that it may be more robust to mis-

specification of the imputation model and that grossly

unrealistic imputed values are avoided, since every

imputed value has actually been realised at least once in

the dataset.

Passive imputation and PMM ensure that the imputed

values conform to the known functional relation between

the covariates, e.g. that the imputed value of X2is equal

to the square of the imputed value of X. The third

method of MI that we examine was recently proposed by

Von Hippel (2009) [3]. This ignores the functional rela-

tion between covariates and treats a higher-order or

interaction term as just another variable. Von Hippel

called this approach ‘transform then impute’; following

White et al. (2011) [5], we call it JAV (’Just Another Vari-

able’). In this method, missing X and X2(or XZ) are

imputed under the assumption that Y, X and X2(or Y, X,

Z and XZ) are jointly normally distributed. Correspond-

ing imputed values of X and X2will not, in general, be

consistent with one another, e.g. X may be imputed as 2

while X2is imputed as 5. However, Von Hippel argued

that this does not matter for estimation of the parameters

of the analysis model. We shall examine Von Hippel’s

argument in detail in the Results section.

In the present article we investigate, using simulation,

the performance of three methods easily implemented in

standard software – passive imputation, PMM and JAV –

in the two settings described above. We look at bias of

parameter estimators and coverage of confidence inter-

vals. In addition to considering linear regression analysis

models, we also look at the logistic regression of binary Y

on X and X2. Von Hippel justified the use of JAV for a

linear regression analysis model, but suggested that it

might also work well in the setting of logistic regression,

because the logistic link function is fairly linear except in

regions where the fitted probability is near to zero and

one. In the Methods section, we formally describe the

three approaches and the simulations we performed to

assess the performance of these approaches. We also

describe a dataset from the EPIC study on which we illus-

trate the methods. In the Results, we present a theoretical

investigation of the properties of JAV, showing that

Seaman et al. BMC Medical Research Methodology 2012, 12:46

http://www.biomedcentral.com/1471-2288/12/46

Page 2 of 13

Page 3

although JAV gives consistent estimation for linear

regression under MCAR, it will not, in general, under

MAR. Results from the simulations and from applying

the methods to the EPIC dataset are also described there.

These results are followed by a discussion and

conclusions.

Methods

Three imputation methods

We begin by describing passive imputation, PMM and

JAV for the setting of linear regression of Y on X and

X2. We then describe the modifications necessary for

regression of Y on X, Z and XZ.

Let Xiand Yidenote the values of X and Y, respec-

tively, for subject i (i = 1,...,n). Assume that (X1, Y1),...,

(Xn, Yn) are independently identically distributed. Let

Ri= 1 if Xiis observed (i.e. if subject i is a complete

case), with Ri= 0 otherwise. Let n1denote the number

of complete cases, and q denote the number of reg-

ression parameters in the imputation model. Let

¯X = (X1,...,Xn)T, Wi=Ri(1, Yi)T(so W = (0, 0)Twhenever

X is missing), ¯ W = (W1,...,Wn)T, andψ = (¯WT¯W)−1.

Passive imputation

In the approach we call ‘linear imputation model with

passive imputation of X2’ (or just ‘passive imputation’)

the linear regression model X ∼ N?γ0+ γ1Y,σ2?

ˆ γ =?ˆ γ0, ˆ γ1

ˆ σ2=

unbiased estimator of s2. Ifγ and s2are treated as a

priori independent with joint density proportional to s-

is

fitted to the complete cases. So, q = 2. Let

?= ψ¯WT¯X denote the resulting maximum

?n

likelihood estimate (MLE) of γ = (γ0,γ1), and let

?

i=1Ri

Xi− ˆ γTWi

?2/?n1− q?

denotethe

2, then the posterior distribution of

n1−qand that ofγ given s2is N?ˆ γ, ψσ2?[6]. So, to

?n1− q?ˆ σ2/χ2

where the Bi’s are independently distributed N(0, 1).

??n1− q?ˆ σ2?

/σ2is

χ2

create a single imputed dataset,σ∗2is drawn from

n1−q and γ∗from N?ˆ γ, ψσ∗2?. Then

missing X values are imputed as Xi= γ∗TWi+ σ∗Bi,

PMM

The approach we call ‘linear imputation model with pre-

dictive mean matching’ (or just ‘PMM’) is the same as

passive imputation up to the generation ofσ∗2and γ∗.

Thereafter, instead of generating γ∗TWi+ σ∗Bi, a fitted

value ˆX

each subject i with missing X, the K subjects with

observed Xjand the closest ˆX

value are identified. One of these K subjects is chosen at

∗

i= γ∗TWiis calculated for each subject. For

∗

jvalues to his or her ˆX

∗

i

random and his or her Xjvalue becomes the imputed

value of Xi. The square of the imputed value of Xi

becomes the imputed value of X2

chosen to balance bias in parameter and variance esti-

mation. If K is very large, matching is very loose, leading

to bias in parameter estimates of the analysis model. If

K is very small, uncertainty in the imputed data will not

be fully represented, leading to underestimation of stan-

dard errors when Rubin’s Rules are applied. For our

simulations we used K = 5. Notice that if, as in this

case, the imputation model is a simple linear regression

of X on Y, finding the subjects with the nearest ˆX

values to ˆX

the nearest Yjvalues to Yi. If the imputation model con-

tains more than one predictor, PMM may be quite dif-

ferent from matching on the subjects with the nearest

Yj.

i. The value of K is

∗

j

∗

iis equivalent to finding the subjects with

JAV

In the JAV approach, (Y, X, X2) is assumed to be jointly

normally distributed:

⎡

⎣

Expression (1) can equivalently be written as

Y

X

X2

⎤

⎦∼ N

⎛

⎝

⎡

⎣

μ1

μ2

μ3

⎤

⎦,

⎡

⎣

σ11

σ12

σ13

σ12

σ22

σ23

σ13

σ23

σ33

⎤

⎦

⎞

⎠.

(1)

Y ∼ N(μ1,σ11)

(2)

?X

X2

?

| Y ∼ N

??

δ20+ δ21Y

δ30+ δ31Y

?

,

?

τ22

τ23

τ23

τ33

??

.

(3)

where (for k = 2, 3) δk0= μk− μ1σ12/σ11, δk1= σ12/σ11,

τkk= σkk− σ2

fitted model (1) to the observed data, a perturbation is

added to the maximum likelihood estimates, in a similar

way to that described above for the passive imputation

method. Missing values of X and X2are then generated

from distribution (3) using the perturbed values of the

parameters. As Y is fully observed, an alternative to fit-

ting model (1) is just to fit model (3) directly.

The methods described above need only minor adap-

tion for the setting of linear regression of Y on X, Z and

XZ. In passive imputation and PMM, the imputation

model for X should include Z. Obvious choices

are

and q = 3) or X ∼ N?γ0+ γ1Y + γ2Z + γ3YZ,σ2?

X multiplied by the imputed individual’s value of Z

becomes the imputed value of XZ. In JAV, (Y, X, Z,

1k/σ11 and τ12= σ23− σ12σ13/σ11. Having

X ∼ N?γ0+ γ1Y + γ2Z,σ2?

Wi= Ri(1,Yi,Zi,YiZi)Tand q = 4). The imputed value of

(so Wi= Ri(1,Yi,Zi)T

(so

Seaman et al. BMC Medical Research Methodology 2012, 12:46

http://www.biomedcentral.com/1471-2288/12/46

Page 3 of 13

Page 4

XZ), rather than (Y, X, X2), is assumed to be multivari-

ate normally distributed.

Simulation studies

Linear regression with quadratic term

In all our linear regression simulation studies, a sample

size of 200 was assumed and 1000 simulated datasets

were created. For each simulated dataset, we generated

200 X values from one of four distributions with mean 2

and variance 1: normal, log normal, (shifted and scaled)

beta, and uniform. For the log normal distribution, logX

?

has a coefficient of skewness of 1.63. For the (shifted

and scaled) beta distribution, we generated Z ~ beta(1,

10) and X= 12.05(Z-1/11)+2; X then has a skewness of

1.51. The outcome Y was generated from N(2X+X2,j),

where j was chosen to make the coefficient of determi-

nation R2equal to 0.1, 0.5 or 0.8. Although R2values

greater than 0.5 are uncommon in medical studies, we

wanted also to investigate the performance of methods

in extreme situations. The top two rows of Figure 1

show, for normally and log-normally distributed X, a

typical set of data generated in this way.

Missingness was then imposed on these data. Let expit

(x) = {1+exp(-x)}-1. Y was fully observed; two missing

data mechanisms were assumed for X. For MCAR, each

X was observed with probability 0.7, regardless of the

values of X and Y. For MAR, the probability X was

observed was expit(a0+ a1Y), where a1=-1/SD(Y) and

a0was chosen to make the marginal probability of

observing X equal to 0.7.

For the three methods, passive imputation (‘Passive’),

PMM and JAV, we used M = 5 imputations. We also

carried out the complete-case analysis (‘CCase’) and the

complete-data analysis (‘CData’), i.e. before data

deletion.

Finally, we instead generated Y from N ((X-2)2,j), with

j chosen to make R2= 0.1, 0.5 or 0.8. As the mean of X

is 2, the quadratic relation between Y and X is now

more obvious in such data (see Figure 1).

was generated from N

log

?√3.2

?

,log?5/4??

; X then

Linear regression with interaction

We focussed on normally and log-normally distributed

covariates. Four bivariate distributions were assumed for

the two covariates X and Z. In the first, X and Z were

both independently distributed N(2, 1). In the second,

they were generated from a bivariate normal distribution

so that they both had marginal distribution N(2, 1) but

Cor(X, Y) = 0.5. In the third, logX and logZ were inde-

?

that X and Z were independently log-normal each with

mean 2 and variance 1. In the fourth, logX and logZ

pendently distributed N

log

?√3.2

?

,log?5/4??

, so

were generated from a bivariate normal distribution

sothatthey bothhad

?

Outcome Y was generated from N (X+Z+XZ, j), where

j was chosen so that R2= 0.1 or 0.5.

Y and Z were fully observed; the same two missing

data mechanisms were assumed for X as in ‘Linear

regression with Quadratic Term’. Two variations of pas-

sive imputation were used: in the first (‘Passive1’), the

imputation model contained just Y and Z; in the second

(’Passive2’), the imputation also contained the interac-

tion YZ. For PMM the imputation model also included

YZ. Von Hippel [3] considered only Passive1.

marginaldistribution

N

log

?√3.2

?

,log?5/4??

but Cor((log X, log Z) = 0.5.

Logistic regression with quadratic term

A sample size of 2000 was assumed and 1000 simulated

datasets were created. This larger sample size was used

because binary outcomes provide less information for esti-

mating parameter values than do continuous outcomes.

We used the same normal and log normal distributions

for X as in ‘Linear regression with Quadratic Term’. Binary

outcomes Y were generated from the model P(Y = 1|X) =

expit(b0+2b2X+b2X2). The value of b2was chosen to make

the log odds ratio of Y for X = 3 versus X = 1 equal to

either 1 (b2= 1/12) or 2 (b2= 1/6). When X is normally

distributed, this is the log odds ratio for the mean of X

plus one standard deviation relative to the mean of X

minus one standard deviation. The value of b0was chosen

so that the marginal probability of Y = 1 was either p = 0.1

or p = 0.5.

Y was fully observed; X was MCAR or MAR, with

probability expit(a0+ a1Y) of being observed. For

MCAR a1= 0; for MAR a1= - 2. In both cases a0was

chosen to give a marginal probability of observing Y of

0.7. For passive imputation and PMM the imputation

model was the linear regression of X on Y.

Analysis of vitamin C data from EPIC Study

EPIC-Norfolk is a cohort of 25,639 men and women

recruited during 1993-97 from the population of indivi-

duals aged 45-75 in Norfolk, UK [7]. Shortly after recruit-

ment, study participants were invited to attend a health

check at which a 7-day diet diary was provided for comple-

tion over the next week. Blood samples were provided and

have been stored. A measure of average daily intake of vita-

min C has been derived from the 7-day diet diary and

plasma vitamin C (μmol/l) was measured within a few days

of the blood sample being provided. The dietary assess-

ment methods have been described in detail elsewhere [8].

There is evidence of a non-linear association between

vitamin C intake and plasma vitamin C [9]. Here, we

look at this association in the EPIC-Norfolk data: in par-

ticular, whether this relation is linear or has a quadratic

Seaman et al. BMC Medical Research Methodology 2012, 12:46

http://www.biomedcentral.com/1471-2288/12/46

Page 4 of 13

Page 5

element. Plasma vitamin C is also affected by sex, age,

smoking status, and body size [9-12], so these possible

confounders are adjusted for in our analysis. The analy-

sis presented in this article illustrates the methods

described here and is not intended as a definitive analy-

sis of the EPIC data.

Of the 25639 subjects, 10224 had incomplete data:

3165 had missing plasma vitamin C; 8100 missing

0

2

4

6

8

−50050100

x

y

X~normal, E[Y|X]=2X+X^2, R2=0.1

0

2

4

6

8

−500 50100

x

y

X~normal, E[Y|X]=2X+X^2, R2=0.5

0

2

4

6

8

−500 50100

x

y

X~normal, E[Y|X]=2X+X^2, R2=0.8

0

2

4

6

8

−50050 100

x

y

X~log normal, E[Y|X]=2X+X^2, R2=0.1

0

2

4

6

8

−50050 100

x

y

X~log normal, E[Y|X]=2X+X^2, R2=0.5

0

2

4

6

8

−50050 100

x

y

X~log normal, E[Y|X]=2X+X^2, R2=0.8

0

2

4

6

8

−15−10−50

y

51015

x

X~normal, E[Y|X]=(X−2)^2, R2=0.1

0

2

4

6

8

−15−10−50

y

51015

x

X~normal, E[Y|X]=(X−2)^2, R2=0.5

0

2

4

6

8

−15−10−50

y

51015

x

X~normal, E[Y|X]=(X−2)^2, R2=0.8

Figure 1 Typical datasets for normally or log-normally distributed X (each with mean 2 and variance 1), normally distributed Y with

mean 2X + X2or (X - 2)2, and R2= 0.1, 0.5 or 0.8. Dotted line shows expected value of Y given X.

Seaman et al. BMC Medical Research Methodology 2012, 12:46

http://www.biomedcentral.com/1471-2288/12/46

Page 5 of 13