Page 1

MULTIVARIATE BEHAVIORAL RESEARCH 505

Multivariate Behavioral Research, 38 (4), 505-528

Copyright © 2003, Lawrence Erlbaum Associates, Inc.

Investigation and Treatment of Missing Item Scores

in Test and Questionnaire Data

Klaas Sijtsma and L. Andries van der Ark

Tilburg University

This article first discusses a statistical test for investigating whether or not the pattern of

missing scores in a respondent-by-item data matrix is random. Since this is an asymptotic

test, we investigate whether it is useful in small but realistic sample sizes. Then, we discuss

two known simple imputation methods, person mean (PM) and two-way (TW)

imputation, and we propose two new imputation methods, response-function (RF) and

mean response-function (MRF) imputation. These methods are based on few assumptions

about the data structure. An empirical data example with simulated missing item scores

shows that the new method RF was superior to the methods PM, TW, and MRF in

recovering from incomplete data several statistical properties of the original complete data.

Methods TW and RF are useful both when item score missingness is ignorable and

nonignorable.

Introduction

A well known problem in data collection using tests and questionnaires

is that several item scores may be missing from the n respondents by J items

data matrix, X. This may occur for several reasons, often unknown to the

researcher. For example, the respondent may have missed a particular item,

missed a whole page of items, saved the item for later and then forgot about

it, did not know the answer and then left it open, became bored while making

the test or questionnaire and skipped a few items, felt the item was

embarrassing (e.g., questions about one’s sexual habits), threatening

(questions about the relationship with one’s children), or intrusive to privacy

(questions about one’s income and consumer habits), or felt otherwise

uneasy and reluctant to answer.

The literature is abundant with methods for handling missing data. For

example, Little and Schenker (1995) and Smits, Mellenbergh, and Vorst

(2002) discuss and compare a large number of simple and more advanced

methods. Several methods are rather involved and, as a result, sometimes

perhaps beyond the reach of individual psychological and educational

researchers who are not trained statisticians or psychometricians. One

Correspondence concerning this article should be addressed to Klaas Sijtsma, Department

of Methodology and Statistics, FSW, Tilburg University, P.O. Box 90153, 5000 LE Tilburg,

The Netherlands; e-mail: k.sijtsma@uvt.nl

Page 2

K. Sijtsma and L. van der Ark

506MULTIVARIATE BEHAVIORAL RESEARCH

example is the EM method (Dempster, Laird, & Rubin, 1977; Rubin, 1991)

that alternately estimates the missing data, then updates the parameter

estimates of interest, uses these to re-estimate the missing data, and so on,

until the algorithm converges to, for example, maximum likelihood estimates.

Another example is multiple imputation (e.g., Little & Rubin, 1987). Here, w

complete data matrices are estimated by imputing for a respondent having

missing data, for example, scores of sets of other respondents with complete

data that are similar to the respondent’s available data. Then, statistics based

on the w (usually a surprisingly small number; see Rubin, 1991) complete data

matrices, are averaged to obtain parameter estimates and standard errors.

Data augmentation (Schafer, 1997; Tanner & Wong, 1987) is an iterative

Bayesian procedure that resembles the EM method and also incorporates

features of multiple imputation (Little & Schenker, 1995).

Our starting point was that many researchers do not have a statistician or a

psychometrician in their vicinity who is available to help them implement these

superior but complex and involved missing data handling methods. Those

researchers may be better off using simpler methods, that are easy to implement

and lead to results approaching the quality of EM and multiple imputation. A

circumstance favorable for these simpler methods to succeed is that the items

in a test measure the same underlying ability or trait and, thus, the observed item

scores contain much information about the missing item scores. This helps to

obtain reasonable estimates of missing item scores, even with simple methods.

However, first we investigated whether an asymptotic statistical test

(Huisman, 1999) for the hypothesis that the pattern of missing item scores

in a data matrix X is random (to be explained later on), is useful in small but

realistic sample sizes. This test may be seen as a useful precursor for item

score imputation: When its conclusion is that item score missingness is

random, the researcher can safely use a sensible item score imputation

method to produce a complete data matrix. When item score missingness is

not random, imputation methods must be robust so as to produce a data

matrix that is not heavily biased. We investigated this robustness issue in a

real data example for four imputation methods. Two simple methods were

known (e.g., Bernaards & Sijtsma, 2000), and two others were new

proposals based on concepts from item response theory (IRT), but without

using strong assumptions about the data structure.

Before we continue, it may be noted that a purely statistical approach of

the missing data problem may be too simple in some cases. For example, when

one item produces most of the missing scores then, depending on the research

context, the item may simply be deleted from further research (e.g., it was

printed on the back of the page and therefore missed by many), it may be

reformulated (e.g., positively worded instead of negatively, which caused

Page 3

K. Sijtsma and L. van der Ark

MULTIVARIATE BEHAVIORAL RESEARCH 507

confusion) in future research, or it may be replaced (e.g., respondents did not

understand what was asked of them). Thus, the statistical treatment of missing

item scores should be considered in combination with other courses of action.

Types of Missing Item Scores

The next example item was taken from a questionnaire that measures

people’s tendency to cry (Vingerhoets & Cornelius, 2001):

I cry when I experience opposition from someone else

Never ? ? ? ? ? ? ? Always

In general, for a particular respondent or group of respondents nonresponse

may depend on:

1. The missing value on that item. For example, belonging to the right-most

“Always” group may imply a stronger nonresponse tendency than belonging

to the left-most “Never” group. Consequently, any missing data method based

on available item scores would underestimate the missing value.

2. Values of the other observed items or covariates. For example, for

men it may be more difficult to give a rating in the three boxes to the right

(showing endorsement or partial endorsement) than for women. Thus,

gender has a relation with item score missingness and this can be used for

estimating the missing item scores.

3. Values of variables that were not part of the investigation. For example,

nonresponse may depend on the unobserved verbal comprehension level of the

respondents or on their general intelligence. This kind of missingness is

relevant only if the unobserved variables are related to the observed variables,

and have an impact on the answers to the items in the test.

Item scores are missing completely at random (MCAR; see Little &

Rubin, 1987, pp. 14-17) if the cause of missingness is unrelated to the missing

values themselves, the scores on the other observed items and the observed

covariates, and the scores on unobserved variables. Thus, item score

missingness is ignorable because the observed data are a random sample

from the complete data. After listwise deletion, statistical analysis of the

resulting smaller data set results in less statistical accuracy and less power

when testing hypotheses, but unbiased parameter estimates.

When nonresponse depends on another variable from the data set, but

not on values of the item itself or on unobserved variables, item scores are

missing at random (MAR; see Little & Rubin, 1987, pp. 14-17). For example,

men may find it more difficult to answer “always” to the example item than

women, resulting in more missing item scores for men. The distributions of

Page 4

K. Sijtsma and L. van der Ark

508MULTIVARIATE BEHAVIORAL RESEARCH

item scores are different between men and women, but the distributions are

the same for respondents and nonrespondents in both groups. Note that

within the groups of men and women we have MCAR (given that no other

variables relate to item score missingness). This means that if, for example,

a regression analysis contains gender as a dummy variable the estimates of

the regression coefficients for both groups are unbiased. Thus, when

missingness is of the MAR type it is also ignorable.

When missingness is not MCAR or MAR, the observed data are not a

random sample from the original sample or from subsamples. Thus, the

missingness is nonignorable. In practice, a researcher can only observe that

item scores are missing. To decide whether item score missingness is

ignorable or nonignorable, he/she has to rely on the pattern of item score

missingness in the data matrix, X. When he/she finds no relationships to other

observed variables, he/she may decide that the missingness is of the MCAR

type. When a relationship to other observed variables is found, he/she may

use these variables as covariates in multivariate analyses or to impute

scores. When a more complex pattern of relationships is found, item score

missingness may be considered nonignorable. A reasonable solution is to

impute scores when the imputation method is backed up by robustness

studies (e.g., Bernaards & Sijtsma, 2000, for factor analysis of rating scale

data; and Huisman & Molenaar, 2001, in the context of test construction).

Missing Item Score Analysis

Theory for Analysis of the Whole Data Matrix

The scores on the J items are collected in J random variables Xj, j = 1, ...,

J. For respondent i (i = 1, ..., n), the J item scores, Xij, have realizations xij. Let

Mij be an indicator of a missing score with realization mij; mij = 0 if Xij is

observed and mij = 1 if Xij is missing. These missingness indicators are

collected in an n × J matrix M.

Huisman (1999; Kim & Curry, 1978) investigated whether or not the

pattern of missingness in the data matrix X is unrelated among items. This

is called random missingness and is defined as follows. Frequency counts

of observed missing scores and expected missing scores are compared,

given statistical independence of the missingness between the items. Thus,

whether a respondent misses the score on item j is unrelated to whether he

(or she) misses the score on item k. Items j and k may have different

proportions of missing scores. A more restricted assumption, to be used

later on, is that the proportions for all J items are equal, as is typical of

MCAR. It may be noted that MCAR implies random missingness.

Page 5

K. Sijtsma and L. van der Ark

MULTIVARIATE BEHAVIORAL RESEARCH509

Huisman (1999) classifies each respondent in the sample into one of J + 2

classes: (a) NM (No Missing): none of the item scores in a pattern are

missing; (b) Mj (Missing on item j): a score is missing only on item j; and (c)

MM (Multiple Missings): scores are missing on at least two items.

Let qj = ?iMij/n be the proportion of missing values on item j in the

sample and let pj = 1 – qj be the proportion of observed values on item j. Then,

under the assumption of random missingness (as defined above), the

expected values for NM, Mj, and MM are

()

()

()

()

()

()

1

1

;

; and

.

J

j

j

j

j

j

J

j

j

E NMnp

q

p

E ME NM

E MMnE NM E M

=

=

=

=

=−−

∏

∑

The observed frequencies in these J + 2 classes are denoted by O(NM),

O(Mj), and O(MM). Under the assumption of random missingness

Pearson’s chi-squared statistic,

(1)

()

(

()

[]

)

()(

)

)

(

()

(

()

[]

)

2

2

2

2

1

,

J

j

E M

j

j

j

O ME M

O NME NMO MME MM

X

E NME MM

=

−

−−

=++

∑

has a ?2 distribution with J + 1 degrees of freedom as n → ? (see, e.g.,

Agresti, 1990, pp. 44-45). For n = 8, Table 1 shows an incomplete data matrix

X and the corresponding missingness indicator matrix, M. This example is

used to calculate the X2 statistic (Equation 1). Because p2 = 1, we have that

E(M2) = 0; this is a structural zero, which is ignored in the computation of X2

at the cost of one degree of freedom. Table 2 shows the observed and the

expected frequencies that result in X2 = 1.65 (df = 5). Given the small sample

size, it makes no sense to draw any inferences on the basis of the outcome.

Robustness of X2 Statistic for Small Samples

Problem Definition. The robustness of Huisman’s (1999) asymptotic

test for small (realistic) samples is important. For similar expected

frequencies in each of the J + 1 classes, Koehler and Larntz (1980) found that