MULTIVARIATE BEHAVIORAL RESEARCH505
Multivariate Behavioral Research, 38 (4), 505-528
Copyright © 2003, Lawrence Erlbaum Associates, Inc.
Investigation and Treatment of Missing Item Scores
in Test and Questionnaire Data
Klaas Sijtsma and L. Andries van der Ark
This article first discusses a statistical test for investigating whether or not the pattern of
missing scores in a respondent-by-item data matrix is random. Since this is an asymptotic
test, we investigate whether it is useful in small but realistic sample sizes. Then, we discuss
two known simple imputation methods, person mean (PM) and two-way (TW)
imputation, and we propose two new imputation methods, response-function (RF) and
mean response-function (MRF) imputation. These methods are based on few assumptions
about the data structure. An empirical data example with simulated missing item scores
shows that the new method RF was superior to the methods PM, TW, and MRF in
recovering from incomplete data several statistical properties of the original complete data.
Methods TW and RF are useful both when item score missingness is ignorable and
A well known problem in data collection using tests and questionnaires
is that several item scores may be missing from the n respondents by J items
data matrix, X. This may occur for several reasons, often unknown to the
researcher. For example, the respondent may have missed a particular item,
missed a whole page of items, saved the item for later and then forgot about
it, did not know the answer and then left it open, became bored while making
the test or questionnaire and skipped a few items, felt the item was
embarrassing (e.g., questions about one’s sexual habits), threatening
(questions about the relationship with one’s children), or intrusive to privacy
(questions about one’s income and consumer habits), or felt otherwise
uneasy and reluctant to answer.
The literature is abundant with methods for handling missing data. For
example, Little and Schenker (1995) and Smits, Mellenbergh, and Vorst
(2002) discuss and compare a large number of simple and more advanced
methods. Several methods are rather involved and, as a result, sometimes
perhaps beyond the reach of individual psychological and educational
researchers who are not trained statisticians or psychometricians. One
Correspondence concerning this article should be addressed to Klaas Sijtsma, Department
of Methodology and Statistics, FSW, Tilburg University, P.O. Box 90153, 5000 LE Tilburg,
The Netherlands; e-mail: firstname.lastname@example.org
K. Sijtsma and L. van der Ark
506 MULTIVARIATE BEHAVIORAL RESEARCH
example is the EM method (Dempster, Laird, & Rubin, 1977; Rubin, 1991)
that alternately estimates the missing data, then updates the parameter
estimates of interest, uses these to re-estimate the missing data, and so on,
until the algorithm converges to, for example, maximum likelihood estimates.
Another example is multiple imputation (e.g., Little & Rubin, 1987). Here, w
complete data matrices are estimated by imputing for a respondent having
missing data, for example, scores of sets of other respondents with complete
data that are similar to the respondent’s available data. Then, statistics based
on the w (usually a surprisingly small number; see Rubin, 1991) complete data
matrices, are averaged to obtain parameter estimates and standard errors.
Data augmentation (Schafer, 1997; Tanner & Wong, 1987) is an iterative
Bayesian procedure that resembles the EM method and also incorporates
features of multiple imputation (Little & Schenker, 1995).
Our starting point was that many researchers do not have a statistician or a
psychometrician in their vicinity who is available to help them implement these
superior but complex and involved missing data handling methods. Those
researchers may be better off using simpler methods, that are easy to implement
and lead to results approaching the quality of EM and multiple imputation. A
circumstance favorable for these simpler methods to succeed is that the items
in a test measure the same underlying ability or trait and, thus, the observed item
scores contain much information about the missing item scores. This helps to
obtain reasonable estimates of missing item scores, even with simple methods.
However, first we investigated whether an asymptotic statistical test
(Huisman, 1999) for the hypothesis that the pattern of missing item scores
in a data matrix X is random (to be explained later on), is useful in small but
realistic sample sizes. This test may be seen as a useful precursor for item
score imputation: When its conclusion is that item score missingness is
random, the researcher can safely use a sensible item score imputation
method to produce a complete data matrix. When item score missingness is
not random, imputation methods must be robust so as to produce a data
matrix that is not heavily biased. We investigated this robustness issue in a
real data example for four imputation methods. Two simple methods were
known (e.g., Bernaards & Sijtsma, 2000), and two others were new
proposals based on concepts from item response theory (IRT), but without
using strong assumptions about the data structure.
Before we continue, it may be noted that a purely statistical approach of
the missing data problem may be too simple in some cases. For example, when
one item produces most of the missing scores then, depending on the research
context, the item may simply be deleted from further research (e.g., it was
printed on the back of the page and therefore missed by many), it may be
reformulated (e.g., positively worded instead of negatively, which caused
K. Sijtsma and L. van der Ark
MULTIVARIATE BEHAVIORAL RESEARCH 507
confusion) in future research, or it may be replaced (e.g., respondents did not
understand what was asked of them). Thus, the statistical treatment of missing
item scores should be considered in combination with other courses of action.
Types of Missing Item Scores
The next example item was taken from a questionnaire that measures
people’s tendency to cry (Vingerhoets & Cornelius, 2001):
I cry when I experience opposition from someone else
Never ? ? ? ? ? ? ? Always
In general, for a particular respondent or group of respondents nonresponse
may depend on:
1. The missing value on that item. For example, belonging to the right-most
“Always” group may imply a stronger nonresponse tendency than belonging
to the left-most “Never” group. Consequently, any missing data method based
on available item scores would underestimate the missing value.
2. Values of the other observed items or covariates. For example, for
men it may be more difficult to give a rating in the three boxes to the right
(showing endorsement or partial endorsement) than for women. Thus,
gender has a relation with item score missingness and this can be used for
estimating the missing item scores.
3. Values of variables that were not part of the investigation. For example,
nonresponse may depend on the unobserved verbal comprehension level of the
respondents or on their general intelligence. This kind of missingness is
relevant only if the unobserved variables are related to the observed variables,
and have an impact on the answers to the items in the test.
Item scores are missing completely at random (MCAR; see Little &
Rubin, 1987, pp. 14-17) if the cause of missingness is unrelated to the missing
values themselves, the scores on the other observed items and the observed
covariates, and the scores on unobserved variables. Thus, item score
missingness is ignorable because the observed data are a random sample
from the complete data. After listwise deletion, statistical analysis of the
resulting smaller data set results in less statistical accuracy and less power
when testing hypotheses, but unbiased parameter estimates.
When nonresponse depends on another variable from the data set, but
not on values of the item itself or on unobserved variables, item scores are
missing at random (MAR; see Little & Rubin, 1987, pp. 14-17). For example,
men may find it more difficult to answer “always” to the example item than
women, resulting in more missing item scores for men. The distributions of
K. Sijtsma and L. van der Ark
508MULTIVARIATE BEHAVIORAL RESEARCH
item scores are different between men and women, but the distributions are
the same for respondents and nonrespondents in both groups. Note that
within the groups of men and women we have MCAR (given that no other
variables relate to item score missingness). This means that if, for example,
a regression analysis contains gender as a dummy variable the estimates of
the regression coefficients for both groups are unbiased. Thus, when
missingness is of the MAR type it is also ignorable.
When missingness is not MCAR or MAR, the observed data are not a
random sample from the original sample or from subsamples. Thus, the
missingness is nonignorable. In practice, a researcher can only observe that
item scores are missing. To decide whether item score missingness is
ignorable or nonignorable, he/she has to rely on the pattern of item score
missingness in the data matrix, X. When he/she finds no relationships to other
observed variables, he/she may decide that the missingness is of the MCAR
type. When a relationship to other observed variables is found, he/she may
use these variables as covariates in multivariate analyses or to impute
scores. When a more complex pattern of relationships is found, item score
missingness may be considered nonignorable. A reasonable solution is to
impute scores when the imputation method is backed up by robustness
studies (e.g., Bernaards & Sijtsma, 2000, for factor analysis of rating scale
data; and Huisman & Molenaar, 2001, in the context of test construction).
Missing Item Score Analysis
Theory for Analysis of the Whole Data Matrix
The scores on the J items are collected in J random variables Xj, j = 1, ...,
J. For respondent i (i = 1, ..., n), the J item scores, Xij, have realizations xij. Let
Mij be an indicator of a missing score with realization mij; mij = 0 if Xij is
observed and mij = 1 if Xij is missing. These missingness indicators are
collected in an n × J matrix M.
Huisman (1999; Kim & Curry, 1978) investigated whether or not the
pattern of missingness in the data matrix X is unrelated among items. This
is called random missingness and is defined as follows. Frequency counts
of observed missing scores and expected missing scores are compared,
given statistical independence of the missingness between the items. Thus,
whether a respondent misses the score on item j is unrelated to whether he
(or she) misses the score on item k. Items j and k may have different
proportions of missing scores. A more restricted assumption, to be used
later on, is that the proportions for all J items are equal, as is typical of
MCAR. It may be noted that MCAR implies random missingness.
K. Sijtsma and L. van der Ark
MULTIVARIATE BEHAVIORAL RESEARCH 525
In our one-data set example, Huisman’s (1999) overall test statistic was
effective to detect both simulated ignorable and nonignorable item score
missingness correctly, given an appropriate sample size. When ignorable
item score missingness is found, we may have confidence that single
imputation or another method probably will not greatly invalidate the data.
Alternative classifications of missingness patterns than those used for
Huisman’s method may provide additional ways to test for MCAR or MAR.
Under MCAR any classification of the respondents or the items should fit.
Possibly useful classifications are those based on meaningful covariates,
such as gender, social-economic status and age.
Imputation methods PM and TW are so simple that they can be explained
easily to researchers that are not statistically trained. Also, they are easy to
compute using major software packages such as SPSS and SAS. Methods
RF and MRF use the response function, estimated nonparametrically from
the fully observed respondents, thus ignoring the common and more
restrictive assumptions typical of IRT models. These methods are also rather
easy to explain, but their computation can be cumbersome. This is true
especially for method RF when the restscore groups are small and have to
Rasch Analysis Bias Results for Q2, for Ignorable (MCAR) and
Nonignorable Missingness Mechanisms, q = .01, .05, and .10, and Imputation
Methods PM, TW, RF, and MRF; Q2 = 2112 (df = 1150) for Complete Data
q: .01 .05.10.01.05 .10
Note. J = 23; due to Rasch model estimation properties n varies from 620 to 643 across cells.
K. Sijtsma and L. van der Ark
526MULTIVARIATE BEHAVIORAL RESEARCH
be joined. A simple computer program called impute.exe with the four
imputation methods implemented for both dichotomous and polytomous items
can be obtained from the authors at http://www.uvt.nl/faculteiten/fsw/
organisatie/departementen/mto/software2.html. The software was
written in Borland Pascal 7.0. The maximum order of data matrix X for which
the program works has not yet been explored.
Method RF was superior to methods PM, TW, and MRF in estimating the
alpha and H coefficients, and the Rasch model statistics R1c and Q2. Method
TW produced higher percentages of hits than the other methods, but this
resulted sometimes in estimates of alpha and H that were too high. Method
RF may produce unstable results for small numbers of fully observed
respondents. Consequently, the estimates of the response probabilities may
be inaccurate. Method TW may be more stable, and may be preferred for
smaller sample sizes. Methods RF and TW may be also be useful when item
score missingness is nonignorable. A reviewer suggested that deleting cases
from the analysis with more than, say, half of the item scores missing may
further improve results. This is a possible topic for future research. Finally,
each of the methods probably works best when the data are unidimensional.
Multidimensionality is addressed by Van der Ark and Sijtsma (in press).
The error introduced in the data by single imputation may be too small,
resulting in standard errors that are too small (Little & Rubin, 1987, p. 256).
The analysis of test data usually is more involved, however, calculating large
numbers of statistics, testing many hypotheses, and selecting items based on
such calculations. Moreover, test construction has a cyclic character,
leaving out items in one cycle, re-analyzing the data for remaining items,
leaving out another item as well or re-selecting a previously rejected item in
another cycle, and so on. It would be interesting to see how multiple
imputation (e.g., Rubin, 1991) can help to obtain more stable conclusions for
item analysis. This is a topic for future research.
Agresti, A. (1990). Categorical data analysis. New York: Wiley.
Baker, F. B. (1992). Item response theory. Parameter estimation techniques. New York:
Bernaards, C. A. & Sijtsma, K. (1999). Factor analysis of multidimensional polytomous
item response data suffering from ignorable item nonresponse. Multivariate
Behavioral Research, 34, 277-313.
Bernaards, C. A. & Sijtsma, K. (2000). Influence of imputation and EM methods on factor
analysis when item nonresponse in questionnaire data is nonignorable. Multivariate
Behavioral Research, 35, 321-364.
Cressie, N. & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the
Royal Statistical Society, Series B, 46, 440-464.
K. Sijtsma and L. van der Ark
MULTIVARIATE BEHAVIORAL RESEARCH 527
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16, 297-334.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series
B, 39, 1-38.
Fischer, G. H. & Molenaar, I. W. (1995, Eds.). Rasch models. Foundations, recent
developments, and applications. New York: Springer.
Glas, C. A. W. & Verhelst, N. D. (1995). Testing the Rasch model. In G. H. Fischer & I.
W. Molenaar (Eds.), Rasch models. Foundations, recent developments, and
applications (pp. 69-95). New York: Springer.
Hemker, B. T., Sijtsma, K., & Molenaar, I. W., & Junker (1997). Stochastic ordering using
the latent trait and the sum score in polytomous IRT models. Psychometrika, 62, 331-
Huisman, J. M. E. (1999). Item nonresponse: Occurrence, causes, and imputation of
missing answers to test items. Leiden, The Netherlands: DSWO Press.
Huisman, J. M. E. & Molenaar, I. W. (2001). Imputation of missing scale data with item
response models. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.),
Essays on item response theory (pp. 221-244). New York: Springer.
Junker, B. W. (1993). Conditional association, essential independence, and monotone
unidimensional item response models. The Annals of Statistics, 21, 1359-1378.
Junker, B. W. & Sijtsma, K. (2000). Latent and manifest monotonicity in item response
models. Applied Psychological Measurement, 24, 65-81.
Kim, J. O. & Curry, J. (1978). The treatment of missing data in multivariate analysis. In
D. F. Alwin (Ed.), Survey design and analysis (pp. 91-116). London: Sage.
Koehler, K. & Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics
for sparse multinomials. Journal of the American Statistical Association, 75, 336-344.
Little, R. J. A. & Rubin, D. B. (1987). Statistical analysis with missing data. New York:
Little, R. J. A. & Schenker, N. (1995). Missing data. In G. Arminger, C. C. Clogg, & M.
E. Sobel (Eds.), Handbook of statistical modeling for the social and behavioral
sciences (pp. 39-75). New York: Plenum.
Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,
Mokken, R. J. (1971). A theory and procedure of scale analysis. The Hague: Mouton/
Berlin: De Gruyter.
Mokken, R. J. & Lewis, C. (1982). A nonparametric approach to the analysis of
dichotomous item responses. Applied Psychological Measurement, 6, 417-430.
Molenaar, I. W. & Sijtsma, K. (2000). User’s manual MSP5 for Windows. Groningen, the
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Nielsen & Lydiche.
Rubin, D. B. (1991). EM and beyond. Psychometrika, 56, 241-254.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman & Hall.
Sijtsma, K. & Molenaar, I. W. (2002). Introduction to nonparametric item response theory.
Thousand Oaks, CA: Sage.
Smits, N., Mellenbergh, G. J., & Vorst, H. C. M. (2002). Alternative missing data
techniques to grade point average: Imputing unavailable grades. Journal of
Educational Measurement, 39, 187-206.
Tanner, M. A. & Wong, W. H. (1987). The calculation of posterior distributions by data
augmentation. Journal of the American Statistical Association, 82, 528-550.
K. Sijtsma and L. van der Ark
528MULTIVARIATE BEHAVIORAL RESEARCH
Van den Wollenberg, A. L. (1982). Two new test statistics for the Rasch model.
Psychometrika, 47, 123-140.
Van der Ark, L. A. & Sijtsma, K. (in press). The effect of missing data imputation on
Mokken scale analysis. In L. A. Van der Ark, M. A. Croon, & K. Sijtsma (Eds.), New
developments in categorical data analysis for the social and behavioral sciences.
Mahwah NJ: Erlbaum.
Vingerhoets, A. J. J. M. & Cornelius, R. R. (Eds.) (2001). Adult crying. A biopsychosocial
approach. Hove, UK: Brunner-Routledge.
Von Davier, M. (1997). Bootstrapping goodness-of-fit statistics for sparse categorical
data — Results of a Monte Carlo Study. Methods of Psychological Research Online.
Retrieved January 3, 2002, from the World Wide Web: http://www.mpr-online.de.
Accepted April, 2003.