Content uploaded by Cengiz Zopluoglu
Author content
All content in this area was uploaded by Cengiz Zopluoglu on Feb 18, 2016
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=hmbr20
Download by: [University of Miami] Date: 31 December 2015, At: 11:29
Multivariate Behavioral Research
ISSN: 00273171 (Print) 15327906 (Online) Journal homepage: http://www.tandfonline.com/loi/hmbr20
Evaluating the Sampling Performance of
Exploratory and CrossValidated DETECT
Procedure with Imperfect Models
Cengiz Zopluoglu
To cite this article: Cengiz Zopluoglu (2015) Evaluating the Sampling Performance of
Exploratory and CrossValidated DETECT Procedure with Imperfect Models, Multivariate
Behavioral Research, 50:6, 632644, DOI: 10.1080/00273171.2015.1070708
To link to this article: http://dx.doi.org/10.1080/00273171.2015.1070708
Accepted author version posted online: 15
Jul 2015.
Published online: 05 Nov 2015.
Submit your article to this journal
Article views: 14
View related articles
View Crossmark data
Multivariate Behavioral Research, 50:632–644, 2015
Copyright
C
Taylor & Francis Group, LLC
ISSN: 00273171 print / 15327906 online
DOI: 10.1080/00273171.2015.1070708
Evaluating the Sampling Performance of Exploratory
and CrossValidated DETECT Procedure with
Imperfect Models
Cengiz Zopluoglu
University of Miami
Among the methods proposed for identifying the number of latent traits in multidimensional
IRT models, DETECT has attracted the attention of both methodologists and applied re
searchers as a nonparametric counterpart to other procedures. The current study investigated
the overall performance of the DETECT procedure and its outcomes using a realdata sampling
design recommended by MacCallum (2003) and compared the results from a purely simulated
data set that was generated with a wellspeciﬁed “perfect” model. The comparison revealed
that the sampling behavior of the maximized DETECT value and Rratio statistics was quite
robust to minor factors and other model misspeciﬁcations that potentially exist in the real data
set, as there were negligible differences between the results of the real and simulated data sets.
Item classiﬁcation accuracy was also nearly identical for the real and simulated data sets. The
accuracy of the identiﬁed number of dimensions reported by DETECT was the only outcome
with an obvious difference between the purely simulated data set and the real data set. While
the difference was small for smaller sample sizes, the identiﬁed number of dimensions was
more accurate for larger sample sizes when the population data set was purely simulated. In
many instances, exploratory DETECT analysis outperformed the crossvalidated DETECT
analysis in terms of overall accuracy.
KEYWORDS DETECT; dimensionality; dimensionality assessment; IRT; item response
theory; number of factors.
EVALUATING THE SAMPLING
PERFORMANCE OF EXPLORATORY AND
CROSSVALIDATED DETECT PROCEDURE
WITH IMPERFECT MODELS
Dichotomous items (e.g., true/false items, multiplechoice
items) are typical in educational and psychological assess
ments, and different statistical models that link observed di
chotomous outcomes to latent theoretical constructs have
been developed. While these models are extensively used
in modeling dichotomous response data, a challenging early
step is to determine the number of latent traits in the model.
Multiple latent traits can occur in educational and psycho
logical testing due to either intended or unintended sources.
Correspondence concerning this article should be addressed to Cengiz
Zopluoglu, Department of Educational and Psychological Studies, Univer
sity of Miami, Max Orovitz Building, 333A 1570 Levante Ave., Coral
Gables, FL 33146. Email: c.zopluoglu@miami.edu
While intended sources of multiple latent traits may be the
planned content structure (e.g., subcomponents of a test such
as algebra, geometry, and probability) or different item for
mats within a test, unintended sources of multiple latent traits
may be constructirrelevant abilities (e.g., a reading compo
nent in a math problem), speed of the test’s administration,
testing day, motivation, or dependencies among a set of items
(Tate, 2002).
Although the number of underlying latent traits can be hy
pothesized aprioriin a conﬁrmatory approach, researchers’
judgments about the number of latent traits may not always
ﬁt well to the item response data due to unintended sources of
variability. An exploratory analysis may be helpful in identi
fying unintended sources of variability in item response data,
and several standards in Standards for Educational and Psy
chological Testing (AERA, APA, & NCME, 1999) have been
established to encourage such analysis (e.g., Standards 1.11,
1.12, and 3.17). Dimensionality assessment is a critical pro
cess that requires extra attention from both test developers
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REALDATA SAMPLING DESIGN 633
and test users; therefore, it is recommended as “part of a stan
dard set of analyses conducted after each test administration”
(Ackerman, 2005, p. 24).
Among the methods proposed for determining the num
ber of latent traits in multidimensional IRT models, Di
mensionality Evaluation to Enumerate Contributing Traits
(DETECT) has attracted the attention of both methodolo
gists and applied researchers as a nonparametric counter
part to other alternatives such as a chisquare test of ﬁt
(e.g., Gessaroli & De Champlain, 1996; Gessaroli, De Cham
plain, & Folske, 1997; MaydeuOlivares, 2001; Schilling &
Bock, 2005), model ﬁt indices (e.g., Hu & Bentler, 1999;
Yu & Muthen, 2002; Akaike, 1987), and a Bayesian poste
rior predictive model check (Levy & Svetina, 2011). Many
methodological research studies have examined the statis
tical properties of the DETECT procedure under different
conditions (e.g., Gierl, Leighton, & Tan, 2006; Monahan,
Stump, Finch, & Hambleton, 2007; Roussos & Ozbek, 2006;
Tan & Gierl, 2006), and DETECT seems to be increasingly
used and reported in applied research (e.g., Cheng, 2011;
Froelich & Jensen, 2002; Jang & Roussos, 2007; Puranik,
Petscher, & Lonigan, 2013; Stewart, Batty, & Bovee, 2012).
Although these methodological studies have addressed the
performance and effectiveness of the DETECT procedure
from several perspectives, most of them have been simu
lation studies that have shared a limitation. The previous
simulation studies assumed that the datagenerating model
that the item responses followed at the population level was
a wellspeciﬁed “perfect” model. More research is needed
to understand the statistical properties of the DETECT in
dices (e.g., maximum DETECT value, Rratio) and the over
all performance of the DETECT procedure in determining
the number of latent traits and classiﬁcation accuracy un
der misspeciﬁed models (“imperfect models”, as termed by
MacCallum, 2003). As MacCallum (2003) stated, a more
relevant and informative study would show “how our meth
ods perform when the model in question is not correct in
the population.” Therefore, the primary purpose of the cur
rent study is to contribute to the literature by investigating
the statistical properties of several outcomes of the DETECT
procedure using a real data set with likely nuisance factors
and comparing the results from a purely s imulated data set
with no model error.
DIMENSIONALITY ASSESSMENT AND
DETECT PROCEDURE
While dimensionality assessment has a long history in the
exploratory factor analysis literature (under the heading of
“factor retention criteria”), the issue has been primarily ad
dressed in t he IRT literature with a focus on assessing the as
sumption of unidimensionality (De Champlain & Gessaroli,
1998; Drasgow & Lissak, 1983; Finch & Habing, 2007; Finch
& Monahan, 2008; Froelich, 2000; Hambleton & Rovinelli,
1986; Hattie, 1984; Hattie, Krakowski, Rogers, & Swami
nathan, 1996; Nandakumar & Stout, 1993; Nandakumar &
Yu, 1996; Seraphine, 1994, 2000; Stout, 1987; Tran & For
mann, 2009; Weng & Cheng, 2005) because ﬁtting a uni
dimensional model to multidimensional data may result in
unwarranted inferences about individuals. Until efﬁcient es
timation algorithms and computer software became available
to practitioners, unidimensional IRT models were commonly
used with the acknowledgment that the educational and psy
chological data in most instances did not meet the assumption
of unidimensionality. Therefore, the trend of investigating
the direct effects of the multidimensional data structure on
the unidimensional IRT item and person parameter estimates
(Ackerman, 1989; Ansley & Forsyth, 1985; Drasgow & Par
sons, 1983; Harrison, 1986; Kirisci & Hsu, 1995; Reckase,
1979; Wang, 1986; Way, Ansley, & Forsyth, 1988) and the
indirect effects on the unidimensional IRT applications (e.g.,
Ackerman, 1988; Bolt, 1999; Camilli, Wang, & Fesq, 1995;
Cook, Dorans, Eignor, & Petersen, 1983; Cook, Eignor, &
Taft, 1988; De Ayala, 1992; De Champlain, 1996; Dorans
& Kingston, 1985; Folk & Green, 1989; Lau, 1996; Linn &
Harnisch, 1981; Stocking & Eignor, 1986) appeared in the
IRT research literature in the late 1970s and continued until
the mid1990s.
Direct effects are observed at the item and person parame
ter levels. The unidimensional estimates of the model param
eters in the presence of multidimensionality are a weighted
composite of the underlying traits, and these weights are
primarily a function of the discrimination and difﬁculty pa
rameters and the correlations between the latent traits. When
multiple dimensions with major inﬂuences exist, a unidimen
sional analysis is expected to produce an estimate of ability
that is a weighted average of abilities on multiple latent traits.
Therefore, it becomes difﬁcult to interpret the unidimensional
item and person parameter estimates without any reference
to the latent factor structure, and any interpretation should
be made with extreme caution. Indirect effects are observed
for many IRT applications such as test equating, differen
tial item functioning analysis, and computerized adaptive
testing through inaccurate unidimensional item and person
parameter estimates in the presence of multidimensionality.
For instance, Cook et al. (1983) examined and reported is
sues of scale drift in test equating when a unidimensional
model was used for multidimensional data, and Dorans and
Kingston (1985) reported that the presence of multidimen
sionality worsened the symmetry property of test equating
under a unidimensional model.
When the assumption of unidimensionality is not plau
sible, DETECT is a conditional covariancebased nonpara
metric method that is proposed to assess the number of latent
traits underlying item response data (Kim, 1994; Zhang &
Stout, 1999). DETECT is based on the optimal partitioning
of a set of items such that the items with positive conditional
covariances are grouped in the same clusters and the items
with negative conditional covariances are grouped in differ
Downloaded by [University of Miami] at 11:29 31 December 2015
634 ZOPLUOGLU
ent clusters. The goal is to ﬁnd the partition that maximizes
the DETECT value. The number of clusters in the optimum
partition gives the identiﬁed number of major traits underly
ing the data. Kim (1994) proposed the following quantity for
a prespeciﬁed partitioning of a set of items (P):
D
(
P
)
=
2
n
(
n − 1
)
n−1
i=1
n
j=i+1
υ
ij
C
(
i, j

θ
)
, (1)
where θ is a weighted composite of multiple latent abilities,
υ
ij
equals 1 if the ith and jth items are in the same cluster
and –1 otherwise.
D
(
P
)
is a weighted sum of the condi
tional covariances across all item pairs, in which the sign
of the weight υ
ij
depends on the partition P. The value of
D
(
P
)
drops if pairs of items with negative conditional co
variances are assigned to the same cluster or pairs of items
with positive conditional covariances are assigned to differ
ent clusters. By contrast, the value of
D
(
P
)
increases if pairs
of items with negative conditional covariances are assigned
to different clusters or pairs of items with positive conditional
covariances are assigned to the same cluster.
C
(
i, j

θ
)
is the
conditional covariance estimate between the ith and jth items,
deﬁned as
C
1
=
n
k=0
J
k
N
C
(
i, j

S = k
)
, (2)
C
2
=
n
k=0
J
k
N
C
i, j
S
i,j
= k
, and (3)
C
(
i, j

θ
)
=
C
1
+
C
2
2
, (4)
where S is the total sum score obtained from all items, S
i.j
is
the rest score excluding items i and j, and J
k
is the number
of students with a score of k. The sum score and rest score
are used as a legacy of the weighted composite of multiple
latent abilities θ .
D
(
P
)
is an aggregate measure of pairwise
local dependence for the entire test, and the conditional co
variance terms s hould all be zero at the population level if
the test is indeed unidimensional. Based on theory,
D
(
P
)
is
maximized for the true partitioning of items when the data
are multidimensional.
Several cutoff criteria are proposed to evaluate the mag
nitude of the DETECT index. For instance, Kim (1994)
classiﬁed the DETECT indices from 0 to 0.19 as i ndica
tors of unidimensionality, 0.20 to 0.39 as indicators of weak
multidimensionality, 0.40 to 0.79 as indicators of moderate
multidimensionality, and above 0.80 as indicators of strong
multidimensionality. Stout, Nandakumar, and Habing (1996)
proposed a slightly different classiﬁcation by assigning the
intervals (0, 0.10), (0.10, 0.50), (0.50, 1), (1, 1.50), and
above 1.50, respectively, to unidimensionality, weak, moder
ate, strong, and very strong multidimensionality. Based on a
simulation study, Roussos and Ozbek (2006) recommended
using the intervals (0, 0.20), (0.20, 0.40), (0.40, 1) and above
1 for very weak, weak, moderate, and strong multidimension
ality, respectively. Although these suggestions are helpful for
estimating the amount of multidimensionality in the data, re
searchers are primarily interested in ﬁnding the number of
traits or correct partitioning of the items based on the latent
traits.
In the explanatory framework, the total number of parti
tions for n element is equal to a Bell number in mathemat
ics and increases exponentially as the number of elements
increases. For instance, the number of possible partitions
reaches 115,975 for 10 items, and ﬁnding the number of
traits underlying the data becomes an optimization problem
by ﬁnding the correct partitioning of n items with the high
est DETECT value. Kim (1994) originally proposed using
some prior judgments with the help of cluster analysis to
begin, but no solution was given for ﬁnding the maximum
DETECT value until a scientiﬁcally sound solution was de
veloped (Zhang & Stout, 1999). Zhang and Stout (1999)
ﬁrst developed the theoretical justiﬁcation for DETECT and
then transferred the idea of genetic algorithm from biostatis
tics to an optimization search of the maximum DETECT
value among all possible partitions of a set of items. In this
optimization process, an informed choice of a partition is
speciﬁed by the user (e.g., based on cluster analysis) to start;
then, the genetic algorithm is used to ﬁnd the optimum par
titioning that maximizes the DETECT value. The number
of partitions is predicted to be the number of major dimen
sions underlying the data. In the crossvalidation framework,
the data are ﬁrst divided into training and validation subsets
with userdeﬁned sample sizes. The exploratory analysis is
run on the training data set and optimal partitioning of the
items is obtained. Then, the DETECT value is computed
from the validation data set using the optimal partitioning
of the items that were previously obtained from the training
data set.
One of the assumptions when deriving the theoretical
justiﬁcation for the DETECT index is that the items have
an approximate simple structure. In the approximate simple
structure, items are expected to load primarily on one of the
dimensions and to load relatively less on the other dimen
sions. Zhang and Stout (1999) showed that the ratio of the
maximum DETECT value to the observed DETECT value
can be used to assess whether the assumption of approxi
mate simple structure holds. The observed DETECT value is
computed by assuming that a set of items is unidimensional.
This ratio ranges from 0 to 1; higher values indicate a simpler
structure, and 0.8 is recommended as the cutoff value for ap
proximate simple structure (Zhang & Stout, 1999). However,
Tan and Gierl (2006) recommended a more relaxed thresh
old (0.55 and above). A simulation study found that the ratio
index is not very effective at differentiating between sim
pler and more complex structures, and it is difﬁcult to ﬁnd
a cutoff point that applies to all conditions (Finch, Stage, &
Monahan, 2008).
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REALDATA SAMPLING DESIGN 635
RESEARCH PROBLEM
In his presidential address to the Society of Multivariate
Experimental Psychology, MacCallum (2003) criticized the
dominating approach in simulation studies as follows:
“Although studies based on this general approach may
provide some interesting information, I would argue that
they are of limited value. Although most Monte Carlo stud
ies can be criticized for some lack of realism, the approach
just described is especially problematic for one major reason:
It ignores the fact that our models are always wrong to some
degree, even in the population. This approach addresses the
question: How do our methods behave and perform when
the model in question is exactly correct in the population?
Although answers to this question might be of interest for
theorists, they are of only limited value to users of the meth
ods. A more realistic and relevant question is: How do our
methods behave and perform when the model in question is
not correct in the population? Answers to this question could
be more relevant and informative regarding the performance
of methods in practice” (p. 135)
Although the argument is mainly discussed from the fac
tor analytic theory perspective, the same argument applies
to IRT models. Researchers who generate data using IRT
models with a known dimensional structure, either unidi
mensional or multidimensional, always implicitly assume
that the model perfectly holds at the population level. As
highly encouraged by MacCallum (2003), a more relevant
and informative study should incorporate both model error
and sampling error into the simulation process to mimic a
more realistic scenario. According to MacCallum, there are
two ways to design such a study. The ﬁrst design involves
ﬁnding a large, real data set and treating this data set as a pop
ulation to conduct a sampling study by drawing samples of
the desired sample size from that population. It is expected
that the real data set, with a large sample of observations,
contains some degree of model error and very little sampling
error. The second design recommended by MacCallum in
volves using the common factor model proposed by Tucker,
Koopman, and Linn (1969), which includes a smaller num
ber of latent traits with major inﬂuence (e.g., 1, 2, 3, 4) and
a large number of latent traits with minor inﬂuence (e.g.,
150), when simulating data. While the common factor model
proposed by Tucker et al. is proposed in the factor analytic
framework, it can easily be adopted to generate data based on
a compensatory multidimensional 2PL and 3PL model with
major and minor latent traits.
Thus far, the methodological studies—except for that by
Svetina (2013), who used a noncompensatory model for
data generation—have shown that the DETECT procedure
provides useful information in certain conditions (e.g., ap
proximate simple structure, not too high of a correlation
among latent dimensions) when the datagenerating struc
ture is aligned well with the principles on which the DE
TECT procedure was theorized. However, Svetina (2013)
showed that the DETECT procedure may not perform at an
acceptable level in many instances when the data follows a
noncompensatory model and has a complex factor structure.
In addition, Roussos & Ozbek (2006) conducted the only
study that examined the statistical properties of the DETECT
estimator for multidimensional models, and this study was
limited because the generating model was a wellspeciﬁed
compensatory MIRT model with a relatively clean structure
and the sample size was extremely large (N = 120,000). They
noted the need to study the statistical properties of the DE
TECT estimator using real data s ets with less clean structures
and smaller sample sizes. Therefore, the aim of the current
study is to investigate the outcomes of the DETECT proce
dure using a real data set with likely nuisance factors and to
compare the results from an identical design using a s imu
lated data set from a wellspeciﬁed perfect model identiﬁed
from the real data set.
METHOD
Sample and Data Sets
The real data set came from the administration of Booklet
3 to 27,203 students who completed the international PISA
assessment in 2012 (OECD, 2013). The booklet included 57
items in total, with 25 items in the mathematics domain, 14
items in the reading domain, and 18 items in the science do
main. For the current analysis, 55 items with only dichoto
mous scores were included and 2 items with partial credit
scores were excluded. The analysis sample included 25,263
students who attempted all 55 items (24 items in mathemat
ics, 14 items in reading, and 17 items in science). Of these 55
items, 45 items were also embedded in 16 different testlets
(5 in mathematics, 6 in reading, and 5 in science). The item
means ranged from 0.04 to 0.87, with an average of 0.48, for
the mathematics domain; from 0.17 to 0.98, with an average
of .68, for the reading domain; and from 0.16 to 0.87, with
an average of 0.50, for the science domain. This data set
was selected because it had three major dimensions with po
tential interfactor correlations, and an approximate simple
structure was expected, although some crossloadings were
likely to occur due to a potential reading component for the
mathematics and science items and a potential mathematics
component for the science items.
To examine the dimensionality of the PISA data set, an
exploratory DETECT analysis was run by setting t he MIN
CELL option to 2 and the MUTATIONS option to 11. The
MINCELL option indicates the minimum number of exami
nees required to be present in any one cell when calculating
the conditional covariances, and the MUTATIONS option
indicates the number of vectors mutated in the genetic algo
rithm when maximizing the DETECT value to ﬁnd the op
timal cluster solution. The maximum number of dimensions
to be found was set to 12. The maximized DETECT value
was estimated as 0.186, and the Rratio was estimated as
Downloaded by [University of Miami] at 11:29 31 December 2015
636 ZOPLUOGLU
TABLE 1
Item Clustering From DETECT Analysis for the PISA 2012 Data Set (Booklet 3)
DETECT Analysis
Dimension 1 Dimension 2 Dimension 3
Mathematics M00FQ01, M273Q01, M408Q01,
M420Q01, M446Q01, M446Q02,
M447Q01, M464Q01, M559Q01,
M800Q01, M828Q01, M828Q02,
M828Q03, M903Q03, M923Q01,
M923Q03, M923Q04, M924Q02,
M995Q01, M995Q02, M995Q03
M918Q01, M918Q02, M918Q05
Reading R220Q01, R220Q02B R220Q04,
R432Q01, R432Q05, R432Q06,
R446Q03, R446Q06, R456Q01,
R456Q02, R456Q06, R466Q02,
R466Q03, R466Q06
Science S466Q05 S519Q03 S269Q01, S269Q03, S269Q04,
S408Q01, S408Q03, S408Q04,
S408Q05, S466Q01,
S466Q07,S519Q02, S521Q02,
S521Q06, S527Q01,S527Q03,
S527Q04
DETECT: Dimensionality Evaluation to Enumerate Contributing Traits. PISA: Programme for International Student Assessment.
Note. The letters M, S, and R at the beginning of each item label refer to proprietary items on mathematics, science, and reading sections of the PISA 2012
data set (Booklet 3), respectively. Dimensions 1, 2, and 3 identiﬁed as a result of DETECT analysis were interpreted as the mathematics, reading, and science
domains in the test, respectively.
0.776. The exploratory DETECT analysis suggested a three
dimensional solution, as shown in Table 1. Twentyone out of
24 items in the mathematics domain, 15 out of 17 items in the
science domain, and all reading items were clustered together
in three separate dimensions. Three mathematics items and 1
science item were grouped with the reading items, suggest
ing a likely reading component for these items, and 1 science
item was grouped with the mathematics items, suggesting a
likely mathematical component for this item. In addition, a
crossvalidated DETECT analysis with 5050 split was run.
The identical item clustering was obtained with the maxi
mized DETECT value of 0.175 and the Rratio estimate of
0.657.
The PISA data set was further analyzed using NOHARM
(Fraser & McDonald, 1988) implementing an unweighted
least squares estimation to ﬁt a polynomial factor model
up to a third degree using HermiteChebyshev polynomi
als to approximate a multidimensional IRT model (Maydeu
Olivares, 2001; McDonald, 1967). The Promax rotated fac
tor loading estimates from the NOHARM analysis are re
ported in Table 2. The estimated factor structure was well
aligned with the results of the DETECT analysis. The 3
mathematics items and 1 science item that were grouped
with the reading items in the DETECT analysis loaded
highest on the reading domain. Similarly, the 1 science
item that was grouped with the mathematics items loaded
highest on the mathematics domain. All of the other items
loaded highest on their respective domains. The correlations
among the achievement domains ranged from 0.697 to 0.769
(Table 3).
In addition to the real data set, a simulated data set was
generated using the common factor model following the same
structure reported in Table 2 and Table 3 with the same
sample size and item means as those in the real data set. The
simulated data set was analyzed using an exploratory and
crossvalidated DETECT procedure with a 5050 split. In
both the exploratory and crossvalidated analyses, identical
item clusters were obtained with the maximized DETECT
value of 0.181 and the Rratio of 0.868 for the exploratory
analysis and the maximized DETECT value of 0.179 and the
Rratio of 0.836 for the crossvalidated analysis. The data
characteristics and major factor structure of the simulated
data set were nearly exactly the same as those of the real
data. The only difference was the lack of minor factors in
the simulated data set such as testlets that existed in the
real data set and other potential model misspeciﬁcations that
may arise from a real test administration. In fact, this was
reﬂected in the Rratio estimate, which was slightly higher
for the simulated data set compared to the real data set. This
is likely an indication of a simpler structure due to the lack
of minor factors and other model misspeciﬁcations.
STUDY DESIGN
The two large data sets, the real data set and simulated data
set, were treated as the population for the sampling study. To
study the performance of the exploratory and crossvalidated
DETECT procedures, samples of students with six differ
ent sample sizes (N = 100, 250, 500, 1,000, 2,500, 5,000)
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REALDATA SAMPLING DESIGN 637
TABLE 2
ThreeDimensional Solution of the PISA 2012 Data Set (Booklet 3) Using Nonlinear Factor Analysis
Math Reading Science Math Reading Science
Item Label Item Mean Standardized Factor Loadings Item Label Item Mean Standardized Factor Loadings
M00FQ01 0.43 0.476 0.188 0.067 R220Q01 0.43 0.123 0.327 0.292
M273Q01 0.49 0.497 0.013 0.034 R220Q02 0.68 0.048 0.417 0.233
M408Q01 0.36 0.295 0.209 0.137 R220Q04 0.61 0.050 0.379 0.160
M420Q01 0.48 0.381 0.192 0.174 R432Q01 0.90 −0.118 0.781 0.118
M446Q01 0.62 0.343 0.185 0.177 R432Q05 0.78 −0
.009 0.657 0.084
M446Q02 0.07 0.770 −0.203 0.186 R432Q06 0.17 0.076 0.347 0.163
M447Q01 0.66 0.399 0.144 0.150 R446Q03 0.95 −0.152 0.792 −0.001
M464Q01 0.24 0.734 −0.065 0.094 R446Q06 0.82 0.006 0.649 −0.019
M559Q01 0.60 0.278 0.122 0.228 R456Q01 0.98 −0.167 0.628 0.023
M800Q01 0
.86 0.284 0.102 0.032 R456Q02 0.83 −0.103 0.575 0.060
M828Q01 0.25 0.443 0.223 0.106 R456Q06 0.83 −0.057 0.622 0.039
M828Q02 0.51 0.343 0.213 0.099 R466Q02 0.48 0.170 0.425 0.064
M828Q03 0.24 0.417 0.097 0.151 R466Q03 0.17 0.106 0.264 0.185
M903Q03 0.29 0.664 0.061 0.014 R466Q06 0.85 0.
013 0.759 −0.066
M918Q01 0.87 0.038 0.527 −0.149 S269Q01 0.58 0.085 0.170 0.487
M918Q02 0.77 0.342 0.390 −0.023 S269Q03 0.45 0.190 0.041 0.539
M918Q05 0.76 0.327 0.365 −0.045 S269Q04 0.35 0.142 −0.240 0.638
M923Q01 0.57 0.563 −0.044 0.172 S408Q01 0.61 0.070 0.028 0.475
M923Q03 0.51 0.639 −
0.095 −0.001 S408Q03 0.29 0.020 0.124 0.376
M923Q04 0.17 0.615 −0.113 0.247 S408Q04 0.53 −0.083 0.093 0.368
M924Q02 0.62 0.654 0.093 −0.016 S408Q05 0.41 0.097 −0.023 0.462
M995Q01 0.58 0.672 0.061 0.072 S466Q01 0.71 −0.104 0.232 0.488
M995Q02 0.04 0.691 −0.206 0.094 S466Q05 0.52 0.
261 −0.021 0.236
M995Q03 0.46 0.485 0.179 0.052 S466Q07 0.70 −0.100 0.187 0.379
S519Q02 0.54 0.088 −0.122 0.389
S519Q03 0.26 0.015 0.240 0.238
S521Q02 0.53 0.071 −0.143 0.448
S521Q06 0.87 0.002 0.305 0.458
S527Q01 0.16 −0.025 0.101 0.507
S527Q03 0.56 −0.025 0.010 0.438
S527Q04 0.53
−0.095 0.022 0.558
Note. The highest factor loading in a row is bolded. The letters M, S, and R at the beginning of each item label refer to proprietary items on mathematics,
science, and reading sections of the PISA 2012 data set (Booklet 3), respectively.
were repeatedly drawn 10,000 times from each large data set.
After drawing the random samples, the exploratory and cross
validated DETECT analyses were run for each of the 120,000
sample data sets in the same way that the population data
sets were analyzed, and the results from the sample data sets
were treated as sample estimates for the DETECT outcomes.
There were 24 different conditions in a 2 × 6 × 2 factorial
design, where type of data set (2; real vs. simulated) and
sample size (6; 100, 250, 500, 1,000, 2,500, or 5,000) served
as betweensubjects factors, and type of DETECT analysis
TABLE 3
Correlation Matrix Between Three Dimensions After
Promax Rotation
Mathematics Reading Science
Mathematics 1
Reading 0.697 1
Science 0.769 0.736 1
(2; exploratory analysis, and crossvalidated analysis with a
5050 split) served as a withinsubjects factor. Each cell had
10,000 replications.
OUTCOMES OF INTEREST
Several outcomes of the DETECT analyses included in previ
ous research were also of interest in the current study. These
outcomes were the maximized DETECT value, Rratio, iden
tiﬁed number of dimensions, and item cluster assignments.
The statistical properties of the maximized DETECT value
have been examined in only two studies (Monahan et al.,
2007; Roussos & Ozbek, 2006). The study by Roussos and
Ozbek (2006) was the only study to examine the statistical
properties of the DETECT estimator for compensatory mul
tidimensional models with an extremely large sample size
(N = 120,000) and found that there was bias for short tests
with fewer than 20 items. The bias was relatively small and
Downloaded by [University of Miami] at 11:29 31 December 2015
638 ZOPLUOGLU
negligible for tests with more than 20 items. However, the
researchers noted the need for similar studies with real data
and realistic sample sizes. The second outcome of interest
was the Rratio, the ratio of the maximum DETECT value
to the average of the absolute conditional covariances across
all item pairs. Although previous research has included the
Rratio as an outcome of interest, no work has investigated
the sampling distribution of the Rratio. The third outcome
of interest was the number of identiﬁed latent dimensions by
the DETECT analysis. Conceptually, DETECT is expected
to identify major dimensions; however, in some cases, it puts
only one or two items in a cluster and counts the cluster
as a separate dimension, which does not necessarily make
sense in practice. Thus, in the current study, the “number
of dimensions” was considered the number of clusters in
cluding at least 3 items. The ﬁnal outcome of interest was
the accuracy of the item cluster assignment, measured by the
matching similarity coefﬁcient (MS coefﬁcient) described by
Mroch and Bolt (2006). The MS coefﬁcient was computed
usinga2× 2 table for each replication. Let a, b, c, and d be
the elements of the 2 × 2 table, where
• a is the proportion of item pairs correctly classiﬁed in the
same cluster,
• b is the proportion of item pairs incorrectly classiﬁed in
the same cluster,
• c is the proportion of item pairs incorrectly classiﬁed in
separate clusters, and
• d is the proportion of item pairs correctly classiﬁed in
separate clusters,
while the item cluster assignment as a result of the DETECT
analysis for the population data sets (in Table 1) was treated
as a reference. Then, the MS coefﬁcient is equal to (a + d),
the proportion of correctly classiﬁed item pairs.
ANALYSIS
For the outcome measures described above, I treated the val
ues obtained from the original data sets with a large sample
size (N = 25,263) as the population parameters, and the val
ues obtained from the sample data sets with varying sample
sizes were treated as the corresponding sample estimates. An
exploratory DETECT analysis was run by setting the MIN
CELL option to two and the MUTATIONS option to 11 (20%
of the total number of items). The maximum number of di
mensions to be found was set to 12. For the crossvalidated
DETECT analysis, the same settings were in place, and 50%
of the data were used for training and 50% of the data were
used for validation. The maximum DETECT value, Rratio,
number of identiﬁed dimensions, and item clusters from each
replication were stored for further analysis.
RESULTS
Maximized DETECT Value
For each condition, the average maximized DETECT value
was computed across 10,000 replications and compared to
the corresponding population parameter. The bias across all
conditions is presented in Figure 1. The exploratory DE
TECT analysis yielded maximized DETECT values with
larger positive bias for sample sizes of less than 1,000, but
the bias was negligible when the sample size was 1,000 or
larger. The bias had a maximal value of approximately 0.2,
0.1, and 0.05 when the sample sizes were 100, 250, and 500,
respectively. The crossvalidated DETECT analysis with the
5050 split showed a relatively smaller bias, and it was in
the opposite direction of the exploratory DETECT analysis.
The bias was around −0.05 for sample sizes of 1,000 or less
and very close to zero for larger sample sizes. The differ
ence in bias between the purely simulated data sets and the
real data sets was negligible. In addition to bias, the s tan
dard deviation of the empirical sampling distributions for
the maximized DETECT value was computed and examined
across all conditions, and the empirical standard errors are
presented in Figure 2. The exploratory DETECT analysis
showed smaller sampling variability across 10,000 replica
tions in all conditions; however, the differences between the
exploratory and crossvalidated DETECT analyses dimin
ished as the sample size increased. Similarly, there was a
negligible difference between the purely simulated data sets
and real data sets. The root mean squared error (RMSE) is
equal to
Bias
2
+ SE
2
and is a measure of overall accu
racy that accounts for both bias and standard error. Figure
3 presents the results for RMSE across all conditions. The
ﬁgure reveals that the maximized DETECT value from the
crossvalidated analysis with the 5050 split provided the
most accurate estimates for sample sizes smaller than 250,
and the maximized DETECT value from the exploratory
FIGURE 1 Bias in maximized DETECT value for different levels of
sample size, data set, and analysis.
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REALDATA SAMPLING DESIGN 639
FIGURE 2 Standard error of maximized DETECT value for different
levels of sample size, data set, and analysis.
analysis provided the most accurate estimates for larger sam
ple sizes. The results indicate a tradeoff between bias and
standard error for the maximized DETECT value when the
exploratory or crossvalidated analysis with the 5050 split
was used. The crossvalidated analysis with the 5050 split
provided a nearly unbiased maximized DETECT value in
most conditions, whereas the exploratory DETECT analy
sis provided a more positively biased maximized DETECT
value, particularly for small sample sizes. However, the stan
dard error of the maximized DETECT value was smaller
in the exploratory analysis compared to the crossvalidated
analysis. When bias and standard error were combined, the
crossvalidated analysis with the 5050 split provided more
accurate results for sample sizes of 100 and 250, but the ex
ploratory analysis outperformed the crossvalidated analysis
in terms of overall accuracy for larger sample sizes.
Rratio
Similar to the maximized DETECT value, the average R
ratio, the empirical standard error, and the RMSE across
FIGURE 3 Root mean square error of maximized DETECT value for
different levels of sample size, data set, and analysis.
FIGURE 4 Bias in Rratio estimate for different levels of sample size,
data set, and analysis.
10,000 replications for all conditions are presented in Fig
ures 4, 5, and 6. The results revealed a negative bias in Rratio
statistics obtained from the exploratory and crossvalidated
DETECT analyses. The exploratory DETECT analysis pro
vided less biased results than the crossvalidated DETECT
analysis, and the difference between the simulated data sets
and real data sets was negligible. Although the negative
bias decreased as the sample size increased, it was approxi
mately 0.1 and 0.2 even for sample sizes larger than 2,500.
In terms of overall accuracy measured by the RMSE, the
exploratory DETECT analysis provided the most accurate
Rratio estimates, but there was still a signiﬁcant amount
of error: approximately 0.4 for the sample sizes of 500
or less and approximately 0.1 for a sample size of 5,000.
The consequence of negative bias would be a misinterpre
tation such that a data set would display a more complex
structure than it truly had. Therefore, if the Rratio statistic
is used in applied research, small Rratio statistics should
be cautiously interpreted, particularly for small sample
sizes.
FIGURE 5 Standard error of Rratio estimate for different levels of sample
size, data set, and analysis.
Downloaded by [University of Miami] at 11:29 31 December 2015
640 ZOPLUOGLU
FIGURE 6 Root mean square error of Rratio estimate for different levels
of sample size, data set, and analysis.
Number of Identiﬁed Dimensions
Recall that both the exploratory and crossvalidated DE
TECT analyses identiﬁed three major dimensions for both
the real and simulated large data sets treated as popula
tions. The percentage of replications out of 10,000 for the
identiﬁed number of dimensions under repeated sampling
and the overall mean and standard deviation are reported in
Table 4. When the analysis was exploratory, the number of
major dimensions was more accurately identiﬁed for the sim
ulated data sets. The difference between the real and simu
lated data sets was small when the sample size was 100
and 250, where the percentage of correct decisions was ap
proximately 40%, but noticeable for larger sample sizes. For
both the real and simulated data sets, the exploratory DE
TECT analysis tended to identify four dimensions for most
replications when an incorrect decision was made. When
the analysis was crossvalidated, there was a similar pattern.
The difference between the real and simulated data sets was
small for sample sizes smaller than 1,000, but the results were
noticeably more accurate for simulated data sets for larger
sample sizes. Overall, the accuracy in identifying the number
of major dimensions was slightly better and the variability in
the identiﬁed number of dimensions was smaller when the
exploratory analysis was used.
Item Cluster Assignment
The MS coefﬁcient indicates the proportion of correctly clas
siﬁed item pairs with respect to the item clustering based on
the DETECT analysis for the population data sets, which
was reported in Table 1. Note that both the exploratory and
crossvalidated DETECT analyses indicated the same item
clustering for the real and simulated population data sets.
There were 1,485 item pairs for the 55 item tests. The number
of pairs correctly classiﬁed as in the population data sets was
TABLE 4
The Percentage of Replications for the Number of Identiﬁed Dimensions Across Conditions
Number of Identiﬁed Dimensions
Type of Data Sample Size Type of Analysis 1 2 3 4 5 6 7 8 Mean SD
Real 100 E 5.440.637.813.52.40.33.68 0.88
Real 250 E 1.738.344.214.31.53.76 0.78
Real 500 E 0.539.746.912.00.93.73 0.71
Real 1,000 E 55.338.75.70.33.51 0.62
Real 2500 E 85.214.50.33.15 0.37
Real 5,000 E 97.32.73.03 0.16
Simulated 100 E 5.941.037.813.02.10.23.65 0.87
Simulated 250 E 1.740.
743.612.51.53.71 0.76
Simulated 500 E 0.852.439.86.70.33.53 0.65
Simulated 1,000 E 0.176.822.30.83.24 0.45
Simulated 2500 E 97.32.73.03 0.16
Simulated 5,000 E 99.80.23.00 0.05
Real 100 CV 0.218.647.626.56.20.80.14.23 0.86
Real 250 CV 7.344.235.810.61.90.23.56 0.87
Real 500 CV 2.942.141.412.
11.40.13.67 0.78
Real 1,000 CV 1.039.345.612.71.30.13.74 0.74
Real 2500 CV 0.149.343.07.30.33.58 0.64
Real 5,000 CV 69.528.71.83.32 0.51
Simulated 100 CV 0.1 18.147.327.46.20.90.13.24 0.86
Simulated 250 CV 7.544.735.210.81.60.23.55 0.86
Simulated 500 CV 3.443.040.311.71.60.13.65 0
.80
Simulated 1,000 CV 1.444.942.610.30.73.64 0.71
Simulated 2500 CV 0.367.829.62.20.13.34 0.53
Simulated 5,000 CV 0.189.510.30.13.10 0.31
Note. E: exploratory DETECT analysis; CV: indicates crossvalidated DETECT analysis.
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REALDATA SAMPLING DESIGN 641
FIGURE 7 Accuracy of item cluster assignment for different levels of
sample size, data set, and analysis.
counted for each replication, and the proportion was com
puted. Therefore, the MS coefﬁcient in the current study is an
indicator of how well the item clustering at the sample level
agreed with the item clustering at the population level. The
average agreement across 10,000 replications within each
condition is presented in Figure 7. The exploratory DETECT
analysis performed better than the crossvalidated analysis
with a gradually increasing average agreement ranging from
approximately 60% to approximately 95% as the sample size
increased from 100 to 5,000. A similar pattern in the increase
in average agreement from approximately 55% to approxi
mately 90% was observed as the sample size increased from
100 to 5,000 for the crossvalidated DETECT analysis. The
difference between the real and simulated data sets in terms
of item clustering accuracy was negligible.
SUMMARY AND DISCUSSION
The current study extends the ﬁndings of previous research
regarding the several outcomes obtained from the DETECT
procedure by implementing a sampling study with realistic
sample sizes using real and simulated data sets with a sim
ilar major multidimensional latent structure. A wellknown
motto that “all models are wrong, but some of them are
useful” (Box & Draper, 1987, p. 424) is always used to
acknowledge model misspeciﬁcation or model error in prac
tice, but it is rarely integrated into the research design in
most simulation studies, particularly on dimensionality as
sessment. The real data set used in the current study had a
multidimensional structure due to its three major underlying
domains (math, science, and reading) and other minor fac
tors that may potentially arise from a realworld application.
The purely simulated data set based on a similar major factor
structure with no minor factors allowed for a comparison of
the results and an observation of the potential implications of
using perfect models in simulation studies, which has been
criticized (MacCallum, 2003). The comparison revealed that
the sampling behavior of the maximized DETECT value and
Rratio statistics was quite robust to minor factors and other
model misspeciﬁcations that potentially exist in the real data
set, as there were negligible differences between the results
of the real and simulated data sets. The item classiﬁcation
accuracy was also nearly identical for the real and simulated
data sets. The accuracy of the identiﬁed number of dimen
sions reported by DETECT was the only outcome for which
the difference between using a purely simulated data set and
real data set became obvious. Although the difference was
small for smaller sample sizes, the decisions were more ac
curate for larger sample sizes when the population data set
was purely simulated.
One of the interesting ﬁndings of the current study is the
negative bias observed in the Rratio statistics. The statistical
properties of the Rratio index have never been systemati
cally examined, although the issues of using the Rratio to
interpret the simplicity of the factor structure have been pre
viously addressed (Finch & Monahan, 2008; Tan & Gierl,
2006). Further analysis of this bias revealed that the Rratio
is a ratio of two biased estimators, with the estimator in
the denominator having a larger positive bias in magnitude
than that in the numerator. Future research should attempt to
eliminate the bias from the components of Rratio statistics
before its use is emphasized in practice. A recent study by
Nandakumar, Yu, and Zhang (2011) made such an attempt
but was not very successful.
In many instances, the exploratory DETECT analysis out
performed the crossvalidated DETECT analysis in terms
of overall accuracy. This ﬁnding may seem a bit counter
intuitive to some readers because many researchers have
been advised to use crossvalidation when possible. It can
be speculated that this counterintuitive ﬁnding is due to the
reduced amount of information used for the crossvalidated
DETECT analysis. A crossvalidated DETECT analysis with
a 5050 split, as implemented in the current study, uses
one half of the sample to identify the number of dimen
sions and item clustering and the other half to compute
the maximized DETECT value and Rratio statistics based
on the previously identiﬁed item clustering. For the maxi
mized DETECT value, this eliminates bias at the expense
of increasing estimation error due to the reduced amount
of information. When the gain due to the smaller bias does
not overcome the loss due to the increased estimation error,
an exploratory DETECT analysis yields more accurate re
sults overall. Similarly, the crossvalidated analysis with the
5050 split uses half of the sample to identify the number
of dimensions and item clustering, whereas the exploratory
analysis uses the whole sample. Therefore, it should not be
surprising that the exploratory analysis provided more accu
rate results in terms of ﬁnding the major dimensions and item
clustering.
While the current study increased our understanding of
possible issues in interpreting DETECT output by extending
Downloaded by [University of Miami] at 11:29 31 December 2015
642 ZOPLUOGLU
the ﬁndings to more realistic data sets with possibly un
derlying complex imperfect “true” models, the ﬁndings are
also limited to the speciﬁc data sets used in the study. Fu
ture research using similar realdata sampling designs with
different types of data sets would enhance and extend our
understanding of potential issues related to DETECT output
and provide practitioners with greater insights.
ARTICLE INFORMATION
Conﬂict of Interest Disclosures: The author signed a form
for disclosure of potential conﬂicts of interest. No author
reported any ﬁnancial or other conﬂicts of interest in relation
to the work described.
Ethical Principles: The author afﬁrms having followed pro
fessional ethical guidelines in preparing this work. These
guidelines include obtaining informed consent from human
participants, maintaining ethical treatment and respect for
the rights of human or animal participants, and ensuring the
privacy of participants and their data, such as ensuring that
individual participants cannot be identiﬁed in reported results
or from publicly available original or archival data.
Funding: There are no funders to report for this submission.
Role of the Funders/Sponsors: None of the funders or spon
sors of this research had any role in the design and conduct
of the study; collection, management, analysis, and inter
pretation of data; preparation, review, or approval of the
manuscript; or decision to submit the manuscript for pub
lication.
Acknowledgements: The author would like to thank the
editors, Keith F. Widaman and Stephen G. West, and three
anonymous reviewers for their comments on prior versions of
this manuscript. The ideas and opinions expressed herein are
those of the author alone, and endorsement by the author’s
institution is not intended and should not be inferred.
REFERENCES
Ackerman, T. A. (1988). An explanation of differential item functioning
from a multidimensional perspective. Retrieved from ERIC database.
(ED306281)
Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory
and noncompensatory multidimensional items. Applied Psychological
Measurement, 13 (2), 113–127. http://dx.doi.org/10.1177/014662168901
300201
Ackerman, T. A. (2005). Multidimensional item response theory modeling.
In A. MaydeuOlivares & J. J. McArdle (Eds.), Contemporary Psycho
metrics (pp. 3–25). Mahwah, NJ: Erlbaum.
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 3, 317–332.
http://dx.doi.org/10.1007/BF02294359
American Educational Research Association, American Psychological As
sociation, National Council on Measurement in Education, Joint Com
mittee on Standards for Educational, & Psychological Testing. (1999).
Standards for educational and psychological testing. Washington, DC:
APA.
Ansley, T., & Forsyth, R. (1985). An examination of the character
istics of unidimensional IRT parameter estimates derived from two
dimensional data. Applied Psychological Measurement, 9(1), 37–48.
http://dx.doi.org/10.1177/014662168500900104
Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT
true score equating. Applied Measurement in Education, 12(4), 383–407.
http://dx.doi.org/10.1207/S15324818AME1204
4
Box, G. E., & Draper, N. R. (1987). Empirical modelbuilding and response
surfaces. New York: Wiley.
Camilli, G., Wang, M. M, & Fesq, J. (1995). The effects of dimension
ality on equating the Law School Admission Test. Journal of Edu
cational Measurement, 32(1), 79–96. http://dx.doi.org/10.1111/j.1745
3984.1995.tb00457.x
Cheng, W. (2011). Examining the dimensionality of early numeracy skill
measures (Unpublished doctoral dissertation). Pennsylvania State Uni
versity, State College, PA.
Cook, L. L., Dorans, N. J., Eignor, D. R., & Petersen, N. S. (1983). An as
sessment of the relationship between the assumption of unidimensionality
and the quality of IRT truescore equating. Paper presented at the annual
meeting of the American Educational Research Association, Montreal,
Canada.
Cook, L. L., Eignor, D. R., & Taft, H. L. (1988). A comparative study of the
effects of recency of instruction on the stability of IRT and conventional
item parameter estimates. Journal of Educational Measurement, 25(1),
31–45. http://dx.doi.org/10.1111/j.17453984.1988.tb00289.x
De Ayala, R. J. (1992). The inﬂuence of dimensionality on CAT ability es
timation. Educational and Psychological Measurement, 52 (3), 513–527.
http://dx.doi.org/10.1177/0013164492052003002
De Champlain, A. F. (1996). The effect of multidimensionality on IRT
true score equating for subgroups of examinees. Journal of Educa
tional Measurement, 33(2), 181–201. http://dx.doi.org/10.1111/j.1745
3984.1996.tb00488.x
De Champlain, A., & Gessaroli, M. (1998). Assessing the dimension
ality of item response matrices with small sample sizes and short
test lengths. Applied Measurement in Education, 11(3), 231–253.
http://dx.doi.org/10.1207/s15324818ame1103
2
Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of uni
dimensionality on the estimation of item and ability parameters and on
item response theory equating of the GRE verbal scale. Journal of Educa
tional Measurement, 22 (4), 249–262. http://dx.doi.org/10.1111/j.1745
3984.1985.tb01062.x
Drasgow, F., & Lissak, R. I. (1983). Modiﬁed parallel analysis: A procedure
for examining the latent dimensionality of dichotomously scored item
responses. Journal of Applied Psychology, 68 (3), 363–373. http://
dx.doi.org/10.1037/00219010.68.3.363
Drasgow, F., & Parsons, C. K. (1983). Application of unidimensional
item response theory models to multidimensional data. Applied
Psychological Measurement, 7, 189–199. http://dx.doi.org/10.1177/
014662168300700207
Finch, H., & Habing, B. (2007). Performance of DIMTESTand NOHARM
based statistics for testing unidimensionality. Applied Psychological Mea
surement, 31(4), 292–307. http://dx.doi.org/10.1177/0146621606294490
Finch, H., & Monahan, P. (2008). A bootstrap generalization of modiﬁed
parallel analysis for IRT dimensionality asssessment. Applied
Measurement in Education, 21(2), 119–140. http://dx.doi.org/10.1080/
08957340801926102
Finch, H., Stage, A. K., & Monahan, P. (2008). Comparison of factor sim
plicity indices for dichotomous data: DETECT R, Bentler’s simplicity
index, and the loading simplicity index. Applied Measurement in Educa
tion, 21, 41–64. http://dx.doi.org/10.1080/08957340701796365
Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares
item factor analysis. Multivariate Behavioral Research, 23, 267–269.
http://dx.doi.org/10.1207/s15327906mbr2302
9
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REALDATA SAMPLING DESIGN 643
Folk, V. G., & Green, B. F. (1989). Adaptive estimation when the unidimen
sionality assumption of IRT is violated. Applied Psychological Measure
ment, 4, 373–389. http://dx.doi.org/10.1177/014662168901300404
Froelich, A. G. (2000). Assessing unidimensionality of test items and some
asymptotics of parametric item response theory. (Doctoral dissertation).
University of Illinois at UrbanaChampaign, Champaign, IL.
Froelich, A. G., & Jensen, H. H. (2002). Dimensionality of thew
USDA food security index. Retrieved from http://www.public.
iastate.edu/∼amyf/technicalreports/usdadimensionality.pdf
Gessaroli, M. E., De Champlain, A. F., & Folske, J. C. (1997, March).
Assessing dimensionality using a likelihoodratio chisquare test based
on a nonlinear factor analysis of item response data. Paper presented at
the annual meeting of the National Council on Measurement in Education,
Chicago, IL.
Gierl, M. J., Leighton, J. P., & Tan, X. (2006). Evaluating DE
TECT classiﬁcation accuracy and consistency when data display com
plex structure. Journal of Educational Measurement, 43(3), 265–289.
http://dx.doi.org/10.1111/j.17453984.2006.00016.x
Hambleton, R., & Rovinelli, R. (1986). Assessing the dimensionality of a
set of test items. Applied Psychological Measurement, 10(3), 287–302.
http://dx.doi.org/10.1177/014662168601000307
Harrison, D. (1986). Robustness of IRT parameter estimation to violations
on the unidimensionality assumption. Journal of Educational Statistic,
11(2), 91–115. http://dx.doi.org/10.2307/1164972
Hattie, J. (1984). An empirical study of various indices for determin
ing unidimensionality. Multivariate Behavioral Research, 19(1), 49–78.
http://dx.doi.org/10.1207/s15327906mbr1901
3
Hattie, J., Krakowski, K., Rogers, J., & Swaminathan, H. (1996). An
assessment of Stout’s index of essential unidimensionality. Applied
Psychological Measurement, 20(1), 1–14. http://dx.doi.org/10.1177/
014662169602000101
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for ﬁt indexes in
covariance structure analysis: Conventional criteria versus new
alternatives. Structural Equation Modeling, 6, 1–55. http://dx.doi.org/
10.1080/10705519909540118
Jang, E. E., & Roussos, L. (2007). An investigation into the dimen
sionality of TOEFL using conditional covariancebased nonparamet
ric approach. Journal of Educational Measurement, 44 (1), 1–21.
http://dx.doi.org/10.1111/j.17453984.2007.00024.x
Kim, H. R. (1994). New techniques for the dimensionality assessment of
standardized test data. (Doctoral dissertation). University of Illinois at
UrbanaChampaign, Urbana, IL.
Kirisci, L., & Hsu, T. (1995). The robustness of BILOG to violations of the
assumption of unidimensionality of test items and normality of ability dis
tributions. Paper presented at the annual meeting of the National Council
on Measurement in Education, San Francisco, CA.
Lau, C. M. A. (1997). Robustness of a unidimensional computerized mas
tery testing procedure with multidimensional testing data. (Unpublished
doctoral dissertation). University of Iowa, Iowa City, IA.
Levy, R., & Svetina, D. (2011). A generalized dimensionality dis
crepancy measure for Dimensionality assessment in multidimensional
item response theory. British Journal of Mathematical and Statisti
cal Psychology, 64, 208–232. http://dx.doi.org/10.1348/000711010X500
483
Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content
and group membership on achievement test Items. Journal of Educa
tional Measurement, 18(2), 109–118. http://dx.doi.org/10.1111/j.1745
3984.1981.tb00846.x
MacCallum, R. C. (2003). Working with imperfect models. Multivariate
Behavioral Research, 38 (1), 113–139. http://dx.doi.org/10.1207/
S15327906MBR3801
5
McDonald, R. P. (1967). Nonlinear factor analysis (Psychometric Mono
graphs No. 15). Richmond, VA: Psychometric Corporation. Retrieved
from http://www.psychometrika.org/journal/online/MN15.pdf
MaydeuOlivares, A. (2001). Multidimensional item response theory
modeling of binary data: Large sample properties of NOHARM
estimates. Journal of Educational and Behavioral Statistics, 26(1), 51–71.
http://dx.doi.org/10.3102/10769986026001051
Monahan, P. O., Stump, T. E., Finch, H., & Hambleton, R. K. (2007).
Bias of exploratory and crossvalidated DETECT index under uni
dimensionality. Applied Psychological Measurement, 31(6), 483–503.
http://dx.doi.org/10.1177/0146621606292216
Mroch, A. A., & Bolt, D. M. (2006). A simulation comparison of
parametric and nonparametric dimensionality detection procedures.
Applied Measurement in Education, 19(1), 67–91. http://dx.doi.org/
10.1207/s15324818ame1901
4
Nandakumar, R., & Stout, W. (1993). Reﬁnements of Stout’s procedure for
assessing latent trait unidimensionality. Journal of Educational Statistics,
18(1), 41–68. http://dx.doi.org/10.2307/1165182
Nandakumar, R., & Yu, F. (1996). Empirical validation of DIMTEST on
nonnormal ability distributions. Journal of Educational Measurement,
33(3), 355–368. http://dx.doi.org/10.1111/j.17453984.1996.tb00497.x
Nandakumar, R., Yu, F., & Zhang, Y. (2011). A comparison of bias correction
adjustments for the DETECT procedure. Applied Psychological Measure
ment, 35(2), 127–144. http://dx.doi.org/10.1177/0146621610376767
OECD (2013). PISA 2012 assessment and analytical framework: Mathe
matics, reading, science, problem solving and ﬁnancial literacy. Paris,
OECD Publishing. http://dx.doi.org/10.1787/9789264190511en.
Puranik, S. P., Petscher, Y., & Lonigan, C. J. (2013). Dimen
sionality and reliability of letter writing in 3 to 5yearold
preschool children. Learning and Individual Differences, 28, 133–141.
http://dx.doi.org/10.1016/j.lindif.2012.06.011
Reckase, M. (1979). Unifactor latent trait models applied to multifactor tests:
Results and implications. Journal of Educational Statistics, 4 , 207–230.
http://dx.doi.org/10.2307/1164671
Roussos, L. A., & Ozbek, O. Y. (2006). Formulation of the DETECT popula
tion parameter and evaluation of DETECT estimator bias. Journal of Edu
cational Measurement, 43(3), 215–243. http://dx.doi.org/10.1111/j.1745
3984.2006.00014.x
Schilling, S., & Bock, R. D. (2005). Highdimensional maximum
marginal maximum likelihood item factor analysis by adaptive quadra
ture. Psychometrika, 70, 533–555. http://dx.doi.org/10.1007/s11336003
1141x
Seraphine, A. E. (1994). A power study of three procedures for the assess
ment of unidimensionality. (Doctoral dissertation). University of Illinois
at UrbanaChampaign, Champaign, IL.
Seraphine, A. E. (2000). The performance of DIMTEST when latent
trait and item difﬁculty distributions differ. Applied Psychological
Measurement, 24(1), 82–94. http://dx.doi.org/10.1177/0146621600024
1005
Stewart, J., Batty, O. A., & Bovee, N. (2012). Comparing multidimen
sional continuum models of vocabulary acquisition: An examination
of the vocabulary knowledge scale. TESOL Quarterly, 46(4), 695–721.
http://dx.doi.org/10.1002/tesq.35
Stocking, M. L., & Eignor, D. R. (1986). The impact of different ability distri
butions on IRT preequating. Retrieved from ERIC database. (ED281864).
Stout, W. (1987). A nonparametric approach for assessing latent trait
unidimensionality. Psychometrika, 52(4), 589–617. http://dx.doi.org/
10.1007/BF02294821
Stout, W., Nandakumar, R., & Habing, B. (1996). Analysis of la
tent dimensionality of dichotomously and polytomously scored test
data. Behaviormetrika, 23, 37–66. http://dx.doi.org/10.2333/bhmk.
23.37
Svetina, D. (2013). Assessing dimensionality of noncompensatory multid
imensional item response theory with complex structures. Educational
and Psychological Measurement, 73(2), 312–338. http://dx.doi.org/
10.1177/0013164412461353
Tan, X., & Gierl, M. J. (2006, April).
Evaluating the consistency of DE
TECT indices and item clusters using simulated and real data that display
both simple and complex structure. Paper presented at the annual meet
ing of the American Educational Research Association, San Francisco,
CA.
Downloaded by [University of Miami] at 11:29 31 December 2015
644 ZOPLUOGLU
Tate, R. (2002). Test dimensionality. In G. Tindal & T. M. Haladyna (Eds.),
Large scale assessment programs for all students: validity, technical ad
equacy, and implementation (pp. 181–211). Mahwah, NJ: Erlbaum.
Tran, U., & Formann, A. (2009). Performance of parallel anal
ysis in retrieving unidimensionality in the presence of binary
data. Educational and Psychological Measurement, 69(1), 50–61.
http://dx.doi.org/10.1177/0013164408318761
Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969). Evaluation of
factor analytic research procedures by means of simulated correlation
matrices. Psychometrika, 34 (4), 421–459. http://dx.doi.org/10.1007/
BF02290601
Wang, M. (1986, April). Fitting a unidimensional model to multidimensional
item response data. Paper presented at the ONR Contractors Conference,
Gatlinburg, TN.
Way, W. D., Ansley, T. N., & Forsyth, R. A. (1988). The comparative
effects of compensatory and noncompensatory twodimensional data
on unidimensional IRT estimates. Applied Psychological Measurement,
12(3), 239–252. http://dx.doi.org/10.1177/014662168801200303
Weng, L., & Cheng, C. (2005). Parallel analysis with unidimensional bi
nary data. Educational and Psychological Measurement, 65(5), 697–716.
http://dx.doi.org/10.1177/0013164404273941
Yu, C . , & M u t h
´
en, B. O. (2002, April). Evaluation of model ﬁt indices for
latent variable models with categorical and continuous outcomes. Paper
presented at the annual meeting of the American Educational Research
Association, New Orleans, LA.
Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimension
ality and its application to approximate simple structure. Psychometrika,
64(2), 213–249. http://dx.doi.org/10.1007/BF02294536
Downloaded by [University of Miami] at 11:29 31 December 2015