ArticlePDF Available

Evaluating the Sampling Performance of Exploratory and Cross-Validated DETECT Procedure with Imperfect Models

Authors:

Abstract and Figures

Among the methods proposed for identifying the number of latent traits in multidimensional IRT models, DETECT has attracted the attention of both methodologists and applied researchers as a nonparametric counterpart to other procedures. The current study investigated the overall performance of the DETECT procedure and its outcomes using a real-data sampling design recommended by MacCallum (200339. MacCallum, R.C. (2003). Working with imperfect models. Multivariate Behavioral Research, 38 (1), 113–139. http://dx.doi.org/10.1207/S15327906MBR3801_5View all references) and compared the results from a purely simulated data set that was generated with a well-specified “perfect” model. The comparison revealed that the sampling behavior of the maximized DETECT value and R-ratio statistics was quite robust to minor factors and other model misspecifications that potentially exist in the real data set, as there were negligible differences between the results of the real and simulated data sets. Item classification accuracy was also nearly identical for the real and simulated data sets. The accuracy of the identified number of dimensions reported by DETECT was the only outcome with an obvious difference between the purely simulated data set and the real data set. While the difference was small for smaller sample sizes, the identified number of dimensions was more accurate for larger sample sizes when the population data set was purely simulated. In many instances, exploratory DETECT analysis outperformed the cross-validated DETECT analysis in terms of overall accuracy.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=hmbr20
Download by: [University of Miami] Date: 31 December 2015, At: 11:29
Multivariate Behavioral Research
ISSN: 0027-3171 (Print) 1532-7906 (Online) Journal homepage: http://www.tandfonline.com/loi/hmbr20
Evaluating the Sampling Performance of
Exploratory and Cross-Validated DETECT
Procedure with Imperfect Models
Cengiz Zopluoglu
To cite this article: Cengiz Zopluoglu (2015) Evaluating the Sampling Performance of
Exploratory and Cross-Validated DETECT Procedure with Imperfect Models, Multivariate
Behavioral Research, 50:6, 632-644, DOI: 10.1080/00273171.2015.1070708
To link to this article: http://dx.doi.org/10.1080/00273171.2015.1070708
Accepted author version posted online: 15
Jul 2015.
Published online: 05 Nov 2015.
Submit your article to this journal
Article views: 14
View related articles
View Crossmark data
Multivariate Behavioral Research, 50:632–644, 2015
Copyright
C
Taylor & Francis Group, LLC
ISSN: 0027-3171 print / 1532-7906 online
DOI: 10.1080/00273171.2015.1070708
Evaluating the Sampling Performance of Exploratory
and Cross-Validated DETECT Procedure with
Imperfect Models
Cengiz Zopluoglu
University of Miami
Among the methods proposed for identifying the number of latent traits in multidimensional
IRT models, DETECT has attracted the attention of both methodologists and applied re-
searchers as a nonparametric counterpart to other procedures. The current study investigated
the overall performance of the DETECT procedure and its outcomes using a real-data sampling
design recommended by MacCallum (2003) and compared the results from a purely simulated
data set that was generated with a well-specified “perfect” model. The comparison revealed
that the sampling behavior of the maximized DETECT value and R-ratio statistics was quite
robust to minor factors and other model misspecifications that potentially exist in the real data
set, as there were negligible differences between the results of the real and simulated data sets.
Item classification accuracy was also nearly identical for the real and simulated data sets. The
accuracy of the identified number of dimensions reported by DETECT was the only outcome
with an obvious difference between the purely simulated data set and the real data set. While
the difference was small for smaller sample sizes, the identified number of dimensions was
more accurate for larger sample sizes when the population data set was purely simulated. In
many instances, exploratory DETECT analysis outperformed the cross-validated DETECT
analysis in terms of overall accuracy.
KEYWORDS DETECT; dimensionality; dimensionality assessment; IRT; item response
theory; number of factors.
EVALUATING THE SAMPLING
PERFORMANCE OF EXPLORATORY AND
CROSS-VALIDATED DETECT PROCEDURE
WITH IMPERFECT MODELS
Dichotomous items (e.g., true/false items, multiple-choice
items) are typical in educational and psychological assess-
ments, and different statistical models that link observed di-
chotomous outcomes to latent theoretical constructs have
been developed. While these models are extensively used
in modeling dichotomous response data, a challenging early
step is to determine the number of latent traits in the model.
Multiple latent traits can occur in educational and psycho-
logical testing due to either intended or unintended sources.
Correspondence concerning this article should be addressed to Cengiz
Zopluoglu, Department of Educational and Psychological Studies, Univer-
sity of Miami, Max Orovitz Building, 333-A 1570 Levante Ave., Coral
Gables, FL 33146. E-mail: c.zopluoglu@miami.edu
While intended sources of multiple latent traits may be the
planned content structure (e.g., subcomponents of a test such
as algebra, geometry, and probability) or different item for-
mats within a test, unintended sources of multiple latent traits
may be construct-irrelevant abilities (e.g., a reading compo-
nent in a math problem), speed of the test’s administration,
testing day, motivation, or dependencies among a set of items
(Tate, 2002).
Although the number of underlying latent traits can be hy-
pothesized aprioriin a confirmatory approach, researchers’
judgments about the number of latent traits may not always
fit well to the item response data due to unintended sources of
variability. An exploratory analysis may be helpful in identi-
fying unintended sources of variability in item response data,
and several standards in Standards for Educational and Psy-
chological Testing (AERA, APA, & NCME, 1999) have been
established to encourage such analysis (e.g., Standards 1.11,
1.12, and 3.17). Dimensionality assessment is a critical pro-
cess that requires extra attention from both test developers
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REAL-DATA SAMPLING DESIGN 633
and test users; therefore, it is recommended as “part of a stan-
dard set of analyses conducted after each test administration”
(Ackerman, 2005, p. 24).
Among the methods proposed for determining the num-
ber of latent traits in multidimensional IRT models, Di-
mensionality Evaluation to Enumerate Contributing Traits
(DETECT) has attracted the attention of both methodolo-
gists and applied researchers as a nonparametric counter-
part to other alternatives such as a chi-square test of fit
(e.g., Gessaroli & De Champlain, 1996; Gessaroli, De Cham-
plain, & Folske, 1997; Maydeu-Olivares, 2001; Schilling &
Bock, 2005), model fit indices (e.g., Hu & Bentler, 1999;
Yu & Muthen, 2002; Akaike, 1987), and a Bayesian poste-
rior predictive model check (Levy & Svetina, 2011). Many
methodological research studies have examined the statis-
tical properties of the DETECT procedure under different
conditions (e.g., Gierl, Leighton, & Tan, 2006; Monahan,
Stump, Finch, & Hambleton, 2007; Roussos & Ozbek, 2006;
Tan & Gierl, 2006), and DETECT seems to be increasingly
used and reported in applied research (e.g., Cheng, 2011;
Froelich & Jensen, 2002; Jang & Roussos, 2007; Puranik,
Petscher, & Lonigan, 2013; Stewart, Batty, & Bovee, 2012).
Although these methodological studies have addressed the
performance and effectiveness of the DETECT procedure
from several perspectives, most of them have been simu-
lation studies that have shared a limitation. The previous
simulation studies assumed that the data-generating model
that the item responses followed at the population level was
a well-specified “perfect” model. More research is needed
to understand the statistical properties of the DETECT in-
dices (e.g., maximum DETECT value, R-ratio) and the over-
all performance of the DETECT procedure in determining
the number of latent traits and classification accuracy un-
der misspecified models (“imperfect models”, as termed by
MacCallum, 2003). As MacCallum (2003) stated, a more
relevant and informative study would show “how our meth-
ods perform when the model in question is not correct in
the population. Therefore, the primary purpose of the cur-
rent study is to contribute to the literature by investigating
the statistical properties of several outcomes of the DETECT
procedure using a real data set with likely nuisance factors
and comparing the results from a purely s imulated data set
with no model error.
DIMENSIONALITY ASSESSMENT AND
DETECT PROCEDURE
While dimensionality assessment has a long history in the
exploratory factor analysis literature (under the heading of
“factor retention criteria”), the issue has been primarily ad-
dressed in t he IRT literature with a focus on assessing the as-
sumption of unidimensionality (De Champlain & Gessaroli,
1998; Drasgow & Lissak, 1983; Finch & Habing, 2007; Finch
& Monahan, 2008; Froelich, 2000; Hambleton & Rovinelli,
1986; Hattie, 1984; Hattie, Krakowski, Rogers, & Swami-
nathan, 1996; Nandakumar & Stout, 1993; Nandakumar &
Yu, 1996; Seraphine, 1994, 2000; Stout, 1987; Tran & For-
mann, 2009; Weng & Cheng, 2005) because fitting a uni-
dimensional model to multidimensional data may result in
unwarranted inferences about individuals. Until efficient es-
timation algorithms and computer software became available
to practitioners, unidimensional IRT models were commonly
used with the acknowledgment that the educational and psy-
chological data in most instances did not meet the assumption
of unidimensionality. Therefore, the trend of investigating
the direct effects of the multidimensional data structure on
the unidimensional IRT item and person parameter estimates
(Ackerman, 1989; Ansley & Forsyth, 1985; Drasgow & Par-
sons, 1983; Harrison, 1986; Kirisci & Hsu, 1995; Reckase,
1979; Wang, 1986; Way, Ansley, & Forsyth, 1988) and the
indirect effects on the unidimensional IRT applications (e.g.,
Ackerman, 1988; Bolt, 1999; Camilli, Wang, & Fesq, 1995;
Cook, Dorans, Eignor, & Petersen, 1983; Cook, Eignor, &
Taft, 1988; De Ayala, 1992; De Champlain, 1996; Dorans
& Kingston, 1985; Folk & Green, 1989; Lau, 1996; Linn &
Harnisch, 1981; Stocking & Eignor, 1986) appeared in the
IRT research literature in the late 1970s and continued until
the mid-1990s.
Direct effects are observed at the item and person parame-
ter levels. The unidimensional estimates of the model param-
eters in the presence of multidimensionality are a weighted
composite of the underlying traits, and these weights are
primarily a function of the discrimination and difficulty pa-
rameters and the correlations between the latent traits. When
multiple dimensions with major influences exist, a unidimen-
sional analysis is expected to produce an estimate of ability
that is a weighted average of abilities on multiple latent traits.
Therefore, it becomes difficult to interpret the unidimensional
item and person parameter estimates without any reference
to the latent factor structure, and any interpretation should
be made with extreme caution. Indirect effects are observed
for many IRT applications such as test equating, differen-
tial item functioning analysis, and computerized adaptive
testing through inaccurate unidimensional item and person
parameter estimates in the presence of multidimensionality.
For instance, Cook et al. (1983) examined and reported is-
sues of scale drift in test equating when a unidimensional
model was used for multidimensional data, and Dorans and
Kingston (1985) reported that the presence of multidimen-
sionality worsened the symmetry property of test equating
under a unidimensional model.
When the assumption of unidimensionality is not plau-
sible, DETECT is a conditional covariance-based nonpara-
metric method that is proposed to assess the number of latent
traits underlying item response data (Kim, 1994; Zhang &
Stout, 1999). DETECT is based on the optimal partitioning
of a set of items such that the items with positive conditional
covariances are grouped in the same clusters and the items
with negative conditional covariances are grouped in differ-
Downloaded by [University of Miami] at 11:29 31 December 2015
634 ZOPLUOGLU
ent clusters. The goal is to find the partition that maximizes
the DETECT value. The number of clusters in the optimum
partition gives the identified number of major traits underly-
ing the data. Kim (1994) proposed the following quantity for
a prespecified partitioning of a set of items (P):
D
(
P
)
=
2
n
(
n 1
)
n1
i=1
n
j=i+1
υ
ij
C
(
i, j
|
θ
)
, (1)
where θ is a weighted composite of multiple latent abilities,
υ
ij
equals 1 if the ith and jth items are in the same cluster
and –1 otherwise.
D
(
P
)
is a weighted sum of the condi-
tional covariances across all item pairs, in which the sign
of the weight υ
ij
depends on the partition P. The value of
D
(
P
)
drops if pairs of items with negative conditional co-
variances are assigned to the same cluster or pairs of items
with positive conditional covariances are assigned to differ-
ent clusters. By contrast, the value of
D
(
P
)
increases if pairs
of items with negative conditional covariances are assigned
to different clusters or pairs of items with positive conditional
covariances are assigned to the same cluster.
C
(
i, j
|
θ
)
is the
conditional covariance estimate between the ith and jth items,
defined as
C
1
=
n
k=0
J
k
N
C
(
i, j
|
S = k
)
, (2)
C
2
=
n
k=0
J
k
N
C
i, j
S
i,j
= k
, and (3)
C
(
i, j
|
θ
)
=
C
1
+
C
2
2
, (4)
where S is the total sum score obtained from all items, S
i.j
is
the rest score excluding items i and j, and J
k
is the number
of students with a score of k. The sum score and rest score
are used as a legacy of the weighted composite of multiple
latent abilities θ .
D
(
P
)
is an aggregate measure of pairwise
local dependence for the entire test, and the conditional co-
variance terms s hould all be zero at the population level if
the test is indeed unidimensional. Based on theory,
D
(
P
)
is
maximized for the true partitioning of items when the data
are multidimensional.
Several cutoff criteria are proposed to evaluate the mag-
nitude of the DETECT index. For instance, Kim (1994)
classified the DETECT indices from 0 to 0.19 as i ndica-
tors of unidimensionality, 0.20 to 0.39 as indicators of weak
multidimensionality, 0.40 to 0.79 as indicators of moderate
multidimensionality, and above 0.80 as indicators of strong
multidimensionality. Stout, Nandakumar, and Habing (1996)
proposed a slightly different classification by assigning the
intervals (0, 0.10), (0.10, 0.50), (0.50, 1), (1, 1.50), and
above 1.50, respectively, to unidimensionality, weak, moder-
ate, strong, and very strong multidimensionality. Based on a
simulation study, Roussos and Ozbek (2006) recommended
using the intervals (0, 0.20), (0.20, 0.40), (0.40, 1) and above
1 for very weak, weak, moderate, and strong multidimension-
ality, respectively. Although these suggestions are helpful for
estimating the amount of multidimensionality in the data, re-
searchers are primarily interested in finding the number of
traits or correct partitioning of the items based on the latent
traits.
In the explanatory framework, the total number of parti-
tions for n element is equal to a Bell number in mathemat-
ics and increases exponentially as the number of elements
increases. For instance, the number of possible partitions
reaches 115,975 for 10 items, and finding the number of
traits underlying the data becomes an optimization problem
by finding the correct partitioning of n items with the high-
est DETECT value. Kim (1994) originally proposed using
some prior judgments with the help of cluster analysis to
begin, but no solution was given for finding the maximum
DETECT value until a scientifically sound solution was de-
veloped (Zhang & Stout, 1999). Zhang and Stout (1999)
first developed the theoretical justification for DETECT and
then transferred the idea of genetic algorithm from biostatis-
tics to an optimization search of the maximum DETECT
value among all possible partitions of a set of items. In this
optimization process, an informed choice of a partition is
specified by the user (e.g., based on cluster analysis) to start;
then, the genetic algorithm is used to find the optimum par-
titioning that maximizes the DETECT value. The number
of partitions is predicted to be the number of major dimen-
sions underlying the data. In the cross-validation framework,
the data are first divided into training and validation subsets
with user-defined sample sizes. The exploratory analysis is
run on the training data set and optimal partitioning of the
items is obtained. Then, the DETECT value is computed
from the validation data set using the optimal partitioning
of the items that were previously obtained from the training
data set.
One of the assumptions when deriving the theoretical
justification for the DETECT index is that the items have
an approximate simple structure. In the approximate simple
structure, items are expected to load primarily on one of the
dimensions and to load relatively less on the other dimen-
sions. Zhang and Stout (1999) showed that the ratio of the
maximum DETECT value to the observed DETECT value
can be used to assess whether the assumption of approxi-
mate simple structure holds. The observed DETECT value is
computed by assuming that a set of items is unidimensional.
This ratio ranges from 0 to 1; higher values indicate a simpler
structure, and 0.8 is recommended as the cutoff value for ap-
proximate simple structure (Zhang & Stout, 1999). However,
Tan and Gierl (2006) recommended a more relaxed thresh-
old (0.55 and above). A simulation study found that the ratio
index is not very effective at differentiating between sim-
pler and more complex structures, and it is difficult to find
a cutoff point that applies to all conditions (Finch, Stage, &
Monahan, 2008).
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REAL-DATA SAMPLING DESIGN 635
RESEARCH PROBLEM
In his presidential address to the Society of Multivariate
Experimental Psychology, MacCallum (2003) criticized the
dominating approach in simulation studies as follows:
Although studies based on this general approach may
provide some interesting information, I would argue that
they are of limited value. Although most Monte Carlo stud-
ies can be criticized for some lack of realism, the approach
just described is especially problematic for one major reason:
It ignores the fact that our models are always wrong to some
degree, even in the population. This approach addresses the
question: How do our methods behave and perform when
the model in question is exactly correct in the population?
Although answers to this question might be of interest for
theorists, they are of only limited value to users of the meth-
ods. A more realistic and relevant question is: How do our
methods behave and perform when the model in question is
not correct in the population? Answers to this question could
be more relevant and informative regarding the performance
of methods in practice” (p. 135)
Although the argument is mainly discussed from the fac-
tor analytic theory perspective, the same argument applies
to IRT models. Researchers who generate data using IRT
models with a known dimensional structure, either unidi-
mensional or multidimensional, always implicitly assume
that the model perfectly holds at the population level. As
highly encouraged by MacCallum (2003), a more relevant
and informative study should incorporate both model error
and sampling error into the simulation process to mimic a
more realistic scenario. According to MacCallum, there are
two ways to design such a study. The first design involves
finding a large, real data set and treating this data set as a pop-
ulation to conduct a sampling study by drawing samples of
the desired sample size from that population. It is expected
that the real data set, with a large sample of observations,
contains some degree of model error and very little sampling
error. The second design recommended by MacCallum in-
volves using the common factor model proposed by Tucker,
Koopman, and Linn (1969), which includes a smaller num-
ber of latent traits with major influence (e.g., 1, 2, 3, 4) and
a large number of latent traits with minor influence (e.g.,
150), when simulating data. While the common factor model
proposed by Tucker et al. is proposed in the factor analytic
framework, it can easily be adopted to generate data based on
a compensatory multidimensional 2PL and 3PL model with
major and minor latent traits.
Thus far, the methodological studies—except for that by
Svetina (2013), who used a non-compensatory model for
data generation—have shown that the DETECT procedure
provides useful information in certain conditions (e.g., ap-
proximate simple structure, not too high of a correlation
among latent dimensions) when the data-generating struc-
ture is aligned well with the principles on which the DE-
TECT procedure was theorized. However, Svetina (2013)
showed that the DETECT procedure may not perform at an
acceptable level in many instances when the data follows a
non-compensatory model and has a complex factor structure.
In addition, Roussos & Ozbek (2006) conducted the only
study that examined the statistical properties of the DETECT
estimator for multidimensional models, and this study was
limited because the generating model was a well-specified
compensatory MIRT model with a relatively clean structure
and the sample size was extremely large (N = 120,000). They
noted the need to study the statistical properties of the DE-
TECT estimator using real data s ets with less clean structures
and smaller sample sizes. Therefore, the aim of the current
study is to investigate the outcomes of the DETECT proce-
dure using a real data set with likely nuisance factors and to
compare the results from an identical design using a s imu-
lated data set from a well-specified perfect model identified
from the real data set.
METHOD
Sample and Data Sets
The real data set came from the administration of Booklet
3 to 27,203 students who completed the international PISA
assessment in 2012 (OECD, 2013). The booklet included 57
items in total, with 25 items in the mathematics domain, 14
items in the reading domain, and 18 items in the science do-
main. For the current analysis, 55 items with only dichoto-
mous scores were included and 2 items with partial credit
scores were excluded. The analysis sample included 25,263
students who attempted all 55 items (24 items in mathemat-
ics, 14 items in reading, and 17 items in science). Of these 55
items, 45 items were also embedded in 16 different testlets
(5 in mathematics, 6 in reading, and 5 in science). The item
means ranged from 0.04 to 0.87, with an average of 0.48, for
the mathematics domain; from 0.17 to 0.98, with an average
of .68, for the reading domain; and from 0.16 to 0.87, with
an average of 0.50, for the science domain. This data set
was selected because it had three major dimensions with po-
tential inter-factor correlations, and an approximate simple
structure was expected, although some cross-loadings were
likely to occur due to a potential reading component for the
mathematics and science items and a potential mathematics
component for the science items.
To examine the dimensionality of the PISA data set, an
exploratory DETECT analysis was run by setting t he MIN-
CELL option to 2 and the MUTATIONS option to 11. The
MINCELL option indicates the minimum number of exami-
nees required to be present in any one cell when calculating
the conditional covariances, and the MUTATIONS option
indicates the number of vectors mutated in the genetic algo-
rithm when maximizing the DETECT value to find the op-
timal cluster solution. The maximum number of dimensions
to be found was set to 12. The maximized DETECT value
was estimated as 0.186, and the R-ratio was estimated as
Downloaded by [University of Miami] at 11:29 31 December 2015
636 ZOPLUOGLU
TABLE 1
Item Clustering From DETECT Analysis for the PISA 2012 Data Set (Booklet 3)
DETECT Analysis
Dimension 1 Dimension 2 Dimension 3
Mathematics M00FQ01, M273Q01, M408Q01,
M420Q01, M446Q01, M446Q02,
M447Q01, M464Q01, M559Q01,
M800Q01, M828Q01, M828Q02,
M828Q03, M903Q03, M923Q01,
M923Q03, M923Q04, M924Q02,
M995Q01, M995Q02, M995Q03
M918Q01, M918Q02, M918Q05
Reading R220Q01, R220Q02B R220Q04,
R432Q01, R432Q05, R432Q06,
R446Q03, R446Q06, R456Q01,
R456Q02, R456Q06, R466Q02,
R466Q03, R466Q06
Science S466Q05 S519Q03 S269Q01, S269Q03, S269Q04,
S408Q01, S408Q03, S408Q04,
S408Q05, S466Q01,
S466Q07,S519Q02, S521Q02,
S521Q06, S527Q01,S527Q03,
S527Q04
DETECT: Dimensionality Evaluation to Enumerate Contributing Traits. PISA: Programme for International Student Assessment.
Note. The letters M, S, and R at the beginning of each item label refer to proprietary items on mathematics, science, and reading sections of the PISA 2012
data set (Booklet 3), respectively. Dimensions 1, 2, and 3 identified as a result of DETECT analysis were interpreted as the mathematics, reading, and science
domains in the test, respectively.
0.776. The exploratory DETECT analysis suggested a three-
dimensional solution, as shown in Table 1. Twenty-one out of
24 items in the mathematics domain, 15 out of 17 items in the
science domain, and all reading items were clustered together
in three separate dimensions. Three mathematics items and 1
science item were grouped with the reading items, suggest-
ing a likely reading component for these items, and 1 science
item was grouped with the mathematics items, suggesting a
likely mathematical component for this item. In addition, a
cross-validated DETECT analysis with 50-50 split was run.
The identical item clustering was obtained with the maxi-
mized DETECT value of 0.175 and the R-ratio estimate of
0.657.
The PISA data set was further analyzed using NOHARM
(Fraser & McDonald, 1988) implementing an unweighted
least squares estimation to fit a polynomial factor model
up to a third degree using Hermite-Chebyshev polynomi-
als to approximate a multidimensional IRT model (Maydeu-
Olivares, 2001; McDonald, 1967). The Promax rotated fac-
tor loading estimates from the NOHARM analysis are re-
ported in Table 2. The estimated factor structure was well
aligned with the results of the DETECT analysis. The 3
mathematics items and 1 science item that were grouped
with the reading items in the DETECT analysis loaded
highest on the reading domain. Similarly, the 1 science
item that was grouped with the mathematics items loaded
highest on the mathematics domain. All of the other items
loaded highest on their respective domains. The correlations
among the achievement domains ranged from 0.697 to 0.769
(Table 3).
In addition to the real data set, a simulated data set was
generated using the common factor model following the same
structure reported in Table 2 and Table 3 with the same
sample size and item means as those in the real data set. The
simulated data set was analyzed using an exploratory and
cross-validated DETECT procedure with a 50-50 split. In
both the exploratory and cross-validated analyses, identical
item clusters were obtained with the maximized DETECT
value of 0.181 and the R-ratio of 0.868 for the exploratory
analysis and the maximized DETECT value of 0.179 and the
R-ratio of 0.836 for the cross-validated analysis. The data
characteristics and major factor structure of the simulated
data set were nearly exactly the same as those of the real
data. The only difference was the lack of minor factors in
the simulated data set such as testlets that existed in the
real data set and other potential model misspecifications that
may arise from a real test administration. In fact, this was
reflected in the R-ratio estimate, which was slightly higher
for the simulated data set compared to the real data set. This
is likely an indication of a simpler structure due to the lack
of minor factors and other model misspecifications.
STUDY DESIGN
The two large data sets, the real data set and simulated data
set, were treated as the population for the sampling study. To
study the performance of the exploratory and cross-validated
DETECT procedures, samples of students with six differ-
ent sample sizes (N = 100, 250, 500, 1,000, 2,500, 5,000)
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REAL-DATA SAMPLING DESIGN 637
TABLE 2
Three-Dimensional Solution of the PISA 2012 Data Set (Booklet 3) Using Nonlinear Factor Analysis
Math Reading Science Math Reading Science
Item Label Item Mean Standardized Factor Loadings Item Label Item Mean Standardized Factor Loadings
M00FQ01 0.43 0.476 0.188 0.067 R220Q01 0.43 0.123 0.327 0.292
M273Q01 0.49 0.497 0.013 0.034 R220Q02 0.68 0.048 0.417 0.233
M408Q01 0.36 0.295 0.209 0.137 R220Q04 0.61 0.050 0.379 0.160
M420Q01 0.48 0.381 0.192 0.174 R432Q01 0.90 0.118 0.781 0.118
M446Q01 0.62 0.343 0.185 0.177 R432Q05 0.78 0
.009 0.657 0.084
M446Q02 0.07 0.770 0.203 0.186 R432Q06 0.17 0.076 0.347 0.163
M447Q01 0.66 0.399 0.144 0.150 R446Q03 0.95 0.152 0.792 0.001
M464Q01 0.24 0.734 0.065 0.094 R446Q06 0.82 0.006 0.649 0.019
M559Q01 0.60 0.278 0.122 0.228 R456Q01 0.98 0.167 0.628 0.023
M800Q01 0
.86 0.284 0.102 0.032 R456Q02 0.83 0.103 0.575 0.060
M828Q01 0.25 0.443 0.223 0.106 R456Q06 0.83 0.057 0.622 0.039
M828Q02 0.51 0.343 0.213 0.099 R466Q02 0.48 0.170 0.425 0.064
M828Q03 0.24 0.417 0.097 0.151 R466Q03 0.17 0.106 0.264 0.185
M903Q03 0.29 0.664 0.061 0.014 R466Q06 0.85 0.
013 0.759 0.066
M918Q01 0.87 0.038 0.527 0.149 S269Q01 0.58 0.085 0.170 0.487
M918Q02 0.77 0.342 0.390 0.023 S269Q03 0.45 0.190 0.041 0.539
M918Q05 0.76 0.327 0.365 0.045 S269Q04 0.35 0.142 0.240 0.638
M923Q01 0.57 0.563 0.044 0.172 S408Q01 0.61 0.070 0.028 0.475
M923Q03 0.51 0.639
0.095 0.001 S408Q03 0.29 0.020 0.124 0.376
M923Q04 0.17 0.615 0.113 0.247 S408Q04 0.53 0.083 0.093 0.368
M924Q02 0.62 0.654 0.093 0.016 S408Q05 0.41 0.097 0.023 0.462
M995Q01 0.58 0.672 0.061 0.072 S466Q01 0.71 0.104 0.232 0.488
M995Q02 0.04 0.691 0.206 0.094 S466Q05 0.52 0.
261 0.021 0.236
M995Q03 0.46 0.485 0.179 0.052 S466Q07 0.70 0.100 0.187 0.379
S519Q02 0.54 0.088 0.122 0.389
S519Q03 0.26 0.015 0.240 0.238
S521Q02 0.53 0.071 0.143 0.448
S521Q06 0.87 0.002 0.305 0.458
S527Q01 0.16 0.025 0.101 0.507
S527Q03 0.56 0.025 0.010 0.438
S527Q04 0.53
0.095 0.022 0.558
Note. The highest factor loading in a row is bolded. The letters M, S, and R at the beginning of each item label refer to proprietary items on mathematics,
science, and reading sections of the PISA 2012 data set (Booklet 3), respectively.
were repeatedly drawn 10,000 times from each large data set.
After drawing the random samples, the exploratory and cross-
validated DETECT analyses were run for each of the 120,000
sample data sets in the same way that the population data
sets were analyzed, and the results from the sample data sets
were treated as sample estimates for the DETECT outcomes.
There were 24 different conditions in a 2 × 6 × 2 factorial
design, where type of data set (2; real vs. simulated) and
sample size (6; 100, 250, 500, 1,000, 2,500, or 5,000) served
as between-subjects factors, and type of DETECT analysis
TABLE 3
Correlation Matrix Between Three Dimensions After
Promax Rotation
Mathematics Reading Science
Mathematics 1
Reading 0.697 1
Science 0.769 0.736 1
(2; exploratory analysis, and cross-validated analysis with a
50-50 split) served as a within-subjects factor. Each cell had
10,000 replications.
OUTCOMES OF INTEREST
Several outcomes of the DETECT analyses included in previ-
ous research were also of interest in the current study. These
outcomes were the maximized DETECT value, R-ratio, iden-
tified number of dimensions, and item cluster assignments.
The statistical properties of the maximized DETECT value
have been examined in only two studies (Monahan et al.,
2007; Roussos & Ozbek, 2006). The study by Roussos and
Ozbek (2006) was the only study to examine the statistical
properties of the DETECT estimator for compensatory mul-
tidimensional models with an extremely large sample size
(N = 120,000) and found that there was bias for short tests
with fewer than 20 items. The bias was relatively small and
Downloaded by [University of Miami] at 11:29 31 December 2015
638 ZOPLUOGLU
negligible for tests with more than 20 items. However, the
researchers noted the need for similar studies with real data
and realistic sample sizes. The second outcome of interest
was the R-ratio, the ratio of the maximum DETECT value
to the average of the absolute conditional covariances across
all item pairs. Although previous research has included the
R-ratio as an outcome of interest, no work has investigated
the sampling distribution of the R-ratio. The third outcome
of interest was the number of identified latent dimensions by
the DETECT analysis. Conceptually, DETECT is expected
to identify major dimensions; however, in some cases, it puts
only one or two items in a cluster and counts the cluster
as a separate dimension, which does not necessarily make
sense in practice. Thus, in the current study, the “number
of dimensions” was considered the number of clusters in-
cluding at least 3 items. The final outcome of interest was
the accuracy of the item cluster assignment, measured by the
matching similarity coefficient (MS coefficient) described by
Mroch and Bolt (2006). The MS coefficient was computed
usinga2× 2 table for each replication. Let a, b, c, and d be
the elements of the 2 × 2 table, where
a is the proportion of item pairs correctly classified in the
same cluster,
b is the proportion of item pairs incorrectly classified in
the same cluster,
c is the proportion of item pairs incorrectly classified in
separate clusters, and
d is the proportion of item pairs correctly classified in
separate clusters,
while the item cluster assignment as a result of the DETECT
analysis for the population data sets (in Table 1) was treated
as a reference. Then, the MS coefficient is equal to (a + d),
the proportion of correctly classified item pairs.
ANALYSIS
For the outcome measures described above, I treated the val-
ues obtained from the original data sets with a large sample
size (N = 25,263) as the population parameters, and the val-
ues obtained from the sample data sets with varying sample
sizes were treated as the corresponding sample estimates. An
exploratory DETECT analysis was run by setting the MIN-
CELL option to two and the MUTATIONS option to 11 (20%
of the total number of items). The maximum number of di-
mensions to be found was set to 12. For the cross-validated
DETECT analysis, the same settings were in place, and 50%
of the data were used for training and 50% of the data were
used for validation. The maximum DETECT value, R-ratio,
number of identified dimensions, and item clusters from each
replication were stored for further analysis.
RESULTS
Maximized DETECT Value
For each condition, the average maximized DETECT value
was computed across 10,000 replications and compared to
the corresponding population parameter. The bias across all
conditions is presented in Figure 1. The exploratory DE-
TECT analysis yielded maximized DETECT values with
larger positive bias for sample sizes of less than 1,000, but
the bias was negligible when the sample size was 1,000 or
larger. The bias had a maximal value of approximately 0.2,
0.1, and 0.05 when the sample sizes were 100, 250, and 500,
respectively. The cross-validated DETECT analysis with the
50-50 split showed a relatively smaller bias, and it was in
the opposite direction of the exploratory DETECT analysis.
The bias was around 0.05 for sample sizes of 1,000 or less
and very close to zero for larger sample sizes. The differ-
ence in bias between the purely simulated data sets and the
real data sets was negligible. In addition to bias, the s tan-
dard deviation of the empirical sampling distributions for
the maximized DETECT value was computed and examined
across all conditions, and the empirical standard errors are
presented in Figure 2. The exploratory DETECT analysis
showed smaller sampling variability across 10,000 replica-
tions in all conditions; however, the differences between the
exploratory and cross-validated DETECT analyses dimin-
ished as the sample size increased. Similarly, there was a
negligible difference between the purely simulated data sets
and real data sets. The root mean squared error (RMSE) is
equal to
Bias
2
+ SE
2
and is a measure of overall accu-
racy that accounts for both bias and standard error. Figure
3 presents the results for RMSE across all conditions. The
figure reveals that the maximized DETECT value from the
cross-validated analysis with the 50-50 split provided the
most accurate estimates for sample sizes smaller than 250,
and the maximized DETECT value from the exploratory
FIGURE 1 Bias in maximized DETECT value for different levels of
sample size, data set, and analysis.
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REAL-DATA SAMPLING DESIGN 639
FIGURE 2 Standard error of maximized DETECT value for different
levels of sample size, data set, and analysis.
analysis provided the most accurate estimates for larger sam-
ple sizes. The results indicate a trade-off between bias and
standard error for the maximized DETECT value when the
exploratory or cross-validated analysis with the 50-50 split
was used. The cross-validated analysis with the 50-50 split
provided a nearly unbiased maximized DETECT value in
most conditions, whereas the exploratory DETECT analy-
sis provided a more positively biased maximized DETECT
value, particularly for small sample sizes. However, the stan-
dard error of the maximized DETECT value was smaller
in the exploratory analysis compared to the cross-validated
analysis. When bias and standard error were combined, the
cross-validated analysis with the 50-50 split provided more
accurate results for sample sizes of 100 and 250, but the ex-
ploratory analysis outperformed the cross-validated analysis
in terms of overall accuracy for larger sample sizes.
R-ratio
Similar to the maximized DETECT value, the average R-
ratio, the empirical standard error, and the RMSE across
FIGURE 3 Root mean square error of maximized DETECT value for
different levels of sample size, data set, and analysis.
FIGURE 4 Bias in R-ratio estimate for different levels of sample size,
data set, and analysis.
10,000 replications for all conditions are presented in Fig-
ures 4, 5, and 6. The results revealed a negative bias in R-ratio
statistics obtained from the exploratory and cross-validated
DETECT analyses. The exploratory DETECT analysis pro-
vided less biased results than the cross-validated DETECT
analysis, and the difference between the simulated data sets
and real data sets was negligible. Although the negative
bias decreased as the sample size increased, it was approxi-
mately 0.1 and 0.2 even for sample sizes larger than 2,500.
In terms of overall accuracy measured by the RMSE, the
exploratory DETECT analysis provided the most accurate
R-ratio estimates, but there was still a significant amount
of error: approximately 0.4 for the sample sizes of 500
or less and approximately 0.1 for a sample size of 5,000.
The consequence of negative bias would be a misinterpre-
tation such that a data set would display a more complex
structure than it truly had. Therefore, if the R-ratio statistic
is used in applied research, small R-ratio statistics should
be cautiously interpreted, particularly for small sample
sizes.
FIGURE 5 Standard error of R-ratio estimate for different levels of sample
size, data set, and analysis.
Downloaded by [University of Miami] at 11:29 31 December 2015
640 ZOPLUOGLU
FIGURE 6 Root mean square error of R-ratio estimate for different levels
of sample size, data set, and analysis.
Number of Identified Dimensions
Recall that both the exploratory and cross-validated DE-
TECT analyses identified three major dimensions for both
the real and simulated large data sets treated as popula-
tions. The percentage of replications out of 10,000 for the
identified number of dimensions under repeated sampling
and the overall mean and standard deviation are reported in
Table 4. When the analysis was exploratory, the number of
major dimensions was more accurately identified for the sim-
ulated data sets. The difference between the real and simu-
lated data sets was small when the sample size was 100
and 250, where the percentage of correct decisions was ap-
proximately 40%, but noticeable for larger sample sizes. For
both the real and simulated data sets, the exploratory DE-
TECT analysis tended to identify four dimensions for most
replications when an incorrect decision was made. When
the analysis was cross-validated, there was a similar pattern.
The difference between the real and simulated data sets was
small for sample sizes smaller than 1,000, but the results were
noticeably more accurate for simulated data sets for larger
sample sizes. Overall, the accuracy in identifying the number
of major dimensions was slightly better and the variability in
the identified number of dimensions was smaller when the
exploratory analysis was used.
Item Cluster Assignment
The MS coefficient indicates the proportion of correctly clas-
sified item pairs with respect to the item clustering based on
the DETECT analysis for the population data sets, which
was reported in Table 1. Note that both the exploratory and
cross-validated DETECT analyses indicated the same item
clustering for the real and simulated population data sets.
There were 1,485 item pairs for the 55 item tests. The number
of pairs correctly classified as in the population data sets was
TABLE 4
The Percentage of Replications for the Number of Identified Dimensions Across Conditions
Number of Identified Dimensions
Type of Data Sample Size Type of Analysis 1 2 3 4 5 6 7 8 Mean SD
Real 100 E 5.440.637.813.52.40.33.68 0.88
Real 250 E 1.738.344.214.31.53.76 0.78
Real 500 E 0.539.746.912.00.93.73 0.71
Real 1,000 E 55.338.75.70.33.51 0.62
Real 2500 E 85.214.50.33.15 0.37
Real 5,000 E 97.32.73.03 0.16
Simulated 100 E 5.941.037.813.02.10.23.65 0.87
Simulated 250 E 1.740.
743.612.51.53.71 0.76
Simulated 500 E 0.852.439.86.70.33.53 0.65
Simulated 1,000 E 0.176.822.30.83.24 0.45
Simulated 2500 E 97.32.73.03 0.16
Simulated 5,000 E 99.80.23.00 0.05
Real 100 CV 0.218.647.626.56.20.80.14.23 0.86
Real 250 CV 7.344.235.810.61.90.23.56 0.87
Real 500 CV 2.942.141.412.
11.40.13.67 0.78
Real 1,000 CV 1.039.345.612.71.30.13.74 0.74
Real 2500 CV 0.149.343.07.30.33.58 0.64
Real 5,000 CV 69.528.71.83.32 0.51
Simulated 100 CV 0.1 18.147.327.46.20.90.13.24 0.86
Simulated 250 CV 7.544.735.210.81.60.23.55 0.86
Simulated 500 CV 3.443.040.311.71.60.13.65 0
.80
Simulated 1,000 CV 1.444.942.610.30.73.64 0.71
Simulated 2500 CV 0.367.829.62.20.13.34 0.53
Simulated 5,000 CV 0.189.510.30.13.10 0.31
Note. E: exploratory DETECT analysis; CV: indicates cross-validated DETECT analysis.
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REAL-DATA SAMPLING DESIGN 641
FIGURE 7 Accuracy of item cluster assignment for different levels of
sample size, data set, and analysis.
counted for each replication, and the proportion was com-
puted. Therefore, the MS coefficient in the current study is an
indicator of how well the item clustering at the sample level
agreed with the item clustering at the population level. The
average agreement across 10,000 replications within each
condition is presented in Figure 7. The exploratory DETECT
analysis performed better than the cross-validated analysis
with a gradually increasing average agreement ranging from
approximately 60% to approximately 95% as the sample size
increased from 100 to 5,000. A similar pattern in the increase
in average agreement from approximately 55% to approxi-
mately 90% was observed as the sample size increased from
100 to 5,000 for the cross-validated DETECT analysis. The
difference between the real and simulated data sets in terms
of item clustering accuracy was negligible.
SUMMARY AND DISCUSSION
The current study extends the findings of previous research
regarding the several outcomes obtained from the DETECT
procedure by implementing a sampling study with realistic
sample sizes using real and simulated data sets with a sim-
ilar major multidimensional latent structure. A well-known
motto that “all models are wrong, but some of them are
useful” (Box & Draper, 1987, p. 424) is always used to
acknowledge model misspecification or model error in prac-
tice, but it is rarely integrated into the research design in
most simulation studies, particularly on dimensionality as-
sessment. The real data set used in the current study had a
multidimensional structure due to its three major underlying
domains (math, science, and reading) and other minor fac-
tors that may potentially arise from a real-world application.
The purely simulated data set based on a similar major factor
structure with no minor factors allowed for a comparison of
the results and an observation of the potential implications of
using perfect models in simulation studies, which has been
criticized (MacCallum, 2003). The comparison revealed that
the sampling behavior of the maximized DETECT value and
R-ratio statistics was quite robust to minor factors and other
model misspecifications that potentially exist in the real data
set, as there were negligible differences between the results
of the real and simulated data sets. The item classification
accuracy was also nearly identical for the real and simulated
data sets. The accuracy of the identified number of dimen-
sions reported by DETECT was the only outcome for which
the difference between using a purely simulated data set and
real data set became obvious. Although the difference was
small for smaller sample sizes, the decisions were more ac-
curate for larger sample sizes when the population data set
was purely simulated.
One of the interesting findings of the current study is the
negative bias observed in the R-ratio statistics. The statistical
properties of the R-ratio index have never been systemati-
cally examined, although the issues of using the R-ratio to
interpret the simplicity of the factor structure have been pre-
viously addressed (Finch & Monahan, 2008; Tan & Gierl,
2006). Further analysis of this bias revealed that the R-ratio
is a ratio of two biased estimators, with the estimator in
the denominator having a larger positive bias in magnitude
than that in the numerator. Future research should attempt to
eliminate the bias from the components of R-ratio statistics
before its use is emphasized in practice. A recent study by
Nandakumar, Yu, and Zhang (2011) made such an attempt
but was not very successful.
In many instances, the exploratory DETECT analysis out-
performed the cross-validated DETECT analysis in terms
of overall accuracy. This finding may seem a bit counter-
intuitive to some readers because many researchers have
been advised to use cross-validation when possible. It can
be speculated that this counter-intuitive finding is due to the
reduced amount of information used for the cross-validated
DETECT analysis. A cross-validated DETECT analysis with
a 50-50 split, as implemented in the current study, uses
one half of the sample to identify the number of dimen-
sions and item clustering and the other half to compute
the maximized DETECT value and R-ratio statistics based
on the previously identified item clustering. For the maxi-
mized DETECT value, this eliminates bias at the expense
of increasing estimation error due to the reduced amount
of information. When the gain due to the smaller bias does
not overcome the loss due to the increased estimation error,
an exploratory DETECT analysis yields more accurate re-
sults overall. Similarly, the cross-validated analysis with the
50-50 split uses half of the sample to identify the number
of dimensions and item clustering, whereas the exploratory
analysis uses the whole sample. Therefore, it should not be
surprising that the exploratory analysis provided more accu-
rate results in terms of finding the major dimensions and item
clustering.
While the current study increased our understanding of
possible issues in interpreting DETECT output by extending
Downloaded by [University of Miami] at 11:29 31 December 2015
642 ZOPLUOGLU
the findings to more realistic data sets with possibly un-
derlying complex imperfect “true” models, the findings are
also limited to the specific data sets used in the study. Fu-
ture research using similar real-data sampling designs with
different types of data sets would enhance and extend our
understanding of potential issues related to DETECT output
and provide practitioners with greater insights.
ARTICLE INFORMATION
Conflict of Interest Disclosures: The author signed a form
for disclosure of potential conflicts of interest. No author
reported any financial or other conflicts of interest in relation
to the work described.
Ethical Principles: The author affirms having followed pro-
fessional ethical guidelines in preparing this work. These
guidelines include obtaining informed consent from human
participants, maintaining ethical treatment and respect for
the rights of human or animal participants, and ensuring the
privacy of participants and their data, such as ensuring that
individual participants cannot be identified in reported results
or from publicly available original or archival data.
Funding: There are no funders to report for this submission.
Role of the Funders/Sponsors: None of the funders or spon-
sors of this research had any role in the design and conduct
of the study; collection, management, analysis, and inter-
pretation of data; preparation, review, or approval of the
manuscript; or decision to submit the manuscript for pub-
lication.
Acknowledgements: The author would like to thank the
editors, Keith F. Widaman and Stephen G. West, and three
anonymous reviewers for their comments on prior versions of
this manuscript. The ideas and opinions expressed herein are
those of the author alone, and endorsement by the author’s
institution is not intended and should not be inferred.
REFERENCES
Ackerman, T. A. (1988). An explanation of differential item functioning
from a multidimensional perspective. Retrieved from ERIC database.
(ED306281)
Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory
and noncompensatory multidimensional items. Applied Psychological
Measurement, 13 (2), 113–127. http://dx.doi.org/10.1177/014662168901
300201
Ackerman, T. A. (2005). Multidimensional item response theory modeling.
In A. Maydeu-Olivares & J. J. McArdle (Eds.), Contemporary Psycho-
metrics (pp. 3–25). Mahwah, NJ: Erlbaum.
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 3, 317–332.
http://dx.doi.org/10.1007/BF02294359
American Educational Research Association, American Psychological As-
sociation, National Council on Measurement in Education, Joint Com-
mittee on Standards for Educational, & Psychological Testing. (1999).
Standards for educational and psychological testing. Washington, DC:
APA.
Ansley, T., & Forsyth, R. (1985). An examination of the character-
istics of unidimensional IRT parameter estimates derived from two-
dimensional data. Applied Psychological Measurement, 9(1), 37–48.
http://dx.doi.org/10.1177/014662168500900104
Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT
true score equating. Applied Measurement in Education, 12(4), 383–407.
http://dx.doi.org/10.1207/S15324818AME1204
4
Box, G. E., & Draper, N. R. (1987). Empirical model-building and response
surfaces. New York: Wiley.
Camilli, G., Wang, M. M, & Fesq, J. (1995). The effects of dimension-
ality on equating the Law School Admission Test. Journal of Edu-
cational Measurement, 32(1), 79–96. http://dx.doi.org/10.1111/j.1745-
3984.1995.tb00457.x
Cheng, W. (2011). Examining the dimensionality of early numeracy skill
measures (Unpublished doctoral dissertation). Pennsylvania State Uni-
versity, State College, PA.
Cook, L. L., Dorans, N. J., Eignor, D. R., & Petersen, N. S. (1983). An as-
sessment of the relationship between the assumption of unidimensionality
and the quality of IRT true-score equating. Paper presented at the annual
meeting of the American Educational Research Association, Montreal,
Canada.
Cook, L. L., Eignor, D. R., & Taft, H. L. (1988). A comparative study of the
effects of recency of instruction on the stability of IRT and conventional
item parameter estimates. Journal of Educational Measurement, 25(1),
31–45. http://dx.doi.org/10.1111/j.1745-3984.1988.tb00289.x
De Ayala, R. J. (1992). The influence of dimensionality on CAT ability es-
timation. Educational and Psychological Measurement, 52 (3), 513–527.
http://dx.doi.org/10.1177/0013164492052003002
De Champlain, A. F. (1996). The effect of multidimensionality on IRT
true score equating for subgroups of examinees. Journal of Educa-
tional Measurement, 33(2), 181–201. http://dx.doi.org/10.1111/j.1745-
3984.1996.tb00488.x
De Champlain, A., & Gessaroli, M. (1998). Assessing the dimension-
ality of item response matrices with small sample sizes and short
test lengths. Applied Measurement in Education, 11(3), 231–253.
http://dx.doi.org/10.1207/s15324818ame1103
2
Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of uni-
dimensionality on the estimation of item and ability parameters and on
item response theory equating of the GRE verbal scale. Journal of Educa-
tional Measurement, 22 (4), 249–262. http://dx.doi.org/10.1111/j.1745-
3984.1985.tb01062.x
Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure
for examining the latent dimensionality of dichotomously scored item
responses. Journal of Applied Psychology, 68 (3), 363–373. http://
dx.doi.org/10.1037/0021-9010.68.3.363
Drasgow, F., & Parsons, C. K. (1983). Application of unidimensional
item response theory models to multidimensional data. Applied
Psychological Measurement, 7, 189–199. http://dx.doi.org/10.1177/
014662168300700207
Finch, H., & Habing, B. (2007). Performance of DIMTEST-and NOHARM-
based statistics for testing unidimensionality. Applied Psychological Mea-
surement, 31(4), 292–307. http://dx.doi.org/10.1177/0146621606294490
Finch, H., & Monahan, P. (2008). A bootstrap generalization of modified
parallel analysis for IRT dimensionality asssessment. Applied
Measurement in Education, 21(2), 119–140. http://dx.doi.org/10.1080/
08957340801926102
Finch, H., Stage, A. K., & Monahan, P. (2008). Comparison of factor sim-
plicity indices for dichotomous data: DETECT R, Bentler’s simplicity
index, and the loading simplicity index. Applied Measurement in Educa-
tion, 21, 41–64. http://dx.doi.org/10.1080/08957340701796365
Fraser, C., & McDonald, R. P. (1988). NOHARM: Least squares
item factor analysis. Multivariate Behavioral Research, 23, 267–269.
http://dx.doi.org/10.1207/s15327906mbr2302
9
Downloaded by [University of Miami] at 11:29 31 December 2015
DETECT EVALUATION WITH A REAL-DATA SAMPLING DESIGN 643
Folk, V. G., & Green, B. F. (1989). Adaptive estimation when the unidimen-
sionality assumption of IRT is violated. Applied Psychological Measure-
ment, 4, 373–389. http://dx.doi.org/10.1177/014662168901300404
Froelich, A. G. (2000). Assessing unidimensionality of test items and some
asymptotics of parametric item response theory. (Doctoral dissertation).
University of Illinois at Urbana-Champaign, Champaign, IL.
Froelich, A. G., & Jensen, H. H. (2002). Dimensionality of thew
USDA food security index. Retrieved from http://www.public.
iastate.edu/amyf/technicalreports/usdadimensionality.pdf
Gessaroli, M. E., De Champlain, A. F., & Folske, J. C. (1997, March).
Assessing dimensionality using a likelihood-ratio chi-square test based
on a nonlinear factor analysis of item response data. Paper presented at
the annual meeting of the National Council on Measurement in Education,
Chicago, IL.
Gierl, M. J., Leighton, J. P., & Tan, X. (2006). Evaluating DE-
TECT classification accuracy and consistency when data display com-
plex structure. Journal of Educational Measurement, 43(3), 265–289.
http://dx.doi.org/10.1111/j.1745-3984.2006.00016.x
Hambleton, R., & Rovinelli, R. (1986). Assessing the dimensionality of a
set of test items. Applied Psychological Measurement, 10(3), 287–302.
http://dx.doi.org/10.1177/014662168601000307
Harrison, D. (1986). Robustness of IRT parameter estimation to violations
on the unidimensionality assumption. Journal of Educational Statistic,
11(2), 91–115. http://dx.doi.org/10.2307/1164972
Hattie, J. (1984). An empirical study of various indices for determin-
ing unidimensionality. Multivariate Behavioral Research, 19(1), 49–78.
http://dx.doi.org/10.1207/s15327906mbr1901
3
Hattie, J., Krakowski, K., Rogers, J., & Swaminathan, H. (1996). An
assessment of Stout’s index of essential unidimensionality. Applied
Psychological Measurement, 20(1), 1–14. http://dx.doi.org/10.1177/
014662169602000101
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in
covariance structure analysis: Conventional criteria versus new
alternatives. Structural Equation Modeling, 6, 1–55. http://dx.doi.org/
10.1080/10705519909540118
Jang, E. E., & Roussos, L. (2007). An investigation into the dimen-
sionality of TOEFL using conditional covariance-based nonparamet-
ric approach. Journal of Educational Measurement, 44 (1), 1–21.
http://dx.doi.org/10.1111/j.1745-3984.2007.00024.x
Kim, H. R. (1994). New techniques for the dimensionality assessment of
standardized test data. (Doctoral dissertation). University of Illinois at
Urbana-Champaign, Urbana, IL.
Kirisci, L., & Hsu, T. (1995). The robustness of BILOG to violations of the
assumption of unidimensionality of test items and normality of ability dis-
tributions. Paper presented at the annual meeting of the National Council
on Measurement in Education, San Francisco, CA.
Lau, C. M. A. (1997). Robustness of a unidimensional computerized mas-
tery testing procedure with multidimensional testing data. (Unpublished
doctoral dissertation). University of Iowa, Iowa City, IA.
Levy, R., & Svetina, D. (2011). A generalized dimensionality dis-
crepancy measure for Dimensionality assessment in multidimensional
item response theory. British Journal of Mathematical and Statisti-
cal Psychology, 64, 208–232. http://dx.doi.org/10.1348/000711010X500
483
Linn, R. L., & Harnisch, D. L. (1981). Interactions between item content
and group membership on achievement test Items. Journal of Educa-
tional Measurement, 18(2), 109–118. http://dx.doi.org/10.1111/j.1745-
3984.1981.tb00846.x
MacCallum, R. C. (2003). Working with imperfect models. Multivariate
Behavioral Research, 38 (1), 113–139. http://dx.doi.org/10.1207/
S15327906MBR3801
5
McDonald, R. P. (1967). Nonlinear factor analysis (Psychometric Mono-
graphs No. 15). Richmond, VA: Psychometric Corporation. Retrieved
from http://www.psychometrika.org/journal/online/MN15.pdf
Maydeu-Olivares, A. (2001). Multidimensional item response theory
modeling of binary data: Large sample properties of NOHARM
estimates. Journal of Educational and Behavioral Statistics, 26(1), 51–71.
http://dx.doi.org/10.3102/10769986026001051
Monahan, P. O., Stump, T. E., Finch, H., & Hambleton, R. K. (2007).
Bias of exploratory and cross-validated DETECT index under uni-
dimensionality. Applied Psychological Measurement, 31(6), 483–503.
http://dx.doi.org/10.1177/0146621606292216
Mroch, A. A., & Bolt, D. M. (2006). A simulation comparison of
parametric and nonparametric dimensionality detection procedures.
Applied Measurement in Education, 19(1), 67–91. http://dx.doi.org/
10.1207/s15324818ame1901
4
Nandakumar, R., & Stout, W. (1993). Refinements of Stout’s procedure for
assessing latent trait unidimensionality. Journal of Educational Statistics,
18(1), 41–68. http://dx.doi.org/10.2307/1165182
Nandakumar, R., & Yu, F. (1996). Empirical validation of DIMTEST on
nonnormal ability distributions. Journal of Educational Measurement,
33(3), 355–368. http://dx.doi.org/10.1111/j.1745-3984.1996.tb00497.x
Nandakumar, R., Yu, F., & Zhang, Y. (2011). A comparison of bias correction
adjustments for the DETECT procedure. Applied Psychological Measure-
ment, 35(2), 127–144. http://dx.doi.org/10.1177/0146621610376767
OECD (2013). PISA 2012 assessment and analytical framework: Mathe-
matics, reading, science, problem solving and financial literacy. Paris,
OECD Publishing. http://dx.doi.org/10.1787/9789264190511-en.
Puranik, S. P., Petscher, Y., & Lonigan, C. J. (2013). Dimen-
sionality and reliability of letter writing in 3- to 5-year-old
preschool children. Learning and Individual Differences, 28, 133–141.
http://dx.doi.org/10.1016/j.lindif.2012.06.011
Reckase, M. (1979). Unifactor latent trait models applied to multifactor tests:
Results and implications. Journal of Educational Statistics, 4 , 207–230.
http://dx.doi.org/10.2307/1164671
Roussos, L. A., & Ozbek, O. Y. (2006). Formulation of the DETECT popula-
tion parameter and evaluation of DETECT estimator bias. Journal of Edu-
cational Measurement, 43(3), 215–243. http://dx.doi.org/10.1111/j.1745-
3984.2006.00014.x
Schilling, S., & Bock, R. D. (2005). High-dimensional maximum
marginal maximum likelihood item factor analysis by adaptive quadra-
ture. Psychometrika, 70, 533–555. http://dx.doi.org/10.1007/s11336-003-
1141-x
Seraphine, A. E. (1994). A power study of three procedures for the assess-
ment of unidimensionality. (Doctoral dissertation). University of Illinois
at Urbana-Champaign, Champaign, IL.
Seraphine, A. E. (2000). The performance of DIMTEST when latent
trait and item difficulty distributions differ. Applied Psychological
Measurement, 24(1), 82–94. http://dx.doi.org/10.1177/0146621600024
1005
Stewart, J., Batty, O. A., & Bovee, N. (2012). Comparing multidimen-
sional continuum models of vocabulary acquisition: An examination
of the vocabulary knowledge scale. TESOL Quarterly, 46(4), 695–721.
http://dx.doi.org/10.1002/tesq.35
Stocking, M. L., & Eignor, D. R. (1986). The impact of different ability distri-
butions on IRT preequating. Retrieved from ERIC database. (ED281864).
Stout, W. (1987). A nonparametric approach for assessing latent trait
unidimensionality. Psychometrika, 52(4), 589–617. http://dx.doi.org/
10.1007/BF02294821
Stout, W., Nandakumar, R., & Habing, B. (1996). Analysis of la-
tent dimensionality of dichotomously and polytomously scored test
data. Behaviormetrika, 23, 37–66. http://dx.doi.org/10.2333/bhmk.
23.37
Svetina, D. (2013). Assessing dimensionality of noncompensatory multid
imensional item response theory with complex structures. Educational
and Psychological Measurement, 73(2), 312–338. http://dx.doi.org/
10.1177/0013164412461353
Tan, X., & Gierl, M. J. (2006, April).
Evaluating the consistency of DE-
TECT indices and item clusters using simulated and real data that display
both simple and complex structure. Paper presented at the annual meet-
ing of the American Educational Research Association, San Francisco,
CA.
Downloaded by [University of Miami] at 11:29 31 December 2015
644 ZOPLUOGLU
Tate, R. (2002). Test dimensionality. In G. Tindal & T. M. Haladyna (Eds.),
Large scale assessment programs for all students: validity, technical ad-
equacy, and implementation (pp. 181–211). Mahwah, NJ: Erlbaum.
Tran, U., & Formann, A. (2009). Performance of parallel anal-
ysis in retrieving unidimensionality in the presence of binary
data. Educational and Psychological Measurement, 69(1), 50–61.
http://dx.doi.org/10.1177/0013164408318761
Tucker, L. R., Koopman, R. F., & Linn, R. L. (1969). Evaluation of
factor analytic research procedures by means of simulated correlation
matrices. Psychometrika, 34 (4), 421–459. http://dx.doi.org/10.1007/
BF02290601
Wang, M. (1986, April). Fitting a unidimensional model to multidimensional
item response data. Paper presented at the ONR Contractors Conference,
Gatlinburg, TN.
Way, W. D., Ansley, T. N., & Forsyth, R. A. (1988). The comparative
effects of compensatory and noncompensatory two-dimensional data
on unidimensional IRT estimates. Applied Psychological Measurement,
12(3), 239–252. http://dx.doi.org/10.1177/014662168801200303
Weng, L., & Cheng, C. (2005). Parallel analysis with unidimensional bi-
nary data. Educational and Psychological Measurement, 65(5), 697–716.
http://dx.doi.org/10.1177/0013164404273941
Yu, C . , & M u t h
´
en, B. O. (2002, April). Evaluation of model fit indices for
latent variable models with categorical and continuous outcomes. Paper
presented at the annual meeting of the American Educational Research
Association, New Orleans, LA.
Zhang, J., & Stout, W. (1999). The theoretical DETECT index of dimension-
ality and its application to approximate simple structure. Psychometrika,
64(2), 213–249. http://dx.doi.org/10.1007/BF02294536
Downloaded by [University of Miami] at 11:29 31 December 2015
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In order to extend the use of latent trait models across the full spectrum of mental testing, the applicability of the models to multivariate data must be determined. Since all of the commonly used models assume a unidimensional test, the applicability of the procedures to obviously multidimensional tests, such as achievement tests, is questionable. This paper presents the results of the application of latent trait analyses to a series of tests that vary in factorial complexity. The purpose is to determine what characteristics are estimated by the models for these tests, while at the same time determining the relationship of latent trait parameters to traditional item analysis and factor analysis indices.
Article
The purpose of this study was to investigate the effect of complex structure on dimensionality assessment in noncompensatory multidimensional item response models using dimensionality assessment procedures based on DETECT (dimensionality evaluation to enumerate contributing traits) and NOHARM (normal ogive harmonic analysis robust method). Five methods were evaluated: two DETECT-based methods—exploratory and cross-validated—and three NOHARM-based methods: root mean square residual (RMSR), , and approximate likelihood ratio (ALR). The results suggested that the studied methods had varying degree of success in correctly counting the number of dimensions and in meaningfully labeling sets of items as dimension-like. In two-dimensional, shorter tests, and ALR largely outperformed RMSR- and DETECT-based methods in conditions with small and medium sample sizes, across all levels of complexity and correlations. Lengthening of the test in two-dimensional conditions led to most notably improved accuracy in determining the correct number of dimensions for NOHARM-based RMSR, whereas for the remaining methods, increase in the number of items had differential effect. DETECT-based methods, on the other hand, had more success in labeling sets of items as dimension-like in conditions with two dimensions, irrespective of the test length, suggesting the items that ought to be together were more often grouped together. Performance of the methods was further evaluated in conditions with larger dimensionality (i.e., 3) and conditions with the increased number of items (i.e., longer tests).
Article
DETECT is a nonparametric methodology to identify the dimensional structure underlying test data. The associated DETECT index, Dmax, denotes the degree of multidimensionality in data. Conditional covariances (CCOV) are the building blocks of this index. In specifying population CCOVs, the latent test composite θTT is used as the conditional variable. In estimating the CCOVs, the total test score of all items in the test (T) or the rest score of remaining items (S) are generally used as estimates of the latent composite θTT . However, estimated CCOVs are biased when using T or S as a proxy for θTT. Some type of correction is needed to adjust this bias. This study was an investigation of different ways to estimate the DETECT index based on the conditional scores T and S, and additional bias adjustments, resulting in six different estimates, D 1 through D 6. These six indices were investigated in simulated settings, varying the test length, sample size, and the degree of multidimensionality (108 in all). The results showed that indices D 1, D 2, and D 5 are not acceptable as they displayed highly inflated D max values. No statistically significant differences were found between indices D 3 and D 6. Overall comparison of indices D 3 with D 4 showed that even though they differed significantly in D max values, there was no practically meaningful difference between the performance of these two indices. For these reasons, it is recommended that the current index D 3 be retained even though index D 4 displayed slightly better results in the current study.
Article
Abstract DETECT is a nonparametric procedurebased,on conditional covariances,for dimensionality assessment. Itcan,be used to assess the strength of multidimensionality in a test, to estimate the number of dimensions, and to partition items into distinct clusters that represent dimensions,underlying the test. DETECT works well with test datathat display simple or approximate simple structure in simulated and real data studies (e.g., Roussos & Ozbek, 2005; Zhang & Stout, 1999b). However, its performance with data that display
Article
This article provides a detailed investigation of Stout's statistical procedure (the computer program DIMTEST) for testing the hypothesis that an essentially unidimensional latent trait model fits observed binary item response data from a psychological test. One finding was that DIMTEST may fail to perform as desired in the presence of guessing when coupled with many high-discriminating items. A revision of DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised to determine the size of the assessment subtests. Further, an adjustment is made on the estimated standard error of the statistic on which DIMTEST depends. These three refinements have led to an improved procedure that is shown in simulation studies to adhere closely to the nominal level of signficance while achieving considerably greater power. Finally, DIMTEST is validated on a selection of real data sets.
Article
Multidimensional item response data were created from a hierarchical factor model under a variety of conditions. The strength of a second-order general factor, the number of first-order common factors, the distribution of items loading on those common factors, and the number of items in simulated tests were systematically manipulated. The computer program LOGIST effectively recovered both item parameters and trait parameters implied by the general factor in nearly all of the experimental conditions. Implications of these findings for computerized adaptive testing, investigations of item bias, and other applications of item response theory models are discussed.
Article
The performance of the item response theory (IRT) true-score equating method is examined under conditions of test multidimensionality. It is argued that a primary concern in applying unidimensional equating methods when multidimensionality is present is the potential decrease in equity (Lord, 1980) attributable to the fact that examinees of different ability are expected to obtain the same test scores. In contrast to equating studies based on real test data, the use of simulation in equating research not only permits assessment of these effects but also enables investigation of hypothetical equating conditions in which multidimensionality can be suspected to be especially problematic for test equating. In this article, I investigate whether the IRT true-score equating method, which explicitly assumes the item response matrix is unidimensional, is more adversely affected by the presence of multidimensionality than 2 conventional equating methods-linear and equipercentile equating-using several recently proposed equity-based criteria (Thomasson, 1993). Results from 2 simulation studies suggest that the IRT method performs at least as well as the conventional methods when the correlation between dimensions is high (³ 0.7) and may be only slightly inferior to the equipercentile method when the correlation is moderate to low (£ 0.5).