ArticlePDF Available

A Comparison of Anchor-Item Designs for the Concurrent Calibration of Large Banks of Likert-Type Items

Authors:

Abstract and Figures

Current interest in measuring quality of life is generating interest in the construction of computerized adaptive tests (CATs) with Likert-type items. Calibration of an item bank for use in CAT requires collecting responses to a large number of candidate items. However, the number is usually too large to administer to each subject in the calibration sample. The concurrent anchor-item design solves this problem by splitting the items into separate subtests, with some common items across subtests; then administering each subtest to a different sample; and finally running estimation algorithms once on the aggregated data array, from which a substantial number of responses are then missing. Although the use of anchor-item designs is widespread, the consequences of several configuration decisions on the accuracy of parameter estimates have never been studied in the polytomous case. The present study addresses this question by simulation, comparing the outcomes of several alternatives on the configuration of the anchor-item design. The factors defining variants of the anchor-item design are (a) subtest size, (b) balance of common and unique items per subtest, (c) characteristics of the common items, and (d) criteria for the distribution of unique items across subtests. The results of this study indicate that maximizing accuracy in item parameter recovery requires subtests of the largest possible number of items and the smallest possible number of common items; the characteristics of the common items and the criterion for distribution of unique items do not affect accuracy.
Content may be subject to copyright.
A Comparison of Anchor-Item
Designs for the Concurrent
Calibration of Large Banks of
Likert-Type Items
Miguel A. Gar
´
a-Pe
´
rez
1
, Rocı
´
o Alcala
´
-
Quintana
1
, and Eduardo Gar
´
a-Cueto
2
Abstract
Current interest in measuring quality of life is generating interest in the construction of com-
puterized adaptive tests (CATs) with Likert-type items. Calibration of an item bank for use
in CAT requires collecting responses to a large number of candidate items. However, the num-
ber is usually too large to administer to each subject in the calibration sample. The concurrent
anchor-item design solves this problem by splitting the items into separate subtests, with some
common items across subtests; then administering each subtest to a different sample; and finally
running estimation algorithms once on the aggregated data array, from which a substantial num-
ber of responses are then missing. Although the use of anchor-item designs is widespread, the
consequences of several configuration decisions on the accuracy of parameter estimates have
never been studied in the polytomous case. The present study addresses this question by sim-
ulation, comparing the outcomes of several alternatives on the configuration of the anchor-item
design. The factors defining variants of the anchor-item design are (a) subtest size, (b) balance of
common and unique items per subtest, (c) characteristics of the common items, and (d) criteria
for the distribution of unique items across subtests. The results of this study indicate that max-
imizing accuracy in item parameter recovery requires subtests of the largest possible number of
items and the smallest possible number of common items; the characteristics of the common
items and the criterion for distribution of unique items do not affect accuracy.
Keywords
computerized adaptive testing, item calibration, graded response model, linking, questionnaires,
simulation, anchor-item designs, health status, attitudes
Article
1
Universidad Complutense, Madrid, Spain
2
Universidad de Oviedo, Oviedo, Spain
Corresponding Author:
Miguel A. Garcı
´
a-Pe
´
rez, Departamento de Metodologı
´
a, Facultad de Psicologı
´
a, Universidad Complutense, Campus de
Somosaguas, 28223 Madrid, Spain
Email: miguel@psi.ucm.es
Applied Psychological Measurement
XX(X) 1–20
ªThe Author(s) 2010
Reprints and permission:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/0146621609351259
http://apm.sagepub.com
Applied Psychological Measurement OnlineFirst, published on July 28, 2010 as doi:10.1177/0146621609351259
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Numerous inventories consisting of Likert-type items have been developed for marketing sur-
veys or to assess health status, physical functioning, or quality of life, and some of these inven-
tories are administered adaptively (Bjorner, Kosinski, & Ware, 2003; Cella & Chang, 2000;
Fletcher & Hattie, 2004; Hart, Wang, Stratford, & Mioduski, 2008a, 2008b; Lai, Cella, Chang,
Bode, & Heinemann, 2003; Singh, Howell, & Rhoads, 1990; Singh, Rhoads, & Howell, 1992;
Uttaro & Lehman, 1999; Ware, Bjorner, & Kosinski, 2000; Ware et al., 2003). The validity of
these adaptive inventories depends on the relevance of the substantive content of the items
but also on the accuracy with which the item parameters are estimated.
Computerized adaptive tests (CATs) rely on large item banks whose calibration is somewhat
problematic. In principle, the optimal choice for item calibration is the single-group design, in
which a large number of subjects respond to each of the items in the pool. However, the size
of the initial item pool for a CAT is generally too large to permit this approach. On other occa-
sions, an existing item bank is gradually updated with the addition of new items, which makes it
impossible to calibrate all items concurrently. As a result, items need to be calibrated in separate
sets. A number of linking techniques have been proposed that are intended to bring items thus
calibrated to a common scale that permits the use of arbitrary subsets of the item bank in com-
bination (see Vale, 1986).
Linking techniques for use in updating and maintenance of existing item banks use an anchor-
ing design intended to collect item responses appropriately, along with a transformation method
that brings the item parameters from separate calibrations to the common scale.
However, when a large item bank is constructed from scratch, the transformation method
becomes unnecessary because the data gathered with the anchoring design can be calibrated
in a single pass of the item parameter estimation algorithm, which guarantees the common scale
for all items. This article considers item calibration under these conditions. Vale (1986)
described a number of anchoring designs, but this analysis will be restricted to the ‘anchor-
item’ design, in which the item pool is divided into small subtests, each administered to a differ-
ent group of respondents but with the characteristic that all subtests share a few common items.
This design does not reduce the total number of subjects required for the calibration sample, but
it certainly reduces the burden on each subject. The resultant data array is sparse because all sub-
jects in the calibration sample respond only to the common items, whereas only small groups of
subjects respond to the remaining subsets of (unique) items.
The choice of an anchor-item design for the calibration of a large pool of items requires deci-
sions about (a) the number of items per subtest, (b) the relative numbers of common and unique
items in each subtest, (c) the choice of common items, and (d) the distribution of unique items
across subtests. The choice of common items and the distribution of unique items across subtests
seems critical: The accuracy of parameter estimates may depend, for instance, on whether the
common items have particular characteristics such as high or low discrimination or whether
the unique items are distributed so as to make the different subtests roughly parallel or hetero-
geneous. The goal of this article is to compare the accuracy with which the parameters of Likert-
type items can be recovered by anchor-item designs varying as to the above-mentioned factors.
The intended context of application is the development of a CAT item bank from scratch and,
therefore, the calibration of a large item pool, with no prior knowledge of the item parameters
or the trait levels of the subjects in the (large) calibration sample. The characteristics of this con-
text preclude the use of approaches based on the theory of optimal designs (e.g., Holman &
Berger, 2001).
Various studies have addressed some of these questions previously (e.g., Cohen & Kim, 1998;
de Gruijter, 1988; Hanson & Be
´
guin, 2002; Kim & Cohen, 2002; Vale, 1986; Wingersky & Lord,
1984), but mostly in the context of bank maintenance (which involves a subsequent linking step)
and for dichotomous items. No detailed simulation study appears to have been published that
2 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
addresses all of these questions systematically in the context of concurrent calibration of a large
pool of Likert-type items. This research aimed at filling this gap.
Method
The present study used Samejima’s (1969, 1997) graded response model for Likert-type items
with K ¼ 5 ordered response categories. To introduce the present notation, recall that the prob-
ability p
jk
that the ith subject, whose trait level is y
i
, responds in category k (with 1 k K)on
item j is given by
pjk ¼
1
1
1þexp½a
j
ðy
i
b
j;1
Þ
if k ¼ 1
1
1þexp½a
j
ðy
i
b
j;k1
Þ
1
1þexp½a
j
ðy
i
b
j;k
Þ
if 1 < k < K
1
1þexp½a
j
ðy
i
b
j;K1
Þ
if k ¼ K
8
>
<
>
:
; ð1Þ
where a
j
is the item discrimination parameter and b
J;K
, is the boundary between categories k and
k + 1 (for 1 k K 1).
Item and Subject Parameters
The question that this research addresses was empirically motivated by the development of an
adaptive inventory for the assessment of perceived health, for which the initial pool consisted
of 155 Likert-type items with K ¼ 5 response categories, and where a calibration sample of
1,000+ subjects was available. Therefore, for the present simulation study, parameters were gen-
erated for a pool of n ¼ 155 items and a total sample of N ¼ 1,120 respondents. The size of the
initial pool of items may be regarded as sufficiently small in some contexts for single-group
designs to be feasible (e.g., with cognitive items administered to student populations, or with
noncognitive items administered to healthy and motivated respondents with no time limitations).
However, the present application targets elderly and ill respondents who cannot reasonably be
confronted with 155 items. Furthermore, the subjects in this calibration sample were patients
at the offices of general-practice physicians and they were to take the subtest while they waited
for their appointment, which further limited the time available for responding (and, therefore, the
number of items that could be administered). Nevertheless, there seems to be no reason why the
present results would not generalize to cases in which the initial pool of items is substantially
larger.
Trait levels y were drawn from a unit normal distribution using NAG subroutine G05DDF
(Numerical Algorithms Group, 1999), yielding the actual distribution shown in Figure 1a. As
for item parameters, the literature does not provide clear indications as to how the parameters
of Likert-type items with K ¼ 5 are empirically distributed. Parameters have occasionally
been reported for sets of between 5 and 36 items (see Cohen & Kim, 1998; Emons, 2008;
Hol, Vorst, & Mellenbergh, 2007; Kim & Cohen, 2002; Koch, 1983; Reise, Widaman, &
Pugh, 1993; Singh et al., 1990), but these data do not provide sufficient information for the arti-
ficial generation of realistic parameters for 155 items. Nevertheless, analyses of all of these
reported sets of empirical item parameters invariably indicate that (a) the separation between
consecutive boundary parameters varies within and across items and (b) there is a negative cor-
relation between the discrimination parameter a and the separation between outer boundary
parameters (i.e., the difference b
j,K 1
b
j,1
). In other words, items with high discrimination
tend to have their boundary parameters closely spaced, whereas these boundary parameters
are more widely spread out in items with low discrimination.
Garcı
´
a-Pe
´
rez et al. 3
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
In contrast, earlier simulation studies generated item parameters lacking one or both of these
characteristics. For instance, Dodd, Koch, and De Ayala (1989) used item discriminations that
varied systematically from 0.90 to 2.15 in steps of 0.05. These discrimination levels were ran-
domly assigned to items whose boundary parameter characteristics were only described to reflect
‘those typically obtained from calibration of real graded response ability test data.’ Baker
(1992) generated boundary parameters for each item from a unit-normal distribution, and their
outer boundary separation was uncorrelated with a discrimination parameter generated from
a uniform distribution from 1.34 to 2.65, a strategy guaranteeing only that the separation between
consecutive boundary parameters is not constant. Woods (2007) used a slightly different
approach that also generated discrimination parameters that were uncorrelated with the separa-
tion between outer boundary parameters, although the distance between consecutive boundary
parameters was still random within and across items. Finally, Meade, Lautenschlager, and John-
son (2007) generated discrimination parameters to be normally distributed (M ¼ 1.25 and SD ¼
0.07). These were uncorrelated with boundary parameters generated such that b
j,1
for each item
was drawn from a normal distribution (M ¼ 1.7 and SD ¼ 0.45), but b
j,k+1
¼ b
j,k
+ 1.2 for all
1 k K 2 in each item, unrealistically implying that the distance between consecutive
boundary parameters in each item was a constant 1.2 for all items.
To reproduce the empirical characteristics discussed above (i.e., a negative relationship
between discrimination and separation between outer boundary parameters, coupled with
unequal distances between consecutive boundary parameters within and across items), the pres-
ent study uses a different strategy. In particular, item discrimination parameters were generated
through NAG subroutine G05DAF to be uniformly distributed on the interval [1, 3], and bound-
ary parameters were also generated to be uniformly distributed on an interval whose range varied
with the particular boundary to be considered and also with the discrimination parameter for the
item of concern. Specifically, b
j,1
was unconstrained and ranged between 3 and 1, but b
j,k
(for 2 k 4) ranged between b
j,k 1
+(6 0.7a
j
)/12 and b
j,k 1
+(6 0.7a
j
)/3. Thus,
any given boundary parameter beyond the first one for an item was randomly located above
the preceding boundary parameter within a range whose breadth was a decreasing function of
the item discrimination parameter a. The constants in the expressions defining the limits of these
ranges were chosen arbitrarily so as to produce reasonable boundary locations and separations
between the outer boundaries for each item. This approach produced the desired negative
(a)
Trait level, θ
Frequency
–3 –2 –1 0 1 2 3
0
25
50
75
100
0.5 1.5 2.5 3.5
1.0
2.0
3.0
4.0
5.0
(b)
Separation, b
4
b
1
Discrimination, a
Figure 1. Distribution of trait levels in the simulated sample (a) and plot of the separation b
j,4
b
j,1
against a
j
for each of the 155 items in the pool (b)
4 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
correlation between discrimination and separation of outer boundaries. This is evident from the
outcomes when this approach was used for the generation of item parameters for the 155 items
(see Figure 1b); note also that it guarantees compliance with the order restriction b
j,k
< b
j,k+1
for
all 1 k 3.
Calibration Designs to Be Compared
Twenty-four anchor-item designs were considered, varying according to the four factors mentioned
in the introduction. The first factor, number of items per subtest, had two levels: 15 and 20 items.
The second factor, the relative numbers of common and unique items, had three levels per size of
subtest. Thus, for 15-item subtests, the number of common items was 5, 8, or 10, respectively,
requiring 10, 7, and 5 additional unique items per subtest; for 20-item subtests, the number of com-
mon items was 5, 11, or 15, respectively, requiring 15, 9, and 5 additional unique items per subtest.
Each of these designs, in turn, required a different number of subtests and, hence, groups of sub-
jects of different size (subject to the constraint that the total calibration sample consists of 1,120
respondents). Table 1 gives a summary of these requirements. At the same time, the resultant
data array was generally sparse, as illustrated in the upper panel of Figure 2, showing the case
of 20-item subtests with five common items: Of the 1,120 × 155 ¼ 173,600 cells in the data
array, only 1,120 × 20 ¼ 22,400 contain data, arising from 1,120 responses to each of the
five common items, plus 112 responses to each of 10 sets of 15 unique items.
The criterion for the choice of common items and the distribution of unique items acted as
a single factor with four levels. In the absence of any prior knowledge of item parameters, com-
mon items can only be chosen at random (or perhaps because of their content, but certainly not
because of their parameters), and unique items can only be distributed at random. This is one of
the levels for this factor, and is indeed the only empirical option available where no preliminary
study has been carried out to obtain item parameter estimates. In that case, the resultant subtests
will be heterogeneous. However, the choice of common items and the distribution of unique
items according to their parameters might be expected to render more accurate estimates. The
true item parameters are not available in the practical application that is being considered
(although they will be roughly known in other practical applications where the items have
been pretested) but they are actually known in the present simulation study. Capitalizing on
this knowledge, three additional strategies were defined for the choice of common items, and
a single additional strategy for the distribution of unique items. The three selection criteria for
the choice of common items were (a) items with the highest discrimination and the largest sep-
aration between their outer boundary parameters, (b) items with the lowest discrimination and
the smallest separation between their outer boundary parameters, and (c) items that were
Table 1. Features of Each of the Six Major Anchor-Item Configurations
NI NC NU NG NS
20 5 15 10 112
11 9 16 70
15 5 28 40
15 5 10 15 74-75
8 7 21 53-54
10 5 29 38-39
Note: NI ¼ number of items per subtest; NC ¼ number of common items; NU ¼ number of unique items; NG ¼
number of resultant subtests (and groups of subjects); NS ¼ number of subjects per group.
Garcı
´
a-Pe
´
rez et al. 5
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
heterogeneous in both respects. The only criterion for the distribution of unique items in these
three cases was to maximize the homogeneity of the resultant subtests by placing in them items
that are comparable as to discrimination and separation between their outer boundaries. Figure 3
illustrates what the four levels of this factor represent for the assembly of 20-item subtests with
15 common items. Solid circles in each panel indicate the set of 15 common items that were
selected with the applicable criterion, and each of the five remaining groups of 28 identical sym-
bols (open circles, open squares, open diamonds, open inverted triangles, and upright gray trian-
gles) represent the bins from which unique items were selected for inclusion in the 28
subtests required by this design. The five unique items for a given subtest were selected by pick-
ing at random (and without replacement) one item from each of these five bins.
To determine how the accuracy of these item parameter estimates compared with those result-
ing from single-group designs involving similar numbers of respondents, the present study also
included single-group designs in which all subjects responded to all 155 items. Because the pres-
ent set of anchor-item design configurations involved a different number of respondents per
unique item (given that all subjects respond to the common items; see Table 1), in this study sin-
gle-group designs with 39, 40, 53, 70, 75, 112, and 1,120 subjects were included. The lower panel
in Figure 2 illustrates the data array for the single-group design matched to the anchor-item
design in the upper panel. In this case, the array had no missing data and consisted of 112 ×
155 ¼ 17,360 cells with the same number of responses to Items 6 to 155 in the single-group
1120 subjects
112
112
112
112
112
112
112
112
112
112
155 items
5 15 15 15 15 15 15 15 15 15 15
112
Figure 2. Sketch of the data array for a 20-item anchor-item design with five common items (upper panel)
and sketch of the data array for a comparison single-group design in which the total number of subjects
equals the number of subjects per group in the anchor-item design (lower panel)
6 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
and the anchor-item designs, although the latter collected 1,120 responses to the five common
items compared to only 112 responses in the matched single-group design.
Simulation Approach
The simulation approach for each of the 24 anchor-item designs and the seven single-group
designs was identical, except that the number of subjects (or the items to which the subjects
responded) varied as dictated by the particular design under consideration. Custom software
was written for this purpose. To simulate the response of subject i to item j, numerical values
for the probabilities under the multinomial distribution in Equation 1 were determined by insert-
ing into the expressions the particular trait level of the subject and the parameters of the item.
Then, the resultant multinomial experiment was simulated through NAG subroutine G05DAF,
and the outcome (ranging from 1 to 5 and representing the ordered category of the response)
was recorded in the applicable data array. The NAG subroutine that simulates the multinomial
experiment does only what could have been programmed from scratch: The set of five probabil-
ities arising from Equation 1 for item j and subject i partitioned the segment [0, 1] into five adja-
cent and nonoverlapping regions labeled from 1 to 5, each of which had length p
jk
(for 1 k
5); next, the simulated response was given directly by the label of the region in which a uniformly
distributed random number fell. Missing responses to items not administered in the anchor-item
1.0 1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Separation, b
4
b
1
Discrimination, a
(a) High discrimination and large separation
1.0 1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Separation, b
4
b
1
Discrimination, a
(b) Low discrimination and small separation
1.0 1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
,noitarapeS b
4
b
1
Discrimination, a
(c) Heterogeneous
1.0 1.5 2.02.53.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
,noitarapeS b
4
b
1
Discrimination, a
(d) Random
Figure 3. Illustration of several criteria for the choice of common items (solid circles) and the grouping of
unique items (subsets of symbols of different shapes) for distribution across subtests
Garcı
´
a-Pe
´
rez et al. 7
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
design under consideration were differently coded in the data file so as to be identifiable by the
calibration software.
Parameter Estimation
The data array representing the simulation of each design was input to MULTILOG 7.03 (du Toit,
2003) to obtain estimates of the item parameters using marginal maximum likelihood methods.
Default values were used for the remaining options. Estimates
^
a,
^
b
1
,
^
b
2
,
^
b
3
, and
^
b
4
of the discrim-
ination and boundary parameters of each item were thus obtained. Each individual
MULTILOG run
was checked for convergence, and no failure was observed. Also, evidence of prior dominance
was sought by checking for unduly concentrated parameter estimates around their prior (vs. true)
locations but, again, no signs of this were observed.
Criteria for the Comparison
Because the main goal of this study was to compare the accuracy with which each design recov-
ered the item parameters, four complementary criteria were used. The first criterion involves
a scatterplot and the accompanying product–moment correlation between true and estimated
parameters, separately computed for each of the five item parameters. The second criterion,
also separately computed for each parameter, is the root mean square error (RMSE), defined as
RMSE
o
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n
X
n
j¼1
ð
^
ω
j
ω
j
Þ
2
r
ð2Þ
for each o {a, b
1
, b
2
, b
3
, b
4
}. The third criterion is a global measure for each item, defined as
the mean Euclidean distance (MED) between the point in five-dimensional space at the location
of the true item parameters and the point at the location of their estimates, that is,
MED ¼
1
n
X
n
j¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð
^
a
j
a
j
Þ
2
þ
X
4
k¼1
ð
^
b
j;k
b
j;k
Þ
2
r
: ð3Þ
Finally, and because
MULTILOG occasionally seems unable to estimate the parameters of some
items, the fourth criterion was a mere count of the items whose parameters could not be
estimated. This behavior of
MULTILOG has been described earlier (Baker, 1997) and shows in that esti-
mates of the boundary parameters do not satisfy the order restriction
^
b
j;k
<
^
b
j;kþ1
for all 1 k 3.
Naturally, when the parameters of some of the items could not be estimated,
1
equations (2) and (3) for
the computation of RMSE and MED did not use n ¼ 155 but only the number of items actually
involved. The same was true for the computation of product–moment correlations.
Results
Single-Group Designs
Figure 4 shows scatterplots of true parameters and their estimates in four representative single-
group designs involving the number of subjects indicated on the right of each row. Left to right,
the columns correspond to a, b
1
, b
2
, b
3
, and b
4
(see the labels at the bottom). Also given in each
panel are the product–moment correlation and the value of RMSE for the corresponding param-
eter. The global value of MED is given on the right of each row, where the number of items
whose parameters could not be estimated is also given in parenthesis.
8 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
A quick visual comparison of the four rows in Figure 4 corroborates the well-known fact that
accuracy in parameter estimation decreases as the number of respondents decreases (Reise & Yu,
1990): The scatter of data increases and the correlation decreases. The summary measures RMSE
012345
0
1
2
3
4
5
Estimate
r = 0.989
RMSE = 0.478
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.985
RMSE = 0.227
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.997
RMSE = 0.366
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.998
RMSE = 0.613
–4 –2 0 2 4
–4
–2
0
2
4
1120 subjects
MED = 1.216
(0)
r = 0.997
RMSE = 0.881
012345
0
1
2
3
4
5
Estimate
r = 0.875
RMSE = 0.424
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.828
RMSE = 0.598
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.974
RMSE = 0.444
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.979
RMSE = 0.642
–4 –2 0 2 4
–4
–2
0
2
4
112 subjects
MED = 1.281
(0)
r = 0.973
RMSE = 0.856
012345
0
1
2
3
4
5
Estimate
r = 0.817
RMSE = 0.404
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.745
RMSE = 0.760
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.941
RMSE = 0.464
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.968
RMSE = 0.543
–4 –2 0 2 4
–4
–2
0
2
4
70 subjects
MED = 1.289
(4)
r = 0.808
RMSE = 1.082
012345
0
1
2
3
4
5
Estimate
True a
r = 0.697
RMSE = 0.603
–4 –2 0 2 4
–4
–2
0
2
4
True b
1
r = 0.676
RMSE = 1.028
–4 –2 0 2 4
–4
–2
0
2
4
True b
2
r = 0.898
RMSE = 0.528
–4 –2 0 2 4
–4
–2
0
2
4
True b
3
r = 0.955
RMSE = 0.534
–4 –2 0 2 4
–4
–2
0
2
4
40 subjects
MED = 1.468
(9)
True b
4
r = 0.747
RMSE = 1.197
Figure 4. Scatterplot of true and estimated parameters from single-group designs involving different
numbers of respondents
0.0
0.0
0.4
0.4
0.8
0.8
1.2
1.2
1.6
1.6
RMSE
MED
1120 subjects (0)
1.216
112 subjects (0)
1.281
70 subjects (4)
1.289
40 subjects (9)
1.468
Figure 5. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
,andb
4
)and
MED measures (horizontal lines and numerals) for single-group designs involving different numbers of
respondents. RMSE ¼ root mean square error; MED ¼ mean Euclidean distance.
Garcı
´
a-Pe
´
rez et al. 9
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
and MED printed in Figure 4 are better compared graphically in Figure 5. Each of the four single-
group designs is represented in Figure 5 as a separate block, with the total number of respondents
decreasing left to right (see the bottom labels, where the number of items whose parameters
could not be estimated is also given in parenthesis). The height of the black bar in each block
indicates the value of RMSE for the discrimination parameter a, whereas the height of the white
bars indicates, left to right, the value of RMSE for the boundary parameters b
1
, b
2
, b
3
, and b
4
. The
height of the horizontal segment above each block gives the global value of MED. The deteri-
oration also shows in that both RMSE and MED generally increase as the number of respondents
decreases, and the number of items whose parameters cannot be estimated also increases as the
number of respondents decreases.
It is worth noting that in Figure 5 MED and, especially, RMSE do not seem to capture very
accurately the actual deterioration that the scatterplots and correlations in Figure 4 reveal. For
instance, the scatter of data for the discrimination parameter is clearly seen in the left column
of Figure 4 to increase downwards (i.e., as the size of the sample of respondents decreases),
and the correlation decreases accordingly. The increase in scatter is substantial from the top
panel (for 1,120 respondents) to the panel immediately underneath (for 112 respondents), and
the correlation also decreases from .989 to .875. Yet, against all reasonable expectations,
RMSE actually drops slightly from .478 to .424 (instead of increasing sizably) when sample
size decreases from 1,120 to 112 respondents. In other words, the deterioration is actually there
as revealed by the scatterplots and correlations, but RMSE and MED do not seem to capture it
properly.
Another characteristic that is worth commenting on is that data points in the panels of Figure 4
do not meander around the diagonal identity line. This is particularly evident in the top row,
where the scatter of data is minimal but regression lines would not have a unit slope: The slope
is clearly less than unity for the discrimination parameter (left panel) and it is greater than unity
for all boundary parameters. These characteristics reveal that
MULTILOG recovers parameters
accurately given a sufficiently large number of respondents (as indicated by the tightness of
data and the high correlations in the top row of Figure 4) but it does so under a metric that differs
from that of the true parameters (which produces the non-unit slope of the elongated cloud of
data). This is reminiscent of the need for linking methods in separate-group calibration designs,
but these methods are not applicable in practice because the anchor points could only be provided
by a few items whose true parameters were known. In practice, this is not a problem because the
metric of item parameters is immaterial as long as it is common to all items, but in the present
situation, it explains the failure of RMSE and MED to capture the accuracy with which param-
eters can be estimated (i.e., the amount of scatter in plots like those in Figure 4).
RMSE measures the discrepancy between true and estimated parameters by way of the alge-
braic difference between them, and the same is true for MED. In graphs like those in Figure 4,
RMSE thus increases as the vertical distance between data points and the diagonal identity line
increases. For the discrimination parameter in the leftmost panel of the top row in Figure 4,
where the scatter of data is minimal, these distances are large as a result of the different metric
of true and estimated parameters, yielding an RMSE of .478; for the first boundary parameter
(see the second panel in the first row of Figure 4), the scatter of data is similarly small. However,
the data points lie closer to the identity line, yielding an RMSE of only .227; for the last boundary
parameter (see the rightmost panel in the top row of Figure 4), the scatter is again similar but the
data points lie farther from the diagonal line, yielding a much larger RMSE valued at .881. Con-
sider now the second row in Figure 4, where the scatter of data is substantially larger, with the
consequence that some of the data points lie closer to the identity line. This produces spurious
RMSE values that are generally similar to those in the first row (with the exception of the first
10 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
boundary parameter, for which data were around the diagonal line in the top row), despite the
clearly inferior estimation accuracy.
To confirm that the RMSE and MED measures reported in Figure 5 are indeed affected by this
problem, 50 independent replications of these single-group designs were run and the results of
each individual replication were plotted in the form of Figures 4 and 5. The additional replica-
tions rendered thoroughly analogous results: Differences in metric and variations in estimation
accuracy with sample size paralleled those observed in Figure 4, and RMSE and MED measures
continued to display the contaminated trends shown in Figure 5. To illustrate, Figure 6 shows box
plots of the distribution of MED and RMSE (for each parameter) across replications, using the
same graphical format used in Figure 5 to report the results of a single replicate. Note in the upper
panel of Figure 6 that the distribution of MED values across replications increases as the number
of respondents in the sample decreases, in much the same form as was reported in Figure 5 for
a single replicate; at the same time, distributions of RMSE for each of the five item parameters
(arranged within each block in Figure 6 in the same order as in the blocks of Figure 5) also show
that the distribution of RMSE values are highly similar for all parameters except b
1
when the
sample consisted of either 1120 or 112 respondents, paralleling the results reported in Figure
5 for a single replicate.
In sum, RMSE and MED do not faithfully portray the accuracy with which parameters can be
estimated when the metric of true and estimated parameters differs. In these situations,
MED
1120 subjects 112 subjects 70 subjects 40 subjects
0.0 0.0
0.10.1
0.20.2
0.30.3
RMSE
1120 subjects 112 subjects 70 subjects 40 subjects
0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
Figure 6. Box plots (minimum, first quartile, second quartile, third quartile, maximum) of the distribution
of MED (top panel) and RMSE (bottom panel; left to right, the box plots pertain to parameters a, b
1
, b
2
, b
3
,
and b
4
) across replications of single-group designs involving different numbers of respondents. RMSE ¼
root mean square error; MED ¼ mean Euclidean distance.
Garcı
´
a-Pe
´
rez et al. 11
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
scatterplots and correlations clearly have the last word. It should nevertheless be stressed that
these discrepancies may occur only in extreme conditions such as those depicted in the top
row of Figure 4, where differences in metric overwhelm accuracy as determined by the scatter
of data; indeed, RMSE and MED measures seem to provide valid summaries for the comparison
of accuracy across the three bottom rows of Figure 4. The present study will continue to use
RMSE and MED despite their occasional misbehavior because these are typical summary meas-
ures in simulation studies, but it must be emphasized that all of the conclusions are primarily
based on an analysis of the raw results (scatterplots and correlations) and not only on the sum-
maries provided by RMSE and MED.
It is useful to recall that these single-group designs were included in the present study to set
a reference against which the outcomes of anchor-item designs would be compared. In practice,
it will often be infeasible to have all subjects (regardless of their number) respond to all of the
items in the initial pool. Nevertheless, it is interesting to have some sense of how the outcomes of
a given anchor-item design compare to the (infeasible) single-group design involving the same
number of respondents as in each of the groups participating in the anchor-item design. It is inter-
esting to note that the accuracy with which the parameters of Likert-type items can be estimated
varies greatly with the number of respondents (Reise & Yu, 1990); however, further research
should be conducted to establish whether the potentially inferior accuracy provided by an
anchor-item design is caused by the fact that calibration involves a single run on a very sparse
data array, or the fact that different groups of subjects respond to different sets of unique items, or
is only a consequence of the small size of the subgroups into which the total calibration sample is
divided.
Twenty-Item Anchor-Item Designs
The outcomes of anchor-item designs were analyzed as described above for single-group
designs. However, scatterplots and correlations did not differ meaningfully across variations
in the criteria used for the selection of common items and the distribution of unique items,
and these plots are thus omitted. Nevertheless, they are available from the corresponding author
on request. In any case, the study confirmed that the location of data points relative to the identity
line in the scatterplots were similar in all compared conditions. The summary RMSE and MED
measures, therefore, were not contaminated by the differences in metric discussed in the preced-
ing section. Figure 7 shows summary plots of RMSE and MED for each of the 12 anchor-item
designs involving 20-item subtests, and for the three comparison single-group designs involving
a total number of subjects identical to that in each of the subgroups of the corresponding anchor-
item design. Thus, Figure 7a shows results for the case of 5 common and 15 unique items per
subtest (10 groups of 112 subjects), Figure 7b shows results for the case of 11 common and 9
unique items (16 groups of 70 subjects), and Figure 7c shows results for the case of 15 common
and 5 unique items (28 groups of 40 subjects). The leftmost block in each part of Figure 7 gives
results for the comparison single-group design (containing responses to all 155 items by single
groups of sizes 112, 70, and 40, respectively, for Figures 7a, 7b, and 7c; these blocks are merely
replotted from Figure 5), and the remaining blocks give results under the four different criteria
for selection of common items and distribution of unique items.
Three characteristics of the patterns of RMSE and MED displayed in Figure 7 are worth not-
ing. First, accuracy deteriorates as the number of common items increases and, consequently, the
number of unique items decreases. This is perhaps because the size of the subgroups of respond-
ents also decreases and, thus, the number of responses to the unique items decreases. Second,
results for matched single-group designs (leftmost block in each panel of Figure 7) and alterna-
tive anchor-item designs (four blocks on the right of each panel of Figure 7) reveals that the
12 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
matched single-group design renders slightly less accurate parameter estimates than any of the
anchor-item designs with which it can be compared (this was further confirmed by inspection of
scatterplots and correlations). Third, no meaningful differences in accuracy can be observed
across anchor-item designs varying only as to criteria for the selection of common items and
the distribution of unique items. This has the important practical implication that common items
can be chosen randomly and unique items can be distributed randomly across subtests with no
consequence to the accuracy with which item parameters can be estimated, despite the hetero-
geneity of the resultant subtests. Finally, all anchor-item designs varying only as to criteria
for the selection of common items and distribution of unique items yield similar numbers of
items whose parameters cannot be estimated; and these numbers are generally no larger than
those resulting from the matched single-group design. In all the scatterplots, the data points per-
taining to common items were always right on or very close to the putative regression line (which
was not the diagonal line because of the metric differences discussed above), suggesting that the
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(a) 5 common and 15 unique items; 10 groups of 112 subjects
155 items (0)
1.281
HD-LS (0)
0.825
LD-SS (1)
0.825
Heterogeneous (0)
0.758
Random (0)
0.801
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(b) 11 common and 9 unique items; 16 groups of 70 subjects
155 items (4)
1.289
HD-LS (4)
0.904
LD-SS (3)
0.878
Heterogeneous (3)
0.792
Random (3)
0.866
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(c) 15 common and 5 unique items; 28 groups of 40 subjects
155 items (9)
1.468
HD-LS (4)
1.082
LD-SS (5)
1.208
Hetero
g
eneous (4)
1.147
Random (6)
1.112
Figure 7. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
, and b
4
) and MED
measures (horizontal lines and numerals) for the set of 12 anchor-item designs involving 20-item subtests
and the comparison single-group designs. RMSE ¼ root mean square error; MED ¼ mean Euclidean
distance.
Garcı
´
a-Pe
´
rez et al. 13
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
large number of responses collected for them actually contributed to the accuracy with which
their parameters could be estimated.
It seems clear, then, that anchor-item designs with five common items are optimal among
these 20-item designs, and that these designs outperform their matched single-group design.
Nevertheless, they should not outperform a single-group design involving 1,120 respondents.
Yet these 20-item anchor-item designs appear to provide more accurate parameter estimates
in terms of RMSE and MED than the single-group design in which 1,120 subjects respond to
all 155 items (compare with the leftmost block in Figure 5). This conclusion defies logic, but
it only reflects the inability of RMSE and MED to portray the actual accuracy with which param-
eters can be estimated when there are differences in the metric of true and estimated parameters,
something that severely inflates RMSE and MED measures in the case of the present single-
group design with 1,120 subjects as discussed above. Contrary to what RMSE and MED indicate,
scatterplots and correlations reveal that the single-group design involving 1,120 respondents
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(a) 5 common and 10 unique items; 15 groups of 74-75 subjects
155 items (2)
1.179
HD-LS (2)
0.768
LD-SS (1)
0.861
Heterogeneous (2)
0.799
Random (1)
0.935
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(b) 8 common and 7 unique items; 21 groups of 53-54 subjects
155 items (6)
1.371
HD-LS (0)
1.025
LD-SS (2)
1.147
Heterogeneous (4)
0.947
Random (3)
1.040
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(c) 10 common and 5 unique items; 29 groups of 38-39 subjects
155 items (11)
1.454
HD-LS (5)
1.075
LD-SS (9)
1.276
Hetero
g
eneous (10)
1.201
Random (12)
1.326
Figure 8. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
, and b
4
) and MED
measures (horizontal lines and numerals) for the set of 12 anchor-item designs involving 15-item subtests
and the comparison single-group designs. RMSE ¼ root mean square error; MED ¼ mean Euclidean
distance.
14 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
indeed provides more accurate estimates (despite the different metric) than 20-item anchor-item
designs with five common items.
Fifteen-Item Anchor-Item Designs
The outcomes of anchor-item designs involving 15-item subtests are displayed in Figure 8. The
three characteristics that were described for 20-item designs hold also for these 15-item designs,
leading to the practical conclusions that items can be distributed at random across subtests and
that maximal accuracy is obtained when the number of common items is smallest and the number
of unique items is largest. The anchor-item design having five common items and 10 unique items
is optimal among the present 15-item designs, but a comparison of Figures 7a and 8a reveals that
20-item subtests consisting of five common and 15 unique items yield more accurate estimates
than 15-item subtests with five common items. Ultimately, a small number of common items com-
bined with a large number of unique items serve the purpose of reducing the number of subtests
into which the item pool must be partitioned and, when the size of the calibration sample is fixed,
allows larger groups of subjects to respond to each subtest, which is what seems to determine the
accuracy with which item parameters are estimated. To confirm this latter point, data arrays from
the 20-item designs with five common items were reduced by randomly eliminating 37 subjects
from each group of 112 respondents. This left the same number of respondents per group as in
the comparable 15-item designs and reduced the number of responses to the five common items
from 1,120 to only 750. Compared to the initial 20-item designs with 112 respondents per group,
the accuracy with which item parameters were estimated from the trimmed 20-item designs was
similar to that described in Figure 8a for 15-item designs also involving five common items.
Clearly, the accuracy of item parameter estimates from anchor-item designs of the type considered
in this article are determined by the number of respondents per group.
Additional Anchor-Item Designs
Although subtests should be assembled with a minimum number of common items and a maxi-
mum number of unique items, the smallest number of common items in the preceding designs
0.0
0.0
0.4
0.4
0.8
0.8
1.2
1.2
1.6
1.6
RMSE
MED
1 common (0)
784 subjects
0.846
3 common (0)
896 subjects
0.818
5 common (0)
1120 subjects
0.801
Figure 9. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
, and b
4
) and MED
measures (horizontal lines and numerals) for anchor-item designs involving 23-item subtests with one
common item, 22-item subtests with three common items, and 20-item subtests with five common items.
RMSE ¼ root mean square error; MED ¼ mean Euclidean distance.
Garcı
´
a-Pe
´
rez et al. 15
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
was five. But what is the minimum number of common items that provides sufficiently accurate
parameter estimates? To investigate this question, two further anchor-item designs were consid-
ered, one using three common items and 19 unique items (yielding eight 22-item subtests), and
896 subjects; the other used a single common item and 22 unique items (yielding seven 23-item
subtests), and 784 subjects. For these additional analyses, common items were selected at ran-
dom, and unique items were distributed at random among subtests. In both cases, there were
112 subjects per group so that the number of common items and the number of subjects per group
were not confounded. The results of these two additional designs are presented in Figure 9 along
with those from the random condition (five common items and the same number of subjects per
group, replotted from the rightmost block in Figure 7a). These results reveal that a single com-
mon item was sufficient to provide accurate parameter estimates. Inspection of scatterplots and
correlations confirmed the picture provided by the RMSE and MED measures. Accuracy
increases minimally as the number of common items increases, but this appears to be merely
a result of the increasing overall number of subjects in the calibration sample that in turn
increases the number of responses to the common items.
Alternative Item Parameters
After this simulation study was completed, a subset of n ¼ 125 of the actual items were pretested
with a sample of N ¼ 185 general-population respondents under a single-group design. The
resultant parameter estimates turned out to have characteristics quite different from those used
in the simulations just described, in that the negative relation between discrimination and sepa-
ration between outer boundaries was much stronger (see Figure 10). The question was whether
the relative performance of the anchor-item designs in the present study would change when item
parameters were so drastically different, and so the simulation study for 125 items were repeated
with the true parameters depicted in Figure 10 instead of the 155 items whose parameters were
depicted in Figure 1b. The simulation was carried out along the same lines and with the same
factors, but it was adapted to accommodate the case of 125 items.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
0
1
2
3
4
5
6
7
Separation, b
4
b
1
Discrimination, a
Figure 10. Separation b
j,4
b
j,1
against a
j
for each of the 125 items for the second simulation study
16 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Individually, the results did change slightly but the overall pattern remained the same, leading
to the same conclusion: Parameter estimates were comparatively more accurate when the num-
ber of common items and subtests were smallest, and the size of each group of respondents was
largest. These converging results based on two different sets of true item parameters (one derived
from actual administration of real items) support the generalizability of these conclusions.
Discussion and Conclusion
This study compared the relative accuracy with which the parameters of Likert-type items were
recovered through anchor-item calibration designs of various configurations. When the applica-
tion was subject to the empirical constraints that the sample of respondents (however large) and
the maximal number of items in a subtest are limited, results showed that item parameters could
be most accurately recovered using the smallest possible number of common items (a single
common item was sufficient) and the number of unique items was as large as the maximum fea-
sible length of the subtests. The resulting number of subtests was small and, therefore, the avail-
able sample of respondents was split into the minimum number of groups of the largest possible
size. This provided a large number of responses for each unique item, which also contributed to
the accuracy with which item parameters could be estimated.
Study results also show that random selection of common items and random distribution of
unique items across subtests yielded the same estimation accuracy as did selection of common
items and distribution of unique items according to their characteristics. This conclusion arises
from the finding that the various alternative options that were used to assemble subtests produced
comparable outcomes, which seems to indicate that the true parameter values of common items are
actually immaterial. The practical importance of this result is that the subtests involved in the appli-
cation of an anchor-item design can be safely assembled with no prior information about the items
whose parameters will be estimated. The results also showed that the optimal anchor-item design
under the empirical constraints of the size of the sample of respondents and maximum length of the
subtests, provided item parameter estimates that were more accurate than those that could be
obtained through a matched single-group design with the same number of respondents as in
each of the subgroups involved in the anchor-item design. This is perhaps surprising because
the anchor-item design renders a very sparse data array, but that sparse data array contains exactly
as many responses to unique items as there are in the comparable single-group design, and also
includes more responses to the common items than would be collected in the matched single-group
design (see Figure 2). This seems to suggest that the additional responses to the common items in
the anchor-item design played a substantial role in increasing estimation accuracy.
The two simulation studies considered only Likert-type items with K ¼ 5 response catego-
ries. The numbers of items in each pool were n ¼ 155 and n ¼ 125, and the item pools had true
parameters with different characteristics (compare Figures 1b and 10). In view of the similarity
of the results, it is very likely that the study’s conclusions will generalize to other K, n, and N.
It is worth noting that all the results are based on simulations using unidimensional items that
measure the same dimension, where there are no DIF items, and where trait distributions are unit
normal and equal (within sampling error) across groups of respondents. The question then arises
whether these conclusions would be valid in empirical situations in which one or more of these
conditions are violated.
First, note that a failure of unidimensionality and the presence of DIF items is not more of
a problem in anchor-item designs than it is in single-group designs. The same solution can be
applied in both cases. Under a single-group design, responses to any new set of items are
used to test for unidimensionality and DIF, and DIF or otherwise deviant items are then elimi-
nated, leaving a reduced pool of homogeneous and non-DIF items that are then calibrated. The
Garcı
´
a-Pe
´
rez et al. 17
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
same approach can be taken with responses collected under an anchor-item design, although the
number of common items must not be too small or there will be some risk that all of the common
items would have to be eliminated. Interestingly, the results of this study show that all surviving
items will be properly calibrated provided at least one common item remains in the reduced pool.
On the other hand, a mismatch between the trait distribution of the respondents and the prior dis-
tribution assumed during calibration does not seem to affect estimation accuracy in dichotomous
items (Garcı
´
a-Pe
´
rez, 1999), and there is no reason to think that such mismatches should be rel-
evant with Likert-type items.
Violations of the assumptions of the present study may occur in a myriad of ways. Such vio-
lations would result in a reduction in estimation accuracy, and the amount of absolute reduction
would probably vary with the particular form in which the assumptions were violated. In any
case, there is no reason to think that a particular type of anchor-item design is intrinsically
more robust to violations of any kind. A thorough study that systematically addresses these issues
is beyond the scope of this article, but this is certainly an area where further research would be
useful. For the time being, and under the practical constraints that motivated this study (i.e., item
banks consisting of new items whose characteristics are unknown and whose size is too large to
permit single-group approaches), the results yield guidelines that may actually alleviate the
potential problems that might be caused by violation of the study assumptions.
The use of the smallest possible number of subtests of the largest possible size allows for
larger groups of respondents that are less likely to be widely heterogeneous in trait distribution.
On the other hand, the potentially detrimental effects of DIF, or of otherwise low-quality items
being accidentally selected as common items (and which will be eliminated during a screening
process prior to calibration, as discussed above) can be reduced by not pushing the design to the
extreme of using a single common item. Interestingly, the finding that the characteristics of
anchor items are immaterial under these assumptions (and for the purpose of accurate parameter
estimation) is not in conflict with recommendations arising from results reported by Lopez
Rivas, Stark, and Chernyshenko (2009) on how the power of DIF detection varies as a function
of the discrimination level of common items in anchor-item designs: Lopez Rivas et al. (see their
Table 7) found that the proportion of correctly identified DIF items improved slightly if a single
common item of high discrimination was used, as compared to the results obtained when the sin-
gle common item had low discrimination. When there were three or five common items, it turned
out that a mixture of discrimination levels across common items yielded similar or better DIF
detection than homogeneity of discrimination levels across common items, whether they were
high or low. These differences in the power of DIF detection as a function of the characteristics
of common items also vanished as the number of respondents per group increased. It should be
noted that Lopez Rivas et al. did not assess the accuracy with which item parameters could be
estimated (which was the goal of the present article). However, their results do not contradict
the guidelines of the present study for the configuration of anchor-item designs: When DIF
detection and accurate parameter estimation are both an issue, the use of a single common
item should perhaps be avoided. At the same time, the random selection of common items
when their discrimination levels are unknown will likely diversify their discrimination levels
so as to yield better DIF detection than could be obtained by choosing common items of homo-
geneously high or homogeneously low discrimination.
Declaration of Conflicting Interests
The authors declared no conflicts of interests with respect to the authorship and/or publication of this
article.
18 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Funding
The authors disclosed receipt of the following financial support for the research and/or authorship of this
article:
A grant SEJ2005-00485 from Ministerio de Educacio
´
n y Ciencia (Spain).
Note
1. Baker (1997) discusses the conditions that trigger this behavior but not the ultimate reason for its occur-
rence or the reasons that the parameter estimation algorithm in
MULTILOG accepts disordinal boundaries as
a valid solution. In any case, in the present authors’ own experience, these problems occur generally
when the number of respondents is small, and disappear as the number of respondents increases. (The
reader can easily reproduce this feature by running
MULTILOG on a data file as it increases by addition
of data from further respondents.) Thus, this behavior actually reflects an idiosyncrasy of
MULTILOG
(Baker, 1997) and is not the manifestation of some intrinsic problem with the item itself. When these
problems occurred, the entire calibration results were checked and it was confirmed that SEs for the
remaining items (i.e., those not affected by this problem) were similar to those observed in calibration
runs in which no item turned up with disordinal boundaries.
References
Baker, F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement,
16, 87-96.
Baker, F. B. (1997). Estimation of graded response model parameters using
MULTILOG. Applied Psycholog-
ical Measurement, 21, 89-90.
Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Calibration of an item pool for assessing the burden of
headaches: An application of item response theory to the Headache Impact Test (HIT
TM
). Quality of Life
Research, 12, 913-933.
Cella, D., & Chang, C.-H. (2000). A discussion of item response theory and its applications in health status
assessment. Medical Care, 38, II-66-II-72.
Cohen, A. S., & Kim, S.-H. (1998). An investigation of linking methods under the graded response model.
Applied Psychological Measurement, 22, 116-130.
de Gruijter, D. N. M. (1988). Standard errors of item parameter estimates in incomplete designs. Applied
Psychological Measurement, 12, 109-116.
Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1989). Operational characteristics of adaptive testing proce-
dures using the graded response model. Applied Psychological Measurement, 13, 129-143.
du Toit, M. (Ed.). (2003). IRT from SSI: bilog-mg, multilog, parscale, testfact. Lincolnwood, IL: Scientific
Software International.
Emons, W. H. M. (2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psycho-
logical Measurement, 32, 224-247.
Fletcher, R. B., & Hattie, J. A. (2004). An examination of the psychometric properties of the physical self-
description questionnaire using a polytomous item response model. Psychology of Sport and Exercise, 5,
423-446.
Garcı
´
a-Pe
´
rez, M. A. (1999). Fitting logistic IRT models: Small wonder. Spanish Journal of Psychology, 2,
74-94. Available from http://www.ucm.es/sjp
Hanson, B. A., & Be
´
guin, A. A. (2002). Obtaining a common scale for item response theory item param-
eters using separate versus concurrent estimation in the common-item equating design. Applied Psycho-
logical Measurement, 26, 3-24.
Hart, D. L., Wang, Y.-C., Stratford, P. W., & Mioduski, J. E. (2008a). A computerized adaptive test for
patients with hip impairments produced valid and responsive measures of function. Archives of Physical
Medicine and Rehabilitation, 89, 2129-2139.
Garcı
´
a-Pe
´
rez et al. 19
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Hart, D. L., Wang, Y.-C., Stratford, P. W., & Mioduski, J. E. (2008b). Computerized adaptive test for
patients with knee impairments produced valid and responsive measures of function. Journal of Clinical
Epidemiology, 61, 1113-1124.
Hol, A. M., Vorst, H. C. M., & Mellenbergh, G. J. (2007). Computerized adaptive testing for polytomous
motivation items: Administration mode effects and a comparison with short forms. Applied Psycholog-
ical Measurement, 31, 412-429.
Holman, R., & Berger, M. P. F. (2001). Optimal calibration designs for tests of polytomously scored items
described by item response theory models. Journal of Educational and Behavioral Statistics, 26, 361-
380.
Kim, S.-H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded
response model. Applied Psychological Measurement, 26, 25-41.
Koch, W. R. (1983). Likert scaling using the graded response latent trait model. Applied Psychological
Measurement, 7, 15-32.
Lai, J.-S., Cella, D., Chang, C.-H., Bode, R. K., & Heinemann, A. W. (2003). Item banking to improve,
shorten and computerize self-reported fatigue: An illustration of steps to create a core item bank
from the FACIT-Fatigue Scale. Quality of Life Research, 12, 485-501.
Lopez Rivas, G. E., Stark, S., & Chernyshenko, O. S. (2009). The effects of referent item parameters on
differential item functioning detection using the free baseline likelihood ratio test. Applied Psycholog-
ical Measurement,33, 251-265.
Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the sensitivity
of the differential functioning of items and tests framework for tests of measurement invariance with
Likert data. Applied Psychological Measurement, 31, 430-455.
Numerical Algorithms Group. (1999). NAG Fortran library manual, Mark 19. Oxford, UK: Author.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory:
Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using
MULTILOG. Journal of
Educational Measurement, 27, 133-144.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika
Monograph Supplement, 34(4, Pt. 2, No. 17), 100-114.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Hand-
book of modern item response theory (pp. 85-100). New York, NY: Springer.
Singh, J., Howell, R. D., & Rhoads, G. K. (1990). Adaptive designs for Likert-type data: An approach for
implementing marketing surveys. Journal of Marketing Research, 27, 304-321.
Singh, J., Rhoads, G. K., & Howell, R. D. (1992). Adapting marketing surveys to individual respondents:
An approach using item information functions. Journal of the Market Research Society, 34, 125-147.
Uttaro, T., & Lehman, A. (1999). Graded response modeling of the Quality of Life Interview. Evaluation
and Program Planning, 22, 41-52.
Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement,
10, 333-344.
Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and
computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact
scales. Medical Care, 38, II-73-II-82.
Ware, J. E., Jr., Kosinski, M., Bjorner, J. B., Bayliss, M. S., Batenhorst, A., Dahlo
¨
f, C. G. H., . . . Dowson, A.
(2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact.
Quality of Life Research, 12, 935-952.
Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain
IRT procedures. Applied Psychological Measurement, 8, 347-364.
Woods, C. M. (2007). Ramsay curve IRT for Likert-type data. Applied Psychological Measurement, 31,
195-212.
20 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
... Under a NEAT design, two forms are administered in two different populations, and the forms are ''linked'' via a common set of anchor items. Several possible anchor-item designs have been proposed (García-Pérez, Alcalá-Quintana, & García-Cueto, 2010;Petersen, Kolen, & Hoover, 1989;Vale, 1986). Three anchor-item designs considered in this research are shown in Figure 2. The situation demonstrates three groups with set(s) of anchor items. ...
... Anchor-item design. Concurrent calibration with respondents from different subpopulations requires anchoring items to establish a common metric for the parameter estimates (García-Pérez et al., 2010;Kolen & Brennan, 2014). In this study, it was assumed that three groups of respondents were administered unique 20 items with subsets of items in common (anchor items). ...
Article
Full-text available
Concurrent calibration using anchor items has proven to be an effective alternative to separate calibration and linking for developing large item banks, which are needed to support continuous testing. In principle, anchor-item designs and estimation methods that have proven effective with dominance item response theory (IRT) models, such as the 3PL model, should also lead to accurate parameter recovery with ideal point IRT models, but surprisingly little research has been devoted to this issue. This study, therefore, had two purposes: (a) to develop software for concurrent calibration with, what is now the most widely used ideal point model, the generalized graded unfolding model (GGUM); (b) to compare the efficacy of different GGUM anchor-item designs and develop empirically based guidelines for practitioners. A Monte Carlo study was conducted to compare the efficacy of three anchor-item designs in vertical and horizontal linking scenarios. The authors found that a block-interlaced design provided the best parameter recovery in nearly all conditions. The implications of these findings for concurrent calibration with the GGUM and practical recommendations for pretest designs involving ideal point computer adaptive testing (CAT) applications are discussed.
... In addition, the slight overestimation of item discrimination parameters that was observed in Figure 5A is clearly not a consequence of discretization of continuous responses, given that it is identically observed in Figure 8A where no discretization was involved. Misestimation of discrimination parameters is ubiquitous in simulations involving graded responses to Likert items (see, for example, García-Pérez, 2017;García-Pérez et al., 2010;Kieftenbeld & Natesan, 2012). ...
Article
Full-text available
A recurring question regarding Likert items is whether the discrete steps that this response format allows represent constant increments along the underlying continuum. This question appears unsolvable because Likert responses carry no direct information to this effect. Yet, any item administered in Likert format can identically be administered with a continuous response format such as a visual analog scale (VAS) in which respondents mark a position along a continuous line. Then, the operating characteristics of the item would manifest under both VAS and Likert formats, although perhaps differently as captured by the continuous response model (CRM) and the graded response model (GRM) in item response theory. This article shows that CRM and GRM item parameters hold a formal relation that is mediated by the form in which the continuous dimension is partitioned into intervals to render the discrete Likert responses. Then, CRM and GRM characterizations of the items in a test administered with VAS and Likert formats allow estimating the boundaries of the partition that renders Likert responses for each item and, thus, the distance between consecutive steps. The validity of this approach is first documented via simulation studies. Subsequently, the same approach is used on public data from three personality scales with 12, eight, and six items, respectively. The results indicate the expected correspondence between VAS and Likert responses and reveal unequal distances between successive pairs of Likert steps that also vary greatly across items. Implications for the scoring of Likert items are discussed.
... Supplemental Fig. 1 presents an overview of the anchors between datasets. Previous simulation studies found that accurate parameter estimates can be obtained using this method for ordered polytomous data, even when the percentage of common items is low [38,39]. The marginal maximum likelihood estimator, with dataset specific score distributions was used to obtain estimates of the item parameters and the means and standard deviations of the populations [40]. ...
Article
Full-text available
Objectives Outcomes obtained using different physical function patient reported outcome measures (PROMs) are difficult to compare. To facilitate standardization of physical function outcome measurement and reporting we developed an item response theory (IRT) based standardized physical function score metric for ten commonly used physical function PROMs. Methods Data of a total of 16,386 respondents from representative cohorts of patients with rheumatic diseases as well as the Dutch general population were used to map the items of ten commonly used physical function PROMs on a continuous latent physical function variable. The resulting IRT based common metric was cross-validated in an independent dataset of 243 patients with gout, osteoarthritis or polymyalgia in which four of the linked PROMs were administered. Results Our analyses supported that all 97 items of the ten included PROMs relate to a single underlying physical function variable and that responses to each item could be described by the generalized partial credit IRT model. In the cross-validation analyses we found congruent mean scores for four different PROMs when the IRT based scoring procedures were used. Conclusions We showed that the standardized physical function score metric developed in this study can be used to facilitate standardized reporting of physical function outcomes for ten commonly used make physical function PROMs.
... ssicentral.com), parscale (du Toit, 2003), or multilog (du Toit, 2003) occasionally returns disordered b k (e.g., Baker, 1997a;Forrest et al., 2014;García-Pérez, Alcalá-Quintana, & García-Cueto, 2010;Hahn et al., 2010; Osborne, Batterham, Elsworth, Hawkins, & Buchbinder, 2013; Rubio, Aguado, Hontangas, & Hernández, 2015;Wang, Deutscher, Yen, Werneke, & Mioduski, 2014). The reason for this outcome, which cannot logically inform of disordered categories, will be documented later. ...
Article
Threshold parameters have distinct referents across models for ordered responses. In difference models, thresholds are trait levels at which responding beyond category k is as likely as responding at or below it; in divide-by-total models, thresholds are trait levels at which responding in category k is as likely as responding in category k – 1. Thus, thresholds in divide-by-total models (but not in difference models) are the crossings of the option response functions for consecutive categories. Thresholds in difference models are always ordered but they may inconsequentially yield ordered or disordered crossings. In contrast, assimilation of thresholds and crossings in divide-by-total models questions category order when crossings are disordered. We analyze these aspects of difference and divide-by-total models, their relation to the order of response categories, and the consequences of collapsing categories to instate ordered crossings under divide-by-total models. We also show that item parameters in models for ordered responses can never contradict the pre-assumed order of categories and that the empirical order can only be established using a polytomous model that does not assume ordered categories, although this often gives rise to spurious outcomes. Practical implications for scale development are discussed.
... It is conceivable that transformation constants based on only three common items may not be very stable, especially for the MS method, which uses the standard deviation of the item difficulty estimates. Although common-item characteristics might affect linking accuracy for separate calibration methods, recent research (García-Pérez, Alcalá-Quintana, & García-Cueto, 2010) seems to suggest that they matter little for concurrent calibration, which is consistent with the observation of the concurrent method's relative insensitivity to the number of common items. The finding that the concurrent calibration method tends to underestimate growth rate, regardless of test length, whereas separate calibration methods tend to overestimate it when a test is short is consistent with the finding of Skorupski et al. (2003) but inconsistent with that of Chin et al. (2006). ...
Article
Vertical scaling is necessary to facilitate comparison of scores from test forms of different difficulty levels. It is widely used to enable the tracking of student growth in academic performance over time. Most previous studies on vertical scaling methods assume relatively long tests and large samples. Little is known about their performance when the sample is small or the test is short, challenges that small testing programs often face. This study examined effects of sample size, test length, and choice of item response theory (IRT) models on the performance of IRT-based scaling methods (concurrent calibration, separate calibration with Stocking–Lord, Haebara, Mean/Mean, and Mean/Sigma transformation) in linear growth estimation when the 2-parameter IRT model was appropriate. Results showed that IRT vertical scales could be used for growth estimation without grossly biasing growth parameter estimates when sample size was not large, as long as the test was not too short (≥20 items), although larger sample sizes would generally increase the stability of the growth parameter estimates. The optimal rate of return in total estimation error reduction as a result of increasing sample size appeared to be around 250. Concurrent calibration produced slightly lower total estimation error than separate calibration in the worst combination of short test length (≤20 items) and small sample size (n ≤ 100), whereas separate calibration, except in the case of the Mean/Sigma method, produced similar or somewhat lower amounts of total error in other conditions.
... The use of regression in these cases is appropriate because the independent variable has fixed values measured without error (or with a negligible error). Another area in which the use of regression is permissible is in simulation studies on parameter recovery (García-Pérez et al., 2010), where the true parameters generating the data are free of measurement error by definition. ...
Article
Full-text available
The ultimate goal of research is to produce dependable knowledge or to provide the evidence that may guide practical decisions. Statistical conclusion validity (SCV) holds when the conclusions of a research study are founded on an adequate analysis of the data, generally meaning that adequate statistical methods are used whose small-sample behavior is accurate, besides being logically capable of providing an answer to the research question. Compared to the three other traditional aspects of research validity (external validity, internal validity, and construct validity), interest in SCV has recently grown on evidence that inadequate data analyses are sometimes carried out which yield conclusions that a proper analysis of the data would not have supported. This paper discusses evidence of three common threats to SCV that arise from widespread recommendations or practices in data analysis, namely, the use of repeated testing and optional stopping without control of Type-I error rates, the recommendation to check the assumptions of statistical tests, and the use of regression whenever a bivariate relation or the equivalence between two variables is studied. For each of these threats, examples are presented and alternative practices that safeguard SCV are discussed. Educational and editorial changes that may improve the SCV of published research are also discussed.
Article
Full-text available
A positive and reciprocal relationship between subjective wellbeing and pro-environmental behaviour (PEB) has been observed across a range of countries worldwide. There is good reason however to think that the nature of the PEB-wellbeing link might vary between individuals and cross-culturally. We use data obtained in Brazil, China, Denmark, India, Poland, South Africa and the UK (total n = 6969) to test a series of hypotheses using pre-registered regression models. First, we assess the relationship between PEB and wellbeing across countries, and test the ‘privilege’ hypothesis that this varies according to personal income and a country’s level of development. Second, we consider the role of individual values and motivations in relation to PEB and wellbeing. To this end, we test the ‘enhancement’ hypothesis, in which the PEB-wellbeing link is strengthened by people holding particular values and motivations. Third, we consider the role of cultural differences for the nature of the PEB-wellbeing link. We test the ‘social green’ hypothesis that public sphere behaviours (e.g. addressing environmental issues with other people) are more closely linked to wellbeing than are private sphere behaviours (e.g. product purchasing) in collectivistic cultures; in tandem, we assess whether private sphere behaviours are more closely linked to wellbeing in individualistic cultures. We obtain strong evidence for a PEB-wellbeing link across nations. There is partial evidence across countries to support the ‘social green’ hypothesis, but little evidence for the ‘privilege’ or ‘enhancement’ hypotheses. We discuss the implications of our findings for understanding the relationship between PEB and wellbeing, and consider how its promotion might feature in environmental and public health policy.
Article
Full-text available
Conformity – people’s propensity to comply with the norms and expectations of others – is an important driver of behavior. In this research, we develop a measure of people’s level of conformity which is grounded in an innovative paradigm from attitude research. By relying on relatively easy-to-answer questions about past activities, the new scale addresses some of the conceptual and methodical shortcomings of existing conformity measures. Using a sample of 1,398 people, we calibrated individuals’ claims about how they have conformed with norms, conventions, and the expectations of others in the past. Even though some conformity items seem somewhat gender sensitive, all 33 of them nevertheless form a fairly reliable Rasch scale (rel = .67). Convergent and discriminant validity were corroborated with substantial overlaps with traditional conformity, social desirability, and conscientiousness measures, and with a moderate negative correspondence with people’s desire for uniqueness. Incremental and explanatory validity was provided in a quasi-experiment (n = 152) on evaluations of commercials.
Article
Full-text available
The authors discuss a method for implementing marketing surveys in which the questionnaire is "adapted" to the individual respondent. This approach (referred to as "adaptive design") enables the researcher to pose to the respondent only those scale items (usually a subset of the entire scale) that provide useful "information" about his or her attitude. Such adaptive designs are based on a measurement theory, the Latent Trait Theory (LTT), which has been developed largely in the areas of education and psychology. However, much previous research has focused on adaptive designs that are based on binary LTT models. The authors take the first step in examining adaptive designs for Likert-type data within a marketing context. Initially they discuss and evaluate the graded-response LTT model that utilizes all of the information in Likert-type items. Then, using a simulation study, they apply the graded LTT model to implement an adaptive design for the measurement of consumer discontent. The results of the simulation suggest that an adaptive design based on the graded-response LTT performs reasonably well on several criteria. Furthermore, the findings support theoretical predictions that adaptive designs for Likert-type data can achieve measurement efficiency and precision with a Substantially smaller item pool. Implications of adaptive designs for marketing researchers and practitioners, and of LTT in general, are discussed.
Article
Full-text available
Item response theory item parameters can be estimated using data from a common-item equating design either separately for each form or concurrently across forms. This paper reports the results of a simulation study of separate versus concurrent item parameter estimation. Using simulated data from a test with 60 dichotomous items, four factors were considered: (a) estimation program (MULTILOG versus BILOG-MG), (b) sample size per form (3,000 versus 1,000), (c) number of common items (20 versus 10), and (d) equivalent versus nonequivalent groups taking the two forms (no mean difference versus a mean difference of 1 SD). In addition, four methods of item parameter scaling were used in the separate estimation condition: two item characteristic curve methods (Stocking-Lord and Haebara) and two moment methods (Mean/Mean and Mean/Sigma). Concurrent estimation generally resulted in lower error than separate estimation, although not universally so. The results suggest that one factor accounting for the lower error when using concurrent estimation may be that the parameter estimates for the common item parameters are based on larger samples. It is argued that the results of this study, together with other research on this topic, are not sufficient to recommend completely avoiding separate estimation in favor of concurrent estimation.
Chapter
The graded response model represents a family of mathematical models that deals with ordered polytomous categories. These ordered categories include rating such as letter grading, A, B, C, D, and F, used in the evaluation of students’ performance; strongly disagree, disagree, agree, and strongly agree, used in attitude surveys; or partial credit given in accordance with an examinee’s degree of attainment in solving a problem.
Article
Background: Measurement of headache impact is important in clinical trials, case detection, and the clinical monitoring of patients. Computerized adaptive testing (CAT) of headache impact has potential advantages over traditional fixed-length tests in terms of precision, relevance, real-time quality control and flexibility. Objective: To develop an item pool that can be used for a computerized adaptive test of headache impact. Methods: We analyzed responses to four well-known tests of headache impact from a population-based sample of recent headache sufferers (n = 1016). We used confirmatory factor analysis for categorical data and analyses based on item response theory (IRT). Results: In factor analyses, we found very high correlations between the factors hypothesized by the original test constructers, both within and between the original questionnaires. These results suggest that a single score of headache impact is sufficient. We established a pool of 47 items which fitted the generalized partial credit IRT model. By simulating a computerized adaptive health test we showed that an adaptive test of only five items had a very high concordance with the score based on all items and that different worst-case item selection scenarios did not lead to bias. Conclusion: We have established a headache impact item pool that can be used in CAT of headache impact.
Article
Person-fit methods are used to uncover atypical test performance as reflected in the pattern of scores on individual items in a test. Unlike parametric person-fit statistics, nonparametric person-fit statistics do not require fitting a parametric test theory model. This study investigates the effectiveness of generalizations of nonparametric person-fit statistics to polytomous item response data. A simulation study using varying test and item characteristics shows that a simple count of the Guttman errors is effective in detecting serious person misfit. The simulation study further shows that in most conditions a simple nonparametric person-fit statistic is as effective as a commonly used parametric person-fit statistic in detecting deviant item score vectors. An empirical example illustrates the use of the nonparametric person-fit statistics in real data.
Article
This article highlights issues associated with the use of the differential functioning of items and tests (DFIT) methodology for assessing measurement invariance (or differential functioning) with Likert-type data. Monte Carlo analyses indicate relatively low sensitivity of the DFIT methodology for identifying differential item functioning (DIF) under some conditions of differential functioning with previously recommended significance values. The differential test functioning index was extremely insensitive to differential functioning under all study conditions. The authors recommend alternative noncompensatory DIF cutoff values used to evaluate the significance of DIF for different DIF effect sizes. Additionally, contrasts between polytomous and dichotomous data are drawn, and problems with determining measurement invariance at the scale, rather than item, level for Likert scale data are highlighted.
Article
In a randomized experiment (n = 515), a computerized and a computerized adaptive test (CAT) are compared. The item pool consists of 24 polytomous motivation items. Although items are carefully selected, calibration data show that Samejima's graded response model did not fit the data optimally. A simulation study is done to assess possible consequences of model misfit. CAT efficiency was studied by a systematic comparison of the CAT with two types of conventional fixed length short forms, which are created to be good CAT competitors. Results showed no essential administration mode effects. Efficiency analyses show that CAT outperformed the short forms in almost all aspects when results are aggregated along the latent trait scale. The real and the simulated data results are very similar, which indicate that the real data results are not affected by model misfit.
Article
Ramsay curve item response theory (RC-IRT) was recently developed to detect and correct for nonnormal latent variables when unidimensional IRT models are fitted to data using maximum marginal likelihood estimation. The purpose of this research is to evaluate the performance of RC-IRT for Likert-type item responses with varying test lengths, sample sizes, and latent variable distributions. Also, RC-IRT is compared to a related procedure implemented in the PARSCALE program. Results indicated that for nonnormal latent variables, item parameters and scores from RC-IRT can be more accurate than those obtained from the normal model, as long as there is enough information in the data (i.e., enough items and people). The procedure in PARSCALE tended to give better answers than the normal model but was generally not as accurate as RC-IRT.
Article
The purpose of this simulation study is to investigate the effects of anchor subtest composition on the accuracy of item response theory (IRT) likelihood ratio (LR) differential item functioning (DIF) detection (Thissen, Steinberg, & Wainer, 1988). Here, the IRT LR test was implemented with a free baseline approach wherein a baseline model was formed by freeing all items except a referent or anchor subset and examining the changes in fit with respect to a series of models wherein 1 item at a time was constrained in addition to the referent(s). The results clearly indicated that the composition of the anchor subtest is important for accurate DIF detection. It was found that using a single highly discriminating rather than a low discriminating referent greatly enhanced the power of the procedure. Moreover, in conditions involving small DIF or smaller sample sizes or both, power appeared to improve when a group of highly discriminating referents was used. These findings have implications for applied research involving short scales and small sample sizes.