Content uploaded by Eduardo García-Cueto
Author content
All content in this area was uploaded by Eduardo García-Cueto on May 07, 2015
Content may be subject to copyright.
A Comparison of Anchor-Item
Designs for the Concurrent
Calibration of Large Banks of
Likert-Type Items
Miguel A. Garcı
´
a-Pe
´
rez
1
, Rocı
´
o Alcala
´
-
Quintana
1
, and Eduardo Garcı
´
a-Cueto
2
Abstract
Current interest in measuring quality of life is generating interest in the construction of com-
puterized adaptive tests (CATs) with Likert-type items. Calibration of an item bank for use
in CAT requires collecting responses to a large number of candidate items. However, the num-
ber is usually too large to administer to each subject in the calibration sample. The concurrent
anchor-item design solves this problem by splitting the items into separate subtests, with some
common items across subtests; then administering each subtest to a different sample; and finally
running estimation algorithms once on the aggregated data array, from which a substantial num-
ber of responses are then missing. Although the use of anchor-item designs is widespread, the
consequences of several configuration decisions on the accuracy of parameter estimates have
never been studied in the polytomous case. The present study addresses this question by sim-
ulation, comparing the outcomes of several alternatives on the configuration of the anchor-item
design. The factors defining variants of the anchor-item design are (a) subtest size, (b) balance of
common and unique items per subtest, (c) characteristics of the common items, and (d) criteria
for the distribution of unique items across subtests. The results of this study indicate that max-
imizing accuracy in item parameter recovery requires subtests of the largest possible number of
items and the smallest possible number of common items; the characteristics of the common
items and the criterion for distribution of unique items do not affect accuracy.
Keywords
computerized adaptive testing, item calibration, graded response model, linking, questionnaires,
simulation, anchor-item designs, health status, attitudes
Article
1
Universidad Complutense, Madrid, Spain
2
Universidad de Oviedo, Oviedo, Spain
Corresponding Author:
Miguel A. Garcı
´
a-Pe
´
rez, Departamento de Metodologı
´
a, Facultad de Psicologı
´
a, Universidad Complutense, Campus de
Somosaguas, 28223 Madrid, Spain
Email: miguel@psi.ucm.es
Applied Psychological Measurement
XX(X) 1–20
ªThe Author(s) 2010
Reprints and permission:
sagepub.com/journalsPermissions.nav
DOI: 10.1177/0146621609351259
http://apm.sagepub.com
Applied Psychological Measurement OnlineFirst, published on July 28, 2010 as doi:10.1177/0146621609351259
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Numerous inventories consisting of Likert-type items have been developed for marketing sur-
veys or to assess health status, physical functioning, or quality of life, and some of these inven-
tories are administered adaptively (Bjorner, Kosinski, & Ware, 2003; Cella & Chang, 2000;
Fletcher & Hattie, 2004; Hart, Wang, Stratford, & Mioduski, 2008a, 2008b; Lai, Cella, Chang,
Bode, & Heinemann, 2003; Singh, Howell, & Rhoads, 1990; Singh, Rhoads, & Howell, 1992;
Uttaro & Lehman, 1999; Ware, Bjorner, & Kosinski, 2000; Ware et al., 2003). The validity of
these adaptive inventories depends on the relevance of the substantive content of the items
but also on the accuracy with which the item parameters are estimated.
Computerized adaptive tests (CATs) rely on large item banks whose calibration is somewhat
problematic. In principle, the optimal choice for item calibration is the single-group design, in
which a large number of subjects respond to each of the items in the pool. However, the size
of the initial item pool for a CAT is generally too large to permit this approach. On other occa-
sions, an existing item bank is gradually updated with the addition of new items, which makes it
impossible to calibrate all items concurrently. As a result, items need to be calibrated in separate
sets. A number of linking techniques have been proposed that are intended to bring items thus
calibrated to a common scale that permits the use of arbitrary subsets of the item bank in com-
bination (see Vale, 1986).
Linking techniques for use in updating and maintenance of existing item banks use an anchor-
ing design intended to collect item responses appropriately, along with a transformation method
that brings the item parameters from separate calibrations to the common scale.
However, when a large item bank is constructed from scratch, the transformation method
becomes unnecessary because the data gathered with the anchoring design can be calibrated
in a single pass of the item parameter estimation algorithm, which guarantees the common scale
for all items. This article considers item calibration under these conditions. Vale (1986)
described a number of anchoring designs, but this analysis will be restricted to the ‘‘anchor-
item’’ design, in which the item pool is divided into small subtests, each administered to a differ-
ent group of respondents but with the characteristic that all subtests share a few common items.
This design does not reduce the total number of subjects required for the calibration sample, but
it certainly reduces the burden on each subject. The resultant data array is sparse because all sub-
jects in the calibration sample respond only to the common items, whereas only small groups of
subjects respond to the remaining subsets of (unique) items.
The choice of an anchor-item design for the calibration of a large pool of items requires deci-
sions about (a) the number of items per subtest, (b) the relative numbers of common and unique
items in each subtest, (c) the choice of common items, and (d) the distribution of unique items
across subtests. The choice of common items and the distribution of unique items across subtests
seems critical: The accuracy of parameter estimates may depend, for instance, on whether the
common items have particular characteristics such as high or low discrimination or whether
the unique items are distributed so as to make the different subtests roughly parallel or hetero-
geneous. The goal of this article is to compare the accuracy with which the parameters of Likert-
type items can be recovered by anchor-item designs varying as to the above-mentioned factors.
The intended context of application is the development of a CAT item bank from scratch and,
therefore, the calibration of a large item pool, with no prior knowledge of the item parameters
or the trait levels of the subjects in the (large) calibration sample. The characteristics of this con-
text preclude the use of approaches based on the theory of optimal designs (e.g., Holman &
Berger, 2001).
Various studies have addressed some of these questions previously (e.g., Cohen & Kim, 1998;
de Gruijter, 1988; Hanson & Be
´
guin, 2002; Kim & Cohen, 2002; Vale, 1986; Wingersky & Lord,
1984), but mostly in the context of bank maintenance (which involves a subsequent linking step)
and for dichotomous items. No detailed simulation study appears to have been published that
2 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
addresses all of these questions systematically in the context of concurrent calibration of a large
pool of Likert-type items. This research aimed at filling this gap.
Method
The present study used Samejima’s (1969, 1997) graded response model for Likert-type items
with K ¼ 5 ordered response categories. To introduce the present notation, recall that the prob-
ability p
jk
that the ith subject, whose trait level is y
i
, responds in category k (with 1 ≤ k ≤ K)on
item j is given by
pjk ¼
1
1
1þexp½a
j
ðy
i
b
j;1
Þ
if k ¼ 1
1
1þexp½a
j
ðy
i
b
j;k1
Þ
1
1þexp½a
j
ðy
i
b
j;k
Þ
if 1 < k < K
1
1þexp½a
j
ðy
i
b
j;K1
Þ
if k ¼ K
8
>
<
>
:
; ð1Þ
where a
j
is the item discrimination parameter and b
J;K
, is the boundary between categories k and
k + 1 (for 1 ≤ k ≤ K − 1).
Item and Subject Parameters
The question that this research addresses was empirically motivated by the development of an
adaptive inventory for the assessment of perceived health, for which the initial pool consisted
of 155 Likert-type items with K ¼ 5 response categories, and where a calibration sample of
1,000+ subjects was available. Therefore, for the present simulation study, parameters were gen-
erated for a pool of n ¼ 155 items and a total sample of N ¼ 1,120 respondents. The size of the
initial pool of items may be regarded as sufficiently small in some contexts for single-group
designs to be feasible (e.g., with cognitive items administered to student populations, or with
noncognitive items administered to healthy and motivated respondents with no time limitations).
However, the present application targets elderly and ill respondents who cannot reasonably be
confronted with 155 items. Furthermore, the subjects in this calibration sample were patients
at the offices of general-practice physicians and they were to take the subtest while they waited
for their appointment, which further limited the time available for responding (and, therefore, the
number of items that could be administered). Nevertheless, there seems to be no reason why the
present results would not generalize to cases in which the initial pool of items is substantially
larger.
Trait levels y were drawn from a unit normal distribution using NAG subroutine G05DDF
(Numerical Algorithms Group, 1999), yielding the actual distribution shown in Figure 1a. As
for item parameters, the literature does not provide clear indications as to how the parameters
of Likert-type items with K ¼ 5 are empirically distributed. Parameters have occasionally
been reported for sets of between 5 and 36 items (see Cohen & Kim, 1998; Emons, 2008;
Hol, Vorst, & Mellenbergh, 2007; Kim & Cohen, 2002; Koch, 1983; Reise, Widaman, &
Pugh, 1993; Singh et al., 1990), but these data do not provide sufficient information for the arti-
ficial generation of realistic parameters for 155 items. Nevertheless, analyses of all of these
reported sets of empirical item parameters invariably indicate that (a) the separation between
consecutive boundary parameters varies within and across items and (b) there is a negative cor-
relation between the discrimination parameter a and the separation between outer boundary
parameters (i.e., the difference b
j,K −1
− b
j,1
). In other words, items with high discrimination
tend to have their boundary parameters closely spaced, whereas these boundary parameters
are more widely spread out in items with low discrimination.
Garcı
´
a-Pe
´
rez et al. 3
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
In contrast, earlier simulation studies generated item parameters lacking one or both of these
characteristics. For instance, Dodd, Koch, and De Ayala (1989) used item discriminations that
varied systematically from 0.90 to 2.15 in steps of 0.05. These discrimination levels were ran-
domly assigned to items whose boundary parameter characteristics were only described to reflect
‘‘those typically obtained from calibration of real graded response ability test data.’’ Baker
(1992) generated boundary parameters for each item from a unit-normal distribution, and their
outer boundary separation was uncorrelated with a discrimination parameter generated from
a uniform distribution from 1.34 to 2.65, a strategy guaranteeing only that the separation between
consecutive boundary parameters is not constant. Woods (2007) used a slightly different
approach that also generated discrimination parameters that were uncorrelated with the separa-
tion between outer boundary parameters, although the distance between consecutive boundary
parameters was still random within and across items. Finally, Meade, Lautenschlager, and John-
son (2007) generated discrimination parameters to be normally distributed (M ¼ 1.25 and SD ¼
0.07). These were uncorrelated with boundary parameters generated such that b
j,1
for each item
was drawn from a normal distribution (M ¼ −1.7 and SD ¼ 0.45), but b
j,k+1
¼ b
j,k
+ 1.2 for all
1 ≤ k ≤ K − 2 in each item, unrealistically implying that the distance between consecutive
boundary parameters in each item was a constant 1.2 for all items.
To reproduce the empirical characteristics discussed above (i.e., a negative relationship
between discrimination and separation between outer boundary parameters, coupled with
unequal distances between consecutive boundary parameters within and across items), the pres-
ent study uses a different strategy. In particular, item discrimination parameters were generated
through NAG subroutine G05DAF to be uniformly distributed on the interval [1, 3], and bound-
ary parameters were also generated to be uniformly distributed on an interval whose range varied
with the particular boundary to be considered and also with the discrimination parameter for the
item of concern. Specifically, b
j,1
was unconstrained and ranged between −3 and −1, but b
j,k
(for 2 ≤ k ≤ 4) ranged between b
j,k −1
+(6 − 0.7a
j
)/12 and b
j,k −1
+(6 − 0.7a
j
)/3. Thus,
any given boundary parameter beyond the first one for an item was randomly located above
the preceding boundary parameter within a range whose breadth was a decreasing function of
the item discrimination parameter a. The constants in the expressions defining the limits of these
ranges were chosen arbitrarily so as to produce reasonable boundary locations and separations
between the outer boundaries for each item. This approach produced the desired negative
(a)
Trait level, θ
Frequency
–3 –2 –1 0 1 2 3
0
25
50
75
100
0.5 1.5 2.5 3.5
1.0
2.0
3.0
4.0
5.0
(b)
Separation, b
4
– b
1
Discrimination, a
Figure 1. Distribution of trait levels in the simulated sample (a) and plot of the separation b
j,4
− b
j,1
against a
j
for each of the 155 items in the pool (b)
4 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
correlation between discrimination and separation of outer boundaries. This is evident from the
outcomes when this approach was used for the generation of item parameters for the 155 items
(see Figure 1b); note also that it guarantees compliance with the order restriction b
j,k
< b
j,k+1
for
all 1 ≤ k ≤ 3.
Calibration Designs to Be Compared
Twenty-four anchor-item designs were considered, varying according to the four factors mentioned
in the introduction. The first factor, number of items per subtest, had two levels: 15 and 20 items.
The second factor, the relative numbers of common and unique items, had three levels per size of
subtest. Thus, for 15-item subtests, the number of common items was 5, 8, or 10, respectively,
requiring 10, 7, and 5 additional unique items per subtest; for 20-item subtests, the number of com-
mon items was 5, 11, or 15, respectively, requiring 15, 9, and 5 additional unique items per subtest.
Each of these designs, in turn, required a different number of subtests and, hence, groups of sub-
jects of different size (subject to the constraint that the total calibration sample consists of 1,120
respondents). Table 1 gives a summary of these requirements. At the same time, the resultant
data array was generally sparse, as illustrated in the upper panel of Figure 2, showing the case
of 20-item subtests with five common items: Of the 1,120 × 155 ¼ 173,600 cells in the data
array, only 1,120 × 20 ¼ 22,400 contain data, arising from 1,120 responses to each of the
five common items, plus 112 responses to each of 10 sets of 15 unique items.
The criterion for the choice of common items and the distribution of unique items acted as
a single factor with four levels. In the absence of any prior knowledge of item parameters, com-
mon items can only be chosen at random (or perhaps because of their content, but certainly not
because of their parameters), and unique items can only be distributed at random. This is one of
the levels for this factor, and is indeed the only empirical option available where no preliminary
study has been carried out to obtain item parameter estimates. In that case, the resultant subtests
will be heterogeneous. However, the choice of common items and the distribution of unique
items according to their parameters might be expected to render more accurate estimates. The
true item parameters are not available in the practical application that is being considered
(although they will be roughly known in other practical applications where the items have
been pretested) but they are actually known in the present simulation study. Capitalizing on
this knowledge, three additional strategies were defined for the choice of common items, and
a single additional strategy for the distribution of unique items. The three selection criteria for
the choice of common items were (a) items with the highest discrimination and the largest sep-
aration between their outer boundary parameters, (b) items with the lowest discrimination and
the smallest separation between their outer boundary parameters, and (c) items that were
Table 1. Features of Each of the Six Major Anchor-Item Configurations
NI NC NU NG NS
20 5 15 10 112
11 9 16 70
15 5 28 40
15 5 10 15 74-75
8 7 21 53-54
10 5 29 38-39
Note: NI ¼ number of items per subtest; NC ¼ number of common items; NU ¼ number of unique items; NG ¼
number of resultant subtests (and groups of subjects); NS ¼ number of subjects per group.
Garcı
´
a-Pe
´
rez et al. 5
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
heterogeneous in both respects. The only criterion for the distribution of unique items in these
three cases was to maximize the homogeneity of the resultant subtests by placing in them items
that are comparable as to discrimination and separation between their outer boundaries. Figure 3
illustrates what the four levels of this factor represent for the assembly of 20-item subtests with
15 common items. Solid circles in each panel indicate the set of 15 common items that were
selected with the applicable criterion, and each of the five remaining groups of 28 identical sym-
bols (open circles, open squares, open diamonds, open inverted triangles, and upright gray trian-
gles) represent the bins from which unique items were selected for inclusion in the 28
subtests required by this design. The five unique items for a given subtest were selected by pick-
ing at random (and without replacement) one item from each of these five bins.
To determine how the accuracy of these item parameter estimates compared with those result-
ing from single-group designs involving similar numbers of respondents, the present study also
included single-group designs in which all subjects responded to all 155 items. Because the pres-
ent set of anchor-item design configurations involved a different number of respondents per
unique item (given that all subjects respond to the common items; see Table 1), in this study sin-
gle-group designs with 39, 40, 53, 70, 75, 112, and 1,120 subjects were included. The lower panel
in Figure 2 illustrates the data array for the single-group design matched to the anchor-item
design in the upper panel. In this case, the array had no missing data and consisted of 112 ×
155 ¼ 17,360 cells with the same number of responses to Items 6 to 155 in the single-group
1120 subjects
112
112
112
112
112
112
112
112
112
112
155 items
5 15 15 15 15 15 15 15 15 15 15
112
Figure 2. Sketch of the data array for a 20-item anchor-item design with five common items (upper panel)
and sketch of the data array for a comparison single-group design in which the total number of subjects
equals the number of subjects per group in the anchor-item design (lower panel)
6 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
and the anchor-item designs, although the latter collected 1,120 responses to the five common
items compared to only 112 responses in the matched single-group design.
Simulation Approach
The simulation approach for each of the 24 anchor-item designs and the seven single-group
designs was identical, except that the number of subjects (or the items to which the subjects
responded) varied as dictated by the particular design under consideration. Custom software
was written for this purpose. To simulate the response of subject i to item j, numerical values
for the probabilities under the multinomial distribution in Equation 1 were determined by insert-
ing into the expressions the particular trait level of the subject and the parameters of the item.
Then, the resultant multinomial experiment was simulated through NAG subroutine G05DAF,
and the outcome (ranging from 1 to 5 and representing the ordered category of the response)
was recorded in the applicable data array. The NAG subroutine that simulates the multinomial
experiment does only what could have been programmed from scratch: The set of five probabil-
ities arising from Equation 1 for item j and subject i partitioned the segment [0, 1] into five adja-
cent and nonoverlapping regions labeled from 1 to 5, each of which had length p
jk
(for 1 ≤ k ≤
5); next, the simulated response was given directly by the label of the region in which a uniformly
distributed random number fell. Missing responses to items not administered in the anchor-item
1.0 1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Separation, b
4
– b
1
Discrimination, a
(a) High discrimination and large separation
1.0 1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Separation, b
4
– b
1
Discrimination, a
(b) Low discrimination and small separation
1.0 1.5 2.0 2.5 3.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
,noitarapeS b
4
– b
1
Discrimination, a
(c) Heterogeneous
1.0 1.5 2.02.53.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
,noitarapeS b
4
– b
1
Discrimination, a
(d) Random
Figure 3. Illustration of several criteria for the choice of common items (solid circles) and the grouping of
unique items (subsets of symbols of different shapes) for distribution across subtests
Garcı
´
a-Pe
´
rez et al. 7
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
design under consideration were differently coded in the data file so as to be identifiable by the
calibration software.
Parameter Estimation
The data array representing the simulation of each design was input to MULTILOG 7.03 (du Toit,
2003) to obtain estimates of the item parameters using marginal maximum likelihood methods.
Default values were used for the remaining options. Estimates
^
a,
^
b
1
,
^
b
2
,
^
b
3
, and
^
b
4
of the discrim-
ination and boundary parameters of each item were thus obtained. Each individual
MULTILOG run
was checked for convergence, and no failure was observed. Also, evidence of prior dominance
was sought by checking for unduly concentrated parameter estimates around their prior (vs. true)
locations but, again, no signs of this were observed.
Criteria for the Comparison
Because the main goal of this study was to compare the accuracy with which each design recov-
ered the item parameters, four complementary criteria were used. The first criterion involves
a scatterplot and the accompanying product–moment correlation between true and estimated
parameters, separately computed for each of the five item parameters. The second criterion,
also separately computed for each parameter, is the root mean square error (RMSE), defined as
RMSE
o
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
n
X
n
j¼1
ð
^
ω
j
ω
j
Þ
2
r
ð2Þ
for each o ∈ {a, b
1
, b
2
, b
3
, b
4
}. The third criterion is a global measure for each item, defined as
the mean Euclidean distance (MED) between the point in five-dimensional space at the location
of the true item parameters and the point at the location of their estimates, that is,
MED ¼
1
n
X
n
j¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð
^
a
j
a
j
Þ
2
þ
X
4
k¼1
ð
^
b
j;k
b
j;k
Þ
2
r
: ð3Þ
Finally, and because
MULTILOG occasionally seems unable to estimate the parameters of some
items, the fourth criterion was a mere count of the items whose parameters could not be
estimated. This behavior of
MULTILOG has been described earlier (Baker, 1997) and shows in that esti-
mates of the boundary parameters do not satisfy the order restriction
^
b
j;k
<
^
b
j;kþ1
for all 1 ≤ k ≤ 3.
Naturally, when the parameters of some of the items could not be estimated,
1
equations (2) and (3) for
the computation of RMSE and MED did not use n ¼ 155 but only the number of items actually
involved. The same was true for the computation of product–moment correlations.
Results
Single-Group Designs
Figure 4 shows scatterplots of true parameters and their estimates in four representative single-
group designs involving the number of subjects indicated on the right of each row. Left to right,
the columns correspond to a, b
1
, b
2
, b
3
, and b
4
(see the labels at the bottom). Also given in each
panel are the product–moment correlation and the value of RMSE for the corresponding param-
eter. The global value of MED is given on the right of each row, where the number of items
whose parameters could not be estimated is also given in parenthesis.
8 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
A quick visual comparison of the four rows in Figure 4 corroborates the well-known fact that
accuracy in parameter estimation decreases as the number of respondents decreases (Reise & Yu,
1990): The scatter of data increases and the correlation decreases. The summary measures RMSE
012345
0
1
2
3
4
5
Estimate
r = 0.989
RMSE = 0.478
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.985
RMSE = 0.227
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.997
RMSE = 0.366
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.998
RMSE = 0.613
–4 –2 0 2 4
–4
–2
0
2
4
1120 subjects
MED = 1.216
(0)
r = 0.997
RMSE = 0.881
012345
0
1
2
3
4
5
Estimate
r = 0.875
RMSE = 0.424
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.828
RMSE = 0.598
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.974
RMSE = 0.444
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.979
RMSE = 0.642
–4 –2 0 2 4
–4
–2
0
2
4
112 subjects
MED = 1.281
(0)
r = 0.973
RMSE = 0.856
012345
0
1
2
3
4
5
Estimate
r = 0.817
RMSE = 0.404
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.745
RMSE = 0.760
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.941
RMSE = 0.464
–4 –2 0 2 4
–4
–2
0
2
4
r = 0.968
RMSE = 0.543
–4 –2 0 2 4
–4
–2
0
2
4
70 subjects
MED = 1.289
(4)
r = 0.808
RMSE = 1.082
012345
0
1
2
3
4
5
Estimate
True a
r = 0.697
RMSE = 0.603
–4 –2 0 2 4
–4
–2
0
2
4
True b
1
r = 0.676
RMSE = 1.028
–4 –2 0 2 4
–4
–2
0
2
4
True b
2
r = 0.898
RMSE = 0.528
–4 –2 0 2 4
–4
–2
0
2
4
True b
3
r = 0.955
RMSE = 0.534
–4 –2 0 2 4
–4
–2
0
2
4
40 subjects
MED = 1.468
(9)
True b
4
r = 0.747
RMSE = 1.197
Figure 4. Scatterplot of true and estimated parameters from single-group designs involving different
numbers of respondents
0.0
0.0
0.4
0.4
0.8
0.8
1.2
1.2
1.6
1.6
RMSE
MED
1120 subjects (0)
1.216
112 subjects (0)
1.281
70 subjects (4)
1.289
40 subjects (9)
1.468
Figure 5. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
,andb
4
)and
MED measures (horizontal lines and numerals) for single-group designs involving different numbers of
respondents. RMSE ¼ root mean square error; MED ¼ mean Euclidean distance.
Garcı
´
a-Pe
´
rez et al. 9
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
and MED printed in Figure 4 are better compared graphically in Figure 5. Each of the four single-
group designs is represented in Figure 5 as a separate block, with the total number of respondents
decreasing left to right (see the bottom labels, where the number of items whose parameters
could not be estimated is also given in parenthesis). The height of the black bar in each block
indicates the value of RMSE for the discrimination parameter a, whereas the height of the white
bars indicates, left to right, the value of RMSE for the boundary parameters b
1
, b
2
, b
3
, and b
4
. The
height of the horizontal segment above each block gives the global value of MED. The deteri-
oration also shows in that both RMSE and MED generally increase as the number of respondents
decreases, and the number of items whose parameters cannot be estimated also increases as the
number of respondents decreases.
It is worth noting that in Figure 5 MED and, especially, RMSE do not seem to capture very
accurately the actual deterioration that the scatterplots and correlations in Figure 4 reveal. For
instance, the scatter of data for the discrimination parameter is clearly seen in the left column
of Figure 4 to increase downwards (i.e., as the size of the sample of respondents decreases),
and the correlation decreases accordingly. The increase in scatter is substantial from the top
panel (for 1,120 respondents) to the panel immediately underneath (for 112 respondents), and
the correlation also decreases from .989 to .875. Yet, against all reasonable expectations,
RMSE actually drops slightly from .478 to .424 (instead of increasing sizably) when sample
size decreases from 1,120 to 112 respondents. In other words, the deterioration is actually there
as revealed by the scatterplots and correlations, but RMSE and MED do not seem to capture it
properly.
Another characteristic that is worth commenting on is that data points in the panels of Figure 4
do not meander around the diagonal identity line. This is particularly evident in the top row,
where the scatter of data is minimal but regression lines would not have a unit slope: The slope
is clearly less than unity for the discrimination parameter (left panel) and it is greater than unity
for all boundary parameters. These characteristics reveal that
MULTILOG recovers parameters
accurately given a sufficiently large number of respondents (as indicated by the tightness of
data and the high correlations in the top row of Figure 4) but it does so under a metric that differs
from that of the true parameters (which produces the non-unit slope of the elongated cloud of
data). This is reminiscent of the need for linking methods in separate-group calibration designs,
but these methods are not applicable in practice because the anchor points could only be provided
by a few items whose true parameters were known. In practice, this is not a problem because the
metric of item parameters is immaterial as long as it is common to all items, but in the present
situation, it explains the failure of RMSE and MED to capture the accuracy with which param-
eters can be estimated (i.e., the amount of scatter in plots like those in Figure 4).
RMSE measures the discrepancy between true and estimated parameters by way of the alge-
braic difference between them, and the same is true for MED. In graphs like those in Figure 4,
RMSE thus increases as the vertical distance between data points and the diagonal identity line
increases. For the discrimination parameter in the leftmost panel of the top row in Figure 4,
where the scatter of data is minimal, these distances are large as a result of the different metric
of true and estimated parameters, yielding an RMSE of .478; for the first boundary parameter
(see the second panel in the first row of Figure 4), the scatter of data is similarly small. However,
the data points lie closer to the identity line, yielding an RMSE of only .227; for the last boundary
parameter (see the rightmost panel in the top row of Figure 4), the scatter is again similar but the
data points lie farther from the diagonal line, yielding a much larger RMSE valued at .881. Con-
sider now the second row in Figure 4, where the scatter of data is substantially larger, with the
consequence that some of the data points lie closer to the identity line. This produces spurious
RMSE values that are generally similar to those in the first row (with the exception of the first
10 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
boundary parameter, for which data were around the diagonal line in the top row), despite the
clearly inferior estimation accuracy.
To confirm that the RMSE and MED measures reported in Figure 5 are indeed affected by this
problem, 50 independent replications of these single-group designs were run and the results of
each individual replication were plotted in the form of Figures 4 and 5. The additional replica-
tions rendered thoroughly analogous results: Differences in metric and variations in estimation
accuracy with sample size paralleled those observed in Figure 4, and RMSE and MED measures
continued to display the contaminated trends shown in Figure 5. To illustrate, Figure 6 shows box
plots of the distribution of MED and RMSE (for each parameter) across replications, using the
same graphical format used in Figure 5 to report the results of a single replicate. Note in the upper
panel of Figure 6 that the distribution of MED values across replications increases as the number
of respondents in the sample decreases, in much the same form as was reported in Figure 5 for
a single replicate; at the same time, distributions of RMSE for each of the five item parameters
(arranged within each block in Figure 6 in the same order as in the blocks of Figure 5) also show
that the distribution of RMSE values are highly similar for all parameters except b
1
when the
sample consisted of either 1120 or 112 respondents, paralleling the results reported in Figure
5 for a single replicate.
In sum, RMSE and MED do not faithfully portray the accuracy with which parameters can be
estimated when the metric of true and estimated parameters differs. In these situations,
MED
1120 subjects 112 subjects 70 subjects 40 subjects
0.0 0.0
0.10.1
0.20.2
0.30.3
RMSE
1120 subjects 112 subjects 70 subjects 40 subjects
0.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
Figure 6. Box plots (minimum, first quartile, second quartile, third quartile, maximum) of the distribution
of MED (top panel) and RMSE (bottom panel; left to right, the box plots pertain to parameters a, b
1
, b
2
, b
3
,
and b
4
) across replications of single-group designs involving different numbers of respondents. RMSE ¼
root mean square error; MED ¼ mean Euclidean distance.
Garcı
´
a-Pe
´
rez et al. 11
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
scatterplots and correlations clearly have the last word. It should nevertheless be stressed that
these discrepancies may occur only in extreme conditions such as those depicted in the top
row of Figure 4, where differences in metric overwhelm accuracy as determined by the scatter
of data; indeed, RMSE and MED measures seem to provide valid summaries for the comparison
of accuracy across the three bottom rows of Figure 4. The present study will continue to use
RMSE and MED despite their occasional misbehavior because these are typical summary meas-
ures in simulation studies, but it must be emphasized that all of the conclusions are primarily
based on an analysis of the raw results (scatterplots and correlations) and not only on the sum-
maries provided by RMSE and MED.
It is useful to recall that these single-group designs were included in the present study to set
a reference against which the outcomes of anchor-item designs would be compared. In practice,
it will often be infeasible to have all subjects (regardless of their number) respond to all of the
items in the initial pool. Nevertheless, it is interesting to have some sense of how the outcomes of
a given anchor-item design compare to the (infeasible) single-group design involving the same
number of respondents as in each of the groups participating in the anchor-item design. It is inter-
esting to note that the accuracy with which the parameters of Likert-type items can be estimated
varies greatly with the number of respondents (Reise & Yu, 1990); however, further research
should be conducted to establish whether the potentially inferior accuracy provided by an
anchor-item design is caused by the fact that calibration involves a single run on a very sparse
data array, or the fact that different groups of subjects respond to different sets of unique items, or
is only a consequence of the small size of the subgroups into which the total calibration sample is
divided.
Twenty-Item Anchor-Item Designs
The outcomes of anchor-item designs were analyzed as described above for single-group
designs. However, scatterplots and correlations did not differ meaningfully across variations
in the criteria used for the selection of common items and the distribution of unique items,
and these plots are thus omitted. Nevertheless, they are available from the corresponding author
on request. In any case, the study confirmed that the location of data points relative to the identity
line in the scatterplots were similar in all compared conditions. The summary RMSE and MED
measures, therefore, were not contaminated by the differences in metric discussed in the preced-
ing section. Figure 7 shows summary plots of RMSE and MED for each of the 12 anchor-item
designs involving 20-item subtests, and for the three comparison single-group designs involving
a total number of subjects identical to that in each of the subgroups of the corresponding anchor-
item design. Thus, Figure 7a shows results for the case of 5 common and 15 unique items per
subtest (10 groups of 112 subjects), Figure 7b shows results for the case of 11 common and 9
unique items (16 groups of 70 subjects), and Figure 7c shows results for the case of 15 common
and 5 unique items (28 groups of 40 subjects). The leftmost block in each part of Figure 7 gives
results for the comparison single-group design (containing responses to all 155 items by single
groups of sizes 112, 70, and 40, respectively, for Figures 7a, 7b, and 7c; these blocks are merely
replotted from Figure 5), and the remaining blocks give results under the four different criteria
for selection of common items and distribution of unique items.
Three characteristics of the patterns of RMSE and MED displayed in Figure 7 are worth not-
ing. First, accuracy deteriorates as the number of common items increases and, consequently, the
number of unique items decreases. This is perhaps because the size of the subgroups of respond-
ents also decreases and, thus, the number of responses to the unique items decreases. Second,
results for matched single-group designs (leftmost block in each panel of Figure 7) and alterna-
tive anchor-item designs (four blocks on the right of each panel of Figure 7) reveals that the
12 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
matched single-group design renders slightly less accurate parameter estimates than any of the
anchor-item designs with which it can be compared (this was further confirmed by inspection of
scatterplots and correlations). Third, no meaningful differences in accuracy can be observed
across anchor-item designs varying only as to criteria for the selection of common items and
the distribution of unique items. This has the important practical implication that common items
can be chosen randomly and unique items can be distributed randomly across subtests with no
consequence to the accuracy with which item parameters can be estimated, despite the hetero-
geneity of the resultant subtests. Finally, all anchor-item designs varying only as to criteria
for the selection of common items and distribution of unique items yield similar numbers of
items whose parameters cannot be estimated; and these numbers are generally no larger than
those resulting from the matched single-group design. In all the scatterplots, the data points per-
taining to common items were always right on or very close to the putative regression line (which
was not the diagonal line because of the metric differences discussed above), suggesting that the
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(a) 5 common and 15 unique items; 10 groups of 112 subjects
155 items (0)
1.281
HD-LS (0)
0.825
LD-SS (1)
0.825
Heterogeneous (0)
0.758
Random (0)
0.801
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(b) 11 common and 9 unique items; 16 groups of 70 subjects
155 items (4)
1.289
HD-LS (4)
0.904
LD-SS (3)
0.878
Heterogeneous (3)
0.792
Random (3)
0.866
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(c) 15 common and 5 unique items; 28 groups of 40 subjects
155 items (9)
1.468
HD-LS (4)
1.082
LD-SS (5)
1.208
Hetero
g
eneous (4)
1.147
Random (6)
1.112
Figure 7. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
, and b
4
) and MED
measures (horizontal lines and numerals) for the set of 12 anchor-item designs involving 20-item subtests
and the comparison single-group designs. RMSE ¼ root mean square error; MED ¼ mean Euclidean
distance.
Garcı
´
a-Pe
´
rez et al. 13
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
large number of responses collected for them actually contributed to the accuracy with which
their parameters could be estimated.
It seems clear, then, that anchor-item designs with five common items are optimal among
these 20-item designs, and that these designs outperform their matched single-group design.
Nevertheless, they should not outperform a single-group design involving 1,120 respondents.
Yet these 20-item anchor-item designs appear to provide more accurate parameter estimates
in terms of RMSE and MED than the single-group design in which 1,120 subjects respond to
all 155 items (compare with the leftmost block in Figure 5). This conclusion defies logic, but
it only reflects the inability of RMSE and MED to portray the actual accuracy with which param-
eters can be estimated when there are differences in the metric of true and estimated parameters,
something that severely inflates RMSE and MED measures in the case of the present single-
group design with 1,120 subjects as discussed above. Contrary to what RMSE and MED indicate,
scatterplots and correlations reveal that the single-group design involving 1,120 respondents
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(a) 5 common and 10 unique items; 15 groups of 74-75 subjects
155 items (2)
1.179
HD-LS (2)
0.768
LD-SS (1)
0.861
Heterogeneous (2)
0.799
Random (1)
0.935
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(b) 8 common and 7 unique items; 21 groups of 53-54 subjects
155 items (6)
1.371
HD-LS (0)
1.025
LD-SS (2)
1.147
Heterogeneous (4)
0.947
Random (3)
1.040
0.0 0.0
0.4 0.4
0.8 0.8
1.2 1.2
1.6 1.6
RMSE
MED
(c) 10 common and 5 unique items; 29 groups of 38-39 subjects
155 items (11)
1.454
HD-LS (5)
1.075
LD-SS (9)
1.276
Hetero
g
eneous (10)
1.201
Random (12)
1.326
Figure 8. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
, and b
4
) and MED
measures (horizontal lines and numerals) for the set of 12 anchor-item designs involving 15-item subtests
and the comparison single-group designs. RMSE ¼ root mean square error; MED ¼ mean Euclidean
distance.
14 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
indeed provides more accurate estimates (despite the different metric) than 20-item anchor-item
designs with five common items.
Fifteen-Item Anchor-Item Designs
The outcomes of anchor-item designs involving 15-item subtests are displayed in Figure 8. The
three characteristics that were described for 20-item designs hold also for these 15-item designs,
leading to the practical conclusions that items can be distributed at random across subtests and
that maximal accuracy is obtained when the number of common items is smallest and the number
of unique items is largest. The anchor-item design having five common items and 10 unique items
is optimal among the present 15-item designs, but a comparison of Figures 7a and 8a reveals that
20-item subtests consisting of five common and 15 unique items yield more accurate estimates
than 15-item subtests with five common items. Ultimately, a small number of common items com-
bined with a large number of unique items serve the purpose of reducing the number of subtests
into which the item pool must be partitioned and, when the size of the calibration sample is fixed,
allows larger groups of subjects to respond to each subtest, which is what seems to determine the
accuracy with which item parameters are estimated. To confirm this latter point, data arrays from
the 20-item designs with five common items were reduced by randomly eliminating 37 subjects
from each group of 112 respondents. This left the same number of respondents per group as in
the comparable 15-item designs and reduced the number of responses to the five common items
from 1,120 to only 750. Compared to the initial 20-item designs with 112 respondents per group,
the accuracy with which item parameters were estimated from the trimmed 20-item designs was
similar to that described in Figure 8a for 15-item designs also involving five common items.
Clearly, the accuracy of item parameter estimates from anchor-item designs of the type considered
in this article are determined by the number of respondents per group.
Additional Anchor-Item Designs
Although subtests should be assembled with a minimum number of common items and a maxi-
mum number of unique items, the smallest number of common items in the preceding designs
0.0
0.0
0.4
0.4
0.8
0.8
1.2
1.2
1.6
1.6
RMSE
MED
1 common (0)
784 subjects
0.846
3 common (0)
896 subjects
0.818
5 common (0)
1120 subjects
0.801
Figure 9. Summary RMSE (bars; left to right, the bars pertain to parameters a, b
1
, b
2
, b
3
, and b
4
) and MED
measures (horizontal lines and numerals) for anchor-item designs involving 23-item subtests with one
common item, 22-item subtests with three common items, and 20-item subtests with five common items.
RMSE ¼ root mean square error; MED ¼ mean Euclidean distance.
Garcı
´
a-Pe
´
rez et al. 15
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
was five. But what is the minimum number of common items that provides sufficiently accurate
parameter estimates? To investigate this question, two further anchor-item designs were consid-
ered, one using three common items and 19 unique items (yielding eight 22-item subtests), and
896 subjects; the other used a single common item and 22 unique items (yielding seven 23-item
subtests), and 784 subjects. For these additional analyses, common items were selected at ran-
dom, and unique items were distributed at random among subtests. In both cases, there were
112 subjects per group so that the number of common items and the number of subjects per group
were not confounded. The results of these two additional designs are presented in Figure 9 along
with those from the random condition (five common items and the same number of subjects per
group, replotted from the rightmost block in Figure 7a). These results reveal that a single com-
mon item was sufficient to provide accurate parameter estimates. Inspection of scatterplots and
correlations confirmed the picture provided by the RMSE and MED measures. Accuracy
increases minimally as the number of common items increases, but this appears to be merely
a result of the increasing overall number of subjects in the calibration sample that in turn
increases the number of responses to the common items.
Alternative Item Parameters
After this simulation study was completed, a subset of n ¼ 125 of the actual items were pretested
with a sample of N ¼ 185 general-population respondents under a single-group design. The
resultant parameter estimates turned out to have characteristics quite different from those used
in the simulations just described, in that the negative relation between discrimination and sepa-
ration between outer boundaries was much stronger (see Figure 10). The question was whether
the relative performance of the anchor-item designs in the present study would change when item
parameters were so drastically different, and so the simulation study for 125 items were repeated
with the true parameters depicted in Figure 10 instead of the 155 items whose parameters were
depicted in Figure 1b. The simulation was carried out along the same lines and with the same
factors, but it was adapted to accommodate the case of 125 items.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
0
1
2
3
4
5
6
7
Separation, b
4
– b
1
Discrimination, a
Figure 10. Separation b
j,4
− b
j,1
against a
j
for each of the 125 items for the second simulation study
16 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Individually, the results did change slightly but the overall pattern remained the same, leading
to the same conclusion: Parameter estimates were comparatively more accurate when the num-
ber of common items and subtests were smallest, and the size of each group of respondents was
largest. These converging results based on two different sets of true item parameters (one derived
from actual administration of real items) support the generalizability of these conclusions.
Discussion and Conclusion
This study compared the relative accuracy with which the parameters of Likert-type items were
recovered through anchor-item calibration designs of various configurations. When the applica-
tion was subject to the empirical constraints that the sample of respondents (however large) and
the maximal number of items in a subtest are limited, results showed that item parameters could
be most accurately recovered using the smallest possible number of common items (a single
common item was sufficient) and the number of unique items was as large as the maximum fea-
sible length of the subtests. The resulting number of subtests was small and, therefore, the avail-
able sample of respondents was split into the minimum number of groups of the largest possible
size. This provided a large number of responses for each unique item, which also contributed to
the accuracy with which item parameters could be estimated.
Study results also show that random selection of common items and random distribution of
unique items across subtests yielded the same estimation accuracy as did selection of common
items and distribution of unique items according to their characteristics. This conclusion arises
from the finding that the various alternative options that were used to assemble subtests produced
comparable outcomes, which seems to indicate that the true parameter values of common items are
actually immaterial. The practical importance of this result is that the subtests involved in the appli-
cation of an anchor-item design can be safely assembled with no prior information about the items
whose parameters will be estimated. The results also showed that the optimal anchor-item design
under the empirical constraints of the size of the sample of respondents and maximum length of the
subtests, provided item parameter estimates that were more accurate than those that could be
obtained through a matched single-group design with the same number of respondents as in
each of the subgroups involved in the anchor-item design. This is perhaps surprising because
the anchor-item design renders a very sparse data array, but that sparse data array contains exactly
as many responses to unique items as there are in the comparable single-group design, and also
includes more responses to the common items than would be collected in the matched single-group
design (see Figure 2). This seems to suggest that the additional responses to the common items in
the anchor-item design played a substantial role in increasing estimation accuracy.
The two simulation studies considered only Likert-type items with K ¼ 5 response catego-
ries. The numbers of items in each pool were n ¼ 155 and n ¼ 125, and the item pools had true
parameters with different characteristics (compare Figures 1b and 10). In view of the similarity
of the results, it is very likely that the study’s conclusions will generalize to other K, n, and N.
It is worth noting that all the results are based on simulations using unidimensional items that
measure the same dimension, where there are no DIF items, and where trait distributions are unit
normal and equal (within sampling error) across groups of respondents. The question then arises
whether these conclusions would be valid in empirical situations in which one or more of these
conditions are violated.
First, note that a failure of unidimensionality and the presence of DIF items is not more of
a problem in anchor-item designs than it is in single-group designs. The same solution can be
applied in both cases. Under a single-group design, responses to any new set of items are
used to test for unidimensionality and DIF, and DIF or otherwise deviant items are then elimi-
nated, leaving a reduced pool of homogeneous and non-DIF items that are then calibrated. The
Garcı
´
a-Pe
´
rez et al. 17
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
same approach can be taken with responses collected under an anchor-item design, although the
number of common items must not be too small or there will be some risk that all of the common
items would have to be eliminated. Interestingly, the results of this study show that all surviving
items will be properly calibrated provided at least one common item remains in the reduced pool.
On the other hand, a mismatch between the trait distribution of the respondents and the prior dis-
tribution assumed during calibration does not seem to affect estimation accuracy in dichotomous
items (Garcı
´
a-Pe
´
rez, 1999), and there is no reason to think that such mismatches should be rel-
evant with Likert-type items.
Violations of the assumptions of the present study may occur in a myriad of ways. Such vio-
lations would result in a reduction in estimation accuracy, and the amount of absolute reduction
would probably vary with the particular form in which the assumptions were violated. In any
case, there is no reason to think that a particular type of anchor-item design is intrinsically
more robust to violations of any kind. A thorough study that systematically addresses these issues
is beyond the scope of this article, but this is certainly an area where further research would be
useful. For the time being, and under the practical constraints that motivated this study (i.e., item
banks consisting of new items whose characteristics are unknown and whose size is too large to
permit single-group approaches), the results yield guidelines that may actually alleviate the
potential problems that might be caused by violation of the study assumptions.
The use of the smallest possible number of subtests of the largest possible size allows for
larger groups of respondents that are less likely to be widely heterogeneous in trait distribution.
On the other hand, the potentially detrimental effects of DIF, or of otherwise low-quality items
being accidentally selected as common items (and which will be eliminated during a screening
process prior to calibration, as discussed above) can be reduced by not pushing the design to the
extreme of using a single common item. Interestingly, the finding that the characteristics of
anchor items are immaterial under these assumptions (and for the purpose of accurate parameter
estimation) is not in conflict with recommendations arising from results reported by Lopez
Rivas, Stark, and Chernyshenko (2009) on how the power of DIF detection varies as a function
of the discrimination level of common items in anchor-item designs: Lopez Rivas et al. (see their
Table 7) found that the proportion of correctly identified DIF items improved slightly if a single
common item of high discrimination was used, as compared to the results obtained when the sin-
gle common item had low discrimination. When there were three or five common items, it turned
out that a mixture of discrimination levels across common items yielded similar or better DIF
detection than homogeneity of discrimination levels across common items, whether they were
high or low. These differences in the power of DIF detection as a function of the characteristics
of common items also vanished as the number of respondents per group increased. It should be
noted that Lopez Rivas et al. did not assess the accuracy with which item parameters could be
estimated (which was the goal of the present article). However, their results do not contradict
the guidelines of the present study for the configuration of anchor-item designs: When DIF
detection and accurate parameter estimation are both an issue, the use of a single common
item should perhaps be avoided. At the same time, the random selection of common items
when their discrimination levels are unknown will likely diversify their discrimination levels
so as to yield better DIF detection than could be obtained by choosing common items of homo-
geneously high or homogeneously low discrimination.
Declaration of Conflicting Interests
The authors declared no conflicts of interests with respect to the authorship and/or publication of this
article.
18 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Funding
The authors disclosed receipt of the following financial support for the research and/or authorship of this
article:
A grant SEJ2005-00485 from Ministerio de Educacio
´
n y Ciencia (Spain).
Note
1. Baker (1997) discusses the conditions that trigger this behavior but not the ultimate reason for its occur-
rence or the reasons that the parameter estimation algorithm in
MULTILOG accepts disordinal boundaries as
a valid solution. In any case, in the present authors’ own experience, these problems occur generally
when the number of respondents is small, and disappear as the number of respondents increases. (The
reader can easily reproduce this feature by running
MULTILOG on a data file as it increases by addition
of data from further respondents.) Thus, this behavior actually reflects an idiosyncrasy of
MULTILOG
(Baker, 1997) and is not the manifestation of some intrinsic problem with the item itself. When these
problems occurred, the entire calibration results were checked and it was confirmed that SEs for the
remaining items (i.e., those not affected by this problem) were similar to those observed in calibration
runs in which no item turned up with disordinal boundaries.
References
Baker, F. B. (1992). Equating tests under the graded response model. Applied Psychological Measurement,
16, 87-96.
Baker, F. B. (1997). Estimation of graded response model parameters using
MULTILOG. Applied Psycholog-
ical Measurement, 21, 89-90.
Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Calibration of an item pool for assessing the burden of
headaches: An application of item response theory to the Headache Impact Test (HIT
TM
). Quality of Life
Research, 12, 913-933.
Cella, D., & Chang, C.-H. (2000). A discussion of item response theory and its applications in health status
assessment. Medical Care, 38, II-66-II-72.
Cohen, A. S., & Kim, S.-H. (1998). An investigation of linking methods under the graded response model.
Applied Psychological Measurement, 22, 116-130.
de Gruijter, D. N. M. (1988). Standard errors of item parameter estimates in incomplete designs. Applied
Psychological Measurement, 12, 109-116.
Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1989). Operational characteristics of adaptive testing proce-
dures using the graded response model. Applied Psychological Measurement, 13, 129-143.
du Toit, M. (Ed.). (2003). IRT from SSI: bilog-mg, multilog, parscale, testfact. Lincolnwood, IL: Scientific
Software International.
Emons, W. H. M. (2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psycho-
logical Measurement, 32, 224-247.
Fletcher, R. B., & Hattie, J. A. (2004). An examination of the psychometric properties of the physical self-
description questionnaire using a polytomous item response model. Psychology of Sport and Exercise, 5,
423-446.
Garcı
´
a-Pe
´
rez, M. A. (1999). Fitting logistic IRT models: Small wonder. Spanish Journal of Psychology, 2,
74-94. Available from http://www.ucm.es/sjp
Hanson, B. A., & Be
´
guin, A. A. (2002). Obtaining a common scale for item response theory item param-
eters using separate versus concurrent estimation in the common-item equating design. Applied Psycho-
logical Measurement, 26, 3-24.
Hart, D. L., Wang, Y.-C., Stratford, P. W., & Mioduski, J. E. (2008a). A computerized adaptive test for
patients with hip impairments produced valid and responsive measures of function. Archives of Physical
Medicine and Rehabilitation, 89, 2129-2139.
Garcı
´
a-Pe
´
rez et al. 19
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from
Hart, D. L., Wang, Y.-C., Stratford, P. W., & Mioduski, J. E. (2008b). Computerized adaptive test for
patients with knee impairments produced valid and responsive measures of function. Journal of Clinical
Epidemiology, 61, 1113-1124.
Hol, A. M., Vorst, H. C. M., & Mellenbergh, G. J. (2007). Computerized adaptive testing for polytomous
motivation items: Administration mode effects and a comparison with short forms. Applied Psycholog-
ical Measurement, 31, 412-429.
Holman, R., & Berger, M. P. F. (2001). Optimal calibration designs for tests of polytomously scored items
described by item response theory models. Journal of Educational and Behavioral Statistics, 26, 361-
380.
Kim, S.-H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded
response model. Applied Psychological Measurement, 26, 25-41.
Koch, W. R. (1983). Likert scaling using the graded response latent trait model. Applied Psychological
Measurement, 7, 15-32.
Lai, J.-S., Cella, D., Chang, C.-H., Bode, R. K., & Heinemann, A. W. (2003). Item banking to improve,
shorten and computerize self-reported fatigue: An illustration of steps to create a core item bank
from the FACIT-Fatigue Scale. Quality of Life Research, 12, 485-501.
Lopez Rivas, G. E., Stark, S., & Chernyshenko, O. S. (2009). The effects of referent item parameters on
differential item functioning detection using the free baseline likelihood ratio test. Applied Psycholog-
ical Measurement,33, 251-265.
Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A Monte Carlo examination of the sensitivity
of the differential functioning of items and tests framework for tests of measurement invariance with
Likert data. Applied Psychological Measurement, 31, 430-455.
Numerical Algorithms Group. (1999). NAG Fortran library manual, Mark 19. Oxford, UK: Author.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory:
Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using
MULTILOG. Journal of
Educational Measurement, 27, 133-144.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika
Monograph Supplement, 34(4, Pt. 2, No. 17), 100-114.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Hand-
book of modern item response theory (pp. 85-100). New York, NY: Springer.
Singh, J., Howell, R. D., & Rhoads, G. K. (1990). Adaptive designs for Likert-type data: An approach for
implementing marketing surveys. Journal of Marketing Research, 27, 304-321.
Singh, J., Rhoads, G. K., & Howell, R. D. (1992). Adapting marketing surveys to individual respondents:
An approach using item information functions. Journal of the Market Research Society, 34, 125-147.
Uttaro, T., & Lehman, A. (1999). Graded response modeling of the Quality of Life Interview. Evaluation
and Program Planning, 22, 41-52.
Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement,
10, 333-344.
Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and
computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact
scales. Medical Care, 38, II-73-II-82.
Ware, J. E., Jr., Kosinski, M., Bjorner, J. B., Bayliss, M. S., Batenhorst, A., Dahlo
¨
f, C. G. H., . . . Dowson, A.
(2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact.
Quality of Life Research, 12, 935-952.
Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain
IRT procedures. Applied Psychological Measurement, 8, 347-364.
Woods, C. M. (2007). Ramsay curve IRT for Likert-type data. Applied Psychological Measurement, 31,
195-212.
20 Applied Psychological Measurement XX(X)
at Univ de Oviedo-Bib Univ on May 7, 2015apm.sagepub.comDownloaded from