Content uploaded by Hung Tan Ha
Author content
All content in this area was uploaded by Hung Tan Ha on Feb 18, 2025
Content may be subject to copyright.
Content uploaded by Hung Tan Ha
Author content
All content in this area was uploaded by Hung Tan Ha on Feb 07, 2025
Content may be subject to copyright.
Language Assessment Quarterly
ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/hlaq20
A Comparison of Yen’s Q3 Coefficient and Rasch Testlet
Modeling for Identifying Local Item Dependence:
Evidence from Two Vocabulary Matching Tests
Hung Tan Ha, Duyen Thi Bich Nguyen & Tim Stoeckel
To cite this article: Hung Tan Ha, Duyen Thi Bich Nguyen & Tim Stoeckel (2025) A Comparison
of Yen’s Q3 Coefficient and Rasch Testlet Modeling for Identifying Local Item Dependence:
Evidence from Two Vocabulary Matching Tests, Language Assessment Quarterly, 22:1, 56-76,
DOI: 10.1080/15434303.2025.2456953
To link to this article: https://doi.org/10.1080/15434303.2025.2456953
© 2025 The Author(s). Published with
license by Taylor & Francis Group, LLC.
Published online: 27 Jan 2025.
Submit your article to this journal
Article views: 263
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=hlaq20
ARTICLE
A Comparison of Yen’s Q3 Coecient and Rasch Testlet
Modeling for Identifying Local Item Dependence: Evidence
from Two Vocabulary Matching Tests
Hung Tan Ha
a
, Duyen Thi Bich Nguyen
b
, and Tim Stoeckel
c
a
Victoria University of Wellington, Wellington, New Zealand;
b
University of Economics Ho Chi Minh City (UEH),
Ho Chi Minh City, Vietnam;
c
University of Niigata Prefecture, Niigata, Japan
ABSTRACT
This article compares two methods for detecting local item depen-
dence (LID): residual correlation examination and Rasch testlet model-
ing (RTM), in a commonly used 3:6 matching format and an extended
matching test (EMT) format. The two formats are hypothesized to
facilitate dierent levels of item dependency due to dierences in
the number of options and instructions regarding option recycling.
The ndings indicate that (1) RTM allows deeper LID inspection com-
pared to residual correlation examination in matching tests, and (2)
the EMT format has good resistance to LID while the traditional 3:6
format does not.
INTRODUCTION
Because of the importance of receptive vocabulary knowledge in second language (L2)
learning, it is also important to consider how lexis can be reliably assessed. Problems in
vocabulary assessment and design modifications that could address such problems have
been a topic of discussion (Schmitt et al., 2020) and debate (Stoeckel et al., 2021; Webb,
2021). In addition to potential sources of inaccuracy in vocabulary tests such as testwiseness
(Stoeckel et al., 2019), the use of overly-inclusive word counting units (McLean, 2018;
Stoeckel et al., 2020), and unrepresentative item sampling (Gyllstad et al., 2020), questions
have been raised regarding whether some vocabulary tests meet the conditional indepen-
dence (CI) assumption of item response theory (IRT) models (Ha, 2022; Kamimoto, 2014).
Conditional independence between items assumes that the only source of correlation
between items is the latent trait that the test measures (Lord & Novick, 1968), and that when
this latent trait has been conditioned out, correlations between items should be zero (Lord &
Novick, 1968; Yen & Fitzpatrick, 2006). Ignoring violations of this assumption might lead to
problems in evaluating a test’s psychometric qualities which, in turn, can result in negative
consequences in score interpretation and use (Yen & Fitzpatrick, 2006).
Amongst vocabulary tests, ones that employ matching formats are the most likely to
violate this assumption, as items are put into clusters and share the same stimuli. For
example, the well-known Vocabulary Levels Test (VLT, Schmitt et al., 2001) and its updated
CONTACT Hung Tan Ha hung.ha@vuw.ac.nz School of Linguistics and Applied Language Studies, Victoria University
of Wellington, von Zedlitz Building, Kelburn Parade, Wellington 6012, New Zealand
This article has been republished with minor changes. These changes do not impact no the academic content of the article.
LANGUAGE ASSESSMENT QUARTERLY
2025, VOL. 22, NO. 1, 56–76
https://doi.org/10.1080/15434303.2025.2456953
© 2025 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The terms
on which this article has been published allow the posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
variant (Webb et al., 2017) each have clusters that contain three target words and six
response options (Figure 1).
This makes each cluster a potential super-item where the correct or incorrect answer to
one item may be dependent on the correct or incorrect response to another item (Baghaei,
2007). Vocabulary test creators tend to examine the correlations between item residuals
(Yen, 1984, 1993) to check if the items are conditionally independent (Ha, 2021; Nguyen
et al., 2024; Webb et al., 2017). However, despite being informative and easy to use, the
exploratory nature of this method offers no further information than the potential presence
of LID between test items.
The present study extends previous examinations of LID in vocabulary tests by
comparing (a) residual correlation check and (b) Rasch testlet modeling (Wang &
Wilson, 2005), a confirmatory method that allows researchers to model both knowledge-
and method-specific dimensions and estimate the unique variance accounted for by each
of them. This comparison is made to examine item dependency in two matching
vocabulary test formats. The first is a traditional matching format employed by the
Updated Vocabulary Levels Test (UVLT; Webb et al., 2017), in which target items
appear in small clusters and option recycling (i.e., using the same option multiple
times) is neither encouraged nor prohibited. The second is an extended matching test
format (EMT; Stoeckel et al., 2024) in which target words are placed in very large
clusters, and learners are instructed that the same response option may be used more
than once. It is hypothesized that the traditional matching test is more likely to
demonstrate item dependency because of its use of relatively small clusters and lack of
instructions regarding option recycling. The use of two formats with different hypothe-
sized levels of item dependency facilitates an in-depth comparison of two methods for
examining item dependency.
LITERATURE REVIEW
Local Item Dependence in the Rasch Model
The most parsimonious, effective, and arguably easiest way to understand the Rasch model
is through the relationship between a person’s ability (θ) and task difficulty (b; Bond et al.,
2021):
Equation (1) expresses the probability (P) of person n with ability θ being successful on task
i with difficulty level b. There is an inverse relationship between person ability and task
difficulty such that the success rate will be higher when either person n is more able or task
i is less difficult. f is a mathematical function of the difference between ability θ and
Figure 1. An example of an Updated Vocabulary Levels Test (UVLT) item Cluster.
LANGUAGE ASSESSMENT QUARTERLY 57
difficulty b. For the Rasch dichotomous model, the above equation is expressed as below
when logit link function is applied, with exp() representing the exponential function:
Note: epsilon (ε) is specific to item–person interaction.
ε in equations (1) and (2) is the residual, or the difference between observed scores from
test takers’ actual performance and the expected scores from the Rasch model. It is the
“unexplained variance” in the Rasch model. In other words, it is any performance that can
be explained by neither the person’s ability nor the task’s difficulty. It is worth noting that ε
is not explicitly mentioned in the original Rasch model, but we added it here for clarity, as it
relates to the methodology of the present study. In fact, residuals exist in all idealized
mathematical models as “observed test data collected from real life can never attain that
mathematical ideal” (Fan & Bond, 2019, p. 86). In an actual test, examinees may unexpect-
edly succeed or fail at a difficult or easy test item for many reasons. This means
a considerable amount of information, or variance, is often accounted for by these residuals
(Linacre, 2017). In a unidimensional test, these residuals are expected to be random noise,
or not to follow a specific pattern (Bond et al., 2021; Fan & Bond, 2019). In other words, test
takers’ unexpected responses should not correlate. When the residuals correlate, it means
that the answers of test takers are being predicted by something other than their ability or
task difficulty. This is a violation of local item independence (LII), a basic assumption of
Rasch and other (IRT) models. The violation itself is denoted as local item dependence
(LID; Baghaei, 2016; Fan & Bond, 2019).
Item Bundles and Local Item Dependence
LID can occur for various reasons including external assistance, rushed responses due to
lack of time, mental fatigue, and item design (Yen, 1993). Researchers tend to focus on
sources of LID that come from the test itself as they pose critical issues to the validity of
score interpretations. Tests that include testlets are believed to be the most vulnerable to
LID (Brandt, 2017). The term testlet, or item bundle, refers to a set of items that share
a common stimulus (Wang & Wilson, 2005). Typical examples of testlets in L2 assessment
include comprehension questions that share the same reading passage or items that belong
to the same cluster in a matching vocabulary test (Schmitt et al., 2001; Webb et al., 2017).
Such matching formats are popular because they are efficient. That is, by allowing several
items to share a common stimulus, test creators need less time than when separate stimuli
are required for each target word. For example, whereas three 4-option, multiple-choice
questions require a total of 12 options, only six are needed when the three target words are
placed in a single cluster and allowed to “load” on the same set of options. Moreover, the
target-to-distractor ratio would also be 1:3 for the last item that is answered, which is
identical to a traditional four-option multiple-choice format. So, theoretically, a matching
format saves time while maintaining test quality. However, there is a critical drawback.
When several items share the same stimulus, the probability of test takers answering one
item correctly is influenced by their answers to other items in the bundle (Baghaei, 2007;
Brandt, 2017). In other words, the items exhibit LID.
58 H. T. HA ET AL.
Research on LID
The effects of LID in IRT models have been extensively investigated. Research has suggested
that ignoring LID in unidimensional analysis may result in (1) biased estimation of item
difficulty and discrimination, (2) misleading estimate of item variance, and (3) inflation of
reliability (Bond et al., 2021; Marais & Andrich, 2008; Monseur et al., 2011; Tuerlinckx & De
Boeck, 2001; Wainer et al., 2007; Wang & Wilson, 2005; Yen, 1984). In language testing, the
effect of LID has also received the attention of researchers, and similar conclusions have
been drawn (Baghaei, 2010, 2016; Baghaei & Aryadoust, 2015; Baghaei & Christensen, 2023;
Baghaei & Ravand, 2016). However, LID remains the least reported aspect of Rasch
measurement in language assessment (Aryadoust et al., 2021; Fan & Bond, 2019). The
existence of LID can be detected using various statistical techniques including Rasch log-
linear modeling (Kelderman, 1984; Kreiner & Christensen, 2004, 2007), residual correlation
examination (van den Wollenberg, 1982; Yen, 1984, 1993), and Rasch testlet modeling
(Wang & Wilson, 2005). The present study focuses on residual correlation examination and
testlet modeling.
Residual Correlation Inspection
In the field of vocabulary assessment, LID is usually diagnosed through the inspection of
raw score or standardized residual correlations using Rasch unidimensional analysis
(Aryadoust et al., 2021; Ha, 2022). Standardized residuals can be calculated by dividing
the raw residual by its standard deviation. Raw score residual correlation, or Yen (1984,
1993) Q3 coefficient, is the most frequently reported in Rasch analysis (Christensen et al.,
2017). The technique is as straightforward as its name suggests. It is based on the basic
assumption of most IRT models that when the variance of the target ability dimension has
been conditioned out or explained by the model, the variance that remains should be
random noise (Baghaei & Aryadoust, 2015). This technique inspects this variance to see if
it is truly random. Several critical values of Q3 statistics have been used in psychometric
studies, ranging from .1 to .7 (see Christensen et al., 2017; Fan & Bond, 2019 for a full
discussion).
This technique of LID inspection has both pros and cons. On the upside, the analysis is
easy to conduct, the theory is accessible, and the results are straightforward. On the down-
side, however, the information offered is limited to the magnitude of correlations between
the residuals of individual items. While these correlations may signal that test-takers’
performance is influenced by something other than the ability of interest, the source and
degree of interference is often hard to identify. Test takers usually need at least two types of
knowledge to answer test questions: knowledge of the tested subject matter, and knowledge
of how to deal with the test format. Effects of the method-related knowledge of different
items can be strongly correlated, especially when the test employs only one format. Thus,
whereas residual correlations may indicate the poor design of a multiple-choice test when
the content of one item signals the answer to another, such correlations can also signify test
takers’ familiarity with the format itself.
The residual correlations that are due to method-related ability, when not taken into
account, can result in misleading conclusions. That is, if analysts examine only the residual
correlations within item clusters and ignore those between items of different clusters, they
may overlook potential method-related dimensions in a test. The existence of dimensions
LANGUAGE ASSESSMENT QUARTERLY 59
other than the general ability dimension can be a serious violation of the LII assumption in
unidimensional analysis and evidence against a test’s validity. In short, due to the limited
information residual correlations offer, this method is only good for signaling the existence
of LID at the whole test level, and not suitable for identifying LID caused by cluster-related
factors. For this, an appropriate method for testlet modeling should be employed. The
present study aims to introduce a method of testlet modeling.
Rasch Testlet Modeling
The existence of LID and its effects can also be investigated using Rasch testlet modeling. In
2005, Wang and Wilson proposed the Rasch Testlet Model (RTM) for testlet-based tests.
They hypothesize that the probability of person n giving a correct answer to item i depends
not only on person ability (θ) and item difficulty (b), but also on the person’s ability on
testlet d that belongs to test i. Imagine a 9-item vocabulary test which assigns items into
3-item clusters; the probability of test takers (n) correctly answering the questions (i) will be
influenced by their vocabulary knowledge (θ), the difficulty of the tested words (b), and
examinees’ familiarity with the format (d). The RTM may be expressed as follows:
or:
γnd ið Þ is the effect of testlet d within test i on person n, or the strength or weakness of person
n on subtest d. According to this model, person n with low ability θ can still manage to
correctly answer item i if they have strong ability γ on the specific testlet d (Wang & Wilson,
2005). With the addition of the testlet-specific subdimensions γ, the testlet model represents
a within-item, multidimensional, bi-factor model where each item loads on one general
ability dimension and one testlet-specific sub-dimension. Depending on the design of the
test, γ could be denoted as the “method” dimension (e.g., test formats), or an area-specific
ability sub-dimension (e.g., different areas in a mathematical test). Figures 2 and 3 depict
how the Rasch unidimensional model and the RTM differ. Figure 2 displays the unidimen-
sional model where all items are hypothesized to load on a single ability dimension. In
contrast, Figure 3 illustrates an RTM in which each item is loaded on not only a single
general ability dimension but also on one of several different testlet dimensions.
The RTM, through model comparison, not only offers another method for the examina-
tion of LID, but also allows researchers to see the reliability indices and the amount of
variance accounted for by each subtest dimension. These metrics provide insights into
whether estimates of general ability are disturbed by the unique effects of testlets.
To effectively extract the amount of variance that the general ability and testlet dimensions
account for, the RTM assumes complete orthogonality. This means that all the dimensions are
assumed to be uncorrelated. In other words, the RTM requires that the covariance between all
subtests and the main dimension be constrained at zero. In this sense, the RTM is nested
within a bi-factor model without orthogonal constraints (Li et al., 2006; Rijmen, 2010). In the
60 H. T. HA ET AL.
case of a vocabulary test, this assumption of total orthogonality allows researchers to isolate
vocabulary knowledge in the main dimension from method-related variance, which results in
more precise data interpretation.
The effect of LID on unidimensional analysis can be examined in RTM by
comparing (a) the amount of variance explained by the general knowledge dimen-
sion in RTM with (b) the variance of a unidimensional model. If test data is truly
unidimensional, then the amount of variance explained by the general knowledge
dimension in RTM and the unidimensional model should be similar (Wang &
Wilson, 2005). However, if method-related dimensions are present, the annoyance
caused by these dimensions would result in a difference in the amount of explained
Figure 2. Unidimensional model.
Figure 3. The rasch testlet model.
LANGUAGE ASSESSMENT QUARTERLY 61
variance between the unidimensional model and the general knowledge dimension of
a RMT, with the variance of the latter usually being higher (Baghaei &
Aryadoust, 2015). Hence, evidence for the existence of testlet dimensions is clearer
in RTM. Due to its orthogonal constraints, the variance explained by a particular
dimension is unique to itself (Wang & Wilson, 2005). For unidimensional test data,
the variance explained by sub-test dimensions should be minimal, or at least too
small to annoy person ability estimates. To the best of our knowledge, no research
has employed Rasch testlet modeling to examine LID and its effects in L2 vocabulary
tests.
The Extended Matching Format
LID occurs in matching formats due to the interrelation between items within clusters. That
is, when there is not option recycling, the number of response options decreases as test-
takers match them to target words, and when initial responses are correct, the probability of
subsequent responses being correct increases. This is not the case for traditional multiple-
choice formats. Additionally, if a person wrongly selects the correct option of one item
(item A) as the answer to another item (item B), the chances of correctly answering the first
(item A) becomes 0%.
This interrelation of items in matching formats drew the attention of David Budescu in
1988. Budescu (1988) compared (a) different configurations of multiple-matching vocabu-
lary tests, with item-to-option ratios ranging from 4:4 to 8:32, to (b) a 4-option, multiple-
choice test. He found that an increase in the number of items per cluster led to fewer
successful guesses and increases in internal consistency, item discrimination, and correla-
tions with a test of language proficiency. While LID was not explicitly examined in
Budescu’s study, his findings suggest that the interdependence between items decreases as
the number of options increases, which implies that LID is less likely to occur in multiple-
matching formats with a large number of options.
In a recent effort to extend Budescu’s findings, Stoeckel et al. (2024) developed an
extended-matching test (EMT) of L2 vocabulary knowledge. This paper-and-pencil test
had three sections, one each for verbs, nouns, and adjectives. Each section contained a single
cluster with 130 response options and 30 target words, 6 from each of the five frequency-
based levels of Form B of the UVLT. Target words appeared in short, lexically and
syntactically simple prompt sentences, and the options were listed in alphabetical order
across five columns (Figure 4). The EMT’s instructions stated that the same option could be
used for multiple target words, and in fact, the test contained three pairs of target words,
each of which shared the same key. For example, weed and grass shared the key cỏ in the
noun section of the test. In other words, option recycling was an explicit aspect of the test
format.
Theoretically, the design of Stoeckel et al. (2024) EMT should reduce the inter-
dependence of items by increasing the number of options per item and encouraging
option recycling. When compared to meaning-recall criterion measures, the EMT did
not significantly differ in terms of internal consistency, mean scores (indicating very few
successful guesses), and correlations with reading comprehension (Stockel et al., 2024),
suggesting a very strong degree of item independency. In the present paper, LID in the
EMT is examined.
62 H. T. HA ET AL.
PURPOSE
The present research seeks to compare the use of two LID inspection techniques, residual
correlation examination and RTM, in two matching vocabulary tests. As residual correlations
offer information on neither the effect of LID on unidimensional analysis nor the amount of
variance accounted for by individual testlets, we hypothesize that tests with similar residual
correlation examination results could exhibit different LID effects. The paper aims to provide
additional validity evidence for the EMT, while introducing a technique for inspecting LID at
the testlet level. To this end, the paper reports the results of two studies where the LII
assumption is investigated in two matching test formats that employ different target-to-
distractor ratios and different instructions regarding option recycling.
In each study, the research questions (RQ) were:
(1) How strongly do the item residuals correlate?
(2) Does the data fit better as a unidimensional model or a Testlet model?
(3) Are there any substantial differences in the amount of variance accounted for by the
unidimensional model and the purified general knowledge dimension of the RTM?
(4) What are the variances accounted for by the testlet-dimensions in the RTM?
Figure 4. Verb section of the EMT.
LANGUAGE ASSESSMENT QUARTERLY 63
STUDY 1
In Study 1, LID was investigated in a traditional matching vocabulary test. Ha’s (2022)
UVLT dataset was used. In his study, Ha examined the residual correlations between 150
items of the UVLT. While problematic residual correlations were found, most of them
belonged to the 1K level. The present study separated the UVLT into five 30-item tests
according to their frequency levels and investigated LID in each level. In addition, RTM was
applied as a confirmatory method for LID detection.
Method
Participants
Participants included 311 students from a highly ranked university in Vietnam. All the
participants were second-year non-English majors who had completed at least 7 years of
compulsory English education from grades six to 12 (ages 12–18; Vu & Peters, 2021) and
two Business English courses at the tertiary level.
Instruments
Form B of the UVLT (Webb et al., 2017) was used. This test has 150 target words. The
UVLT employs a matching format where each 3-item cluster shares 6 options (Figure 1).
Option recycling is not mentioned in test instructions. However, as the nature of matching
formats does not typically involve recycling, without explicit instruction allowing for option
recycling, test takers may implicitly understand that they can choose an option only once.
Because of this, and because of the large difference in cluster size in the two tests examined,
the UVLT is hypothesized to have the highest item dependency within clusters.
Data Analysis
For RQ 1, raw score residual correlations, or Yen (1984, 1993) Q3 coefficients, were
examined using Winsteps 5.6.3.0 (Linacre, 2023a). Shared variance between item residuals
was calculated by squaring the residual correlations. Correlations at ± 0.3, where two items
share approximately 10% of their residual variance in common, and ± 0.7, where common
residual variance is approximately 50%, were employed as the thresholds for moderate and
serious LID, respectively (Aryadoust et al., 2021; Linacre, 2023b).
For RQ 2, test data were subjected to Rasch unidimensional and testlet modeling using
ACER Conquest 5.34.2 (Adams et al., 2020). Model fit statistics were then compared. As the
unidimensional model is nested within the RTM, direct comparison was conducted using
a likelihood ratio test, with X2equaling the differences in −2 log-likelihoods, or deviance,
between the two models. Similarly, the associated degrees of freedom were calculated by
subtracting the total number of estimated parameters in the two compared models (Baghaei
& Aryadoust, 2015). For RQs 3 and 4, the variance and reliability of the dimensions in the
unidimensional and testlet models were estimated using ACER Conquest. Expected
a posteriori method of estimation based on plausible variables (EAP/PV) was applied
because it was designed to “obtain consistent estimates of population characteristics in
assessment situations where individuals are administered too few items to allow precise
estimates of their ability” (American Institute for Research, cited in Wu, 2004, p. 976). For
Rasch unidimensional analysis in Conquest, Gauss-Hermite Quadrature method of
64 H. T. HA ET AL.
estimation was applied. According to Adams et al. (2022) this method of estimation is most
suitable for models with fewer than three dimensions. For Rasch testlet modeling, Monte
Carlo simulation with 2,000 nodes was applied, following the suggestion of Adams et al.
(2022). The mean of the item parameters on each dimension was constrained to zero for
identification purposes. Orthogonal constraints between the dimensions were applied using
the covariance matrix anchoring function in Conquest.
Data and command scripts for Winsteps and ConQuest are are openly available on our
Open Science Framework project page (Ha et al., 2023) at https://doi.org/10.17605/OSF.IO/
AKRCG:
Results
Rasch Separation and Reliability
Table 1 offers Rasch reliability and separation indices for the five 30-item test levels of the
UVLT. Except for the 1K level, which was constrained by a ceiling effect, each level shows
acceptable person separation and reliability, suggesting that in this regard the levels each
perform well as individual tests.
Residual Correlation
Table 2 presents the highest Q3 coefficients for the five levels in the UVLT. The results from
the 1K level showed severe LII violations with items in the same clusters highly correlating
at .90 and 1.00. This could be due to the fact that these students were exposed to the UVLT
and its format for the first time, which possibly led to their initial responses reflecting not
only vocabulary knowledge, but also relative incompetence in handling the test format. This
interpretation is supported by the fact that only three item pairs correlated at greater than
.30 in the remaining levels. This may signal that test takers were more familiar with the test
format after the 1K level, and therefore tended to use more of their vocabulary knowledge to
answer questions, which led to more independent responses. However, the high item
correlations in the 1K level might also simply be a result of the ceiling effect observed in
that level. That is, when nearly all of the responses are correct for an item pair, just a few
unexpected responses can produce a high residual correlation.
Model Comparison
Table 3 presents the fit statistics for model comparison of the five 30-item levels of the
UVLT. Likelihood ratio tests indicated that the data fit the RTM significantly better than the
unidimensional model at p < .001 for all five levels of the UVLT.
Table 1. UVLT rasch reliability and separation.
Item Person
Reliability Separation Reliability Separation
UVLT 1k .92 3.29 .17 0.45
UVLT 2k .98 7.46 .82 2.14
UVLT 3k .99 9.47 .89 2.85
UVLT 4k .99 9.17 .88 2.67
UVLT 5k .98 7.65 .85 2.38
LANGUAGE ASSESSMENT QUARTERLY 65
Variance Estimate and EAP/PV Reliability of the Dimensions
Table 4 shows the variance of the overall vocabulary knowledge dimension and the 10
method-specific dimensions (Testlet 1–10) based on the unidimensional and testlet models.
Except for the 3K level, reliability inflation was observable for all cases. Variance examina-
tion showed that the method-specific subdimensions in the UVLT accounted for a sizable
Table 2. Highest residual correlations for the UVLT.
UVLT 1k UVLT 2k UVLT 3k UVLT 4k UVLT 5k
Items Q3 Items Q3 Items Q3 Items Q3 Items Q3
5–6* 1.00 40–41* .31 73–75* .38 94–95* .27 142–143* .31
26–27* .90 34–36* .26 69–75 .28 95–96* .21 148–150* .27
1–3* .65 34–35* .23 86–87* .21 94–96* .20 125–126* .25
6–7 .48 58–59* .22 79–80* .17 107–119 .17 149–150* .22
5–7 .48 55–56* .20 76–77* .17 97–98* .16 146–147* .22
16–18* .45 52–54* .20 74–78 −.21 110–115 .15 131–146 −.23
4–7 .43 38–55 .19 61–79 −.19 99–106 −.21 127–145 −.23
14–15* .41 43–44* .18 66–73 −.19 96–120 −.21 131–150 −.23
2–3* .38 31–33* .18 72–83 −.18 94–102 −.20 125–141 −.23
12–18 .38 46–47* .17 72–79 −.18 115–118 −.20 122–140 −.22
4–6* .38 36–50 −.25 66–75 −.18 91–108 −.20 121–136 −.21
4–5* .38 44–52 −.21 67–84 −.17 95–102 −.19 124–150 −.21
2–19 .29 51–59 −.21 68–84 −.17 91–104 −.19 123–138 −.20
7–9* .28 31–46 −.20 79–84 −.17 91–102 −.18 124–145 −.19
5–11 .27 32–56 −.19 63–87 −.17 97–107 −.17 121–138 −.19
3–23 −.35 50–54 −.18 76–89 −.17 96–107 −.17 124–143 −.19
9–28 −.30 36–55 −.18 75–83 −.17 93–111 −.17 127–143 −19
8–17 −.28 33–56 −.18 72–81 −.16 108–112 −.17 130–133 −.19
1–23 −.28 40–55 −.17 81–85 −.16 103–114 −.17 121–137 −.19
12–17 −.27 43–50 −.17 66–76 −.16 103–113 −.16 124–142 −.19
*denotes item pairs from the same clusters.
Table 3. Fit statistics for the UVLT.
Test Statistics Unidimensional model Testlet model
UVLT 1k Deviance 2951.51 2881.92
AIC 3013.51 2963.92
AIC corrected 3020.63 2976.72
BIC 3129.45 3117.25
Estimated parameters 31 41
UVLT 2k Deviance 7479.97 7358.68
AIC 7541.97 7440.68
AIC corrected 7549.08 7453.48
BIC 7657.90 7594.01
Estimated parameters 31 41
UVLT 3k Deviance 8796.96 8726.40
AIC 8858.96 8808.40
AIC corrected 8866.07 8821.21
BIC 8974.89 8961.73
Estimated parameters 31 41
UVLT 4k Deviance 9073.48 9005.15
AIC 9135.48 9087.15
AIC corrected 9142.59 9099.95
BIC 9251.42 9240.48
Estimated parameters 31 41
UVLT 5k Deviance 8735.34 8618.37
AIC 8797.34 8700.37
AIC corrected 8804.45 8713.18
BIC 8913.28 8853.71
Estimated parameters 31 41
66 H. T. HA ET AL.
Table 4. Variance estimates and reliability values of the dimensions.
UVLT 1K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 3.626 .732 3.792 .641
Testlet 1 – – 0.417 .015
Testlet 2 – – 0.580 .017
Testlet 3 – – 0.569 .043
Testlet 4 – – 0.491 .035
Testlet 5 – – 1.072 .165
Testlet 6 – – 1.677 .070
Testlet 7 – – 0.404 .056
Testlet 8 – – 1.138 .146
Testlet 9 – – 4.285 .175
Testlet 10 – – 2.638 .274
UVLT 2K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 2.824 .873 3.470 .869
Testlet 1 – – 0.658 .175
Testlet 2 – – 0.869 .091
Testlet 3 – – 0.483 .083
Testlet 4 – – 2.242 .388
Testlet 5 – – 1.420 .203
Testlet 6 – – 1.674 .321
Testlet 7 – – 1.118 .190
Testlet 8 – – 0.864 .183
Testlet 9 – – 0.885 .202
Testlet 10 – – 0.809 .214
UVLT 3K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 2.618 .898 2.951 .921
Testlet 1 – – 1.278 .224
Testlet 2 – – 0.144 .048
Testlet 3 – – 1.431 .260
Testlet 4 – – 1.238 .290
Testlet 5 – – 1.035 .216
Testlet 6 – – 0.358 .125
Testlet 7 – – 0.564 .174
Testlet 8 – – 0.801 .210
Testlet 9 – – 0.262 .094
Testlet 10 – – 0.190 .062
UVLT 4K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 2.291 .903 2.717 .882
Testlet 1 – – 0.883 .229
Testlet 2 – – 2.600 .294
Testlet 3 – – 1.528 .242
Testlet 4 – – 0.882 .202
Testlet 5 – – 0.743 .209
Testlet 6 – – 0.293 .095
Testlet 7 – – 0.428 .150
Testlet 8 – – 0.325 .122
Testlet 9 – – 0.342 .099
(Continued)
LANGUAGE ASSESSMENT QUARTERLY 67
amount of variance, peaking at 113% of the variance of the general knowledge dimension
for the UVLT 1K. That is, the variance explained by dimension 9 (4.285) was 1.13 times that
of the general knowledge dimension (3.792). While the other four levels did not witness
such spiking, the largest amount of variance explained by the testlet effects were as large as
64.6%, 48.5%, 95.7%, and 66.2% of the variance of the main dimension for the 2K, 3K, 4K,
and 5K levels, in the order given. The amount of variance the testlet effects accounted for
clearly demonstrated a substantial disturbance to estimates of test takers’ ability. Due to this
annoyance, the unidimensional models were unable to explain as much variance as the
purified vocabulary knowledge dimensions in the RTMs; this could be observed in all five
levels of the UVLT. The practical implications for this are that person measures from Rasch
unidimensional analysis are disturbed. In other words, they were influenced by method-
specific effects and therefore might not represent true ability.
STUDY 2
In Study 2, Q3 coefficients were calculated and RTM was applied to examine LID in a novel
extended-matching test format. The data were previously used by the authors in a separate
study.
Method
Participants
The participants were 275 students at a top-tier university in southern Vietnam. They included
both English majors and non-English majors of various disciplines. Convenience sampling was
applied. The participants were the students from the classes of one of the authors.
Instruments
Vocabulary was assessed with a 90-item EMT that was introduced in a criterion validation
study by Stoeckel et al. (2024).
Table 4. (Continued).
Testlet 10 – – 0.717 .218
UVLT 5K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary
Knowledge
3.017 .904 3.879 .898
Testlet 1 – – 0.385 .122
Testlet 2 – – 0.464 .131
Testlet 3 – – 1.265 .260
Testlet 4 – – 0.801 .187
Testlet 5 – – 1.135 .202
Testlet 6 – – 0.511 .145
Testlet 7 – – 0.959 .200
Testlet 8 – – 2.569 .348
Testlet 9 – – 1.059 .238
Testlet 10 – – 2.230 .329
68 H. T. HA ET AL.
Data Analysis
The data were analyzed as described for Study 1, except that the EMT was analyzed as
a whole.
Results
Rasch Separation and Reliability
Table 5 presents Rasch unidimensional reliability and separation values for persons and
items for the EMT. The reliability estimates were greater than .80, and separation indices
were greater than 2, suggesting a reliable test (Linacre, 2023b).
Residual Correlation
Table 6 lists the highest raw score residual correlations (Yen’s Q3 coefficients) of the EMT.
Three item pairs were shown to have correlations of greater than .30, the benchmark for
moderate LID. Among these, items 61, 62 and 63 were in the same 30-item cluster and
located close to each other. No items were found to correlate at .70 or above, our threshold
for serious LID.
Table 5. Rasch reliability and separation for the EMT.
Item Person
Reliability Separation Reliability Separation
EMT .99 8.92 .96 4.95
Table 6. Highest resi-
dual correlations for
the EMT.
EMT
Items Q3
61–63* .34
62–63* .33
35–41* .31
1–35 .29
46–49* .28
79–90* .28
10–43 .27
39–47 .26
5–6* .26
5–46 .25
13–17* .25
19–50 .25
15–18* .25
16–22* .25
31–45* .25
9–22* .24
1–5* .24
20–64 −.29
50–68 −.27
5–36 −.25
*denotes item pairs from
the same clusters.
LANGUAGE ASSESSMENT QUARTERLY 69
Model Comparison
Table 7 offers fit statistics for the EMT. Because the Rasch unidimensional model was nested
within the RTM, model comparison could be conducted directly using the likelihood ratio
test. The results showed that Testlet modeling significantly improved fit of the unidimen-
sional model, χ
2
(3) = 150.59, p < .001. This suggested that the data should be best modeled
as an RTM.
Variance Estimate and EAP/PV Reliability of the Dimensions
Table 8 presents the variances and reliability values of the dimensions. The purified
vocabulary knowledge dimension had the highest variance in the testlet model (2.761),
and this was marginally smaller than the variance explained by the unidimensional model
(2.805). The variances of the method-specific effects ranged from 0.097 to 0.207, accounting
for 3.5–7.5% of the variance of the main dimension, which was very small. EAP reliability
values of the unidimensional model and the general dimension of the RTM were .947 and
.949, respectively, which, again, were very similar. This indicated that inflation of reliability
and reduced variance estimates in the unidimensional analysis were not evidenced in the
case of the EMT.
DISCUSSION
This paper exhibits the use of two LID-detecting methods on two L2 vocabulary tests that
employ very different ratios of options to target words and that have different assumptions
regarding option recycling. By giving no explicit instructions about whether learners can
use an option more than once, Webb et al. (2017) UVLT could be said to implicitly
discourage option recycling. On the other hand, Stoeckel et al.’s (2024) EMT not only
offers written instructions supporting the use of the same option for multiple target words,
but also includes target items that are semantically similar and share a response option.
Hence, the EMT encourages option recycling.
Table 7. Fit statistics of the EMT.
Test Statistics Unidimensional model Testlet model
EMT Deviance 19023.25 18872.66
AIC 19205.25 19060.66
AIC corrected 19296.75 19159.88
BIC 19534.38 19400.63
Estimated parameters 91 94
Table 8. Variance estimates and reliability values of the dimensions.
EMT
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary
Knowledge
2.805 .947 2.761 .949
Testlet 1 - - 0.207 .302
Testlet 2 - - 0.097 .150
Testlet 3 - - 0.198 .293
70 H. T. HA ET AL.
To answer research questions 1 and 2, the LII assumption of the two tests was checked
through both residual correlations and Rasch testlet modeling. Except for the 1K level of the
UVLT, which exhibited a severe violation of LII, the UVLT and the EMT showed compar-
able Q3 coefficients between items at around 0.30, which might signal LID (Aryadoust et al.,
2021). The results from model comparisons also suggested that the tests should be best
modeled as bi-factor, multidimensional, with one general knowledge dimension and several
testlet sub-dimensions. This means that the Rasch multidimensional model detected the
existence of underlying secondary dimensions other than the main ability dimension.
Research questions 3 and 4 sought to examine the variance accounted for by the Rasch
unidimensional model, the purified general knowledge dimension of the RTM, and the
testlet- related dimensions in the RTM. In-depth comparisons revealed that similar values
of residual correlations did not always lead to the same findings in RTM. Despite having
only one item pair that displayed a correlation of greater than .30 per test level, a large
amount of variance in the 2K through 5K levels of the UVLT (Study 1) was attributed to the
cluster dimensions, denoting a large amount of construct-irrelevant variance. This indi-
cated that a considerable proportion of the variance produced by UVLT items was due not
to test takers’ vocabulary knowledge but to their familiarity with the format. This serious
interference to the general vocabulary knowledge dimension caused observable contrac-
tions of the variance explained by the unidimensional model. In other words, the general
vocabulary knowledge dimension explained more variance when it was purified from the
annoyance caused by the method effects, which, in turn, demonstrated the effect of LID on
unidimensional analysis. This implies that analyzing UVLT data using Rasch unidimen-
sional analysis might lead to distorted person estimates. These findings extend the conclu-
sions presented in Ha (2022) concerning the issue of LID in the UVLT. That is, LID issues
occur at all levels of the UVLT to a considerable extent. This raised yet another red flag for
the UVLT in particular and the 3:6 matching format in general.
The EMT (Study 2) had three item pairs with Q3 coefficients of greater than .30.
Nevertheless, the unidimensional model and the purified knowledge dimension in the
RTM explained a similar amount of variance, and testlet effects of the clusters were
minimal. This suggests that the disturbance caused by testlet effects was too small to pose
any threat to unidimensional analysis, and so EMT test data can be analyzed using Rasch
unidimensional modeling without concern for LID.
Empirical evidence from past research has suggested that more familiar formats tend to
yield smaller method-related variance (Baghaei & Aryadoust, 2015, pp. 82–83). Considering
the unconventionality of the EMT and the fact that the participants had not encountered it
before, our results seem to indicate the opposite. That is, the complexity of the EMT’s
design, particularly its large cluster size, partialed out more of test takers’ format-related
competence than the simpler 3:6 matching format. When sitting the EMT, because the list
of options is so long, test takers typically have to recall the meaning of a word first, and then
look for an appropriate response from the list of options. This differs from how learners
sometimes approach the 3:6 matching format of the UVLT, scanning the options in hopes
of recognizing the target word meaning (Martin, 2022). For the EMT, then, responses rely
mostly on test takers’ vocabulary knowledge rather than their competence in handling the
format. In this sense, the EMT captures a different strength of word knowledge, as described
in Nation and Coxhead (2021, pp. 102–110), compared to commonly-used meaning-
recognition formats, such as the 4-option, multiple-choice and the 3:6 matching formats.
LANGUAGE ASSESSMENT QUARTERLY 71
Such strength of knowledge is close to that of a written meaning-recall test (Stoeckel et al.,
2024).
The results from this study offer supportive evidence for the validity of the EMT format
for vocabulary assessment. More broadly, the findings confirm Budescu’s (1988) assump-
tion concerning the interrelation of items and the number of options per item. That is, as
the number of options is increased in a matching format, it reduces the interdependence
between items and therefore constrains irrelevant variance caused by guessing.
The findings of the present study also offer practical implications for the detection and
interpretation of LID. While RTM offers an in-depth examination of LID at the cluster level,
the benefits of inspecting residual correlations for item-level LID are undeniable. However, as
an exploratory method, the examination of residual correlations does not always portray the
whole picture of LID, and overreliance on this sole metric might lead to incorrect conclu-
sions. On the other hand, RTM, as a confirmatory method, excels at both detecting multi-
dimensionality and dimension inspection. This is due to the nature of the two methods:
exploratory versus confirmatory. When residual correlation examination is conducted,
statisticians draw conclusions regarding LID based only on the correlation of item residuals
that meet a specified threshold. These conclusions could sometimes be misleading as the
approach generalizes the definition of LID in that it does not separate out the impact of test
format familiarity on LID. That is, to take any test, test takers need at least two types of
knowledge: knowledge of the subject matter being tested and knowledge of how to handle the
test format (and knowledge of the same test format across multiple items is, needless to say,
correlated). When residual correlations are inspected, this kind of format familiarity is very
likely to be captured. Item pairs that have sizable residual correlations are flagged, and
researchers cherry-pick the correlations that they deem meaningful and decide for themselves
if the test items violate LII assumption. As all items are allowed to freely co-vary in this
approach, method-related and item-related effects are mixed, which muddles the examina-
tion. On the contrary, Wang and Wilson’s (2005) RTM isolates the testlets according to the
conceptual design of the test, and examines the amount of variance they account for. This
offers direct information about the unique effect of each sub-test and therefore provides
more precise conclusions on the existence of LID and its effects on unidimensional analysis.
The investigation of LID should be conducted with clear theoretical backings (DeMars,
2012). Particularly, we suggest that, for testlet-based instruments, the inspection of residual
correlations should be used only for exploratory analyses, and that Rasch testlet modeling as
the confirmatory method. We also encourage test developers to consider employing RTM
for deeper inspection when item residuals correlate at .3 and higher. The present study
showed that tests with residual correlations around .3 can have substantial testlet-related
variances.
For tests that do not include testlets, residual correlations can be examined to detect
potential flaws in item design that may violate the conditional independence assumption.
Attention should be paid to the location of the correlated items and qualitative examination
should be conducted to see whether these items accidentally share any common stimuli
(Baghaei & Christensen, 2023). If two correlated items do not share any common stimuli
and are located distant from each other, then what causes the correlation could be test
takers’ format familiarity, test-taking methods, or raters’ judgments.
72 H. T. HA ET AL.
CONCLUSION
By comparing two LID detection techniques under different conditions, the present
research offers additional validity evidence for the vocabulary assessment under the
new EMT format, and provides practical guidance for how LID should be examined
in vocabulary test validation studies. That is, the inspection of residual correlations
alone can be incomplete in that the potential impact of item format on test
performance is unaccounted for. Rasch testlet modeling can separate out the impact
of item format and therefore provides more accurate LID diagnosis. Additionally,
the research found that the use of an extended matching format together with
explicit instructions permitting option recycling may be effective in preventing
LID. Although subject to further research, the findings are promising for extended-
matching formats in vocabulary testing.
Despite being informative, the current study bears certain limitations. First, the
use of different participants for the two matching formats inhibits direct compar-
isons, as we cannot be sure whether differences in LID were due to dissimilarities in
test format or differences in test takers’ behaviors. Second, because the two matching
test formats differed in both cluster size and instructions for recycling, we were
unable to isolate the effects of these two variables on LID. For better understanding,
future research should examine test variants that differ along only one of these
dimensions.
Acknowledgments
We would love to express our gratitude to the Editors and the anonymous reviewers whose comments
greatly improved the manuscript.
ORCID
Hung Tan Ha http://orcid.org/0000-0002-5901-7718
Duyen Thi Bich Nguyen http://orcid.org/0000-0002-9105-9950
Tim Stoeckel http://orcid.org/0000-0002-1447-5002
DISCLOSURE STATEMENT
No potential conflict of interest was reported by the author(s).
DATA AVAILABILITY STATEMENT
Data and command scripts for Winsteps and ConQuest are are openly available on our Open Science
Framework project page (Ha et al., 2023) at https://doi.org/10.17605/OSF.IO/AKRCG
AUTHORS’ CONTRIBUTION
All authors listed have made significant, direct and intellectual contributions to the paper. All the
authors read and approved the manuscript for publication.
LANGUAGE ASSESSMENT QUARTERLY 73
CONSENT TO PARTICIPATE
The participants provided their written informed consent to participate in this study.
ETHICS APPROVAL
The studies involving human participants were reviewed and approved by University of Economics
Ho Chi Minh City (UEH).
REFERENCES
Adams, R. J., Cloney, D., Wu, M., Osses, A., Schwantner, V., & Vista, A. (2022). ACER ConQuest
manual. Australian Council for Educational Research. https://research.acer.edu.au/measurement/5
Adams, R. J., Wu, M. L., Cloney, D., Berezner, A., & Wilson, M. (2020). ACER ConQuest: Generalised
item response modelling software (version 5.29). [Computer software]. Australian Council for
Educational Research. https://www.acer.org/au/conquest
Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in
language assessment: Recommendations and guidelines for research. Language Testing, 38(1),
6–40. https://doi.org/10.1177/0265532220927487
Baghaei, P. (2007). Local dependency and Rasch measures. Rasch Measurement Transactions, 21(3),
1105–1106. https://www.rasch.org/rmt/rmt213b.htm
Baghaei, P. (2010). A comparison of three polychotomous rasch models for super-item analysis.
Psychological Test and Assessment Modelling, 52(3), 313–323. https://www.psychologie-aktuell.
com/fileadmin/download/ptam/3-2010_20100928/06_Baghaei.pdf
Baghaei, P. (2016). Modeling multidimensionality in foreign language comprehension tests: An
Iranian example. In V. Aryadoust & J. Fox (Eds.), Trends in language assessment research and
practice: The view from the Middle East and the Pacific Rim (pp. 47–66). Cambridge Scholars.
Baghaei, P., & Aryadoust, V. (2015). Modeling local item dependence due to common test format
with a multidimensional Rasch model. International Journal of Testing, 15(1), 71–87. https://doi.
org/10.1080/15305058.2014.941108
Baghaei, P., & Christensen, K. B. (2023). Modeling local item dependence in C-tests with the loglinear
Rasch model. Language Testing, 40(3), 820–827. https://doi.org/10.1177/02655322231155109
Baghaei, P., & Ravand, H. (2016). Modeling local item dependence in cloze and reading comprehen-
sion test items using testlet response theory. Psicológica, 37(1), 85–104. https://www.redalyc.org/
pdf/169/16943586005.pdf
Bond, T., Yan, Z., & Heene, M. (2021). Applying the Rasch model: Fundamental measurement in the
human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499
Brandt, S. (2017). Concurrent unidimensional and multidimensional calibration within item
response theory. Pensamiento Educativo Revista de Investigación Educacional Latinoamericana,
54(2), 1–18. https://doi.org/10.7764/PEL.54.2.2017.4
Budescu, D. V. (1988). On the feasibility of multiple matching tests— variations on a theme by
Guiliksen. Applied Psychological Measurement, 12(1), 5–14. https://doi.org/10.1177/
014662168801200102
Christensen, K. B., Makransky, G., & Horton, M. (2017). Critical values for Yen’s Q3: Identification of
local dependence in the Rasch model using residual correlations. Applied Psychological
Measurement, 41(3), 178–194. https://doi.org/10.1177/0146621616677520
DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36(2), 104–121.
https://doi.org/10.1177/0146621612437403
Fan, J., & Bond, T. (2019). Applying Rasch measurement in language assessment. In V. Aryadoust &
M. Raquel (Eds.), Quantitative data analysis for language assessment volume I: Fundamental
techniques (pp. 83–102). Routledge.
74 H. T. HA ET AL.
Gyllstad, H., McLean, S., & Stewart, J. (2020). Using confidence intervals to determine adequate item
sample sizes for vocabulary tests: An essential but overlooked practice. Language Testing, 38(4),
558–579. https://doi.org/10.1177/0265532220979562
Ha, H. T. (2021). A Rasch-based validation of the Vietnamese version of the listening vocabulary
levels test. Language Testing in Asia, 11(1), 1–19. https://doi.org/10.1186/s40468-021-00132-7
Ha, H. T. (2022). Test format and local dependence of items revisited: A case of two vocabulary levels
tests. Frontiers in Psychology, 12(1), 1–6. https://doi.org/10.3389/fpsyg.2021.805450
Ha, H. T., Stoeckel, T., & Nguyen, D. T. B. (2023, November 7). Examining local dependence in
vocabulary tests using Yen’s Q3 coefficient and Rasch Testlet Model. https://doi.org/10.17605/
OSF.IO/AKRCG
Kamimoto, T. (2014). Local item dependence on the vocabulary levels test revisited. Vocabulary
Learning and Instruction, 3(2), 56–68. https://doi.org/10.7820/vli.v03.2.kamimoto
Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49(2), 223–245. https://doi.org/
10.1007/BF02294174
Kreiner, S., & Christensen, K. B. (2004). Analysis of local dependence and multidimensionality in
graphical loglinear Rasch models. Communications in Statistics - Theory and Methods, 33(6),
1239–1276. https://doi.org/10.1081/STA-120030148
Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in health-related scales: Analysis by
graphical loglinear Rasch models. In M. von Davier & C. H. Carstensen (Eds.), Multivariate and
mixture distribution Rasch models: Extensions and applications (pp. 329–346). Springer-Verlag.
https://doi.org/10.1007/978-0-387-49839-3_21
Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied
Psychological Measurement, 30(1), 3–21. https://doi.org/10.1177/0146621605275414
Linacre, J. M. (2017). Teaching Rasch measurement. Rasch Measurement Transactions, 31(2),
1630–1631. www.rasch.org/rmt/rmt312.pdf
Linacre, J. M. (2023a). Winsteps® Rasch measurement computer program (version 5.6.0) [Computer
software]. Winsteps.com.
Linacre, J. M. (2023b). A user’s Guide to WINSTEPS® MINISTEP rasch-model computer programs.
Program manual 5.6.0. https://www.winsteps.com/winman/copyright.htm
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Marais, I., & Andrich, D. (2008). Effects of varying magnitude and patterns of response dependence
in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 105–124. http://
publicifsv.sund.ku.dk/~kach/PsyLab2018/Marais,%20Andrich,%202008.pdf
Martin, J. (2022). A proposed taxonomy of test-taking action and item format in written receptive
vocabulary testing. Vocabulary Learning and Instruction, 11(1), 1–16. https://doi.org/10.7820/vli.
v11.1.martin
McLean, S. (2018). Evidence for the adoption of the flemma as an appropriate word counting unit.
Applied Linguistics, 39(6), 823–845. https://doi.org/10.1093/applin/amw050
Monseur, C., Baye, A., Lafontaine, D., & Quittre, V. (2011). PISA test format assessment and the local
independence assumption. IERI Monograph Series: Issues and Methodologies in Large Scale
Assessments, 4, 131–158. https://orbi.uliege.be/bitstream/2268/103137/1/IERI_Monograph_
Volume04_Chapter_6.pdf
Nation, I. S. P., & Coxhead, A. (2021). Measuring native-speaker vocabulary size. John Benjamins.
https://doi.org/10.1075/z.233
Nguyen, T. M. H., Gu, P., & Coxhead, A. (2024). Argument-based validation of academic collocation
tests. Language Testing, 41(3), 459–505. https://doi.org/10.1177/02655322231198499
Rijmen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and
a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3),
361–372. https://doi.org/10.1111/j.1745-3984.2010.00118.x
Schmitt, N., Nation, P., & Kremmel, B. (2020). Moving the field of vocabulary assessment forward:
The need for more rigorous test development and validation. Language Teaching, 53(1), 109–120.
https://doi.org/10.1017/S0261444819000326
LANGUAGE ASSESSMENT QUARTERLY 75
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new
versions of the vocabulary levels test. Language Testing, 18(1), 55–88. https://doi.org/10.1177/
026553220101800103
Stoeckel, T., Ha, H. T., Nguyen, D. T. B., & Nicklin, C. (2024). Can an extended-matching
second-language vocabulary test format bridge the gap between meaning-recognition and
meaning-recall? Research Methods in Applied Linguistics, 3(2), 1–17. https://doi.org/10.1016/j.
rmal.2024.100109
Stoeckel, T., Ishii, T., & Bennett, P. (2020). Is the lemma more appropriate than the flemma as a word
counting unit? Applied Linguistics, 41(4), 601–606. https://doi.org/10.1093/applin/amy059
Stoeckel, T., McLean, S., & Nation, P. (2021). Limitations of size and levels tests of written receptive
vocabulary knowledge. Studies in Second Language Acquisition, 43(1), 181–203. https://doi.org/10.
1017/S027226312000025X
Stoeckel, T., Stewart, J., McLean, S., Ishii, T., Kramer, B., & Matsumoto, Y. (2019). The relationship of
four variants of the vocabulary size test to a criterion measure of meaning recall vocabulary
knowledge. System, 87, 1–14. https://doi.org/10.1016/j.system.2019.102161
Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated
discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://
doi.org/10.1037/1082-989X.6.2.181
van den Wollenberg, A. L. (1982). Two new test statistics for the Rasch model. Psychometrika, 47(2),
123–140. https://doi.org/10.1007/BF02296270
Vu, D. V., & Peters, E. (2021). Vocabulary in English language learning, teaching, and testing in
Vietnam: A review. Education Science, 11(9), 563. https://doi.org/10.3390/educsci11090563
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge
University Press. https://doi.org/10.1017/CBO9780511618765
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29
(2), 126–149. https://doi.org/10.1177/0146621604271053
Webb, S. (2021). A different perspective on the limitations of size and levels tests of written receptive
vocabulary knowledge. Studies in Second Language Acquisition, 43(2), 454–461. https://doi.org/10.
1017/S0272263121000449
Webb, S., Sasao, Y., & Ballance, O. (2017). The updated vocabulary levels test. ITL - International
Journal of Applied Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web
Wu, M. (2004). Plausible values. Rasch Measurement Transactions, 18(2), 976–978. https://www.
rasch.org/rmt/rmt182c.htm
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the
three-parameter logistic model. Applied Psychological Measurement, 8(2), 125–145. https://doi.org/
10.1177/014662168400800201
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence.
Journal of Educational Measurement, 30(3), 187–213. https://doi.org/10.1111/j.1745-3984.1993.
tb00423.x
Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational
measurement (4th ed. pp. 111–153). American Council on Education/Praeger.
76 H. T. HA ET AL.