ArticlePDF Available

Abstract and Figures

This article compares two methods for detecting local item dependence (LID): residual correlation examination and Rasch testlet modeling (RTM), in a commonly used 3:6 matching format and an extended matching test (EMT) format. The two formats are hypothesized to facilitate different levels of item dependency due to differences in the number of options and instructions regarding option recycling. The findings indicate that (1) RTM allows deeper LID inspection compared to residual correlation examination in matching tests, and (2) the EMT format has good resistance to LID while the traditional 3:6 format does not.
Content may be subject to copyright.
Language Assessment Quarterly
ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/hlaq20
A Comparison of Yen’s Q3 Coefficient and Rasch Testlet
Modeling for Identifying Local Item Dependence:
Evidence from Two Vocabulary Matching Tests
Hung Tan Ha, Duyen Thi Bich Nguyen & Tim Stoeckel
To cite this article: Hung Tan Ha, Duyen Thi Bich Nguyen & Tim Stoeckel (2025) A Comparison
of Yen’s Q3 Coefficient and Rasch Testlet Modeling for Identifying Local Item Dependence:
Evidence from Two Vocabulary Matching Tests, Language Assessment Quarterly, 22:1, 56-76,
DOI: 10.1080/15434303.2025.2456953
To link to this article: https://doi.org/10.1080/15434303.2025.2456953
© 2025 The Author(s). Published with
license by Taylor & Francis Group, LLC.
Published online: 27 Jan 2025.
Submit your article to this journal
Article views: 263
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=hlaq20
ARTICLE
A Comparison of Yen’s Q3 Coecient and Rasch Testlet
Modeling for Identifying Local Item Dependence: Evidence
from Two Vocabulary Matching Tests
Hung Tan Ha
a
, Duyen Thi Bich Nguyen
b
, and Tim Stoeckel
c
a
Victoria University of Wellington, Wellington, New Zealand;
b
University of Economics Ho Chi Minh City (UEH),
Ho Chi Minh City, Vietnam;
c
University of Niigata Prefecture, Niigata, Japan
ABSTRACT
This article compares two methods for detecting local item depen-
dence (LID): residual correlation examination and Rasch testlet model-
ing (RTM), in a commonly used 3:6 matching format and an extended
matching test (EMT) format. The two formats are hypothesized to
facilitate dierent levels of item dependency due to dierences in
the number of options and instructions regarding option recycling.
The ndings indicate that (1) RTM allows deeper LID inspection com-
pared to residual correlation examination in matching tests, and (2)
the EMT format has good resistance to LID while the traditional 3:6
format does not.
INTRODUCTION
Because of the importance of receptive vocabulary knowledge in second language (L2)
learning, it is also important to consider how lexis can be reliably assessed. Problems in
vocabulary assessment and design modifications that could address such problems have
been a topic of discussion (Schmitt et al., 2020) and debate (Stoeckel et al., 2021; Webb,
2021). In addition to potential sources of inaccuracy in vocabulary tests such as testwiseness
(Stoeckel et al., 2019), the use of overly-inclusive word counting units (McLean, 2018;
Stoeckel et al., 2020), and unrepresentative item sampling (Gyllstad et al., 2020), questions
have been raised regarding whether some vocabulary tests meet the conditional indepen-
dence (CI) assumption of item response theory (IRT) models (Ha, 2022; Kamimoto, 2014).
Conditional independence between items assumes that the only source of correlation
between items is the latent trait that the test measures (Lord & Novick, 1968), and that when
this latent trait has been conditioned out, correlations between items should be zero (Lord &
Novick, 1968; Yen & Fitzpatrick, 2006). Ignoring violations of this assumption might lead to
problems in evaluating a test’s psychometric qualities which, in turn, can result in negative
consequences in score interpretation and use (Yen & Fitzpatrick, 2006).
Amongst vocabulary tests, ones that employ matching formats are the most likely to
violate this assumption, as items are put into clusters and share the same stimuli. For
example, the well-known Vocabulary Levels Test (VLT, Schmitt et al., 2001) and its updated
CONTACT Hung Tan Ha hung.ha@vuw.ac.nz School of Linguistics and Applied Language Studies, Victoria University
of Wellington, von Zedlitz Building, Kelburn Parade, Wellington 6012, New Zealand
This article has been republished with minor changes. These changes do not impact no the academic content of the article.
LANGUAGE ASSESSMENT QUARTERLY
2025, VOL. 22, NO. 1, 56–76
https://doi.org/10.1080/15434303.2025.2456953
© 2025 The Author(s). Published with license by Taylor & Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The terms
on which this article has been published allow the posting of the Accepted Manuscript in a repository by the author(s) or with their consent.
variant (Webb et al., 2017) each have clusters that contain three target words and six
response options (Figure 1).
This makes each cluster a potential super-item where the correct or incorrect answer to
one item may be dependent on the correct or incorrect response to another item (Baghaei,
2007). Vocabulary test creators tend to examine the correlations between item residuals
(Yen, 1984, 1993) to check if the items are conditionally independent (Ha, 2021; Nguyen
et al., 2024; Webb et al., 2017). However, despite being informative and easy to use, the
exploratory nature of this method offers no further information than the potential presence
of LID between test items.
The present study extends previous examinations of LID in vocabulary tests by
comparing (a) residual correlation check and (b) Rasch testlet modeling (Wang &
Wilson, 2005), a confirmatory method that allows researchers to model both knowledge-
and method-specific dimensions and estimate the unique variance accounted for by each
of them. This comparison is made to examine item dependency in two matching
vocabulary test formats. The first is a traditional matching format employed by the
Updated Vocabulary Levels Test (UVLT; Webb et al., 2017), in which target items
appear in small clusters and option recycling (i.e., using the same option multiple
times) is neither encouraged nor prohibited. The second is an extended matching test
format (EMT; Stoeckel et al., 2024) in which target words are placed in very large
clusters, and learners are instructed that the same response option may be used more
than once. It is hypothesized that the traditional matching test is more likely to
demonstrate item dependency because of its use of relatively small clusters and lack of
instructions regarding option recycling. The use of two formats with different hypothe-
sized levels of item dependency facilitates an in-depth comparison of two methods for
examining item dependency.
LITERATURE REVIEW
Local Item Dependence in the Rasch Model
The most parsimonious, effective, and arguably easiest way to understand the Rasch model
is through the relationship between a person’s ability (θ) and task difficulty (b; Bond et al.,
2021):
Equation (1) expresses the probability (P) of person n with ability θ being successful on task
i with difficulty level b. There is an inverse relationship between person ability and task
difficulty such that the success rate will be higher when either person n is more able or task
i is less difficult. f is a mathematical function of the difference between ability θ and
Figure 1. An example of an Updated Vocabulary Levels Test (UVLT) item Cluster.
LANGUAGE ASSESSMENT QUARTERLY 57
difficulty b. For the Rasch dichotomous model, the above equation is expressed as below
when logit link function is applied, with exp() representing the exponential function:
Note: epsilon (ε) is specific to item–person interaction.
ε in equations (1) and (2) is the residual, or the difference between observed scores from
test takers’ actual performance and the expected scores from the Rasch model. It is the
“unexplained variance” in the Rasch model. In other words, it is any performance that can
be explained by neither the person’s ability nor the task’s difficulty. It is worth noting that ε
is not explicitly mentioned in the original Rasch model, but we added it here for clarity, as it
relates to the methodology of the present study. In fact, residuals exist in all idealized
mathematical models as “observed test data collected from real life can never attain that
mathematical ideal” (Fan & Bond, 2019, p. 86). In an actual test, examinees may unexpect-
edly succeed or fail at a difficult or easy test item for many reasons. This means
a considerable amount of information, or variance, is often accounted for by these residuals
(Linacre, 2017). In a unidimensional test, these residuals are expected to be random noise,
or not to follow a specific pattern (Bond et al., 2021; Fan & Bond, 2019). In other words, test
takers’ unexpected responses should not correlate. When the residuals correlate, it means
that the answers of test takers are being predicted by something other than their ability or
task difficulty. This is a violation of local item independence (LII), a basic assumption of
Rasch and other (IRT) models. The violation itself is denoted as local item dependence
(LID; Baghaei, 2016; Fan & Bond, 2019).
Item Bundles and Local Item Dependence
LID can occur for various reasons including external assistance, rushed responses due to
lack of time, mental fatigue, and item design (Yen, 1993). Researchers tend to focus on
sources of LID that come from the test itself as they pose critical issues to the validity of
score interpretations. Tests that include testlets are believed to be the most vulnerable to
LID (Brandt, 2017). The term testlet, or item bundle, refers to a set of items that share
a common stimulus (Wang & Wilson, 2005). Typical examples of testlets in L2 assessment
include comprehension questions that share the same reading passage or items that belong
to the same cluster in a matching vocabulary test (Schmitt et al., 2001; Webb et al., 2017).
Such matching formats are popular because they are efficient. That is, by allowing several
items to share a common stimulus, test creators need less time than when separate stimuli
are required for each target word. For example, whereas three 4-option, multiple-choice
questions require a total of 12 options, only six are needed when the three target words are
placed in a single cluster and allowed to “load” on the same set of options. Moreover, the
target-to-distractor ratio would also be 1:3 for the last item that is answered, which is
identical to a traditional four-option multiple-choice format. So, theoretically, a matching
format saves time while maintaining test quality. However, there is a critical drawback.
When several items share the same stimulus, the probability of test takers answering one
item correctly is influenced by their answers to other items in the bundle (Baghaei, 2007;
Brandt, 2017). In other words, the items exhibit LID.
58 H. T. HA ET AL.
Research on LID
The effects of LID in IRT models have been extensively investigated. Research has suggested
that ignoring LID in unidimensional analysis may result in (1) biased estimation of item
difficulty and discrimination, (2) misleading estimate of item variance, and (3) inflation of
reliability (Bond et al., 2021; Marais & Andrich, 2008; Monseur et al., 2011; Tuerlinckx & De
Boeck, 2001; Wainer et al., 2007; Wang & Wilson, 2005; Yen, 1984). In language testing, the
effect of LID has also received the attention of researchers, and similar conclusions have
been drawn (Baghaei, 2010, 2016; Baghaei & Aryadoust, 2015; Baghaei & Christensen, 2023;
Baghaei & Ravand, 2016). However, LID remains the least reported aspect of Rasch
measurement in language assessment (Aryadoust et al., 2021; Fan & Bond, 2019). The
existence of LID can be detected using various statistical techniques including Rasch log-
linear modeling (Kelderman, 1984; Kreiner & Christensen, 2004, 2007), residual correlation
examination (van den Wollenberg, 1982; Yen, 1984, 1993), and Rasch testlet modeling
(Wang & Wilson, 2005). The present study focuses on residual correlation examination and
testlet modeling.
Residual Correlation Inspection
In the field of vocabulary assessment, LID is usually diagnosed through the inspection of
raw score or standardized residual correlations using Rasch unidimensional analysis
(Aryadoust et al., 2021; Ha, 2022). Standardized residuals can be calculated by dividing
the raw residual by its standard deviation. Raw score residual correlation, or Yen (1984,
1993) Q3 coefficient, is the most frequently reported in Rasch analysis (Christensen et al.,
2017). The technique is as straightforward as its name suggests. It is based on the basic
assumption of most IRT models that when the variance of the target ability dimension has
been conditioned out or explained by the model, the variance that remains should be
random noise (Baghaei & Aryadoust, 2015). This technique inspects this variance to see if
it is truly random. Several critical values of Q3 statistics have been used in psychometric
studies, ranging from .1 to .7 (see Christensen et al., 2017; Fan & Bond, 2019 for a full
discussion).
This technique of LID inspection has both pros and cons. On the upside, the analysis is
easy to conduct, the theory is accessible, and the results are straightforward. On the down-
side, however, the information offered is limited to the magnitude of correlations between
the residuals of individual items. While these correlations may signal that test-takers’
performance is influenced by something other than the ability of interest, the source and
degree of interference is often hard to identify. Test takers usually need at least two types of
knowledge to answer test questions: knowledge of the tested subject matter, and knowledge
of how to deal with the test format. Effects of the method-related knowledge of different
items can be strongly correlated, especially when the test employs only one format. Thus,
whereas residual correlations may indicate the poor design of a multiple-choice test when
the content of one item signals the answer to another, such correlations can also signify test
takers’ familiarity with the format itself.
The residual correlations that are due to method-related ability, when not taken into
account, can result in misleading conclusions. That is, if analysts examine only the residual
correlations within item clusters and ignore those between items of different clusters, they
may overlook potential method-related dimensions in a test. The existence of dimensions
LANGUAGE ASSESSMENT QUARTERLY 59
other than the general ability dimension can be a serious violation of the LII assumption in
unidimensional analysis and evidence against a test’s validity. In short, due to the limited
information residual correlations offer, this method is only good for signaling the existence
of LID at the whole test level, and not suitable for identifying LID caused by cluster-related
factors. For this, an appropriate method for testlet modeling should be employed. The
present study aims to introduce a method of testlet modeling.
Rasch Testlet Modeling
The existence of LID and its effects can also be investigated using Rasch testlet modeling. In
2005, Wang and Wilson proposed the Rasch Testlet Model (RTM) for testlet-based tests.
They hypothesize that the probability of person n giving a correct answer to item i depends
not only on person ability (θ) and item difficulty (b), but also on the person’s ability on
testlet d that belongs to test i. Imagine a 9-item vocabulary test which assigns items into
3-item clusters; the probability of test takers (n) correctly answering the questions (i) will be
influenced by their vocabulary knowledge (θ), the difficulty of the tested words (b), and
examinees’ familiarity with the format (d). The RTM may be expressed as follows:
or:
γnd ið Þ is the effect of testlet d within test i on person n, or the strength or weakness of person
n on subtest d. According to this model, person n with low ability θ can still manage to
correctly answer item i if they have strong ability γ on the specific testlet d (Wang & Wilson,
2005). With the addition of the testlet-specific subdimensions γ, the testlet model represents
a within-item, multidimensional, bi-factor model where each item loads on one general
ability dimension and one testlet-specific sub-dimension. Depending on the design of the
test, γ could be denoted as the “method” dimension (e.g., test formats), or an area-specific
ability sub-dimension (e.g., different areas in a mathematical test). Figures 2 and 3 depict
how the Rasch unidimensional model and the RTM differ. Figure 2 displays the unidimen-
sional model where all items are hypothesized to load on a single ability dimension. In
contrast, Figure 3 illustrates an RTM in which each item is loaded on not only a single
general ability dimension but also on one of several different testlet dimensions.
The RTM, through model comparison, not only offers another method for the examina-
tion of LID, but also allows researchers to see the reliability indices and the amount of
variance accounted for by each subtest dimension. These metrics provide insights into
whether estimates of general ability are disturbed by the unique effects of testlets.
To effectively extract the amount of variance that the general ability and testlet dimensions
account for, the RTM assumes complete orthogonality. This means that all the dimensions are
assumed to be uncorrelated. In other words, the RTM requires that the covariance between all
subtests and the main dimension be constrained at zero. In this sense, the RTM is nested
within a bi-factor model without orthogonal constraints (Li et al., 2006; Rijmen, 2010). In the
60 H. T. HA ET AL.
case of a vocabulary test, this assumption of total orthogonality allows researchers to isolate
vocabulary knowledge in the main dimension from method-related variance, which results in
more precise data interpretation.
The effect of LID on unidimensional analysis can be examined in RTM by
comparing (a) the amount of variance explained by the general knowledge dimen-
sion in RTM with (b) the variance of a unidimensional model. If test data is truly
unidimensional, then the amount of variance explained by the general knowledge
dimension in RTM and the unidimensional model should be similar (Wang &
Wilson, 2005). However, if method-related dimensions are present, the annoyance
caused by these dimensions would result in a difference in the amount of explained
Figure 2. Unidimensional model.
Figure 3. The rasch testlet model.
LANGUAGE ASSESSMENT QUARTERLY 61
variance between the unidimensional model and the general knowledge dimension of
a RMT, with the variance of the latter usually being higher (Baghaei &
Aryadoust, 2015). Hence, evidence for the existence of testlet dimensions is clearer
in RTM. Due to its orthogonal constraints, the variance explained by a particular
dimension is unique to itself (Wang & Wilson, 2005). For unidimensional test data,
the variance explained by sub-test dimensions should be minimal, or at least too
small to annoy person ability estimates. To the best of our knowledge, no research
has employed Rasch testlet modeling to examine LID and its effects in L2 vocabulary
tests.
The Extended Matching Format
LID occurs in matching formats due to the interrelation between items within clusters. That
is, when there is not option recycling, the number of response options decreases as test-
takers match them to target words, and when initial responses are correct, the probability of
subsequent responses being correct increases. This is not the case for traditional multiple-
choice formats. Additionally, if a person wrongly selects the correct option of one item
(item A) as the answer to another item (item B), the chances of correctly answering the first
(item A) becomes 0%.
This interrelation of items in matching formats drew the attention of David Budescu in
1988. Budescu (1988) compared (a) different configurations of multiple-matching vocabu-
lary tests, with item-to-option ratios ranging from 4:4 to 8:32, to (b) a 4-option, multiple-
choice test. He found that an increase in the number of items per cluster led to fewer
successful guesses and increases in internal consistency, item discrimination, and correla-
tions with a test of language proficiency. While LID was not explicitly examined in
Budescu’s study, his findings suggest that the interdependence between items decreases as
the number of options increases, which implies that LID is less likely to occur in multiple-
matching formats with a large number of options.
In a recent effort to extend Budescu’s findings, Stoeckel et al. (2024) developed an
extended-matching test (EMT) of L2 vocabulary knowledge. This paper-and-pencil test
had three sections, one each for verbs, nouns, and adjectives. Each section contained a single
cluster with 130 response options and 30 target words, 6 from each of the five frequency-
based levels of Form B of the UVLT. Target words appeared in short, lexically and
syntactically simple prompt sentences, and the options were listed in alphabetical order
across five columns (Figure 4). The EMT’s instructions stated that the same option could be
used for multiple target words, and in fact, the test contained three pairs of target words,
each of which shared the same key. For example, weed and grass shared the key cỏ in the
noun section of the test. In other words, option recycling was an explicit aspect of the test
format.
Theoretically, the design of Stoeckel et al. (2024) EMT should reduce the inter-
dependence of items by increasing the number of options per item and encouraging
option recycling. When compared to meaning-recall criterion measures, the EMT did
not significantly differ in terms of internal consistency, mean scores (indicating very few
successful guesses), and correlations with reading comprehension (Stockel et al., 2024),
suggesting a very strong degree of item independency. In the present paper, LID in the
EMT is examined.
62 H. T. HA ET AL.
PURPOSE
The present research seeks to compare the use of two LID inspection techniques, residual
correlation examination and RTM, in two matching vocabulary tests. As residual correlations
offer information on neither the effect of LID on unidimensional analysis nor the amount of
variance accounted for by individual testlets, we hypothesize that tests with similar residual
correlation examination results could exhibit different LID effects. The paper aims to provide
additional validity evidence for the EMT, while introducing a technique for inspecting LID at
the testlet level. To this end, the paper reports the results of two studies where the LII
assumption is investigated in two matching test formats that employ different target-to-
distractor ratios and different instructions regarding option recycling.
In each study, the research questions (RQ) were:
(1) How strongly do the item residuals correlate?
(2) Does the data fit better as a unidimensional model or a Testlet model?
(3) Are there any substantial differences in the amount of variance accounted for by the
unidimensional model and the purified general knowledge dimension of the RTM?
(4) What are the variances accounted for by the testlet-dimensions in the RTM?
Figure 4. Verb section of the EMT.
LANGUAGE ASSESSMENT QUARTERLY 63
STUDY 1
In Study 1, LID was investigated in a traditional matching vocabulary test. Ha’s (2022)
UVLT dataset was used. In his study, Ha examined the residual correlations between 150
items of the UVLT. While problematic residual correlations were found, most of them
belonged to the 1K level. The present study separated the UVLT into five 30-item tests
according to their frequency levels and investigated LID in each level. In addition, RTM was
applied as a confirmatory method for LID detection.
Method
Participants
Participants included 311 students from a highly ranked university in Vietnam. All the
participants were second-year non-English majors who had completed at least 7 years of
compulsory English education from grades six to 12 (ages 12–18; Vu & Peters, 2021) and
two Business English courses at the tertiary level.
Instruments
Form B of the UVLT (Webb et al., 2017) was used. This test has 150 target words. The
UVLT employs a matching format where each 3-item cluster shares 6 options (Figure 1).
Option recycling is not mentioned in test instructions. However, as the nature of matching
formats does not typically involve recycling, without explicit instruction allowing for option
recycling, test takers may implicitly understand that they can choose an option only once.
Because of this, and because of the large difference in cluster size in the two tests examined,
the UVLT is hypothesized to have the highest item dependency within clusters.
Data Analysis
For RQ 1, raw score residual correlations, or Yen (1984, 1993) Q3 coefficients, were
examined using Winsteps 5.6.3.0 (Linacre, 2023a). Shared variance between item residuals
was calculated by squaring the residual correlations. Correlations at ± 0.3, where two items
share approximately 10% of their residual variance in common, and ± 0.7, where common
residual variance is approximately 50%, were employed as the thresholds for moderate and
serious LID, respectively (Aryadoust et al., 2021; Linacre, 2023b).
For RQ 2, test data were subjected to Rasch unidimensional and testlet modeling using
ACER Conquest 5.34.2 (Adams et al., 2020). Model fit statistics were then compared. As the
unidimensional model is nested within the RTM, direct comparison was conducted using
a likelihood ratio test, with X2equaling the differences in −2 log-likelihoods, or deviance,
between the two models. Similarly, the associated degrees of freedom were calculated by
subtracting the total number of estimated parameters in the two compared models (Baghaei
& Aryadoust, 2015). For RQs 3 and 4, the variance and reliability of the dimensions in the
unidimensional and testlet models were estimated using ACER Conquest. Expected
a posteriori method of estimation based on plausible variables (EAP/PV) was applied
because it was designed to “obtain consistent estimates of population characteristics in
assessment situations where individuals are administered too few items to allow precise
estimates of their ability” (American Institute for Research, cited in Wu, 2004, p. 976). For
Rasch unidimensional analysis in Conquest, Gauss-Hermite Quadrature method of
64 H. T. HA ET AL.
estimation was applied. According to Adams et al. (2022) this method of estimation is most
suitable for models with fewer than three dimensions. For Rasch testlet modeling, Monte
Carlo simulation with 2,000 nodes was applied, following the suggestion of Adams et al.
(2022). The mean of the item parameters on each dimension was constrained to zero for
identification purposes. Orthogonal constraints between the dimensions were applied using
the covariance matrix anchoring function in Conquest.
Data and command scripts for Winsteps and ConQuest are are openly available on our
Open Science Framework project page (Ha et al., 2023) at https://doi.org/10.17605/OSF.IO/
AKRCG:
Results
Rasch Separation and Reliability
Table 1 offers Rasch reliability and separation indices for the five 30-item test levels of the
UVLT. Except for the 1K level, which was constrained by a ceiling effect, each level shows
acceptable person separation and reliability, suggesting that in this regard the levels each
perform well as individual tests.
Residual Correlation
Table 2 presents the highest Q3 coefficients for the five levels in the UVLT. The results from
the 1K level showed severe LII violations with items in the same clusters highly correlating
at .90 and 1.00. This could be due to the fact that these students were exposed to the UVLT
and its format for the first time, which possibly led to their initial responses reflecting not
only vocabulary knowledge, but also relative incompetence in handling the test format. This
interpretation is supported by the fact that only three item pairs correlated at greater than
.30 in the remaining levels. This may signal that test takers were more familiar with the test
format after the 1K level, and therefore tended to use more of their vocabulary knowledge to
answer questions, which led to more independent responses. However, the high item
correlations in the 1K level might also simply be a result of the ceiling effect observed in
that level. That is, when nearly all of the responses are correct for an item pair, just a few
unexpected responses can produce a high residual correlation.
Model Comparison
Table 3 presents the fit statistics for model comparison of the five 30-item levels of the
UVLT. Likelihood ratio tests indicated that the data fit the RTM significantly better than the
unidimensional model at p < .001 for all five levels of the UVLT.
Table 1. UVLT rasch reliability and separation.
Item Person
Reliability Separation Reliability Separation
UVLT 1k .92 3.29 .17 0.45
UVLT 2k .98 7.46 .82 2.14
UVLT 3k .99 9.47 .89 2.85
UVLT 4k .99 9.17 .88 2.67
UVLT 5k .98 7.65 .85 2.38
LANGUAGE ASSESSMENT QUARTERLY 65
Variance Estimate and EAP/PV Reliability of the Dimensions
Table 4 shows the variance of the overall vocabulary knowledge dimension and the 10
method-specific dimensions (Testlet 1–10) based on the unidimensional and testlet models.
Except for the 3K level, reliability inflation was observable for all cases. Variance examina-
tion showed that the method-specific subdimensions in the UVLT accounted for a sizable
Table 2. Highest residual correlations for the UVLT.
UVLT 1k UVLT 2k UVLT 3k UVLT 4k UVLT 5k
Items Q3 Items Q3 Items Q3 Items Q3 Items Q3
5–6* 1.00 40–41* .31 73–75* .38 94–95* .27 142–143* .31
26–27* .90 34–36* .26 69–75 .28 95–96* .21 148–150* .27
1–3* .65 34–35* .23 86–87* .21 94–96* .20 125–126* .25
6–7 .48 58–59* .22 79–80* .17 107–119 .17 149–150* .22
5–7 .48 55–56* .20 76–77* .17 97–98* .16 146–147* .22
16–18* .45 52–54* .20 74–78 −.21 110–115 .15 131–146 −.23
4–7 .43 38–55 .19 61–79 −.19 99–106 −.21 127–145 −.23
14–15* .41 43–44* .18 66–73 −.19 96–120 −.21 131–150 −.23
2–3* .38 31–33* .18 72–83 −.18 94–102 −.20 125–141 −.23
12–18 .38 46–47* .17 72–79 −.18 115–118 −.20 122–140 −.22
4–6* .38 36–50 −.25 66–75 −.18 91–108 −.20 121–136 −.21
4–5* .38 44–52 −.21 67–84 −.17 95–102 −.19 124–150 −.21
2–19 .29 51–59 −.21 68–84 −.17 91–104 −.19 123–138 −.20
7–9* .28 31–46 −.20 79–84 −.17 91–102 −.18 124–145 −.19
5–11 .27 32–56 −.19 63–87 −.17 97–107 −.17 121–138 −.19
3–23 −.35 50–54 −.18 76–89 −.17 96–107 −.17 124–143 −.19
9–28 −.30 36–55 −.18 75–83 −.17 93–111 −.17 127–143 −19
8–17 −.28 33–56 −.18 72–81 −.16 108–112 −.17 130–133 −.19
1–23 −.28 40–55 −.17 81–85 −.16 103–114 −.17 121–137 −.19
12–17 −.27 43–50 −.17 66–76 −.16 103–113 −.16 124–142 −.19
*denotes item pairs from the same clusters.
Table 3. Fit statistics for the UVLT.
Test Statistics Unidimensional model Testlet model
UVLT 1k Deviance 2951.51 2881.92
AIC 3013.51 2963.92
AIC corrected 3020.63 2976.72
BIC 3129.45 3117.25
Estimated parameters 31 41
UVLT 2k Deviance 7479.97 7358.68
AIC 7541.97 7440.68
AIC corrected 7549.08 7453.48
BIC 7657.90 7594.01
Estimated parameters 31 41
UVLT 3k Deviance 8796.96 8726.40
AIC 8858.96 8808.40
AIC corrected 8866.07 8821.21
BIC 8974.89 8961.73
Estimated parameters 31 41
UVLT 4k Deviance 9073.48 9005.15
AIC 9135.48 9087.15
AIC corrected 9142.59 9099.95
BIC 9251.42 9240.48
Estimated parameters 31 41
UVLT 5k Deviance 8735.34 8618.37
AIC 8797.34 8700.37
AIC corrected 8804.45 8713.18
BIC 8913.28 8853.71
Estimated parameters 31 41
66 H. T. HA ET AL.
Table 4. Variance estimates and reliability values of the dimensions.
UVLT 1K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 3.626 .732 3.792 .641
Testlet 1 0.417 .015
Testlet 2 0.580 .017
Testlet 3 0.569 .043
Testlet 4 0.491 .035
Testlet 5 1.072 .165
Testlet 6 1.677 .070
Testlet 7 0.404 .056
Testlet 8 1.138 .146
Testlet 9 4.285 .175
Testlet 10 2.638 .274
UVLT 2K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 2.824 .873 3.470 .869
Testlet 1 0.658 .175
Testlet 2 0.869 .091
Testlet 3 0.483 .083
Testlet 4 2.242 .388
Testlet 5 1.420 .203
Testlet 6 1.674 .321
Testlet 7 1.118 .190
Testlet 8 0.864 .183
Testlet 9 0.885 .202
Testlet 10 0.809 .214
UVLT 3K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 2.618 .898 2.951 .921
Testlet 1 1.278 .224
Testlet 2 0.144 .048
Testlet 3 1.431 .260
Testlet 4 1.238 .290
Testlet 5 1.035 .216
Testlet 6 0.358 .125
Testlet 7 0.564 .174
Testlet 8 0.801 .210
Testlet 9 0.262 .094
Testlet 10 0.190 .062
UVLT 4K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary Knowledge 2.291 .903 2.717 .882
Testlet 1 0.883 .229
Testlet 2 2.600 .294
Testlet 3 1.528 .242
Testlet 4 0.882 .202
Testlet 5 0.743 .209
Testlet 6 0.293 .095
Testlet 7 0.428 .150
Testlet 8 0.325 .122
Testlet 9 0.342 .099
(Continued)
LANGUAGE ASSESSMENT QUARTERLY 67
amount of variance, peaking at 113% of the variance of the general knowledge dimension
for the UVLT 1K. That is, the variance explained by dimension 9 (4.285) was 1.13 times that
of the general knowledge dimension (3.792). While the other four levels did not witness
such spiking, the largest amount of variance explained by the testlet effects were as large as
64.6%, 48.5%, 95.7%, and 66.2% of the variance of the main dimension for the 2K, 3K, 4K,
and 5K levels, in the order given. The amount of variance the testlet effects accounted for
clearly demonstrated a substantial disturbance to estimates of test takers’ ability. Due to this
annoyance, the unidimensional models were unable to explain as much variance as the
purified vocabulary knowledge dimensions in the RTMs; this could be observed in all five
levels of the UVLT. The practical implications for this are that person measures from Rasch
unidimensional analysis are disturbed. In other words, they were influenced by method-
specific effects and therefore might not represent true ability.
STUDY 2
In Study 2, Q3 coefficients were calculated and RTM was applied to examine LID in a novel
extended-matching test format. The data were previously used by the authors in a separate
study.
Method
Participants
The participants were 275 students at a top-tier university in southern Vietnam. They included
both English majors and non-English majors of various disciplines. Convenience sampling was
applied. The participants were the students from the classes of one of the authors.
Instruments
Vocabulary was assessed with a 90-item EMT that was introduced in a criterion validation
study by Stoeckel et al. (2024).
Table 4. (Continued).
Testlet 10 0.717 .218
UVLT 5K
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary
Knowledge
3.017 .904 3.879 .898
Testlet 1 0.385 .122
Testlet 2 0.464 .131
Testlet 3 1.265 .260
Testlet 4 0.801 .187
Testlet 5 1.135 .202
Testlet 6 0.511 .145
Testlet 7 0.959 .200
Testlet 8 2.569 .348
Testlet 9 1.059 .238
Testlet 10 2.230 .329
68 H. T. HA ET AL.
Data Analysis
The data were analyzed as described for Study 1, except that the EMT was analyzed as
a whole.
Results
Rasch Separation and Reliability
Table 5 presents Rasch unidimensional reliability and separation values for persons and
items for the EMT. The reliability estimates were greater than .80, and separation indices
were greater than 2, suggesting a reliable test (Linacre, 2023b).
Residual Correlation
Table 6 lists the highest raw score residual correlations (Yen’s Q3 coefficients) of the EMT.
Three item pairs were shown to have correlations of greater than .30, the benchmark for
moderate LID. Among these, items 61, 62 and 63 were in the same 30-item cluster and
located close to each other. No items were found to correlate at .70 or above, our threshold
for serious LID.
Table 5. Rasch reliability and separation for the EMT.
Item Person
Reliability Separation Reliability Separation
EMT .99 8.92 .96 4.95
Table 6. Highest resi-
dual correlations for
the EMT.
EMT
Items Q3
61–63* .34
62–63* .33
35–41* .31
1–35 .29
46–49* .28
79–90* .28
10–43 .27
39–47 .26
5–6* .26
5–46 .25
13–17* .25
19–50 .25
15–18* .25
16–22* .25
31–45* .25
9–22* .24
1–5* .24
20–64 −.29
50–68 −.27
5–36 −.25
*denotes item pairs from
the same clusters.
LANGUAGE ASSESSMENT QUARTERLY 69
Model Comparison
Table 7 offers fit statistics for the EMT. Because the Rasch unidimensional model was nested
within the RTM, model comparison could be conducted directly using the likelihood ratio
test. The results showed that Testlet modeling significantly improved fit of the unidimen-
sional model, χ
2
(3) = 150.59, p < .001. This suggested that the data should be best modeled
as an RTM.
Variance Estimate and EAP/PV Reliability of the Dimensions
Table 8 presents the variances and reliability values of the dimensions. The purified
vocabulary knowledge dimension had the highest variance in the testlet model (2.761),
and this was marginally smaller than the variance explained by the unidimensional model
(2.805). The variances of the method-specific effects ranged from 0.097 to 0.207, accounting
for 3.5–7.5% of the variance of the main dimension, which was very small. EAP reliability
values of the unidimensional model and the general dimension of the RTM were .947 and
.949, respectively, which, again, were very similar. This indicated that inflation of reliability
and reduced variance estimates in the unidimensional analysis were not evidenced in the
case of the EMT.
DISCUSSION
This paper exhibits the use of two LID-detecting methods on two L2 vocabulary tests that
employ very different ratios of options to target words and that have different assumptions
regarding option recycling. By giving no explicit instructions about whether learners can
use an option more than once, Webb et al. (2017) UVLT could be said to implicitly
discourage option recycling. On the other hand, Stoeckel et al.’s (2024) EMT not only
offers written instructions supporting the use of the same option for multiple target words,
but also includes target items that are semantically similar and share a response option.
Hence, the EMT encourages option recycling.
Table 7. Fit statistics of the EMT.
Test Statistics Unidimensional model Testlet model
EMT Deviance 19023.25 18872.66
AIC 19205.25 19060.66
AIC corrected 19296.75 19159.88
BIC 19534.38 19400.63
Estimated parameters 91 94
Table 8. Variance estimates and reliability values of the dimensions.
EMT
Unidimensional model Testlet model
Variance estimate EAP/PV reliability Variance estimate EAP/PV reliability
Vocabulary
Knowledge
2.805 .947 2.761 .949
Testlet 1 - - 0.207 .302
Testlet 2 - - 0.097 .150
Testlet 3 - - 0.198 .293
70 H. T. HA ET AL.
To answer research questions 1 and 2, the LII assumption of the two tests was checked
through both residual correlations and Rasch testlet modeling. Except for the 1K level of the
UVLT, which exhibited a severe violation of LII, the UVLT and the EMT showed compar-
able Q3 coefficients between items at around 0.30, which might signal LID (Aryadoust et al.,
2021). The results from model comparisons also suggested that the tests should be best
modeled as bi-factor, multidimensional, with one general knowledge dimension and several
testlet sub-dimensions. This means that the Rasch multidimensional model detected the
existence of underlying secondary dimensions other than the main ability dimension.
Research questions 3 and 4 sought to examine the variance accounted for by the Rasch
unidimensional model, the purified general knowledge dimension of the RTM, and the
testlet- related dimensions in the RTM. In-depth comparisons revealed that similar values
of residual correlations did not always lead to the same findings in RTM. Despite having
only one item pair that displayed a correlation of greater than .30 per test level, a large
amount of variance in the 2K through 5K levels of the UVLT (Study 1) was attributed to the
cluster dimensions, denoting a large amount of construct-irrelevant variance. This indi-
cated that a considerable proportion of the variance produced by UVLT items was due not
to test takers’ vocabulary knowledge but to their familiarity with the format. This serious
interference to the general vocabulary knowledge dimension caused observable contrac-
tions of the variance explained by the unidimensional model. In other words, the general
vocabulary knowledge dimension explained more variance when it was purified from the
annoyance caused by the method effects, which, in turn, demonstrated the effect of LID on
unidimensional analysis. This implies that analyzing UVLT data using Rasch unidimen-
sional analysis might lead to distorted person estimates. These findings extend the conclu-
sions presented in Ha (2022) concerning the issue of LID in the UVLT. That is, LID issues
occur at all levels of the UVLT to a considerable extent. This raised yet another red flag for
the UVLT in particular and the 3:6 matching format in general.
The EMT (Study 2) had three item pairs with Q3 coefficients of greater than .30.
Nevertheless, the unidimensional model and the purified knowledge dimension in the
RTM explained a similar amount of variance, and testlet effects of the clusters were
minimal. This suggests that the disturbance caused by testlet effects was too small to pose
any threat to unidimensional analysis, and so EMT test data can be analyzed using Rasch
unidimensional modeling without concern for LID.
Empirical evidence from past research has suggested that more familiar formats tend to
yield smaller method-related variance (Baghaei & Aryadoust, 2015, pp. 82–83). Considering
the unconventionality of the EMT and the fact that the participants had not encountered it
before, our results seem to indicate the opposite. That is, the complexity of the EMT’s
design, particularly its large cluster size, partialed out more of test takers’ format-related
competence than the simpler 3:6 matching format. When sitting the EMT, because the list
of options is so long, test takers typically have to recall the meaning of a word first, and then
look for an appropriate response from the list of options. This differs from how learners
sometimes approach the 3:6 matching format of the UVLT, scanning the options in hopes
of recognizing the target word meaning (Martin, 2022). For the EMT, then, responses rely
mostly on test takers’ vocabulary knowledge rather than their competence in handling the
format. In this sense, the EMT captures a different strength of word knowledge, as described
in Nation and Coxhead (2021, pp. 102–110), compared to commonly-used meaning-
recognition formats, such as the 4-option, multiple-choice and the 3:6 matching formats.
LANGUAGE ASSESSMENT QUARTERLY 71
Such strength of knowledge is close to that of a written meaning-recall test (Stoeckel et al.,
2024).
The results from this study offer supportive evidence for the validity of the EMT format
for vocabulary assessment. More broadly, the findings confirm Budescu’s (1988) assump-
tion concerning the interrelation of items and the number of options per item. That is, as
the number of options is increased in a matching format, it reduces the interdependence
between items and therefore constrains irrelevant variance caused by guessing.
The findings of the present study also offer practical implications for the detection and
interpretation of LID. While RTM offers an in-depth examination of LID at the cluster level,
the benefits of inspecting residual correlations for item-level LID are undeniable. However, as
an exploratory method, the examination of residual correlations does not always portray the
whole picture of LID, and overreliance on this sole metric might lead to incorrect conclu-
sions. On the other hand, RTM, as a confirmatory method, excels at both detecting multi-
dimensionality and dimension inspection. This is due to the nature of the two methods:
exploratory versus confirmatory. When residual correlation examination is conducted,
statisticians draw conclusions regarding LID based only on the correlation of item residuals
that meet a specified threshold. These conclusions could sometimes be misleading as the
approach generalizes the definition of LID in that it does not separate out the impact of test
format familiarity on LID. That is, to take any test, test takers need at least two types of
knowledge: knowledge of the subject matter being tested and knowledge of how to handle the
test format (and knowledge of the same test format across multiple items is, needless to say,
correlated). When residual correlations are inspected, this kind of format familiarity is very
likely to be captured. Item pairs that have sizable residual correlations are flagged, and
researchers cherry-pick the correlations that they deem meaningful and decide for themselves
if the test items violate LII assumption. As all items are allowed to freely co-vary in this
approach, method-related and item-related effects are mixed, which muddles the examina-
tion. On the contrary, Wang and Wilson’s (2005) RTM isolates the testlets according to the
conceptual design of the test, and examines the amount of variance they account for. This
offers direct information about the unique effect of each sub-test and therefore provides
more precise conclusions on the existence of LID and its effects on unidimensional analysis.
The investigation of LID should be conducted with clear theoretical backings (DeMars,
2012). Particularly, we suggest that, for testlet-based instruments, the inspection of residual
correlations should be used only for exploratory analyses, and that Rasch testlet modeling as
the confirmatory method. We also encourage test developers to consider employing RTM
for deeper inspection when item residuals correlate at .3 and higher. The present study
showed that tests with residual correlations around .3 can have substantial testlet-related
variances.
For tests that do not include testlets, residual correlations can be examined to detect
potential flaws in item design that may violate the conditional independence assumption.
Attention should be paid to the location of the correlated items and qualitative examination
should be conducted to see whether these items accidentally share any common stimuli
(Baghaei & Christensen, 2023). If two correlated items do not share any common stimuli
and are located distant from each other, then what causes the correlation could be test
takers’ format familiarity, test-taking methods, or raters’ judgments.
72 H. T. HA ET AL.
CONCLUSION
By comparing two LID detection techniques under different conditions, the present
research offers additional validity evidence for the vocabulary assessment under the
new EMT format, and provides practical guidance for how LID should be examined
in vocabulary test validation studies. That is, the inspection of residual correlations
alone can be incomplete in that the potential impact of item format on test
performance is unaccounted for. Rasch testlet modeling can separate out the impact
of item format and therefore provides more accurate LID diagnosis. Additionally,
the research found that the use of an extended matching format together with
explicit instructions permitting option recycling may be effective in preventing
LID. Although subject to further research, the findings are promising for extended-
matching formats in vocabulary testing.
Despite being informative, the current study bears certain limitations. First, the
use of different participants for the two matching formats inhibits direct compar-
isons, as we cannot be sure whether differences in LID were due to dissimilarities in
test format or differences in test takers’ behaviors. Second, because the two matching
test formats differed in both cluster size and instructions for recycling, we were
unable to isolate the effects of these two variables on LID. For better understanding,
future research should examine test variants that differ along only one of these
dimensions.
Acknowledgments
We would love to express our gratitude to the Editors and the anonymous reviewers whose comments
greatly improved the manuscript.
ORCID
Hung Tan Ha http://orcid.org/0000-0002-5901-7718
Duyen Thi Bich Nguyen http://orcid.org/0000-0002-9105-9950
Tim Stoeckel http://orcid.org/0000-0002-1447-5002
DISCLOSURE STATEMENT
No potential conflict of interest was reported by the author(s).
DATA AVAILABILITY STATEMENT
Data and command scripts for Winsteps and ConQuest are are openly available on our Open Science
Framework project page (Ha et al., 2023) at https://doi.org/10.17605/OSF.IO/AKRCG
AUTHORS’ CONTRIBUTION
All authors listed have made significant, direct and intellectual contributions to the paper. All the
authors read and approved the manuscript for publication.
LANGUAGE ASSESSMENT QUARTERLY 73
CONSENT TO PARTICIPATE
The participants provided their written informed consent to participate in this study.
ETHICS APPROVAL
The studies involving human participants were reviewed and approved by University of Economics
Ho Chi Minh City (UEH).
REFERENCES
Adams, R. J., Cloney, D., Wu, M., Osses, A., Schwantner, V., & Vista, A. (2022). ACER ConQuest
manual. Australian Council for Educational Research. https://research.acer.edu.au/measurement/5
Adams, R. J., Wu, M. L., Cloney, D., Berezner, A., & Wilson, M. (2020). ACER ConQuest: Generalised
item response modelling software (version 5.29). [Computer software]. Australian Council for
Educational Research. https://www.acer.org/au/conquest
Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in
language assessment: Recommendations and guidelines for research. Language Testing, 38(1),
6–40. https://doi.org/10.1177/0265532220927487
Baghaei, P. (2007). Local dependency and Rasch measures. Rasch Measurement Transactions, 21(3),
1105–1106. https://www.rasch.org/rmt/rmt213b.htm
Baghaei, P. (2010). A comparison of three polychotomous rasch models for super-item analysis.
Psychological Test and Assessment Modelling, 52(3), 313–323. https://www.psychologie-aktuell.
com/fileadmin/download/ptam/3-2010_20100928/06_Baghaei.pdf
Baghaei, P. (2016). Modeling multidimensionality in foreign language comprehension tests: An
Iranian example. In V. Aryadoust & J. Fox (Eds.), Trends in language assessment research and
practice: The view from the Middle East and the Pacific Rim (pp. 47–66). Cambridge Scholars.
Baghaei, P., & Aryadoust, V. (2015). Modeling local item dependence due to common test format
with a multidimensional Rasch model. International Journal of Testing, 15(1), 71–87. https://doi.
org/10.1080/15305058.2014.941108
Baghaei, P., & Christensen, K. B. (2023). Modeling local item dependence in C-tests with the loglinear
Rasch model. Language Testing, 40(3), 820–827. https://doi.org/10.1177/02655322231155109
Baghaei, P., & Ravand, H. (2016). Modeling local item dependence in cloze and reading comprehen-
sion test items using testlet response theory. Psicológica, 37(1), 85–104. https://www.redalyc.org/
pdf/169/16943586005.pdf
Bond, T., Yan, Z., & Heene, M. (2021). Applying the Rasch model: Fundamental measurement in the
human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499
Brandt, S. (2017). Concurrent unidimensional and multidimensional calibration within item
response theory. Pensamiento Educativo Revista de Investigación Educacional Latinoamericana,
54(2), 1–18. https://doi.org/10.7764/PEL.54.2.2017.4
Budescu, D. V. (1988). On the feasibility of multiple matching tests— variations on a theme by
Guiliksen. Applied Psychological Measurement, 12(1), 5–14. https://doi.org/10.1177/
014662168801200102
Christensen, K. B., Makransky, G., & Horton, M. (2017). Critical values for Yen’s Q3: Identification of
local dependence in the Rasch model using residual correlations. Applied Psychological
Measurement, 41(3), 178–194. https://doi.org/10.1177/0146621616677520
DeMars, C. E. (2012). Confirming testlet effects. Applied Psychological Measurement, 36(2), 104–121.
https://doi.org/10.1177/0146621612437403
Fan, J., & Bond, T. (2019). Applying Rasch measurement in language assessment. In V. Aryadoust &
M. Raquel (Eds.), Quantitative data analysis for language assessment volume I: Fundamental
techniques (pp. 83–102). Routledge.
74 H. T. HA ET AL.
Gyllstad, H., McLean, S., & Stewart, J. (2020). Using confidence intervals to determine adequate item
sample sizes for vocabulary tests: An essential but overlooked practice. Language Testing, 38(4),
558–579. https://doi.org/10.1177/0265532220979562
Ha, H. T. (2021). A Rasch-based validation of the Vietnamese version of the listening vocabulary
levels test. Language Testing in Asia, 11(1), 1–19. https://doi.org/10.1186/s40468-021-00132-7
Ha, H. T. (2022). Test format and local dependence of items revisited: A case of two vocabulary levels
tests. Frontiers in Psychology, 12(1), 1–6. https://doi.org/10.3389/fpsyg.2021.805450
Ha, H. T., Stoeckel, T., & Nguyen, D. T. B. (2023, November 7). Examining local dependence in
vocabulary tests using Yen’s Q3 coefficient and Rasch Testlet Model. https://doi.org/10.17605/
OSF.IO/AKRCG
Kamimoto, T. (2014). Local item dependence on the vocabulary levels test revisited. Vocabulary
Learning and Instruction, 3(2), 56–68. https://doi.org/10.7820/vli.v03.2.kamimoto
Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49(2), 223–245. https://doi.org/
10.1007/BF02294174
Kreiner, S., & Christensen, K. B. (2004). Analysis of local dependence and multidimensionality in
graphical loglinear Rasch models. Communications in Statistics - Theory and Methods, 33(6),
1239–1276. https://doi.org/10.1081/STA-120030148
Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in health-related scales: Analysis by
graphical loglinear Rasch models. In M. von Davier & C. H. Carstensen (Eds.), Multivariate and
mixture distribution Rasch models: Extensions and applications (pp. 329–346). Springer-Verlag.
https://doi.org/10.1007/978-0-387-49839-3_21
Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied
Psychological Measurement, 30(1), 3–21. https://doi.org/10.1177/0146621605275414
Linacre, J. M. (2017). Teaching Rasch measurement. Rasch Measurement Transactions, 31(2),
1630–1631. www.rasch.org/rmt/rmt312.pdf
Linacre, J. M. (2023a). Winsteps® Rasch measurement computer program (version 5.6.0) [Computer
software]. Winsteps.com.
Linacre, J. M. (2023b). A user’s Guide to WINSTEPS® MINISTEP rasch-model computer programs.
Program manual 5.6.0. https://www.winsteps.com/winman/copyright.htm
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Marais, I., & Andrich, D. (2008). Effects of varying magnitude and patterns of response dependence
in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 105–124. http://
publicifsv.sund.ku.dk/~kach/PsyLab2018/Marais,%20Andrich,%202008.pdf
Martin, J. (2022). A proposed taxonomy of test-taking action and item format in written receptive
vocabulary testing. Vocabulary Learning and Instruction, 11(1), 1–16. https://doi.org/10.7820/vli.
v11.1.martin
McLean, S. (2018). Evidence for the adoption of the flemma as an appropriate word counting unit.
Applied Linguistics, 39(6), 823–845. https://doi.org/10.1093/applin/amw050
Monseur, C., Baye, A., Lafontaine, D., & Quittre, V. (2011). PISA test format assessment and the local
independence assumption. IERI Monograph Series: Issues and Methodologies in Large Scale
Assessments, 4, 131–158. https://orbi.uliege.be/bitstream/2268/103137/1/IERI_Monograph_
Volume04_Chapter_6.pdf
Nation, I. S. P., & Coxhead, A. (2021). Measuring native-speaker vocabulary size. John Benjamins.
https://doi.org/10.1075/z.233
Nguyen, T. M. H., Gu, P., & Coxhead, A. (2024). Argument-based validation of academic collocation
tests. Language Testing, 41(3), 459–505. https://doi.org/10.1177/02655322231198499
Rijmen, F. (2010). Formal relations and an empirical comparison among the bi-factor, the testlet, and
a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3),
361–372. https://doi.org/10.1111/j.1745-3984.2010.00118.x
Schmitt, N., Nation, P., & Kremmel, B. (2020). Moving the field of vocabulary assessment forward:
The need for more rigorous test development and validation. Language Teaching, 53(1), 109–120.
https://doi.org/10.1017/S0261444819000326
LANGUAGE ASSESSMENT QUARTERLY 75
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new
versions of the vocabulary levels test. Language Testing, 18(1), 55–88. https://doi.org/10.1177/
026553220101800103
Stoeckel, T., Ha, H. T., Nguyen, D. T. B., & Nicklin, C. (2024). Can an extended-matching
second-language vocabulary test format bridge the gap between meaning-recognition and
meaning-recall? Research Methods in Applied Linguistics, 3(2), 1–17. https://doi.org/10.1016/j.
rmal.2024.100109
Stoeckel, T., Ishii, T., & Bennett, P. (2020). Is the lemma more appropriate than the flemma as a word
counting unit? Applied Linguistics, 41(4), 601–606. https://doi.org/10.1093/applin/amy059
Stoeckel, T., McLean, S., & Nation, P. (2021). Limitations of size and levels tests of written receptive
vocabulary knowledge. Studies in Second Language Acquisition, 43(1), 181–203. https://doi.org/10.
1017/S027226312000025X
Stoeckel, T., Stewart, J., McLean, S., Ishii, T., Kramer, B., & Matsumoto, Y. (2019). The relationship of
four variants of the vocabulary size test to a criterion measure of meaning recall vocabulary
knowledge. System, 87, 1–14. https://doi.org/10.1016/j.system.2019.102161
Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated
discrimination parameters in item response theory. Psychological Methods, 6(2), 181–195. https://
doi.org/10.1037/1082-989X.6.2.181
van den Wollenberg, A. L. (1982). Two new test statistics for the Rasch model. Psychometrika, 47(2),
123–140. https://doi.org/10.1007/BF02296270
Vu, D. V., & Peters, E. (2021). Vocabulary in English language learning, teaching, and testing in
Vietnam: A review. Education Science, 11(9), 563. https://doi.org/10.3390/educsci11090563
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge
University Press. https://doi.org/10.1017/CBO9780511618765
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29
(2), 126–149. https://doi.org/10.1177/0146621604271053
Webb, S. (2021). A different perspective on the limitations of size and levels tests of written receptive
vocabulary knowledge. Studies in Second Language Acquisition, 43(2), 454–461. https://doi.org/10.
1017/S0272263121000449
Webb, S., Sasao, Y., & Ballance, O. (2017). The updated vocabulary levels test. ITL - International
Journal of Applied Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web
Wu, M. (2004). Plausible values. Rasch Measurement Transactions, 18(2), 976–978. https://www.
rasch.org/rmt/rmt182c.htm
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the
three-parameter logistic model. Applied Psychological Measurement, 8(2), 125–145. https://doi.org/
10.1177/014662168400800201
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence.
Journal of Educational Measurement, 30(3), 187–213. https://doi.org/10.1111/j.1745-3984.1993.
tb00423.x
Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational
measurement (4th ed. pp. 111–153). American Council on Education/Praeger.
76 H. T. HA ET AL.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The functioning of a vocabulary testing instrument rests in part on the test-taking actions made possible for examinees by item format, an aspect of test development that warrants consideration in second-language vocabulary research. For example, although iterations of the written receptive vocabulary levels test (VLT) have integrated improvements in lexis sampling and distractor-item creation (i.e., Beglar & Hunt, 1999; Nation, 1983, 1990; Schmitt et al., 2001; Webb et al., 2017), its clustered form-meaning matching format has remained fundamentally unchanged. This study qualitatively explores the influence of this test item format on test-taking actions observed when taking the updated vocabulary levels test (UVLT, Webb et al, 2017). Data from a think-aloud protocol and retrospective interviewing indicated the predominant use of test-taking strategies for answering test items on the UVLT, such as bidirectional matching and elimination of cluster options, and that these actions enabled correct responses for clusters of target vocabulary about which the test taker demonstrated partial or even no knowledge. This evidence at the interface of test taker and test draws attention to the interconnection of estimating learners’ vocabulary knowledge and the action possibilities provided by item format on vocabulary tests. Such affordances are hierarchically structured in a proposed Taxonomy of Test-taking Actions Afforded by Receptive Vocabulary Test Format as a heuristic to evaluate the influences of test format on written receptive vocabulary assessment.
Article
Full-text available
Local item dependence (LID) is one of the most critical assumption in the Rasch model when it comes to the validity of a test. As the field of vocabulary assessment is calling for more clarity and validity for vocabulary tests, such assumption becomes more important than ever. The article offers a Rasch-based investigation into the issue of LID with the focus on the two popular formats of Vocabulary Levels Tests (VLT): multiple-choice and matching. A Listening Vocabulary Levels Test (LVLT) and an Updated Vocabulary Levels Test (UVLT) were given to a single cohort of 311 university students in an English as a Foreign Language (EFL) context. The analyses of raw score and standardized residuals correlations were conducted. The findings found no relationship between the 4-option, multiple-choice format of the LVLT and item local dependence. However, results from score and standardized residuals correlations analyses showed a strong link between the 3-item-per-cluster, matching format and item local dependence. The study calls for greater attention to the format of future vocabulary tests and support the use of meaning-recall formats in vocabulary testing.
Article
Full-text available
This review paper aims to provide an overview of vocabulary in English language learning, teaching, and testing in Vietnam. First, we review studies on the vocabulary knowledge of Vietnamese EFL learners. Recent research evaluating different aspects of vocabulary knowledge shows that Vietnamese EFL learners generally have limited knowledge of both single words and formulaic language. Next, we discuss contemporary approaches to teaching vocabulary in Vietnam to reveal current issues and provide relevant recommendations. Empirical studies on Vietnamese EFL learners’ vocabulary acquisition are also discussed with an aim to shed light on how vocabulary can be acquired by Vietnamese EFL learners and subsequently draw important pedagogical implications. In addition, we look into the lexical component of high-stakes English tests in Vietnam, calling for more attention to the lexical profiles and lexical coverage of those tests. Finally, we provide concluding remarks and research-informed recommendations for EFL vocabulary learning and teaching in Vietnam to elaborate on how vocabulary can be effectively learned and taught.
Article
Full-text available
The Listening Vocabulary Levels Test (LVLT) created by McLean et al. Language Teaching Research 19:741-760, 2015 filled an important gap in the field of second language assessment by introducing an instrument for the measurement of phonological vocabulary knowledge. However, few attempts have been made to provide further validity evidence for the LVLT and no Vietnamese version of the test has been created to date. The present study describes the development and validation of the Vietnamese version of the LVLT. Data was collected from 311 Vietnamese university students and then analyzed based on the Rasch model using several aspects of Messick's, Educational Measurement, 1989; American Psychologist 50:741-749, 1995 validation framework. Supportive evidence for the test's validity was provided. First, the test items showed very good fit to the Rasch model and presented a sufficient spread of difficulty. Second, the items displayed sound unidimensionality and were locally independent. Finally, the Vietnamese version of the LVLT showed a high degree of generalizability and was found to positively correlate with the IELTS listening test at 0.65.
Preprint
Full-text available
The last three decades has seen an increase of tests aimed at measuring an individual's vocabulary level or size. The target words used in these tests are typically sampled from word frequency lists, which are in turn based on language corpora. Conventionally, test developers sample items from frequency bands of 1,000 words; different tests employ different sampling ratios. Some have as few as 5 or 10 items representing the underlying population of words, whereas other tests feature a larger number of items, e.g., 24, 30, or 40. However, very rarely are the sampling size choices supported by clear empirical evidence. Here, using a bootstrapping approach, we illustrate the effect that a sample size increase has on confidence intervals of individual learner vocabulary knowledge estimates, and on the inferences that can safely be made from test scores. We draw on a unique dataset consisting of adult L1 Japanese test-takers' performance on 2 English vocabulary test formats, each featuring 1,000 words. Our analysis shows that there are few purposes and settings where as few as 5-10 sampled items from a 1,000-word frequency band (1K) are sufficient. The use of ≥ 30 items per 1,000-word frequency band and tests consisting of fewer bands is recommended.
Article
Meaning-recognition and meaning-recall vocabulary tests are commonly used to assess knowledge of the form-meaning link as it relates to the receptive skills. Although meaning-recognition is generally more convenient, think-aloud protocols have revealed that in comparison to meaningrecall, meaning-recognition is more susceptible to blind guessing and the use of constructirrelevant test strategies. Perhaps because of this, meaning-recall tends to be a stronger predictor of reading ability. Following Budescu (1988), this article reports on three studies that investigated an extended-matching test (EMT) format that was designed to address these limitations of meaning-recognition while retaining its convenience. An EMT with 90 target words was developed. It contained three clusters, each with a 30:130 ratio of target words to Vietnamese L2 response options. In comparison to meaning-recall criterion measures, the EMT did not meaningfully differ in terms of internal reliability, mean scores, and, importantly, the strength of the correlation with reading comprehension scores. The consistency of correct/incorrect response classifications ranged from 83 % to 86 %. These initial findings suggest that the EMT format may be used interchangeably with meaning-recall for many research purposes.
Article
Despite extensive research on assessing collocational knowledge, valid measures of academic collocations remain elusive. With the present study, we begin an argument-based approach to validate two Academic Collocation Tests (ACTs) that assess the ability to recognize and produce academic collocations (i.e., two-word units such as key element and well established) in written contexts. A total of 343 tertiary students completed a background questionnaire (including demographic information, IELTS scores, and learning experience), the ACTs, and the Vocabulary Size Test. Forty-four participants also took part in post-test interviews to share reflections on the tests and retook the ACTs verbally. The findings showed that the scoring inference based on analyses of test item characteristics, testing conditions, and scoring procedures was partially supported. The generalization inference, based on the consistency of item measures and testing occasions, was justified. The extrapolation inference, drawn from correlations with other measures and factors such as collocation frequency and learning experience, received partial support. Suggestions for increasing the degree of support for the inferences are discussed. The present study reinforces the value of validation research and generates the momentum for test developers to continue this practice with other vocabulary tests.
Article
C-tests are gap-filling tests mainly used as rough and economical measures of second-language proficiency for placement and research purposes. A C-test usually consists of several short independent passages where the second half of every other word is deleted. Owing to their interdependent structure, C-test items violate the local independence assumption of IRT models. This poses some problems for IRT analysis of C-tests. A few strategies and psychometric models have been suggested and employed in the literature to circumvent the problem. In this research, a new psychometric model, namely, the loglinear Rasch model, is used for C-tests and the results are compared with the dichotomous Rasch model where local item dependence is ignored. Findings showed that the loglinear Rasch model fits significantly better than the dichotomous Rasch model. Examination of the locally dependent items did not reveal anything as regards their contents. However, it did reveal that 50% of the dependent items were adjacent items. Implications of the study for modeling local dependence in C-tests using different approaches are discussed.