ArticlePDF Available

Abstract

Hashimoto (2021) reported a correlation of −.50 (r 2 = .25) between word frequency rank and difficulty, concluding the construct of modern vocabulary size tests is questionable. In this response we show that the relationship between frequency and difficulty is clear albeit non-linear and demonstrate that if a wider range of frequencies is tested and log transformations are applied, the correlation can approach .80. Finally, while we acknowledge the great promise of knowledge-based word lists, we note that a strong correlation between difficulty and frequency is not, in fact, the primary reason size tests are organized by frequency.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Relationship Between Word Difficulty and Frequency: A
Response to Hashimoto (2021)
Jeffrey Stewarta*, Stuart McLeanb, Joseph P. Vittac, Christopher Nicklind,
Geoffrey G. Pinchbecke and Brandon Kramerf
aInstitute of Arts and Sciences , Tokyo University of Science, Tokyo, Japan; bBusiness
Administration, Momoyamagakuin University, Osaka, Japan; cFaculty of Languages
and Cultures, Kyushu University, Fukuoka, Japan; dCenter for Foreign Language
Research and Education, Rikkyo University, Tokyo, Japan; eSchool of Linguistics and
Language Studies, Carleton University, Ottawa, Canada; School of Education,
fKwansei Gakuin University, Nishinomiya, Japan.
*Corresponding author: Jeffrey Stewart, Tokyo University of Science, Institute of Arts
and Sciences, Building 1, 1 Chome-3 Kagurazaka, Shinjuku City, Tokyo 162-8601.
jeffjrstewart@gmail.com
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Relationship Between Word Difficulty and Frequency: A
Response to Hashimoto (2021)
Hashimoto (2021) reported a correlation of -.50 (r2 = .25) between word
frequency rank and difficulty, concluding the construct of modern vocabulary
size tests is questionable. In this response we show that the relationship between
frequency and difficulty is clear albeit non-linear and demonstrate that if a wider
range of frequencies is tested and log transformations are applied, the correlation
can approach .80. Finally, while we acknowledge the great promise of
knowledge-based word lists, we note that a strong correlation between difficulty
and frequency is not, in fact, the primary reason size tests are organized by
frequency.
Keywords: vocabulary testing; frequency; yes/no tests; Zipf’s law; log
transformation
Introduction
We read Hashimoto’s (2021) paper in Language Assessment Quarterly with interest.
Hashimoto stated that L2 vocabulary size tests are based on the notion that the more
frequent a word is in a language, the more likely it is that a learner will know that word.
To demonstrate this, he correlated Rasch logit item difficulty estimates for words tested
on his Vocabulary of American-English Size Test (VAST) (Hashimoto, 2016;
Hashimoto & Egbert, 2019) to the rank of frequency counts for those same words on the
Corpus of Contemporary English (COCA) (Davies, 2008-). The correlation was only
moderate (r = .50; r2 = .25), and from this he concluded that the construct of modern
vocabulary size tests was thrown into question.
We would like to raise the following points in response. First, while the
relationship between word frequency and difficulty is not entirely linear, a strong
relationship is nevertheless clearly present: after replicating Hashimoto’s results using a
data set based on COCA frequency and responses to target words from the Vocabulary
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Size Test (VST) (Beglar & Nation, 2007; Beglar, 2010), we found that although we
attained a rank/difficulty correlation similar to Hashimoto’s (r = -.53, accounting for
about 28% [r2] of the variance), we then observed a considerably improved r of -.78 (r2
= .61)1 for a 5-parameter (5P) logistic curve, indicating a strong but non-linear
relationship. Second, after accounting for the non-linear nature of the data by log-
transforming COCA frequency counts, we attained a final linear model that indicated
the existence of a stronger relationship, r = .78 (r2 = .61). While still not perfect, this
coefficient demonstrates that the relationship between word frequency and difficulty
can be considerably stronger than Hashimoto believes, depending on the choice of
model used and the range of word frequencies that is examined. Third, it should be
noted that a strong correlation between word frequency and difficulty is not, in fact, the
primary reason that L2 vocabulary tests are organized by word frequency, and the
reasons for sectioning such tests by frequency band should be considered when
designing future tests.
In the response that follows we will discuss these findings and observations in
detail. Our aim is to demonstrate that the strength of the relationship between word
frequency and difficulty can change considerably depending on how it is modelled.
First, we investigate the nonlinear relationship between word difficulty and frequency.
Second, we investigate the relationship between word difficulty and log-transformed
frequency counts. Finally, we conclude by discussing the rationale for organizing
vocabulary tests by frequency and add our own thoughts on how vocabulary size tests
can be improved. We suggest how word lists based on learner difficulty rather than
frequency can be used for diagnostic purposes, and how such lists might be used in
conjunction with traditional frequency-based lists in optimizing designs of language
pedagogy.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Methods
For this response, we will examine a data set of word frequencies and self-reported
difficulties obtained from a Yes/No vocabulary test for illustrative purposes. We drew
from unpublished data from a Yes/No checklist test which used the same item stems as
what is perhaps the best-known vocabulary size test in adult SLA literature, the VST
(Beglar, 2010; Nation & Beglar, 2007). Although we typically prefer testing word
difficulty with meaning-recall tests due to their often superior reading proficiency
predictive power (McLean et al, 2020; Zhang & Zhang, 2020), for this response learner
self-reports were used in order to keep our results more analogous to Hashimoto’s 2.
Similarly, although 30 pseudowords were tested alongside the real items, as in
Hashimoto’s study this information was not incorporated into difficulty estimates for
the real items. Our data set contained responses to 5 items each from the first 14 1,000-
word band levels on the VST. These items were not selected for the test based on prior
knowledge of their difficulty (Nation, personal communication). The 70 items were
completed by 397 Japanese university students, none of whom were English majors.
The Yes/No test was completed online in the classroom and collected data had a
Cronbach Alpha value of 0.88. Students were presented with item stems used in the
VST with the target word underlined and in bold (e.g., poor). Two of the items from
low-frequency bands could not be found on COCA’s online resource (Davies, 2020),
meaning it was not possible to compare their frequency and difficulty using
Hashimoto’s measures. They were therefore cut from the analysis, leaving a total of 68
items3. Pearson correlation coefficients and a 5P logistic curve analysis were utilized to
investigate the relationship between item difficulty, as expressed by the Yes/No scores,
and COCA frequency and rank ratings.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Non-Linear Relationship Between Word Difficulty and COCA
Frequency
Figure 1 displays the correlation we attained, r = 0.34 (r2 = 0.12), between word
difficulty and COCA frequency, which was larger than Hashimoto’s, r = -.15 (r2 = .02;
the conceptual direction is the same given his use of logits). However, an assumptions
check of the general linear model, which underpins all parametric testing, revealed
issues with the correlation with violations ranging from nonlinearity to non-
independence and heteroskedasticity of the residuals (see Appendix A for assumptions
checks of correlations). As highlighted by Schützenmeister et al. (2012) and Osborne
and Waters (2002), such violations can render the model’s parameters and point
estimates biased and in this case the number of violations saw it imprudent to place any
credibility in the model’s (i.e., correlation) ability to make inferences about the
population. Perhaps due to such issues, Hashimoto then correlated word difficulty to
COCA frequency rank. For illustrative purposes we will do the same with our own data
along with a line of fit for a linear regression, as can be seen in Figure 2.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Figure 1
Scatter Plot of Self-Reported Word Difficulty by COCA Frequency
Note. One outlier, the very high-frequency word see, has been excluded to retain
perspective on the remaining data points.
Figure 2
Scatter Plot of Self-Reported Word Difficulty by COCA Rank with a Linear Regression
Line
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
.
The scatter plot in Figure 2 represents an r coefficient of .53 (r2 = .28), near
Hashimoto’s r value of -.50 (r2 = .25). However, closer inspection indicates that while
more workable, our data is still not quite linear in nature (see Appendix A). The
majority of high-rank words in the COCA list are very easy for learners, yet these points
are treated as outliers from the model’s slope rather than confirmation of the frequency-
difficulty relationship. Initially, there is a sharp descent in difficulty for COCA ranks
between approximately ranks 1-10,000), followed by a weaker relationship thereafter
for lower-frequency words. The data appears to be curved, indicating a non-linear
relationship between word frequency rank and difficulty. We therefore attempted a non-
linear regression with a 5P logistic curve (Burnham, 2020) for exploratory purposes and
attained a substantially better fit (r = -.78, r2 = .61), as can be seen in Figure 3.
Figure 3
Scatter Plot of Self-Reported Word Difficulty by COCA Rank Fit with a 5P Logistic
Curve
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Relationship Between Word Difficulty and Log-Transformed Frequency
Counts
Although non-linear models of this nature may hold potential for capturing the
relationship between word difficulty and numerous measures of word frequency, a
drawback of this approach is that researchers often use word frequency as one of several
explanatory variables in various linear models. It is possible to accommodate word
frequency within common linear models such as multiple regression by using a standard
procedure in L2 vocabulary research, log transformation of raw frequency counts (e.g.,
Kyle & Crossley, 2015; 2016).
There is a firm theoretical basis for conducting log transformations on frequency
counts. According to Zipf’s law (1935), there is a proportional relationship between
frequency and rank in that human language comprises a small number of highly
frequent words, a smaller number of words with medium frequency, and multitudes of
low frequency words. When plotted, this proportional relationship follows a Zipfian
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
distribution, which is illustrated in Figure 4a with data from the most frequent 50 words
in Alice in Wonderland4 and can be found in all forms of organised language genres,
spoken or written. Zipf interpreted the law in terms of speaker and interlocutor
requirement, whereby a limited set of highly frequent vocabulary items conserves effort
and increases the efficiency of spoken communication (Manning & Schütze, 1999).
Figure 4
The Frequency Distribution of the 50 Most Frequent Words in Alice in Wonderland
Plotted as (a) Raw and (b) Log-Transformed Frequency Counts
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Despite the ubiquitous nature of Zipfian distributions in natural language, such
distributions do not lend themselves to the typical statistical analyses employed in
linguistic and corpus-based research, which more often than not employ parametric
testing driven by the general linear model. As the name insinuates, this analysis assumes
that the observed and predicted case values within the model have a linear relationship,
which Osborne and Waters (2002) demonstrated by modeling regression residual plots.
When attending to this issue, researchers often log-transform frequency counts which
improves such linearity, as can be seen in Figure 4b. This practice has been employed in
operationalization of frequency as a component of lexical sophistication (e.g., Edwards
& Collins, 2011; Kyle & Crossley, 2015; 2016; Kim & Crossley, 2018; Durrant et al.,
2019). An added benefit of log transformations over rank is the coefficients of log
transformed frequency predictors can easily be converted back to meaningful
associations with the dependent variable, aiding interpretability of model predictions. It
should be noted that log transformations cannot be viewed as a cure-all for linearity and
other testing assumptions in all cases; in other situations the transformation may create a
variable which differs from the construct on which the initial theoretical underpinnings
were hypothesized (Field, 2018; Lo & Andrews, 2015). However, due to the close
resemblance of the Zipfian curve to a log-normal distribution, the transformation is
justifiable in this context. Hashimoto (2021) acknowledged log-transformed frequency
as a viable alternative to his chosen methodology and recommended further research to
investigate its case.
Due to the non-linear shape of the data, next we applied a log transformation to
COCA frequency counts. We employed automated TAALES-provided log-transformed
raw frequency measures. An important distinction to note is that TAALES COCA
variables are derived from orthographic (and/or phonographic) form data in lieu of
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
lemma frequency counts, which may differ from lemma-based data somewhat (see
Appendix B for details). We observed a correlation of .78 (r2 = .61), which met all
assumptions of the general linear model (see Appendix A). A scatter plot of the
relationship can be seen in Figure 5. The transformation resulted in a substantially more
linear relationship between the two variables, raising r to 0.78 (r2 = 0.61), and indicated
that the relationship between word frequency and difficulty can be considerably
stronger than Hashimoto’s results suggest.
Figure 5
Scatter Plot of Self-Reported Word Difficulty by Log of COCA Frequency
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Discussion
Our illustrative examples come with some important caveats. While our analyses may
suggest future directions for the modelling of word frequency and difficulty, we would
like to emphasize that our example data set samples only 68 words from 14,000 word
families, and that these illustrative analyses should not be seen as conclusive until a
larger study can be conducted. Further research with more robust samples is needed to
confirm the implied trends. The goal of our analyses has simply been to demonstrate
that the strength of the relationship between word frequency and difficulty can change
considerably depending on how the relationship is modelled and how frequency is
operationalized.
An additional caveat should also be considered when comparing our illustrative
data to Hashimoto’s findings. Unlike the VST, which samples words from the first
14,000 word families, Hashimoto’s test, VAST, only draws from the first 5,000 most
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
common words in the COCA. Therefore, non-linearity will likely not be as important an
issue for Hashimoto’s data set if rank is used as the predictor variable, as the curve we
observed in our own data using rank was more apparent when items were sampled
beyond the 10,000 rank. The range of a predictor variable plays an important role in the
strength of a correlation. Our correlation was likely higher in large part due to the wider
variety of word frequencies sampled. Had Hashimoto tested word difficulty up to the
10,000th frequency rank, he likely would have seen a higher correlation, even before
applying a log transformation to the data. Conversely, although we do not have
sufficient numbers of items per test section to demonstrate this principle to a statistically
significant degree, the correlation between frequency and difficulty within a single
1,000-word band would likely be much lower.
Finally, choice of corpora should be considered when drawing comparisons
between frequency and difficulty. Although we do not have a sufficient number of items
to prove statistically significant differences, it is possible that a large general corpus of
English, such as the web-based corpus Ententen12 (Kilgarriff et al, 2014) would work
well for modelling a wide range of vocabulary difficulty, and that corpora of informal
speech (BNC-Spoken demographic sub-corpus, BNC Consortium, 2001) or of
TV/Movie closed-captions (e.g., CBBC and SubtlexUK, van Heuven et al. 2014;
SubtlexUS, Brysbaert & New, 2009) would work well with a range of vocabulary
limited to high and moderately frequent words, such as that used in Hashimoto (xx)
(Pinchbeck, et al., unpublished). These possibilities warrant further exploration in
future studies.
However, even if the precise degree of the strength of the relationship between
word difficulty and frequency remains in dispute, we believe there are other reasons to
organize vocabulary size tests by frequency. There is a sound rationale for organizing
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
tests by word frequency, and frequency will likely remain an important consideration
for test design even if learner word difficulty is accounted for in future tests. As has
been repeatedly noted in L2 vocabulary acquisition literature, the 2,000 most common
word families in English comprise approximately 80% of spoken and written English,
and the first 5,000 likely cover over 90% (Laufer & Ravenhorst-Kalovski, 2010). As
vocabulary coverage follows a power law, learning less-frequent words results in
diminishing returns, wherein more and more additional, less frequent words must be
learned to attain comparable gains in text coverage. Therefore, learning a relatively low
frequency word known only by a given fraction of a population of learners will be of
less utility than learning a more frequent word known by the same fraction of learners,
even if the difficulty of the two is identical. To illustrate this, suppose two learners each
have English vocabulary sizes of 3,000 words. One learner knows 500 of each of the
first six 1,000 word levels. Another learner knows all of the 3,000 most frequent words
(or a 3,000-word lexical mastery level), but none of the next 3,000. When reading a
text, the second learner will know a considerably higher proportion of the words on a
given page, and in practice is likely to be judged as having stronger fluency in the
language. Therefore, an advantage of L2 vocabulary tests based on word frequency is
that since they are organized by word frequency level, it is easy for educators and
researchers to focus on subsets of words that, if studied, are more likely to markedly
improve learner text coverage. In short, frequency-based vocabulary tests provide a
measure of knowledge of vocabulary that is the most useful within whatever corpus-
register is being used to represent the target language of instruction.
We acknowledge the advantages of collecting learner difficulty data for
assessment, research and pedagogy, and we are keenly interested in the potential uses
for this variable in the future. However, we believe that vocabulary tests will continue
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
to reference word frequency information despite this coming shift in the field. We will
conclude our response with an attempt to reconcile further debate about the validity of
frequency-based vs. knowledge-based approaches to vocabulary testing. Rather than
argue the value of one over the other, we propose that these two approaches offer
complementary information to practitioners and to those who design and develop
language programs.
First, we propose that tests that sample vocabulary items based on frequency
and/or dispersion in a corpus should be used to measure the extent to which learners are
proficient in the vocabulary of a given target language. We appreciate that Hashimoto
intended his comments for size rather than levels tests. However, the primary rationale
for measuring learner L2 vocabulary size in the first place is because vocabulary
knowledge is an important predictor of broader forms of L2 proficiency. Vocabulary
size estimates that can be analyzed by frequency bands likely have greater utility for
such predictions. Frequency-based word lists, from which test items are sampled,
provide us with an operationalization of words that learners should know, and provide
rankings of word usefulness. Furthermore, frequency-based pedagogical word lists
provide a way to prioritize target word forms in course syllabi that include vocabulary
as an explicit program outcome; this is the primary reason that the AWL (and analogous
lists, such as the AVL) is commonly used in EAP programs, since it was compiled
using frequency and dispersion information from academic texts.
Second, we propose that knowledge-based tests be used to understand what
learners already know, irrespective of the usefulness of those words in target language.
Knowledge-based tests and knowledge-based scales of vocabulary would allow teachers
and language program materials developers to know more precisely which words in
candidate course text-passages learners are likely to know (and not know), and this
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
would provide better estimates of readability for individual text-passages than would
frequency-based lexical-indexes. This approach would allow individual learner factors
to be incorporated into models of readability, which varies somewhat according to the
learner’s first language, and other demographic factors. For example, Japanese learners
of English are much more likely to know English loan words that are commonly used in
Japanese. Similarly, speakers of Romance languages have advantages in learning
cognates shared by other Romance languages, in addition to those cognates that are
shared with English. In this way, text-passages could be optimally chosen for specific
course levels and even for specific groups of learners within classes.
Taken together, information from each of these types of tests provide the two
things that teachers need: 1) what do the learners need to learn (corpus-based scale
tests) and 2) what do the learners know now (knowledge based item-scaled tests). The
nexus between where the learners are and where they need to go is where teaching and
learning ideally takes place. Diagnostic vocabulary tests based on scales of knowledge
could assist with matching learners to texts and would provide teachers with an estimate
of what target vocabulary learners already know. Frequency-based word lists could then
be used to identify target words from within the words that learners do not likely
already know for explicit instruction (e.g., AWL for an EAP class).
1As with Hashimoto (2021) and as suggested by Plonsky (2013), r2 is used as the effect size of
consequence as it describes the amount of variance shared between the variables (i.e., the
strength of association).
2 Meaning recall responses were also available for the data set used in this paper. The correlation
to COCA rank was -0.542 compared to -0.533 for Yes/No responses, and the difference
was statistically insignificant (Steiger’s z = 0.245, p > 0.8). While it is possible a larger set
of items could establish a significant difference, this result suggests that relative to
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
correlations of two proficiency levels for the same learner, test item format does not make
as large a difference with correlations to frequency data.
3 A sensitivity power analysis assuming a 5% alpha and 20% beta threshold revealed that our
sample size was powered to detect a minimal effect size of r = .32 (r2 = .10) which is
reasonable vis-a-vis Hashimoto’s value of most interest (r = .50).
4 The data was created by Parr (n. d.) and was retrieved from
https://codepen.io/adrianparr/pen/jwmjmv?js-preprocessor=babel
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood
principle. In B. N. Petrov & F. Caski (Eds.), Proceedings for the 2nd
International Symposium on Information Theory, pp. 267281, Akademiai
Kiado.
Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language
Testing, 27(1), 101-118. https://doi.org/10.1177/0265532209340194
BNC Consortium. (2001). The British national corpus (Version 2) (BNC World).
Oxford University, Computing Services [Distributor].
http://www.natcorp.ox.ac.uk
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical
evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for American English. Behavior Research
Methods, 41(4), 977990. https://doi.org/10.3758/BRM.41.4.977
Burnham, D. (2020, May 13). Curve fitting. JMP Ahead. http://www.pega-
analytics.co.uk/blog/curve-fitting/
Burnham, K. & Anderson, D. (2002). Model selection and multimodel inference: A
practical information‐theoretic approach (2nd ed.). Springer.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA).
Available online at https://www.english-corpora.org/coca/
Davies, M. (2020). Corpus of contemporary American English: Top 60,000 words
(lemmas) in the corpus. Available at: https://www.english-corpora.org/coca
(accessed February 2021).
Durrant, P., Moxley, J., &, McCallum, L. (2019). Vocabulary sophistication in first-year
composition assignments. International Journal of Corpus Linguistics, 24(1),
3366. https://doi.org/10.1075/ijcl.17052.dur
Edwards, R., & Collins, L. (2011). Lexical frequency profiles and Zipf’s law. Language
Learning, 61(1), 130. https://doi.org/10.1111/j.1467-9922.2010.00616.x
Field, A. (2018). Discovering statistics using IBM SPSS Statistics (5th Ed.). Sage.
Hashimoto, B. J. (2016). Rethinking vocabulary size tests: Frequency versus item
difficulty (Unpublished master’s thesis). Brigham Young University. Retrieved
from
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=6957&context=etd
Hashimoto, B. J., & Egbert, J. (2019). More than frequency: Exploring predictors of
word difficulty for second language learners. Language Learning, 69(4), 839-
872. https://doi.org/10.1111/lang.12353
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P.
& Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(1), 7-
36.
Kim, H. (2013). Statistical notes for clinical researchers: Assessing normal distribution
using skewness and kurtosis. Restorative Dentistry & Endodontics, 38(1), 52-54.
https://doi.org/10.5395/rde.2013.38.1.52
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication:
Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757786.
https://doi.org/10.1002/tesq.194.
Kyle, K., & Crossley, S. (2016). The relationship between lexical sophistication and
independent and source-based writing. Journal of Second Language Writing, 34,
1224. https://doi.org/10.1016/j.jslw.2016.10.003
Kyle, K., Crossley, S., & Berger, C. (2018). The tool for the automatic analysis of
lexical sophistication (TAALES): Version 2.0. Behavior Research Methods,
50(3), 1030-1046. https://doi.org/10.3758/s13428-017-0924-4
Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: Lexical
text coverage, learners’ vocabulary size and reading comprehension. Reading in
a Foreign Language, 22(1), 15-30
Lo, S., & Andrews, S. (2015). To transform or not to transform: Using generalized
linear mixed models to analyze reaction time data. Frontiers in Psychology, 6,
1171. https://doi.org/10.3389/fpsyg.2015.01171
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language
processing. MIT Press.
Nation, I.S.P. and Beglar, D. (2007) A vocabulary size test. The Language Teacher,
31(7), 9-13. Retrieved from https://jalt-
publications.org/files/pdf/the_language_teacher/07_2007tlt.pdf
Osborne, J. W, & Waters, E. (2002). Four assumptions of multiple regression that
researchers should always test. Practical Assessment, Research, & Evaluation,
8(2), 1-5. https://doi.org/10.7275/r222-hv23
Pek, J., Wong, O., & Wong, C. M. (2017). Data transformations for inference with
linear regression: Clarifications and recommendations. Practical Assessment,
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Research & Evaluation, 22(9), 1-11. Retrieved from
https://pareonline.net/getvn.asp?v=22&n=9
Pinchbeck, G.G., Brown, D., McLean, S., & Kramer, B. (in review).
Plonsky, L., &; Gonulal, T. (2015). Methodological synthesis in quantitative L2
research: A review of reviews and a case study of exploratory factor analysis.
Language Learning, 65(S1), 936. https://doi.org/10.1111/lang.12111
Schützenmeister, A., Jensen, U., &, Piepho, H.-P. (2012). Checking normality and
homoscedasticity in the general linear model using diagnostic plots.
Communications in Statistics - Simulation and Computation, 41(2), 141154.
https://doi.org/10.1080/03610918.2011.582560
Schwarz, G. (1978). Estimating the dimension of a model. Annals of statistics, 6(2),
461-464.
van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-
UK: A new and improved word frequency database for British English. The
Quarterly Journal of Experimental Psychology, 67(6), 11761190.
https://doi.org/10.1080/17470218.2013.850521
Zhang, S., & Zhang, X. (2020). The relationship between vocabulary knowledge and L2
reading/listening comprehension: A meta-analysis. Language Teaching
Research, Advance online publication.
https://doi.org/10.1177/1362168820913998
Zipf, G. (1935). The psycho-biology of language. Houghton Mifflin.
https://doi.org/10.4324/9781315009421
Zucchini, W. (2000). An introduction to model selection. Journal of mathematical
psychology, 44(1), 41-61
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Appendix A: Models’ Assumption Checks
All parametric testing is driven by the general linear model and the consequential
assumptions have been presented on the residual side. As Pearson’s r correlations are
simple regressions where the latter predates the former, we have also checked additional
regression assumptions such as linearity and influential cases (for underpinning theory
on the checks presented here; see Field, 2018; Schützenmeister et al., 2012; Osborne &
Waters, 2002).
Model 1: Pearson’s r Correlation Between Difficulty and Raw COCA Frequency
The standardized residuals (Y-axis) and standardized predicted (X-axis) values scatter
plot suggested nonlinearity as cases did not mirror the X-axis. There was also a pattern
in the scatter plot suggesting a heteroskedasticity violation. The observed Durbin-
Watson value (= .98) was outside of the 1 - 3 range and autocorrelation (non-
independence of the residuals) was a concern. The maximum Cook’s distance value was
934.05 (maximum acceptable value = 1) which suggested that at least one case had
undue influence in the model. These observations lead to the conclusion that the
parameters and estimates of the model are biased. The residuals were also not normally
distributed (kurtosis absolute z-score = 9.17; above 3.29 threshold; see Kim, 2013)
which further pointed to the model’s not being generalizable beyond the sample.
Model 2: Pearson’s r Correlation Between Difficulty and Rank COCA Frequency
The standardized residuals and standardized predicted values scatter plot had a slight
mirroring issue and thus linearity might not have been met. The scatterplot between the
actual measurements, however, suggested a curved relationship. In the main, the
distribution in the scatter plot, despite the mirroring issue, seemed to be random and the
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
nonsignificant (p > .05) value observed in the F test for heteroscedasticity allowed for
the acceptance of the null hypothesis that there was no violation. The observed Durbin-
Watson (= 1.13) and maximum Cook’s distance (= .44) values were acceptable. Besides
the possible violation of linearity, these observations allowed for acceptance of the
model’s parameters. The residuals were normally distributed (observed z-scores of skew
and kurtosis < 3.29) and thus there was empirical justification to generalize from the
sample to the population.
Model 3. Five Parameter Logistic Curve between difficulty and rank COCA
frequency
Although the 5P logistic curve has a much higher r2 than the previous linear model
(0.61 vs 0.28), it also has considerably more parameters, which can lead to a model
overfitting the data (Zucchini, 2000). Models should adhere to the principle of
parsimony unless a model of greater complexity is clearly superior to simpler models.
In such situations models can be compared using nested model fit statistics such as the
Akaike Information Criterion (AIC) (Akaike, 1973) and the Bayesian Information
Criterion (BIC) (Schwarz, 1978), which impose penalties on models of greater
complexity (Here, in lieu of the AIC we use the AICc, which is more appropriate for
smaller sample sizes (Burnham & Anderson, 2002)). In this case, the 5P logistic curve
(AICc = -16.48; BIC = -4.54) had superior fit to the linear model (AICc = 17.6; BIC =
23.89).
Model 4. Pearson’s r Correlation Between Difficulty TAALES-Enabled Log-
Transformed COCA Frequency
The scatter plot revealed no issues regarding linearity and heteroskedasticity (F test for
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
heteroskedasticity was also nonsignificant). The observed Durbin-Watson (= 1.52) and
maximum Cook’s distance (= .15) values were also acceptable. The residuals likewise
demonstrated a normal distribution. These observations allowed one to place trust in the
generalizability of the model.
Appendix B: TAALES-Enabled Log-Transformed COCA Frequency
Measurement
Durrant et al. (2019) demonstrated that corpus-governed frequency measurements could
be aggregated into scales that predicted L2 performance. In a similar manner, the five
TAALES-provided (Tool for the Automatic Analysis of Lexical Sophistication; Kyle et
al., 2018) COCA subcorpora frequency log variables (content words) were aggregated.
The TAALES enables a variable for each of the five COCA subcorpora (academic,
news, magazine, fiction, and spoken). A principle components analysis (PCA) on these
five measurements revealed only one significant factor (eigenvalue = 4.33) accounting
for 86.5% of the variance. PCA was employed as we expected one factor (Plonsky &
Gonulal, 2015). The factor loadings (lowest value = .87) were high but as with Durrant
et al. collinearity was not a concern as we were more interested in representing the
extent of the COCA. With homogeneity established via the PCA, internal reliability was
assessed using Cronbach’s alpha with strong internal consistency observed (α = .96).
Thus, there was psychometric justification for the aggregation where each measurement
was standardized (i.e., transformed into z-scores) and then averaged.
... Before aggregating person scores to derive item facility values, the underlying psychometric properties using the CTT framework were assessed using a sequential principle component analysis (PCA) and then KR-20 (Cronbach's alpha for dichotomous scoring) testing procedure (see Stewart et al., 2022). The PCA is used when one construct is hypothesised (Loewen & Gonulal, 2015) and is used to establish homogeneity (i.e., unidimensionality); and both assumptions are necessary as evidence of internal validity (Messick, 1994) and are prerequisites for KR-20 (Stewart et al., 2022). ...
... Before aggregating person scores to derive item facility values, the underlying psychometric properties using the CTT framework were assessed using a sequential principle component analysis (PCA) and then KR-20 (Cronbach's alpha for dichotomous scoring) testing procedure (see Stewart et al., 2022). The PCA is used when one construct is hypothesised (Loewen & Gonulal, 2015) and is used to establish homogeneity (i.e., unidimensionality); and both assumptions are necessary as evidence of internal validity (Messick, 1994) and are prerequisites for KR-20 (Stewart et al., 2022). The results of the PCA suggested homogeneity because an elbow distribution was observed (for use of such visual data in vocabulary testing; see Stewart et al., 2022;Vitta et al., 2023) Sheather (2009, pp.195-202) for an example illustrating the impact of multicollinearity on regression models. ...
... The PCA is used when one construct is hypothesised (Loewen & Gonulal, 2015) and is used to establish homogeneity (i.e., unidimensionality); and both assumptions are necessary as evidence of internal validity (Messick, 1994) and are prerequisites for KR-20 (Stewart et al., 2022). The results of the PCA suggested homogeneity because an elbow distribution was observed (for use of such visual data in vocabulary testing; see Stewart et al., 2022;Vitta et al., 2023) Sheather (2009, pp.195-202) for an example illustrating the impact of multicollinearity on regression models. ...
Article
Full-text available
This study examines how second/foreign language (L2) word difficulty estimates derived from item response theory (IRT) and classical test theory (CTT) frameworks are virtually identical in the context of vocabulary testing. This conclusion is reached via a two-stage process: (a) psychometric assessments of both approaches and (b) L2 word difficulty modelling with lexical sophistication. Using data collected from first/native language (L1) Japanese EFL learners, both approaches led to similar conclusions in terms of the psychometric properties of the construct. Furthermore, CTT- and IRT/Rasch-derived estimates for word difficulty yielded nearly identical results in the predictive models. Although the “CTT-vs-IRT” debates in the past few decades have concluded with a middle ground agreed upon in most educational settings, this study acts as a useful demonstration to L2 vocabulary researchers who appear to rely heavily on Rasch/IRT analysis. The findings have practical applications to the area of L2 word difficulty research. Namely, that Rasch (an IRT procedure) alone might not be sufficient for validation, although it could be preferable when conducting inferential testing because of its potential utility for reducing chances of Type II errors.
... The authors underscore that range as well as frequency contribute to the difficulty of learning foreign language words, in addition to contextual distinctiveness, polysemy, and word neighborhood density. More recently, Stewart et al. (2022) refute the argument by showing that frequency could account for up to 80% of vocabulary test scores, if the long-transformed frequency is applied and a wide range of frequency is considered. Note that the log-transformed frequency refers to the logarithmic transformation of raw frequency counts of words in a corpus, to normalize frequency data. ...
... It includes 14 million words gathered from nearly 95,000 English websites, providing lexical information such as the COCA frequency (raw and rank), range, and dispersion across different genres. Following Stewart et al. (2022), the current study included the logtransformed frequency values in the analysis. ...
... Standardizing involves transforming the values in a variable so that the mean is 0.00 and the standard deviation is 1.00. Standardizing these values placed them on the same scale and allowed us to combine them into a single aggregated variable representing COCA frequency (for a rationale of this procedure, see Stewart et al., 2022). We acknowledge that this process omitted three subdomains that are purchasable from COCA. ...
... Furthermore, Fig. 1 illustrates how LSDD covered between 73.30 % to 87.70 % of lemmas in COCA 1K to 10K bands, but only 49.30 % to 70.30 % of 11K to 20K. It might be the case that the present LSDD methodology is more suitable for the 1K to 10K COCA bands, which are more important for the majority of language learning contexts (Stewart et al., 2022). Because the regression coefficients driving LSDD scores were acquired from a study involving Asian EAP students (i.e., Vitta et al., 2023), RQ2 was posed to compare LSDD scores with other word difficulty proxies in predicting vocabulary test scores for a different sample from the same population. ...
Article
Full-text available
*** OPEN ACCESS *** Language teaching stakeholders generally rely on frequency-derived wordlists to determine words for pedagogical purposes. However, words that are instinctively easier for many learners, such as “pizza”, occur less frequently in reference corpora than words that might be considered more difficult, such as “physics”. Furthermore, research demonstrates that modeling frequency alongside other lexical sophistication variables predicts word difficulty better than frequency alone. This study constitutes a proof-of-concept; the concept being that a lexical sophistication-based approach to wordlist construction can produce lists that outperform frequency as word difficulty predictors. The method resulted in lexical sophistication-derived difficulty scores for 14,054 of the 20,000 most frequent Corpus of Contemporary American English lemmas. When compared with other commonly used wordlists, these scores successfully addressed the “pizza/physics” problem in that “pizza” was ranked easier than “physics”, and they also displayed larger correlations with word difficulty than other lists across two linguistic domains. More importantly, the scores also performed comparably to a knowledge-based vocabulary list, but contained almost three times as many lemmas for a fraction of the time and financial costs. We envisage that the present study's methodology can be used by researchers and language teaching stakeholders to create bespoke wordlists for a range of contexts.
... Given that the COCA corpus comprises sources from American English, it was thought to best represent the language to which learners had been exposed. In keeping with previous research in this area (Eguchi & Kyle, 2020;Kim & Crossley, 2018;Kim et al., 2018;Stewart et al., 2022;Vitta et al., 2023), all range and frequency measures were log transformed. As raw frequency measures of word frequency follow Zipfian distributions, log-transformed measures of these indices are better suited to linear regression analysis. ...
... Researchers in vocabulary testing have long acknowledged that frequency is not the only factor in word difficulty (Beglar, 2010;Nation & Beglar, 2007). Stewart et al. (2022) have argued that vocabulary size tests are organized around frequency to facilitate efficient learning of the words that are most useful in English: namely, those that are the most frequent. It has recently been argued that drawing from knowledge-based vocabulary lists would better target the abilities of the test-takers (Schmitt et al., 2021). ...
Article
Full-text available
Recent studies have shown that the frequency effect, although long used as a guide to word difficulty, fails to explain all variance in learner word knowledge. As such, a "more than frequency" conclusion has been offered to explain how lexical sophistication accounts for word difficulty. This study presents a multiple regression model of word-learning difficulty from a data set of monolin-gual Japanese first language (L1) learners. Vocabulary Size Test (VST) scores of 2,999 L1 Japanese university students were converted to logit scores to determine the word-learning difficulty of 80 target words. Five lexical sophistication variables were found to correlate with word-learning difficulty (frequency, 1 This article is based on data published in McLean et al. (2014). Derek N. Canning, Stuart McLean, Joseph P. Vitta 2 cognate status, age of acquisition, prevalence, and polysemy) above a practical significance threshold. These were subsequently entered into a regression model with the logit scores as the dependent variable. The model (R 2 = .55) indicates that three lexical sophistication variables significantly predicted VST scores: frequency (ß =-.28, p = .029), cognateness (ß =-.24, p = .005), and prevalence (ß = 0.22, p = .040). Despite suggestions that complexity studies be interpreted considering what is understood about the construct of linguistic complexity , researchers have rarely made explicit the differences between absolute and relative complexity variables. As some variables can be shown to vary in complexity according to the L1 population, these must be considered in discussions of test generalizability. Although frequency will continue to be the primary criterion for the selection of lexical items for teaching and testing, the cognate status of words can be used to predict the potential learning burden of the word more precisely for learners of different L1 backgrounds.
... These studies indicate that several EFL textbooks generally deviate from frequencybased word lists, starting after the first 1,000 words. The importance of learning vocabulary based on frequency, as pointed out by Stewart et al. (2021), is that not only is frequency an indicator of word difficulty, with higher frequency words being easier than low frequency words, it also relates to the number of words within any given text a learner would be able to comprehend. In other words, knowledge of low frequency words that appear in specific texts have less value than words that are general to a wider range of texts. ...
... Hashimoto and Egbert (2019) also stated that dispersion was one factor that can indicate the difficulty of a word, due to how often a learner is exposed to it. Nicklin et al. (2022) considered that dispersion also factors into the comprehensibility of proper nouns, noting that when there were gaps between occurrences of proper nouns, reader response times to those words were impacted. Therefore, there is evidence to suggest that how often a word is encountered, and how spread out it is across the texts, factors into their difficulty and obtainability. ...
Article
Full-text available
Studies relating to the vocabulary items within EFL textbooks have revealed a divergence from well-researched wordlists such as the New General Service List (NGSL) (Browne et al., 2013), and the BNC/COCA wordlist (Nakayama, 2022; Sun and Dang, 2020). In Japan, the Ministry of Education, Culture, Sports, Science and Technology (MEXT) recently updated its course of study in 2019 to increase the target vocabulary for junior high school students from 1,200 words to a range between 1,600 and 1,800 words, in addition to the 600 to 700 words taught in elementary school. To analyze the content of the increased vocabulary for Japanese junior high school students, this study examined a corpus of six EFL textbooks from the New Horizon series: three elementary texts and three junior high school texts, (published between 2020 and 2021) using the new JACET8000 wordlist (2016), generating data pertaining to lexical coverage, in-corpus frequency, and in-corpus dispersion. It was found that 42.9% of the first 3,000 words from the JACET list were not found in the corpus, and 50.4% of the high-frequency words studied by junior high school students occurred less than two times within the corpus. Additionally, 35% of analyzed words were found to have a dispersion value of zero, indicating that several items were isolated into single units of study. Lastly, factors contributing to lexical difficulty of the textbooks were also examined.
... the consensus is that less frequent words are typically more challenging. numerous studies have demonstrated a strong correlation between word difficulty and frequency (chen & meurers, 2016;Hashimoto, 2021;Kauchak, 2016;nishihara et al., 2006;stewart et al., 2022). Hashimoto (2021) noted a correlation of −0.50 between word frequency and difficulty. ...
Article
Full-text available
Word difficulty has long been a contentious topic in language acquisition research. There are several methods for word difficulty estimation, including objective indices like word frequency and length, alongside subjective ones, such as word familiarity. Although objective indices are currently widely used, word familiarity offers a more nuanced estimation as it reflects the genuine perceptions of language users independently of specific tasks. Despite its potential, the efficacy of word familiarity in gauging word difficulty remains unclear, and it is often assessed based on whether individuals know a word. This study explored the overarching concept of word familiarity, incorporating perspectives, such as ‘KNOW’, ‘LISTEN’, ‘SPEAK’, ‘READ’, and ‘WRITE’ to investigate their relationship with word difficulty. The findings indicated that anomaly detection significantly enhances estimation accuracy from 0.66 to 0.95 across elementary, intermediate, and advanced words. Notably, word familiarity in ‘WRITE’, ‘SPEAK’, and ‘LISTEN’ was particularly effective in estimating word difficulty.
... This frequency-difficulty assumption has been widely adopted by vocabulary test creators (McLean & Kramer, 2015;Webb et al., 2017). In recent years, this assumption has been the topic of debate, and several attempts have been made to re-examine the predictive relationship between numerous variables of lexical sophistication and word difficulty (Hashimoto, 2021;Hashimoto & Egbert, 2019;Robles-García et al., 2023;Stewart et al., 2022;Vitta et al., 2023). While these studies offer useful insights into this relationship, all of them share certain limitations that warrant further research. ...
Article
Full-text available
Word frequency has a long history of being considered the most important predictor of word difficulty and has served as a guideline for several aspects of second language vocabulary teaching, learning, and assessment. However, recent empirical research has challenged the supremacy of frequency as a predictor of word difficulty. Accordingly, applied linguists have questioned the use of frequency as the principal criterion in the development of wordlists and vocabulary tests. Despite being informative, previous studies on the topic have been limited in the way the researchers measured word difficulty and the statistical techniques they employed for exploratory data analysis. In the current study, meaning recall was used as a measure of word difficulty, and random forest was employed to examine the importance of various lexical sophistication metrics in predicting word difficulty. The results showed that frequency was not the most important predictor of word difficulty. Due to the limited scope, research findings are only generalizable to Vietnamese learners of English.
... However, the use of 500-word or smaller word-bands is recommended with higher-frequency words and/or lower-proficiency learners (Kremmel, 2016;McLean, 2021). Generally, as the frequency of a word decreases, its difficulty increases, and the likelihood that it is known decreases Stewart et al., 2022). ...
Article
It is often assumed that the most frequent English words are known by post-beginner second language learners. Yet the sheer frequency of these words and the important roles they play in discourse mean that confirmation of whether they are indeed known would be valuable for understanding second language vocabulary development and reading comprehension. This article reports on a study in which university learners with Japanese as their first language (L1) ( N = 200) were tested on their written receptive knowledge of 63 senses/functions of the first 44 words in the New JACET8000 word list. The study found that for 13 senses/functions item facility was < 0.9. That is, some gaps in receptive knowledge were uncovered which qualitative item analysis suggested may stem from relative frequency of exposure, instructional experiences, knowledge of one sense/function blocking the acquisition of another, as well as abstractness and lack of a direct L1 equivalent. Nevertheless, overall receptive knowledge of the tested senses/functions of these ultra-frequent words was extremely good. Hence, although miscomprehension may arise from occasional gaps in knowledge of these words, the assumption that ultra-frequent words are receptively known by post-beginner second language (L2) learners does seem reasonable.
Article
Full-text available
This study set out to investigate the relationship between L2 vocabulary knowledge (VK) and second-language (L2) reading/listening comprehension. More than 100 individual studies were included in this meta-analysis, which generated 276 effect sizes from a sample of almost 21,000 learners. The current meta-analysis had several major findings. First, the overall correlation between VK and L2 reading comprehension was .57 (p < .01) and that between VK and L2 listening was .56 (p < .01). If the attenuation effect due to reliability of measures was taken into consideration, the ‘true’ correlation between VK and L2 reading/listening comprehension may likely fall within the range of .56–.67, accounting for 31%–45% variance in L2 comprehension. Second, all three mastery levels of form–meaning knowledge (meaning recognition, meaning recall, form recall) had moderate to high correlations with L2 reading and L2 listening. However, meaning recall knowledge had the strongest correlation with L2 reading comprehension and form recall had the strongest correlation with L2 listening comprehension, suggesting that different mastery levels of VK may contribute differently to L2 comprehension in different modalities. Third, both word association knowledge and morphological awareness (two aspects of vocabulary depth knowledge) had significant correlations with L2 reading and L2 listening. Fourth, the modality of VK measure was found to have a significant moderating effect on the correlation between VK and L2 text comprehension: orthographical VK measures had stronger correlations with L2 reading comprehension as compared to auditory VK measures. Auditory VK measures, however, were better predictors of L2 listening comprehension. Fifth, studies with a shorter script distance between L1 and L2 yielded higher correlations between VK and L2 reading. Sixth, the number of items in vocabulary depth measures had a positive predictive power on the correlation between VK and L2 comprehension. Finally, correlations between VK and L2 reading/listening comprehension was found to be associated with two types of publication factors: year-of-publication and publication type. Implications of the findings were discussed.
Article
Full-text available
Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the assessment of L2 written receptive vocabulary knowledge. However, to date, there has been relatively little research regarding which modalities of vocabulary knowledge have the strongest correlations to reading proficiency, and observed differences have often been statistically non-significant. The present research employs a bootstrapping approach to reach a clearer understanding of relationships between various modalities of vocabulary knowledge to reading proficiency. Test-takers (N = 103) answered 1000 vocabulary test items spanning the third 1000 most frequent English words in the New General Service List corpus (Browne, Culligan, & Phillips, 2013). Items were answered under four modalities: Yes/No checklists, form recall, meaning recall, and meaning recognition. These pools of test items were then sampled with replacement to create 1000 simulated tests ranging in length from five to 200 items and the results were correlated to the Test of English for International Communication (TOEIC®) Reading scores. For all examined test lengths, meaning-recall vocabulary tests had the highest average correlations to reading proficiency, followed by form-recall vocabulary tests. The results indicated that tests of vocabulary recall are stronger predictors of reading proficiency than tests of vocabulary recognition, despite the theoretically closer relationship of vocabulary recognition to reading.
Article
Modern vocabulary size tests are generally based on the notion that the more frequent a word is in a language, the more likely a learner will know that word. However, this assumption has been seldom questioned in the literature concerning vocabulary size tests. Using the Vocabulary of American-English Size Test (VAST) based on the Corpus of Contemporary American English (COCA), 403 English language learners were tested on a 10% systematic random sample of the first 5,000 most frequent words from that corpus. Pearson correlation between Rasch item difficulty (the probability that test-takers will know a word) and frequency was only r = 0.50 (r² = 0.25). This moderate correlation indicates that the frequency of a word can only predict which words are known with only a limited degree of and that other factors are also affecting the order of acquisition of vocabulary. Additionally, using vocabulary levels/bands of 1,000 words as part of the structure of vocabulary size tests is shown to be questionable as well. These findings call into question the construct validity of modern vocabulary size tests. However, future confirmatory research is necessary to comprehensively determine the degree to which frequency of words and vocabulary size of learners are related.
Article
Recently-developed tools which quickly and reliably quantify vocabulary use on a range of measures open up new possibilities for understanding the construct of vocabulary sophistication. To take this work forward, we need to understand how these different measures relate to each other and to human readers’ perceptions of texts. This study applied 356 quantitative measures of vocabulary use generated by an automated vocabulary analysis tool ( Kyle & Crossley, 2015 ) to a large corpus of assignments written for First-Year Composition courses at a university in the United States. Results suggest that the majority of measures can be reduced to a much smaller set without substantial loss of information. However, distinctions need to be retained between measures based on content vs. function words and on different measures of collocational strength. Overall, correlations with grades are reliable but weak.
Article
Frequency is often the only variable considered when researchers or teachers develop vocabulary materials for second language (L2) learners. However, researchers have also found that many other variables affect vocabulary acquisition. In this study, we explored the relationship between L2 vocabulary acquisition and a variety of lexical characteristics using vocabulary recognition test data from L2 English learners. Conducting best subsets multiple regression analysis to explore all possible combinations of variables, we produced a best‐fitting model of vocabulary difficulty consisting of six variables (R² = .37). The fact that many variables significantly contributed to the regression model and that a large amount of variance remained yet unexplained by the frequency variable considered in this study indicates that much more than frequency alone affects the likelihood that learners will learn certain L2 words.
Article
This study develops a model of second language (L2) writing quality in the context of a standardized writing test (TOEFL iBT) using a structural equation modeling (SEM) approach. A corpus of 480 test-takers’ responses to source-based and independent writing tasks was the basis for the model. Four latent variables were constructed: an L2 writing quality variable informed by scores of source-based and independent writing tasks, and lexical sophistication, syntactic complexity, and cohesion variables informed by lexical, syntactic, and cohesive features within the essays. The SEM analysis showed that an L2 writing quality model had a good fit, and was generalizable across writing prompts (with the exception of lexical features), gender, and learning contexts. The structural regression analysis indicated that 81.7% of the variance in L2 writing quality was explained by lexical decision reaction time scores (β = 0.932), lexical overlap between paragraphs (β = 0.434), and mean length of clauses via lexical decision reaction time scores (β = 0.607). These findings indicate that higher-rated essays tend to contain more sophisticated words that elicited longer response times in lexical decision tasks, greater lexical overlap between paragraphs, and longer clauses accompanying more sophisticated words. Implications for evaluating lexical, syntactic, and cohesive features in L2 writing are discussed.
Article
This study introduces the second release of the Tool for the Automatic Analysis of Lexical Sophistication (TAALES 2.0), a freely available and easy-to-use text analysis tool. TAALES 2.0 is housed on a user’s hard drive (allowing for secure data processing) and is available on most operating systems (Windows, Mac, and Linux). TAALES 2.0 adds 316 indices to the original tool. These indices are related to word frequency, word range, n-gram frequency, n-gram range, n-gram strength of association, contextual distinctiveness, word recognition norms, semantic network, and word neighbors. In this study, we validated TAALES 2.0 by investigating whether its indices could be used to model both holistic scores of lexical proficiency in free writes and word choice scores in narrative essays. The results indicated that the TAALES 2.0 indices could be used to explain 58% of the variance in lexical proficiency scores and 32% of the variance in word-choice scores. Newly added TAALES 2.0 indices, including those related to n-gram association strength, word neighborhood, and word recognition norms, featured heavily in these predictor models, suggesting that TAALES 2.0 represents a substantial upgrade.
Article
Lexical sophistication is an important component of writing proficiency. New lexical indices related to range, n-gram frequency, psycholinguistic word information, academic language, polysemy, and hypernymy have yielded new insights into the construct of lexical sophistication and its relationship with second language (L2) acquisition and writing. For example, recent studies have suggested that range and bigram indices are stronger indicators of lexical sophistication than frequency in the context of L2 acquisition and L2 writing and speaking proficiency. This study explores the relationship between these newly developed indices of lexical sophistication and holistic scores of writing proficiency in both independent and source-based writing tasks. The results suggest that range and bigrams are important predictors of essay quality in independent tasks, but that lexical sophistication indices are not strong predictors of essay quality in source-based tasks. The results also indicate that responses to source-based tasks tend to include more sophisticated lexical items than responses to independent tasks. Implications for second language writing assessment and pedagogy are discussed.