Content uploaded by Jeffrey Stewart
Author content
All content in this area was uploaded by Jeffrey Stewart on Oct 29, 2021
Content may be subject to copyright.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Relationship Between Word Difficulty and Frequency: A
Response to Hashimoto (2021)
Jeffrey Stewarta*, Stuart McLeanb, Joseph P. Vittac, Christopher Nicklind,
Geoffrey G. Pinchbecke and Brandon Kramerf
aInstitute of Arts and Sciences , Tokyo University of Science, Tokyo, Japan; bBusiness
Administration, Momoyamagakuin University, Osaka, Japan; cFaculty of Languages
and Cultures, Kyushu University, Fukuoka, Japan; dCenter for Foreign Language
Research and Education, Rikkyo University, Tokyo, Japan; eSchool of Linguistics and
Language Studies, Carleton University, Ottawa, Canada; School of Education,
fKwansei Gakuin University, Nishinomiya, Japan.
*Corresponding author: Jeffrey Stewart, Tokyo University of Science, Institute of Arts
and Sciences, Building 1, 1 Chome-3 Kagurazaka, Shinjuku City, Tokyo 162-8601.
jeffjrstewart@gmail.com
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Relationship Between Word Difficulty and Frequency: A
Response to Hashimoto (2021)
Hashimoto (2021) reported a correlation of -.50 (r2 = .25) between word
frequency rank and difficulty, concluding the construct of modern vocabulary
size tests is questionable. In this response we show that the relationship between
frequency and difficulty is clear albeit non-linear and demonstrate that if a wider
range of frequencies is tested and log transformations are applied, the correlation
can approach .80. Finally, while we acknowledge the great promise of
knowledge-based word lists, we note that a strong correlation between difficulty
and frequency is not, in fact, the primary reason size tests are organized by
frequency.
Keywords: vocabulary testing; frequency; yes/no tests; Zipf’s law; log
transformation
Introduction
We read Hashimoto’s (2021) paper in Language Assessment Quarterly with interest.
Hashimoto stated that L2 vocabulary size tests are based on the notion that the more
frequent a word is in a language, the more likely it is that a learner will know that word.
To demonstrate this, he correlated Rasch logit item difficulty estimates for words tested
on his Vocabulary of American-English Size Test (VAST) (Hashimoto, 2016;
Hashimoto & Egbert, 2019) to the rank of frequency counts for those same words on the
Corpus of Contemporary English (COCA) (Davies, 2008-). The correlation was only
moderate (r = .50; r2 = .25), and from this he concluded that the construct of modern
vocabulary size tests was thrown into question.
We would like to raise the following points in response. First, while the
relationship between word frequency and difficulty is not entirely linear, a strong
relationship is nevertheless clearly present: after replicating Hashimoto’s results using a
data set based on COCA frequency and responses to target words from the Vocabulary
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Size Test (VST) (Beglar & Nation, 2007; Beglar, 2010), we found that although we
attained a rank/difficulty correlation similar to Hashimoto’s (r = -.53, accounting for
about 28% [r2] of the variance), we then observed a considerably improved r of -.78 (r2
= .61)1 for a 5-parameter (5P) logistic curve, indicating a strong but non-linear
relationship. Second, after accounting for the non-linear nature of the data by log-
transforming COCA frequency counts, we attained a final linear model that indicated
the existence of a stronger relationship, r = .78 (r2 = .61). While still not perfect, this
coefficient demonstrates that the relationship between word frequency and difficulty
can be considerably stronger than Hashimoto believes, depending on the choice of
model used and the range of word frequencies that is examined. Third, it should be
noted that a strong correlation between word frequency and difficulty is not, in fact, the
primary reason that L2 vocabulary tests are organized by word frequency, and the
reasons for sectioning such tests by frequency band should be considered when
designing future tests.
In the response that follows we will discuss these findings and observations in
detail. Our aim is to demonstrate that the strength of the relationship between word
frequency and difficulty can change considerably depending on how it is modelled.
First, we investigate the nonlinear relationship between word difficulty and frequency.
Second, we investigate the relationship between word difficulty and log-transformed
frequency counts. Finally, we conclude by discussing the rationale for organizing
vocabulary tests by frequency and add our own thoughts on how vocabulary size tests
can be improved. We suggest how word lists based on learner difficulty rather than
frequency can be used for diagnostic purposes, and how such lists might be used in
conjunction with traditional frequency-based lists in optimizing designs of language
pedagogy.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Methods
For this response, we will examine a data set of word frequencies and self-reported
difficulties obtained from a Yes/No vocabulary test for illustrative purposes. We drew
from unpublished data from a Yes/No checklist test which used the same item stems as
what is perhaps the best-known vocabulary size test in adult SLA literature, the VST
(Beglar, 2010; Nation & Beglar, 2007). Although we typically prefer testing word
difficulty with meaning-recall tests due to their often superior reading proficiency
predictive power (McLean et al, 2020; Zhang & Zhang, 2020), for this response learner
self-reports were used in order to keep our results more analogous to Hashimoto’s 2.
Similarly, although 30 pseudowords were tested alongside the real items, as in
Hashimoto’s study this information was not incorporated into difficulty estimates for
the real items. Our data set contained responses to 5 items each from the first 14 1,000-
word band levels on the VST. These items were not selected for the test based on prior
knowledge of their difficulty (Nation, personal communication). The 70 items were
completed by 397 Japanese university students, none of whom were English majors.
The Yes/No test was completed online in the classroom and collected data had a
Cronbach Alpha value of 0.88. Students were presented with item stems used in the
VST with the target word underlined and in bold (e.g., poor). Two of the items from
low-frequency bands could not be found on COCA’s online resource (Davies, 2020),
meaning it was not possible to compare their frequency and difficulty using
Hashimoto’s measures. They were therefore cut from the analysis, leaving a total of 68
items3. Pearson correlation coefficients and a 5P logistic curve analysis were utilized to
investigate the relationship between item difficulty, as expressed by the Yes/No scores,
and COCA frequency and rank ratings.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Non-Linear Relationship Between Word Difficulty and COCA
Frequency
Figure 1 displays the correlation we attained, r = 0.34 (r2 = 0.12), between word
difficulty and COCA frequency, which was larger than Hashimoto’s, r = -.15 (r2 = .02;
the conceptual direction is the same given his use of logits). However, an assumptions
check of the general linear model, which underpins all parametric testing, revealed
issues with the correlation with violations ranging from nonlinearity to non-
independence and heteroskedasticity of the residuals (see Appendix A for assumptions
checks of correlations). As highlighted by Schützenmeister et al. (2012) and Osborne
and Waters (2002), such violations can render the model’s parameters and point
estimates biased and in this case the number of violations saw it imprudent to place any
credibility in the model’s (i.e., correlation) ability to make inferences about the
population. Perhaps due to such issues, Hashimoto then correlated word difficulty to
COCA frequency rank. For illustrative purposes we will do the same with our own data
along with a line of fit for a linear regression, as can be seen in Figure 2.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Figure 1
Scatter Plot of Self-Reported Word Difficulty by COCA Frequency
Note. One outlier, the very high-frequency word see, has been excluded to retain
perspective on the remaining data points.
Figure 2
Scatter Plot of Self-Reported Word Difficulty by COCA Rank with a Linear Regression
Line
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
.
The scatter plot in Figure 2 represents an r coefficient of .53 (r2 = .28), near
Hashimoto’s r value of -.50 (r2 = .25). However, closer inspection indicates that while
more workable, our data is still not quite linear in nature (see Appendix A). The
majority of high-rank words in the COCA list are very easy for learners, yet these points
are treated as outliers from the model’s slope rather than confirmation of the frequency-
difficulty relationship. Initially, there is a sharp descent in difficulty for COCA ranks
between approximately ranks 1-10,000), followed by a weaker relationship thereafter
for lower-frequency words. The data appears to be curved, indicating a non-linear
relationship between word frequency rank and difficulty. We therefore attempted a non-
linear regression with a 5P logistic curve (Burnham, 2020) for exploratory purposes and
attained a substantially better fit (r = -.78, r2 = .61), as can be seen in Figure 3.
Figure 3
Scatter Plot of Self-Reported Word Difficulty by COCA Rank Fit with a 5P Logistic
Curve
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
The Relationship Between Word Difficulty and Log-Transformed Frequency
Counts
Although non-linear models of this nature may hold potential for capturing the
relationship between word difficulty and numerous measures of word frequency, a
drawback of this approach is that researchers often use word frequency as one of several
explanatory variables in various linear models. It is possible to accommodate word
frequency within common linear models such as multiple regression by using a standard
procedure in L2 vocabulary research, log transformation of raw frequency counts (e.g.,
Kyle & Crossley, 2015; 2016).
There is a firm theoretical basis for conducting log transformations on frequency
counts. According to Zipf’s law (1935), there is a proportional relationship between
frequency and rank in that human language comprises a small number of highly
frequent words, a smaller number of words with medium frequency, and multitudes of
low frequency words. When plotted, this proportional relationship follows a Zipfian
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
distribution, which is illustrated in Figure 4a with data from the most frequent 50 words
in Alice in Wonderland4 and can be found in all forms of organised language genres,
spoken or written. Zipf interpreted the law in terms of speaker and interlocutor
requirement, whereby a limited set of highly frequent vocabulary items conserves effort
and increases the efficiency of spoken communication (Manning & Schütze, 1999).
Figure 4
The Frequency Distribution of the 50 Most Frequent Words in Alice in Wonderland
Plotted as (a) Raw and (b) Log-Transformed Frequency Counts
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Despite the ubiquitous nature of Zipfian distributions in natural language, such
distributions do not lend themselves to the typical statistical analyses employed in
linguistic and corpus-based research, which more often than not employ parametric
testing driven by the general linear model. As the name insinuates, this analysis assumes
that the observed and predicted case values within the model have a linear relationship,
which Osborne and Waters (2002) demonstrated by modeling regression residual plots.
When attending to this issue, researchers often log-transform frequency counts which
improves such linearity, as can be seen in Figure 4b. This practice has been employed in
operationalization of frequency as a component of lexical sophistication (e.g., Edwards
& Collins, 2011; Kyle & Crossley, 2015; 2016; Kim & Crossley, 2018; Durrant et al.,
2019). An added benefit of log transformations over rank is the coefficients of log
transformed frequency predictors can easily be converted back to meaningful
associations with the dependent variable, aiding interpretability of model predictions. It
should be noted that log transformations cannot be viewed as a cure-all for linearity and
other testing assumptions in all cases; in other situations the transformation may create a
variable which differs from the construct on which the initial theoretical underpinnings
were hypothesized (Field, 2018; Lo & Andrews, 2015). However, due to the close
resemblance of the Zipfian curve to a log-normal distribution, the transformation is
justifiable in this context. Hashimoto (2021) acknowledged log-transformed frequency
as a viable alternative to his chosen methodology and recommended further research to
investigate its case.
Due to the non-linear shape of the data, next we applied a log transformation to
COCA frequency counts. We employed automated TAALES-provided log-transformed
raw frequency measures. An important distinction to note is that TAALES COCA
variables are derived from orthographic (and/or phonographic) form data in lieu of
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
lemma frequency counts, which may differ from lemma-based data somewhat (see
Appendix B for details). We observed a correlation of .78 (r2 = .61), which met all
assumptions of the general linear model (see Appendix A). A scatter plot of the
relationship can be seen in Figure 5. The transformation resulted in a substantially more
linear relationship between the two variables, raising r to 0.78 (r2 = 0.61), and indicated
that the relationship between word frequency and difficulty can be considerably
stronger than Hashimoto’s results suggest.
Figure 5
Scatter Plot of Self-Reported Word Difficulty by Log of COCA Frequency
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Discussion
Our illustrative examples come with some important caveats. While our analyses may
suggest future directions for the modelling of word frequency and difficulty, we would
like to emphasize that our example data set samples only 68 words from 14,000 word
families, and that these illustrative analyses should not be seen as conclusive until a
larger study can be conducted. Further research with more robust samples is needed to
confirm the implied trends. The goal of our analyses has simply been to demonstrate
that the strength of the relationship between word frequency and difficulty can change
considerably depending on how the relationship is modelled and how frequency is
operationalized.
An additional caveat should also be considered when comparing our illustrative
data to Hashimoto’s findings. Unlike the VST, which samples words from the first
14,000 word families, Hashimoto’s test, VAST, only draws from the first 5,000 most
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
common words in the COCA. Therefore, non-linearity will likely not be as important an
issue for Hashimoto’s data set if rank is used as the predictor variable, as the curve we
observed in our own data using rank was more apparent when items were sampled
beyond the 10,000 rank. The range of a predictor variable plays an important role in the
strength of a correlation. Our correlation was likely higher in large part due to the wider
variety of word frequencies sampled. Had Hashimoto tested word difficulty up to the
10,000th frequency rank, he likely would have seen a higher correlation, even before
applying a log transformation to the data. Conversely, although we do not have
sufficient numbers of items per test section to demonstrate this principle to a statistically
significant degree, the correlation between frequency and difficulty within a single
1,000-word band would likely be much lower.
Finally, choice of corpora should be considered when drawing comparisons
between frequency and difficulty. Although we do not have a sufficient number of items
to prove statistically significant differences, it is possible that a large general corpus of
English, such as the web-based corpus Ententen12 (Kilgarriff et al, 2014) would work
well for modelling a wide range of vocabulary difficulty, and that corpora of informal
speech (BNC-Spoken demographic sub-corpus, BNC Consortium, 2001) or of
TV/Movie closed-captions (e.g., CBBC and SubtlexUK, van Heuven et al. 2014;
SubtlexUS, Brysbaert & New, 2009) would work well with a range of vocabulary
limited to high and moderately frequent words, such as that used in Hashimoto (xx)
(Pinchbeck, et al., unpublished). These possibilities warrant further exploration in
future studies.
However, even if the precise degree of the strength of the relationship between
word difficulty and frequency remains in dispute, we believe there are other reasons to
organize vocabulary size tests by frequency. There is a sound rationale for organizing
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
tests by word frequency, and frequency will likely remain an important consideration
for test design even if learner word difficulty is accounted for in future tests. As has
been repeatedly noted in L2 vocabulary acquisition literature, the 2,000 most common
word families in English comprise approximately 80% of spoken and written English,
and the first 5,000 likely cover over 90% (Laufer & Ravenhorst-Kalovski, 2010). As
vocabulary coverage follows a power law, learning less-frequent words results in
diminishing returns, wherein more and more additional, less frequent words must be
learned to attain comparable gains in text coverage. Therefore, learning a relatively low
frequency word known only by a given fraction of a population of learners will be of
less utility than learning a more frequent word known by the same fraction of learners,
even if the difficulty of the two is identical. To illustrate this, suppose two learners each
have English vocabulary sizes of 3,000 words. One learner knows 500 of each of the
first six 1,000 word levels. Another learner knows all of the 3,000 most frequent words
(or a 3,000-word lexical mastery level), but none of the next 3,000. When reading a
text, the second learner will know a considerably higher proportion of the words on a
given page, and in practice is likely to be judged as having stronger fluency in the
language. Therefore, an advantage of L2 vocabulary tests based on word frequency is
that since they are organized by word frequency level, it is easy for educators and
researchers to focus on subsets of words that, if studied, are more likely to markedly
improve learner text coverage. In short, frequency-based vocabulary tests provide a
measure of knowledge of vocabulary that is the most useful within whatever corpus-
register is being used to represent the target language of instruction.
We acknowledge the advantages of collecting learner difficulty data for
assessment, research and pedagogy, and we are keenly interested in the potential uses
for this variable in the future. However, we believe that vocabulary tests will continue
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
to reference word frequency information despite this coming shift in the field. We will
conclude our response with an attempt to reconcile further debate about the validity of
frequency-based vs. knowledge-based approaches to vocabulary testing. Rather than
argue the value of one over the other, we propose that these two approaches offer
complementary information to practitioners and to those who design and develop
language programs.
First, we propose that tests that sample vocabulary items based on frequency
and/or dispersion in a corpus should be used to measure the extent to which learners are
proficient in the vocabulary of a given target language. We appreciate that Hashimoto
intended his comments for size rather than levels tests. However, the primary rationale
for measuring learner L2 vocabulary size in the first place is because vocabulary
knowledge is an important predictor of broader forms of L2 proficiency. Vocabulary
size estimates that can be analyzed by frequency bands likely have greater utility for
such predictions. Frequency-based word lists, from which test items are sampled,
provide us with an operationalization of words that learners should know, and provide
rankings of word usefulness. Furthermore, frequency-based pedagogical word lists
provide a way to prioritize target word forms in course syllabi that include vocabulary
as an explicit program outcome; this is the primary reason that the AWL (and analogous
lists, such as the AVL) is commonly used in EAP programs, since it was compiled
using frequency and dispersion information from academic texts.
Second, we propose that knowledge-based tests be used to understand what
learners already know, irrespective of the usefulness of those words in target language.
Knowledge-based tests and knowledge-based scales of vocabulary would allow teachers
and language program materials developers to know more precisely which words in
candidate course text-passages learners are likely to know (and not know), and this
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
would provide better estimates of readability for individual text-passages than would
frequency-based lexical-indexes. This approach would allow individual learner factors
to be incorporated into models of readability, which varies somewhat according to the
learner’s first language, and other demographic factors. For example, Japanese learners
of English are much more likely to know English loan words that are commonly used in
Japanese. Similarly, speakers of Romance languages have advantages in learning
cognates shared by other Romance languages, in addition to those cognates that are
shared with English. In this way, text-passages could be optimally chosen for specific
course levels and even for specific groups of learners within classes.
Taken together, information from each of these types of tests provide the two
things that teachers need: 1) what do the learners need to learn (corpus-based scale
tests) and 2) what do the learners know now (knowledge based item-scaled tests). The
nexus between where the learners are and where they need to go is where teaching and
learning ideally takes place. Diagnostic vocabulary tests based on scales of knowledge
could assist with matching learners to texts and would provide teachers with an estimate
of what target vocabulary learners already know. Frequency-based word lists could then
be used to identify target words from within the words that learners do not likely
already know for explicit instruction (e.g., AWL for an EAP class).
1As with Hashimoto (2021) and as suggested by Plonsky (2013), r2 is used as the effect size of
consequence as it describes the amount of variance shared between the variables (i.e., the
strength of association).
2 Meaning recall responses were also available for the data set used in this paper. The correlation
to COCA rank was -0.542 compared to -0.533 for Yes/No responses, and the difference
was statistically insignificant (Steiger’s z = 0.245, p > 0.8). While it is possible a larger set
of items could establish a significant difference, this result suggests that relative to
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
correlations of two proficiency levels for the same learner, test item format does not make
as large a difference with correlations to frequency data.
3 A sensitivity power analysis assuming a 5% alpha and 20% beta threshold revealed that our
sample size was powered to detect a minimal effect size of r = .32 (r2 = .10) which is
reasonable vis-a-vis Hashimoto’s value of most interest (r = .50).
4 The data was created by Parr (n. d.) and was retrieved from
https://codepen.io/adrianparr/pen/jwmjmv?js-preprocessor=babel
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood
principle. In B. N. Petrov & F. Caski (Eds.), Proceedings for the 2nd
International Symposium on Information Theory, pp. 267–281, Akademiai
Kiado.
Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language
Testing, 27(1), 101-118. https://doi.org/10.1177/0265532209340194
BNC Consortium. (2001). The British national corpus (Version 2) (BNC World).
Oxford University, Computing Services [Distributor].
http://www.natcorp.ox.ac.uk
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical
evaluation of current word frequency norms and the introduction of a new and
improved word frequency measure for American English. Behavior Research
Methods, 41(4), 977–990. https://doi.org/10.3758/BRM.41.4.977
Burnham, D. (2020, May 13). Curve fitting. JMP Ahead. http://www.pega-
analytics.co.uk/blog/curve-fitting/
Burnham, K. & Anderson, D. (2002). Model selection and multimodel inference: A
practical information‐theoretic approach (2nd ed.). Springer.
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA).
Available online at https://www.english-corpora.org/coca/
Davies, M. (2020). Corpus of contemporary American English: Top 60,000 words
(lemmas) in the corpus. Available at: https://www.english-corpora.org/coca
(accessed February 2021).
Durrant, P., Moxley, J., &, McCallum, L. (2019). Vocabulary sophistication in first-year
composition assignments. International Journal of Corpus Linguistics, 24(1),
33–66. https://doi.org/10.1075/ijcl.17052.dur
Edwards, R., & Collins, L. (2011). Lexical frequency profiles and Zipf’s law. Language
Learning, 61(1), 1–30. https://doi.org/10.1111/j.1467-9922.2010.00616.x
Field, A. (2018). Discovering statistics using IBM SPSS Statistics (5th Ed.). Sage.
Hashimoto, B. J. (2016). Rethinking vocabulary size tests: Frequency versus item
difficulty (Unpublished master’s thesis). Brigham Young University. Retrieved
from
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=6957&context=etd
Hashimoto, B. J., & Egbert, J. (2019). More than frequency: Exploring predictors of
word difficulty for second language learners. Language Learning, 69(4), 839-
872. https://doi.org/10.1111/lang.12353
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P.
& Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(1), 7-
36.
Kim, H. (2013). Statistical notes for clinical researchers: Assessing normal distribution
using skewness and kurtosis. Restorative Dentistry & Endodontics, 38(1), 52-54.
https://doi.org/10.5395/rde.2013.38.1.52
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Kyle, K., & Crossley, S. A. (2015). Automatically assessing lexical sophistication:
Indices, tools, findings, and application. TESOL Quarterly, 49(4), 757–786.
https://doi.org/10.1002/tesq.194.
Kyle, K., & Crossley, S. (2016). The relationship between lexical sophistication and
independent and source-based writing. Journal of Second Language Writing, 34,
12–24. https://doi.org/10.1016/j.jslw.2016.10.003
Kyle, K., Crossley, S., & Berger, C. (2018). The tool for the automatic analysis of
lexical sophistication (TAALES): Version 2.0. Behavior Research Methods,
50(3), 1030-1046. https://doi.org/10.3758/s13428-017-0924-4
Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: Lexical
text coverage, learners’ vocabulary size and reading comprehension. Reading in
a Foreign Language, 22(1), 15-30
Lo, S., & Andrews, S. (2015). To transform or not to transform: Using generalized
linear mixed models to analyze reaction time data. Frontiers in Psychology, 6,
1171. https://doi.org/10.3389/fpsyg.2015.01171
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language
processing. MIT Press.
Nation, I.S.P. and Beglar, D. (2007) A vocabulary size test. The Language Teacher,
31(7), 9-13. Retrieved from https://jalt-
publications.org/files/pdf/the_language_teacher/07_2007tlt.pdf
Osborne, J. W, & Waters, E. (2002). Four assumptions of multiple regression that
researchers should always test. Practical Assessment, Research, & Evaluation,
8(2), 1-5. https://doi.org/10.7275/r222-hv23
Pek, J., Wong, O., & Wong, C. M. (2017). Data transformations for inference with
linear regression: Clarifications and recommendations. Practical Assessment,
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Research & Evaluation, 22(9), 1-11. Retrieved from
https://pareonline.net/getvn.asp?v=22&n=9
Pinchbeck, G.G., Brown, D., McLean, S., & Kramer, B. (in review).
Plonsky, L., &; Gonulal, T. (2015). Methodological synthesis in quantitative L2
research: A review of reviews and a case study of exploratory factor analysis.
Language Learning, 65(S1), 9–36. https://doi.org/10.1111/lang.12111
Schützenmeister, A., Jensen, U., &, Piepho, H.-P. (2012). Checking normality and
homoscedasticity in the general linear model using diagnostic plots.
Communications in Statistics - Simulation and Computation, 41(2), 141–154.
https://doi.org/10.1080/03610918.2011.582560
Schwarz, G. (1978). Estimating the dimension of a model. Annals of statistics, 6(2),
461-464.
van Heuven, W. J. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-
UK: A new and improved word frequency database for British English. The
Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
https://doi.org/10.1080/17470218.2013.850521
Zhang, S., & Zhang, X. (2020). The relationship between vocabulary knowledge and L2
reading/listening comprehension: A meta-analysis. Language Teaching
Research, Advance online publication.
https://doi.org/10.1177/1362168820913998
Zipf, G. (1935). The psycho-biology of language. Houghton Mifflin.
https://doi.org/10.4324/9781315009421
Zucchini, W. (2000). An introduction to model selection. Journal of mathematical
psychology, 44(1), 41-61
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
Appendix A: Models’ Assumption Checks
All parametric testing is driven by the general linear model and the consequential
assumptions have been presented on the residual side. As Pearson’s r correlations are
simple regressions where the latter predates the former, we have also checked additional
regression assumptions such as linearity and influential cases (for underpinning theory
on the checks presented here; see Field, 2018; Schützenmeister et al., 2012; Osborne &
Waters, 2002).
Model 1: Pearson’s r Correlation Between Difficulty and Raw COCA Frequency
The standardized residuals (Y-axis) and standardized predicted (X-axis) values scatter
plot suggested nonlinearity as cases did not mirror the X-axis. There was also a pattern
in the scatter plot suggesting a heteroskedasticity violation. The observed Durbin-
Watson value (= .98) was outside of the 1 - 3 range and autocorrelation (non-
independence of the residuals) was a concern. The maximum Cook’s distance value was
934.05 (maximum acceptable value = 1) which suggested that at least one case had
undue influence in the model. These observations lead to the conclusion that the
parameters and estimates of the model are biased. The residuals were also not normally
distributed (kurtosis absolute z-score = 9.17; above 3.29 threshold; see Kim, 2013)
which further pointed to the model’s not being generalizable beyond the sample.
Model 2: Pearson’s r Correlation Between Difficulty and Rank COCA Frequency
The standardized residuals and standardized predicted values scatter plot had a slight
mirroring issue and thus linearity might not have been met. The scatterplot between the
actual measurements, however, suggested a curved relationship. In the main, the
distribution in the scatter plot, despite the mirroring issue, seemed to be random and the
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
nonsignificant (p > .05) value observed in the F test for heteroscedasticity allowed for
the acceptance of the null hypothesis that there was no violation. The observed Durbin-
Watson (= 1.13) and maximum Cook’s distance (= .44) values were acceptable. Besides
the possible violation of linearity, these observations allowed for acceptance of the
model’s parameters. The residuals were normally distributed (observed z-scores of skew
and kurtosis < 3.29) and thus there was empirical justification to generalize from the
sample to the population.
Model 3. Five Parameter Logistic Curve between difficulty and rank COCA
frequency
Although the 5P logistic curve has a much higher r2 than the previous linear model
(0.61 vs 0.28), it also has considerably more parameters, which can lead to a model
overfitting the data (Zucchini, 2000). Models should adhere to the principle of
parsimony unless a model of greater complexity is clearly superior to simpler models.
In such situations models can be compared using nested model fit statistics such as the
Akaike Information Criterion (AIC) (Akaike, 1973) and the Bayesian Information
Criterion (BIC) (Schwarz, 1978), which impose penalties on models of greater
complexity (Here, in lieu of the AIC we use the AICc, which is more appropriate for
smaller sample sizes (Burnham & Anderson, 2002)). In this case, the 5P logistic curve
(AICc = -16.48; BIC = -4.54) had superior fit to the linear model (AICc = 17.6; BIC =
23.89).
Model 4. Pearson’s r Correlation Between Difficulty TAALES-Enabled Log-
Transformed COCA Frequency
The scatter plot revealed no issues regarding linearity and heteroskedasticity (F test for
This is an Accepted Manuscript of an article published by Taylor & Francis in
Language Assessment Quarterly on October 21st, 2021, available
online: https://www.tandfonline.com/doi/abs/10.1080/15434303.2021.1992629
heteroskedasticity was also nonsignificant). The observed Durbin-Watson (= 1.52) and
maximum Cook’s distance (= .15) values were also acceptable. The residuals likewise
demonstrated a normal distribution. These observations allowed one to place trust in the
generalizability of the model.
Appendix B: TAALES-Enabled Log-Transformed COCA Frequency
Measurement
Durrant et al. (2019) demonstrated that corpus-governed frequency measurements could
be aggregated into scales that predicted L2 performance. In a similar manner, the five
TAALES-provided (Tool for the Automatic Analysis of Lexical Sophistication; Kyle et
al., 2018) COCA subcorpora frequency log variables (content words) were aggregated.
The TAALES enables a variable for each of the five COCA subcorpora (academic,
news, magazine, fiction, and spoken). A principle components analysis (PCA) on these
five measurements revealed only one significant factor (eigenvalue = 4.33) accounting
for 86.5% of the variance. PCA was employed as we expected one factor (Plonsky &
Gonulal, 2015). The factor loadings (lowest value = .87) were high but as with Durrant
et al. collinearity was not a concern as we were more interested in representing the
extent of the COCA. With homogeneity established via the PCA, internal reliability was
assessed using Cronbach’s alpha with strong internal consistency observed (α = .96).
Thus, there was psychometric justification for the aggregation where each measurement
was standardized (i.e., transformed into z-scores) and then averaged.