ArticlePDF Available

Abstract and Figures

We present word prevalence data for 61,858 English words. Word prevalence refers to the number of people who know the word. The measure was obtained on the basis of an online crowdsourcing study involving over 220,000 people. Word prevalence data are useful for gauging the difficulty of words and, as such, for matching stimulus materials in experimental conditions or selecting stimulus materials for vocabulary tests. Word prevalence also predicts word processing times, over and above the effects of word frequency, word length, similarity to other words, and age of acquisition, in line with previous findings in the Dutch language.
Content may be subject to copyright.
1
Word prevalence norms for 62,000 English lemmas
Marc Brysbaert
1
, Paweł Mandera
1
, Samantha McCormick
2
&
Emmanuel Keuleers
3
1
Department of Experimental Psychology, Ghent University
2
Department of Psychology, University of Roehampton
3
Department of Cognitive Science and Artificial Intelligence, Tilburg University
Keywords: Word prevalence, word frequency, word processing, megastudy
Address: Marc Brysbaert
Department of Experimental Psychology
Ghent University
Henri Dunantlaan 2
B-9000 Gent
Belgium
Tel. +32 9 264 94 25
Fax. +32 9 264 64 96
E-mal: marc.brysbaert@ugent.be
2
Abstract
We present word prevalence data for 61,858 English words. Word prevalence refers to the number
of people who know the word. The measure was obtained on the basis of an online crowdsourcing
study involving over 220,000 people. Word prevalence data are useful to gauge the difficulty of
words and, as such, are interesting to match stimulus materials in experimental conditions or to
select stimulus materials for vocabulary tests. Word prevalence also predicts word processing times,
over and above the effects of word frequency, word length, similarity to other words, and age of
acquisition, in line with previous findings in the Dutch language.
3
Researchers working with word stimuli are taught to select words primarily on word frequency, word
length, similarity to other words, and age of acquisition (e.g., Brysbaert, Buchmeier, Conrad, Jacobs,
Bölte, & Böhl, 2011). For instance, a researcher investigating the effect of emotional valence
(negative, neutral, positive) on word processing efficiency would be expected to match the stimuli on
those four variables.
In our work we gradually discovered that the above set of variables does not fully cover differences
in word knowledge. This is particularly true for low frequency words. Some of these words are
generally known (such as toolbar, screenshot, soulmate, uppercase, hoodie), whereas others are
hardly known by anyone (e.g., scourage, thunk, whicker, or caudle). Furthermore, none of the other
word variables that have been collected so far, seem to fully catch the differences in word
knowledge.
For a long time, we hoped that improved word frequency measures would solve the problem, but so
far this anticipation has not been met: Some words are much better known than expected from the
frequency with which they are used in the corpora we have at our disposal to calculate word
frequency measures. Subjective word familiarity ratings may be an alternative (Gernsbacher, 1984),
but so far have not been collected for most of the words. In addition, such ratings can be criticized,
because they are collected from a small number of people (who may be more or less familiar with
some words for idiosyncratic reasons). In addition, there is a difference between how many people
know a word and how familiar they are with the word. Some words score low on familiarity, yet are
known to nearly everyone (such as basilisk, obelisk, oxymoron, debacle, emporium, and armadillo).
The variable that currently best seems to catch differences in word knowledge, is age of acquisition
(AoA): Words that are not known to the raters get high AoA scores. Indeed, some researchers in
natural language processing have started using AoA values as a proxy for word difficulty in addition
to word frequency. However, this is not the common understanding of AoA, which is considered to
be the order in which known words were acquired.
To solve the issue of differences in word knowledge unrelated to word frequency, we decided to
directly ask people which words they knew. This was first done in Dutch (Brysbaert, Stevens,
Mandera, and Keuleers, 2016a; Keuleers, Stevens, Mandera, & Brysbaert, 2015) and gave rise to a
new word characteristic, which we called word prevalence. The variable refers to the percentage of
people who indicate they know the word (in practice, the percentages are transformed to z-values;
see below for more details). Word prevalence explained 6% extra variance in Dutch word processing
times as measured with the lexical decision task. Even at the high end, it had an effect, as we
4
observed a 20 ms difference in response times between words known to all participants and words
known to only 99% of the participants (Brysbaert et al., 2016a).
The present article introduces the word prevalence measure for English and presents some of the
initial analyses.
Method
Stimulus materials. The stimuli consisted of a list of 61,858 English words, collected over the years at
the Center for Reading Research, Ghent University. The list is largely based on the SUBTLEX word
frequencies we collected, combined with word lists from psycholinguistic experiments and word lists
from freely available spelling checkers and dictionaries. The nonwords consisted of a list of 329,851
pseudowords generated by Wuggy (Keuleers & Brysbaert, 2010).
Participants and the vocabulary test used. For each vocabulary test, a random sample of 67 words
and 33 nonwords was selected. For each letter string, participants had to indicate whether they knew
the stimulus or not. At the end of the test, participants received information about their
performance in the form of a vocabulary score based on the percentage of correctly identified words
minus the percentage of nonwords identified as words. For instance, a participant who responded
yes to 55 of the 67 words and to 2 of the 33 nonwords, received feedback that they knew 55/67 –
2/33 = 76% of the English vocabulary. Participants could do the test multiple times and always got a
different sample of words and nonwords. The test was made available on a dedicated website
(http://vocabulary.ugent.be/). Access to the test was unlimited. Participants were asked whether
English was their native language, what their age and gender were, which country they came from,
and which studies they had completed (see also Brysbaert, Stevens, Mandera, & Keuleers, 2016b;
Keuleers et al., 2015). For the present purposes, we limited the analyses to the first three tests taken
by native speakers of English from the USA and the UK.
1
All in all, we analyzed the data of 221,268
individuals who returned 265,346 sessions. Of these, 56% were completed by female participants
and 44% by male participants.
Results
1
Other countries with English as a native language did not (yet) produce enough observations to make reliable
word prevalence estimates for them.
5
In the dataset we selected, each word was judged on average by 388 participants (282 from the USA
and 106 from the UK). The percentages of people indicating they knew the word ranged from 2%
(stotinka, adyta, kahikatea, gomuti, arseniuret, alsike, …) to 100% (…, you, young, yourself, zone,
zoned). Figure 1 shows the distribution of percentages known. The vast majority of words were
known to 90% or more of the participants.
Figure 1: Distribution of the percentages of words known, showing that most words were known to
90% of the participants and more (see the rightmost two columns of the graph).
Because the distribution of percentages known is very right skewed and does not differentiate much
between well-known words, it is useful to apply a probit transformation to the percentages
(Brysbaert et al., 2016a). The probit function translates percentages known to z-values on the basis
of the cumulative normal distribution. That is, a word known by 2.5% of the participants gets a word
prevalence of -1.96; a word known by 97.5% of the participants gets a prevalence of +1.96. Because a
percentage known of 0% would return a prevalence score of - and a percentage known of 100% a
prevalence score of +, the range was reduced to percentages known .5% (prevalence = - 2.576) and
99.5% (prevalence = +2.576).
2
Figure 2 shows the distribution of prevalence scores for the total list of
words.
2
The specific formula to use in Microsoft Excel was =NORM.INV(0.005+Pknown*0.99;0;1).
6
Figure 2: Distribution of word prevalence scores
Word prevalence has negative values for words known to less than 50% of the participants. This may
be confusing at first sight, but is rather informative. All words with negative prevalence scores are
uninteresting for experiments with RTs (because these words are not known well enough), but they
are interesting for word learning experiments and experiments capitalizing on differences in
accuracy.
Although the US word prevalence and the UK prevalence scores correlate r = .93 with each other,
there are a few words that differ in prevalence between both countries, due to cultural differences.
Table 1 gives a list of the extreme cases. If researchers want to collect or analyze data from one
country only, it may be an idea to exclude the deviating words or to use country-specific word
prevalence data.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 1 about here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7
Similarly, although the word prevalence scores correlate r = .97 between men and women, some
words deviate, as can be seen in Table 2. These tend to follow gender differences in interests (games,
weapons and technical matters for males; food, clothing and flowers for females). The high
correlations between the US and the UK measures and between males and females indicate that the
reliability of the prevalence measure is very high (with .93 as the lower limit).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 2 about here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Uses of the word prevalence measure
Word prevalence as a predictor variable
By its nature, word prevalence will be a good predictor of word difficulty. Experimenters interested in
word processing times naturally want to avoid stimuli that are unknown to many of the participants.
This can now easily be achieved, by only using words with percentage known of 95% and more
(prevalence of 1.60 and more). Similarly, word prevalence can be used as an estimate of word
difficulty for vocabulary tests. By ordering the words according to word prevalence (and word
frequency) it is possible to delineate word difficulty bands, which can be used to select stimuli from.
Word prevalence is also likely to be of interest to natural language processing (NLP) researchers
writing algorithms to gauge the difficulty of texts. At present, word frequency is used as a proxy of
word difficulty (e.g., Benjamin, 2012; De Clercq & Hoste, 2016; Hancke, Vajjala, & Meurers, 2012).
Word prevalence is likely to be a better measure, given that it does not completely reduce to
differences in word frequency.
Finally, word prevalence can be used to predict differences in word processing efficiency. In recent
years, researchers have started to collect reaction times (RTs) to thousands of words and tried to
predict RTs on the basis of word characteristics. Table 3 gives an overview of the word characteristics
included in the analyses and references to some of the articles involved.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 3 about here
8
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Although many variables have been examined, most of them account for less than 1% of the variance
in word processing times, once the effects of word frequency, word length (letters), similarity to
other words (OLD20), and age of acquisition are partialled out. Brysbaert, Buchmeier, Conrad, Jacobs,
Bölte, & Böhl (2011), for example, analyzed the lexical decision times provided by the English Lexicon
Project (ELP; Balota et al. , 2007), using the 20+ word characteristics included in ELP as predictors.
The three most important variables (word frequency, similarity to other words, and word length)
together accounted for 40.5% of the variance. The remaining variables together accounted for only
2% extra variance. Indeed, our work over the last years has shown that the objective of explaining as
much variance as possible in word processing times is better served by looking for improved word
frequency measures than by searching for new variables or interactions between variables. At the
same time, we do not appear to have found all possible sources of variation yet (see also Adelman,
Marquis, Sabatos-DeVito, & Estes, 2013). The systematic variance to be accounted for in megastudies
is typically larger than 80% (as estimated on the basis of the reliability of the scores).
To examine whether word prevalence is a variable that substantially increases the percentage of
variance explained in word processing times, we repeated the analysis of Brysbaert et al. (2011) on
the ELP lexical decision times and additionally included age of acquisition and word prevalence as
predictors. The variables we included were:
- Word frequency based on the SUBTLEX-US corpus (Brysbaert & New, 2009) and expressed as
Zipf scores (Brysbaert, Mandera, & Keuleers, 2018; Van Heuven, Mandera, Keuleers, &
Brysbaert, 2014). The Zipf score is a standardized log-transformed measure of word
frequency that is easy to understand (words with a Zipf score of 1-3 can be considered low-
frequency words; words with a Zipf score of 4-7 can be considered high-frequency).
- Word length in number of letters.
- Number of orthographic neighbors (words formed by changing 1 letter; information obtained
from ELP).
- Number of phonological neighbors (words formed by changing one phoneme; from ELP).
- Orthographic Levenshtein Distance (from ELP).
- Phonological Levenshtein Distance (from ELP).
- Number of phonemes (from ELP).
- Number of syllables (from ELP).
- Number of morphemes (from ELP).
9
- Age of acquisition (AoA; from Kuperman et al., 2012; lemma values applied to inflected
forms).
- Word prevalence.
We took the prevalence of an inflected form to be the same as that of its lemma in case the inflected
form was not in the database. As we were interested in RTs, only words with 75% accuracy or more
in the ELP lexical decision task were included. In our analyses, we used the z-scores of participants’
RTs, rather than their absolute RTs, which eliminates variance in RTs due to participants being faster
or slower than average. The percentage of variance in RTs that can be accounted for is substantially
higher for z-scores than for raw RTs (as shown below, where the percentages of variance accounted
for are substantially higher than the 43% reported by Brysbaert et al., 2011). In total we had
complete data for 25,661 words. We analyzed both the ELP lexical decision times and the ELP naming
latencies. Table 4 shows the correlations between the variables.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 4 about here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Table 4 illustrates the high correlations observed between the different word characteristics. In this
respect, word prevalence comes out well because it is rather unrelated to the variables associated
with word length. In addition, the correlation with frequency is rather limited (r = .487). This is higher
than the value observed in the Dutch analyses of Brysbaert et al. (2016a), probably because the
words from ELP were selected on the basis of a word frequency list. This means that known words
with a frequency of 0 in the corpus were excluded.
One way to find the relevant predictors for the word processing times is to run a hierarchical
regression analysis. As we are particularly interested in the added value of word prevalence, we first
entered all the other variables and then word prevalence. To take into account non-linearities, the
regression analysis included polynomials of the second degree for word frequency, word length, AoA,
and prevalence. Because the number of phonological neighbors and the number of phonemes were
highly correlated with other variables and did not alter the picture, they were left out of the analysis.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 5 about here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
10
When we entered all variables except for prevalence, we explained 66.2% of the variance in the z-
values of the lexical decision times (Table 5). When prevalence was added, we explained 69.8% of the
variance. Figure 3 shows the effects of the various variables.
Figure 3: Effects of the variables on the standardized ELP lexical decision times. First line: effects of
word frequency and length in letters; second line: number of syllables and number of morphemes;
third line: orthographic and phonological similarity to other words; last line: Age of acquisition and
word prevalence.
The results agree with what was found for Dutch. High frequency words are processed faster than
low frequency words. Interestingly, when prevalence is added, the relation becomes linear, whereas
before there was a floor effect for high frequency words. Words with 6-8 letters are responded to
fastest. In addition, response times grow when the words contain more syllables, but tend to
decrease for morphologically complex words when all the other variables are taken into account.
Words that are similar in sound and spelling to many other words (i.e, words with low OLD and PLD
values) are responded to faster. Words were responded to more slowly when they were acquired
11
late. And finally, there is a robust effect of word prevalence. Interestingly, the effect is strongest at
the high end, when all other variables have been accounted for. The effect is rather flat for words
with a prevalence rate below 1.2 (which agrees with percentage known of 89%).
Table 5 and Figure 4 show that the effects were very similar for word naming, but that the
contribution of word prevalence was smaller than for lexical decision times (though still highly
significant).
Figure 4: Effects of the variables on the standardized ELP naming times. First line: effects of word
frequency and length in letters; second line: number of syllables and number of morphemes; third
line: orthographic and phonological similarity to other words; last line: Age of acquisition and word
prevalence.
12
Relation to other word characteristics
Kuperman et al. (2012) used a dataset of Cark and Paivio (2004) to gauge the relationship of AoA to
30+ other word features. We used the same dataset and added word prevalence to it, together with
values from ELP, the concreteness ratings of Brysbaert, Warriner, & Kuperman (2014), and estimates
of word valence, dominance and arousal collected by Hollis, Westbury, and Lefsrud (2016).
Kuperman et al. (2012) found an 8-factor solution to best fit the data. We used the same structure,
but in addition allowed the factors to intercorrelate (using the fa() function from the R package
psych; Revelle, 2018). This resulted in a solution that was more straightforward to interpret. There
were 907 words for which we had all measures. Table 6 shows the outcome.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 6 about here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Word prevalence loads on the same factor as word accuracy of ELP and various ratings of familiarity.
There is a second factor, word frequency, that is correlated r = .66 with the first factor. The other
factors refer to the similarity to other words, word length, affect (valence and arousal), and gender
ladenness of the words. The factor word frequency correlates with the factors similarity to other
words (r = .41), length (r = -.34), and valence (r = .30). Similarity to other words in addition correlates
with length (r = -.45). Valence also correlates with gender ladenness (r = .40). All other correlations
are below r = .3 (absolute values).
All in all, when we analyze the word attributes collected by Clark and Paivio (2004) and add the ones
collected since, we see that the features reduce to eight main word characteristics. Word prevalence
loads on a factor together with other measures of word familiarity. The factor is correlated with word
frequency as observed in various corpora. The word processing measures of ELP also load on the
prevalence/familiarity factor, in line with the impact of prevalence on word processing times we saw
above. On the other hand, the fact that the accuracy data of the ELP lexical decision experiment had
the highest load raise the question to what extent the factor measures word knowledge or the
decision element in the yes/no vocabulary task. If word prevalence is related to the decision
component, one would expect it to only correlate with lexical decision times and not with word
processing times in other tasks (e.g., eye movements in reading). To some extent, the worry is
13
contradicted by the ELP naming data (Figure 4), but more research is clearly needed to establish to
what extent word prevalence is a true word feature (independent variable) or a word processing
characteristic (dependent variable). Notice that a similar question has been raised about word
frequency: Whether it should be considered as an independent variable or a dependent variable
(Baayen, Milin, & Ramscar, 2016). Other evidence that word prevalence is related to word knowledge
(i.e., is an independent variable) can be found in the country and gender differences (Tables 1 and 2)
and in the age differences observed (Brysbaert et al., 2016b; Keuleers et al., 2015). These seem more
related to differences in word knowledge than in decision processes.
Word prevalence as a matching variable
In many studies, a new word attribute is the variable of interest. In such studies, the stimuli in the
various conditions must be matched on word frequency, word length, orthographic similarity to
other words, and age of acquisition. Even with this set of criteria, there is evidence that researchers
can select stimuli in such a way that they increase the chances of observing the hypothesized effect
(i.e., show an experimenter bias; Forster, 2000; Kuperman, 2015). We think word prevalence will be
an important variable to correct for this bias. Table 7 shows words with different percentages known
matched on frequency (Zipf = 1.59, meaning the words were observed only once in the SUBTLEX-US
corpus of 51 million words). The various words clearly illustrate the danger of experimenter bias
when word prevalence is not taken into account.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Insert Table 7 about here
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
As can be seen in Figures 3 and 4, matching words on prevalence is not only needed for words with
very divergent prevalence scores, but also for words with high prevalence scores, something that
cannot be achieved without the present dataset.
Word prevalence as a dependent variable
A final set of studies for which word prevalence will be interesting, relates to the question what
causes differences in prevalence rates. As we have seen above, familiarity and word frequency are
important variables, but not the only ones. Which other variables are involved?
14
The best way to answer this question is to examine the divergences between word prevalence and
word frequency. Which words are more widely known than expected on the basis of their frequency,
and which words are less well known than expected on the basis of their frequency? As for the
former question, it is striking that many well-known words with low frequencies are morphologically
complex words. The best known very low frequency words with a frequency of Zipf = 1.59 are
“binocular, distinctively, reusable, gingerly, preconditioned, legalization, distinctiveness, inaccurately,
localize, resize, pitfall, unsweetened, unsaturated, undersize, compulsiveness”, all words derived
from simpler stems. Another set of words with frequencies less than predicted, are words mainly
used at a young age, such as grandma (AoA = 2.6 yrs; prevalence = 2.4, frequency = 4.7), potty (AoA
= 2.7 yrs; prevalence = 1.9, frequency = 3.2), yummy (AoA = 2.9 yrs; prevalence = 2.1, frequency =
3.7), nap (AoA = 3.0 yrs; prevalence = 2.3, frequency = 4.1), or unicorn (AoA = 4.8 yrs; prevalence =
2.6, frequency = 3.4). Also words that denote utensils are often known more widely than expected on
the basis of their frequency, such a hinge (AoA = 8.6 yrs; prevalence = 2.2, frequency = 2.2), sanitizer
(AoA = 10.9 yrs; prevalence = 2.1, frequency = 1.6), or wiper (AoA = 8.4 yrs; prevalence = 2.3,
frequency = 2.8).
Finally, the prevalence measure itself is likely to be of interest. One may want to investigate, for
instance, to what extent prevalence scores depend on the way in which they were defined. Goulden,
Nation, and Read (1990) presented students with 250 lemmas taken at random from a dictionary and
tested them in the same way as we did (i.e., students had to indicate which words they knew).
Students selected on average 80 words. Milton and Treffers-Daller (2013) used the same words but
asked participants to give a synonym or explanation for each word they knew. Now students were
correct on 45 words only. Two questions are important: (1) how strong is the correlation between
both estimates of word knowledge, and (2) which measure best captures “word knowledge”?
As for the first question, Paul, Stallman, and O’Rourke (1990) reported high correlations between the
yes/no test and tests involving interviews and multiple choice questions. Surprisingly, no other
studies on this topic could be found with native speakers (there are more studies with second
language speakers, which largely – but not always – confirm the finding that the yes/no test
correlates well with other test formats). In addition, all studies looked at correlations across
participants and not at correlations across items (given that the interest was in assessing the
language proficiency of participants, not knowledge of individual words). In order to obtain some
more information, we presented three existing English multiple-choice vocabulary tests to 248 first-
year psychology students at a British University.
3
The three tests were Mill Hill Form 2 (Raven, 1958;
3
We are grateful to Dr. Joe Levy for his help in designing and running the experiments.
15
34 words), the Shipley test (Shipley, 1940; 40 words), and a TOEFL test (Landauer & Dumais, 1997; 80
words). When we correlated the scores on the items with the word prevalence measures, we
obtained a correlation of r = .69 (N = 174), which is shown in Figure 5.
Figure 5: Correlation between word prevalence and probability correct on a multiple choice
vocabulary test.
Because the correlation is lower than hoped for, we may want to have a look at the outliers. The
upper left outlier (correct on MC test = .04, prevalence = 1.98) is the word “sultry”, an item from the
Mill Hill test. According to our test, 98% of the people claim to know the word, whereas the Mill Hill
test suggests that no-one really knows the meaning. If we look in a dictionary, “sultry” has two
meanings: (1) hot and humid, and (2) displaying or arousing sexual desire. If we look at semantic
vectors (Mandera, Keuleers, & Brysbaert, 2017), the closest synonyms are “breathy, steamy,
songstress, hot, sexy, alluring, languid”. The first associates given by humans are “sexy, hot, humid,
steamy, woman, warm, seductive” (De Deyne, Navarro, Perfors, & Storms, 2016). However, none of
these words are among the options available in the Mill Hill: Participants have to choose between
16
“instinctive, sulky, trivial, solid, severe, muggy”. No surprise then that no-one knows the intended
meaning. Another word in the upper left corner comes from the Shipley test: “pristine” (correct on
MC test = .24, prevalence = 2.3). The alternatives given in the Shipley test are “vain, sound, first,
level”, rather than one of the expected associates “clean, pure, or perfect”. On the right side of
Figure 5, we find the word “easygoing” from the TOEFL test (correct on MC test = .97, prevalence =
1.7). In all likelihood, the low prevalence score for this word reflects the fact that many people do not
consider easygoing as a correct English spelling (they arguably prefer the two-word expression “easy
going”). A similar reasoning explains why the word “impostor” is doing worse on word prevalence
(1.5) than on the Shipley test (0.9). Currently, the preferred spelling of the word is “imposter” (which
has a prevalence score of 2.2).
The deviations between the multiple-choice scores and word prevalence bring us to the second
question: Which measure best captures “word knowledge”? As we have seen, answers to multiple
choice questions (the most frequent way of testing vocabulary) not only depend on knowledge of the
target word but also on the alternatives presented. If they test a rare (or outdated) meaning of a
word, they can easily lead to low scores for that word (remember that test makers are not interested
in the scores on individual items; they are interested in average scores of individuals). On the other
hand, word prevalence scores are affected by the spelling of the word and only give information
about the most familiar meaning. Which is the “best” way of testing word knowledge? Although one
might be tempted to think that deeper knowledge is better, it may be that hazy knowledge is what
we use most of the time when we are reading text or hearing discourse. Indeed, it might be argued
that no person, except for specialized lexicographers, know the full meaning of the words they are
using (Anderson & Freebody, 1981). Still, it would be good to have more information on the
relationship between results based on the yes/no format used here and other test formats. In
particular, correlations over items are important.
Availability
We made an Excel file with the Pknown and Prevalence values for the 61,858 words tested. Most
words are lemmas (i.e., without inflections). An exception was made for common irregular forms
(e.g., lice, went, wept, were) and nouns that had a different meaning in plural than in singular
(glasses, aliens). The file also includes the SUBTLEX-US word frequencies, expressed as Zipf scores.
Figure 6 gives a snapshot of the file.
17
The file further contains sheets with the differences between UK and US respondents, and between
male and female respondents, so that readers can make use of this information if they want to do so.
Finally, we make the databases available that were used for the various analyses reported in the
present article, so that readers can check them and, if desired, improve on them. These files are
available as supplementary materials and can also be found at https://osf.io/g4xrt/.
Figure 6: Snapshot of the data file with word prevalences, available as supplementary materials or at
https://osf.io/g4xrt/.
18
References
Adelman, J. S., & Brown, G. D. (2007). Phonographic neighbors, not orthographic neighbors,
determine word naming latencies. Psychonomic Bulletin & Review, 14(3), 455-459.
Adelman, J. S., Brown, G. D., & Quesada, J. F. (2006). Contextual diversity, not word frequency,
determines word-naming and lexical decision times. Psychological Science, 17(9), 814-823.
Adelman, J. S., Marquis, S. J., Sabatos-DeVito, M. G., & Estes, Z. (2013). The unexplained nature of
reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(4), 1037-1053.
Anderson, R.C., & Freebody, P. (1981). Vocabulary knowledge. In Guthrie, J. (Ed.), Reading
comprehension and education (pp. 77-117). Newark, DE: International Reading Association.
Baayen, R. H., Milin, P., & Ramscar, M. (2016). Frequency in lexical processing. Aphasiology, 30(11),
1174-1220.
Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. I., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L.,
Simpson, G.B., & Treiman, R. (2007). The English lexicon project. Behavior Research Methods. 39,
445-459.
Benjamin, R. G. (2012). Reconstructing readability: Recent developments and recommendations in
the analysis of text difficulty. Educational Psychology Review, 24(1), 63-88.
Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word
frequency effect: A review of recent developments and implications for the choice of frequency
estimates in German. Experimental Psychology, 58, 412-424.
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The Word Frequency Effect in Word Processing: An
Updated Review. Current Directions in Psychological Science, 27. DOI: 10.1177/0963721417727521
Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US
word frequencies. Behavior Research Methods, 44(4), 991-997.
Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016a). The impact of word prevalence on
lexical decision times: Evidence from the Dutch Lexicon Project 2. Journal of Experimental
Psychology: Human Perception and Performance, 42, 441-458.
Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016b) How Many Words Do We Know?
Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input
and the Participant’s Age. Frontiers in Psychology 7:1116. doi: 10.3389/fpsyg.2016.01116.
Brysbaert, M., Warriner, A.B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally
known English word lemmas. Behavior Research Methods, 46, 904-911.
Coltheart, M. (1981), The MRC Psycholinguistic Database. Quarterly Journal of Experimental
Psychology, 33A, 497-505.
Connell, L., & Lynott, D. (2012). Strength of perceptual experience predicts word processing
performance better than concreteness or imageability. Cognition, 125(3), 452-465.
19
Cortese, M. J., Hacker, S., Schock, J., & Santo, J. B. (2015). Is reading-aloud performance in
megastudies systematically influenced by the list context? Quarterly Journal of Experimental
Psychology, 68(8), 1711-1722.
Cortese, M. J., Yates, M., Schock, J., & Vilks, L. (2018). Examining word processing via a megastudy of
conditional reading aloud. Quarterly Journal of Experimental Psychology, 71.
https://doi.org/10.1177/1747021817741269
De Clercq, O., & Hoste, V. (2016). All mixed up? finding the optimal feature set for general readability
prediction and its application to english and dutch. Computational Linguistics, 42(3), 457-490.
De Deyne, S., Navarro, D. J., Perfors, A., & Storms, G. (2016). Structure at every scale: A semantic
network account of the similarities between unrelated concepts. Journal of Experimental Psychology:
General, 145(9), 1228.
Dufau, S., Grainger, J., Midgley, K. J., & Holcomb, P. J. (2015). A thousand words are worth a picture:
Snapshots of printed-word processing in an event-related potential megastudy. Psychological
Science, 26(12), 1887-1897.
Ernestus, M., & Cutler, A. (2015). BALDEY: A database of auditory lexical decisions. Quarterly Journal
of Experimental Psychology, 68(8), 1469-1488.
Ferrand, L., Brysbaert, M., Keuleers, E., New, B., Bonin, P., Méot, A., ... & Pallier, C. (2011). Comparing
word processing times in naming, lexical decision, and progressive demasking: Evidence from
Chronolex. Frontiers in psychology, 2: 306.
Ferrand, L., Méot, A., Spinelli, E., New, B., Pallier, C., Bonin, P., ... & Grainger, J. (2018). MEGALEX: A
megastudy of visual and auditory word recognition. Behavior Research Methods, 50.
https://doi.org/10.3758/s13428-017-0943-1
Forster, K. I. (2000). The potential for experimenter bias effects in word recognition experiments.
Memory & Cognition, 28(7), 1109-1115.
Gernsbacher, M. A. (1984). Resolving 20 Years of Inconsistent Interactions Between Lexical
Familiarity and Orthography, Concreteness, and Polysemy. Journal of Experimental Psychology.
General, 113(2), 256–281.
Goulden, R., Nation, P., & Read, J. (1990). How large can a receptive vocabulary be? Applied
Linguistics, 11(4), 341-363.
Hancke, J., Vajjala, S., & Meurers, D. (2012). Readability Classification for German using Lexical,
Syntactic, and Morphological Features. Proceedings of COLING 2012, 1063-1080.
Hollis, G., Westbury, C., & Lefsrud, L. (2016). Extrapolating Human Judgments from Skip-gram Vector
Representations of Word Meaning. The Quarterly Journal of Experimental Psychology, 70(8), 1603-
1619.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behavior
Research Methods, 42, 627-633.
20
Kuperman, V. (2015). Virtual experiments in megastudies: A case study of language and emotion.
Quarterly Journal of Experimental Psychology, 68(8), 1693-1710.
Kuperman, V., Estes, Z., Brysbaert, M., & Warriner, A. B. (2014). Emotion and language: valence and
arousal affect word recognition. Journal of Experimental Psychology: General, 143(3), 1065-1081.
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30
thousand English words. Behavior Research Methods, 44, 978-990.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis
theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-
240.
Liu, Y., Shu, H., & Li, P. (2007). Word naming and psycholinguistic norms: Chinese. Behavior Research
Methods, 39(2), 192-198.
Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycholinguistic
tasks with models of semantic similarity based on prediction and counting: A review and empirical
validation. Journal of Memory and Language, 92, 57-78
Milton, J., & Treffers-Daller, J. (2013). Vocabulary size revisited: the link between vocabulary size and
academic achievement. Applied Linguistics Review, 4(1), 151-172.
Paul, P. V., Stallman, A. C. and O'Rourke, J. P. (1990). Using three test formats to assess good and
poor readers' word knowledge. Technical Report No. 509, Center for the Study of Reading, University
of Illinois at Urbana-Champaign, IL.
Raven, J. C. (1958). Guide to using the Mill Hill Vocabulary Scale with the Progressive Matrices Scales.
Oxford, England: H. K. Lewis & Co.
Revelle, W. (2018). Package ‘psych’. Available on May 29, 2018 at https://cran.r-
project.org/web/packages/psych/psych.pdf.
Shipley, W. C. (1940). A self-administering scale for measuring intellectual impairment and
deterioration. The Journal of Psychology, 9(2), 371-377.
Schröter, P., & Schroeder, S. (2017). The Developmental Lexicon Project: A behavioral database to
investigate visual word recognition across the lifespan. Behavior Research Methods, 49(6), 2183-
2203.
Sze, W. P., Yap, M. J., & Rickard Liow, S. J. (2015). The role of lexical variables in the visual recognition
of Chinese characters: A megastudy analysis. Quarterly Journal of Experimental Psychology, 68(8),
1541-1570.
Tsang, Y. K., Huang, J., Lui, M., Xue, M., Chan, Y. W. F., Wang, S., & Chen, H. C. (2018). MELD-SCH: A
megastudy of lexical decision in simplified Chinese. Behavior Research Methods, 50.
https://doi.org/10.3758/s13428-017-0944-0
21
Tse, C. S., & Yap, M. J. (2018). The role of lexical variables in the visual recognition of two-character
Chinese compound words: A megastudy analysis. Quarterly Journal of Experimental Psychology, 71.
https://doi.org/10.1177/1747021817738965.
Tse, C. S., Yap, M. J., Chan, Y. L., Sze, W. P., Shaoul, C., & Lin, D. (2017). The Chinese Lexicon Project: A
megastudy of lexical decision performance for 25,000+ traditional Chinese two-character compound
words. Behavior Research Methods, 49(4), 1503-1519.
Van Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and
improved word frequency database for British English. Quarterly Journal of Experimental Psychology,
67, 1176-1190.
Yap, M. J., & Balota, D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory
and Language, 60(4), 502-529.
Yap, M. J., Tan, S. E., Pexman, P. M., & Hargreaves, I. S. (2011). Is more always better? Effects of
semantic richness on lexical decision, speeded pronunciation, and semantic classification.
Psychonomic Bulletin & Review, 18(4), 742-750.
22
Table 1: Words much better known in the US than in the UK (left) and vice versa (right)
Word Pus Puk
manicotti 0.90 0.16
ziti 0.81 0.08
tilapia 0.93 0.20
garbanzo 0.92 0.21
kabob 0.98 0.28
kwanza 0.90 0.22
crawdad 0.86 0.20
hibachi 0.90 0.26
sandlot 0.95 0.32
acetaminophen 0.93 0.33
tamale 0.91 0.32
kielbasa 0.84 0.24
conniption 0.76 0.17
chigger 0.80 0.22
tomatillo 0.80 0.22
provolone 0.97 0.40
albuterol 0.74 0.16
staph 0.85 0.28
goober 0.97 0.40
luau 0.83 0.26
Word Pus Puk
tippex 0.07 0.91
biro 0.16 0.99
tombola 0.17 0.97
chipolata 0.16 0.94
dodgem 0.18 0.95
yob 0.21 0.98
gazump 0.05 0.82
abseil 0.14 0.89
naff 0.19 0.94
kerbside 0.23 0.98
plaice 0.16 0.91
judder 0.19 0.94
chiropody 0.19 0.94
korma 0.21 0.95
bolshy 0.11 0.85
quango 0.08 0.82
pelmet 0.11 0.85
brolly 0.24 0.96
chaffinch 0.12 0.85
escalope 0.19 0.91
23
Table 2: Words better known by males than by females (left) and vice versa (right)
Word P_Male P_Female
howitzer 0.84 0.53
thermistor 0.48 0.17
azimuth 0.58 0.27
femtosecond 0.47 0.15
milliamp 0.69 0.37
aileron 0.55 0.22
servo 0.61 0.28
degauss 0.59 0.26
boson 0.76 0.44
checksum 0.58 0.25
piezoelectricity 0.51 0.18
gauss 0.64 0.31
katana 0.80 0.47
shemale 0.88 0.54
neodymium 0.56 0.21
yakuza 0.69 0.32
teraflop 0.58 0.22
strafe 0.83 0.46
parsec 0.83 0.44
bushido 0.60 0.21
Word P_Male P_Female
peplum 0.13 0.64
tulle 0.27 0.77
chignon 0.24 0.72
bandeau 0.35 0.81
freesia 0.27 0.72
chenille 0.34 0.76
kohl 0.36 0.77
verbena 0.30 0.70
doula 0.21 0.59
ruche 0.18 0.55
espadrille 0.36 0.73
damask 0.43 0.80
jacquard 0.39 0.74
whipstitch 0.37 0.71
boucle 0.16 0.50
taffeta 0.53 0.87
sateen 0.38 0.72
chambray 0.43 0.77
pessary 0.19 0.53
voile 0.34 0.68
24
Table 3: Variables investigated in word processing megastudies that correlate with response times. For each variable an exemplary study is given in which
the variable was examined (LDT = Lexical Decision Time).
Chinese Chinese Dutch English English French French German German
LDT Naming LDT LDT Naming LDT Naming LDT Naming
Word frequency
18
14
4
20
20
11
10
15
15
Word length (N letters)
17
4
20
20
11
10
15
15
Age of acquisition
16
14
4
7
7
10
10
Concreteness/imageability
16
14
4
8
7
10
10
Orthographic similarity to other words
4
20
20
11
10
15
15
Phonological similarity to other words
19
√²
10
10
Word length (N phonemes)
20
20
11
10
First phoneme
20
20
10
10
Visual complexity
18
14
8
Semantic richness
16
14
21
Contextual diversity
18
1
1
Phonological consistency
19
20
20
Word length (N syllables)
4
20
20
Phonological uniqueness point
9
11
Part of speech
4
√ ³
Homophone density
19
14
Valence and arousal
13
13
Number of senses
20
20
Semantic neighborhood size
20
20
Perceptual strength
5
5
Sensory experience
12
12
Stress pattern
12
12
Orthographic uniqueness point
11
25
Semantic transparency
18
Pronunciation ambiguity
17
Bigram frequency
8
Consonant vowel proportion
8
List context
6
1
Adelman et al. (2006), ² Adelman & Brown (2007), ³ Brysbaert et al. (2012),
4
Brysbaert et al. (2016),
5
Connell & Lynott (2012),
6
Cortese et al. (2015),
7
Cortese et al. (2018),
8
Dufau et al. (2015),
9
Ernestus & Cutler (2015),
10
Ferrand et al. (2011),
11
Ferrand et al. (2018),
12
Juhasz & Yap (2013),
13
Kuperman et
al. (2014),
14
Liu et al. (2007),
15
Schröter & Schroeder (2017),
16
Sze et al. (2015),
17
Tsang et al., (2018),
18
Tse et al. (2017),
19
Tse & Yap (2018),
20
Yap &
Balota (2009),
21
Yap et al. (2011)
26
Table 4: Correlations between the ELP variables and word prevalence (N = 25,661)
Zipf Ortho_N Phono_N OLD PLD NPhon NSyll NMorph AoA Preval I_Zscore I_NMG_Zscore
Length -0.471 -0.570 -0.574 0.869 0.841 0.916 0.830 0.696 0.476 -0.150 0.654 0.627
SubtlexZipf 0.374 0.408 -0.443 -0.445 -0.451 -0.386 -0.427 -0.561 0.487 -0.649 -0.522
Ortho_N 0.810 -0.592 -0.536 -0.531 -0.495 -0.363 -0.380 0.128 -0.374 -0.379
Phono_N -0.564 -0.580 -0.586 -0.522 -0.390 -0.393 0.128 -0.383 -0.375
OLD 0.912 0.817 0.738 0.542 0.471 -0.230 0.647 0.587
PLD 0.872 0.792 0.567 0.491 -0.224 0.650 0.599
NPhon 0.860 0.664 0.509 -0.136 0.636 0.629
NSyll 0.606 0.516 -0.151 0.614 0.591
NMorph 0.308 -0.065 0.458 0.411
AoA -0.425 0.603 0.560
Prevalence -0.512 -0.392
I_Zscore 0.753
Zipf = log word frequency based on SUBTLEX-US (Brysbaert & New, 2009), AoA = age of acquisition (Kuperman, Warriner, & Brysbaert, 2012), I_Zscore = RT
in the ELP lexical decision task, I_NMG_Zscore = RT in the ELP naming task. All other variables are explained in the text and come from the ELP website
(Balota et al., 2007).
27
Table 5: Variance explained in the ELP data
Lexical decision times
Frequency + Length + AoA + Nsyll + Nmorph + OLD + PLD .662
Frequency + Length + AoA + Nsyll + Nmorph + OLD + PLD + Prevalence .698
Naming latencies
Frequency + Length + AoA + Nsyll + Nmorph + OLD + PLD .539
Frequency + Length + AoA + Nsyll + Nmorph + OLD + PLD + Prevalence .552
28
Table 6: Outcome of factor analysis on the word features collected by Clark & Paivio (2004; N = 907).
Also included are features from ELP (Balota et al., 2007), Subtlex (Brysbaert & New, 2009), AoA
(Kuperman et al., 2012), concreteness (Brysbaert et al., 2014), and valence, dominance, arousal
(Hollis et al., 2016). The analysis shows that word prevalence loads on the same factor as familiarity
(the prevalence factor is correlated r = .65 with the second factor – frequency). The second last
column indicates how much of the variance in the variable is explained by their factor loading; the
last column indicates the proportion of variance not explained by the factors.
Prevalence Frequency Concreteness Similarity Length Valence Arousal Gender h2 u2
Lexical decision accuracy
0.92 0.77 0.23
Familiarity rating 2
0.90 0.95 0.05
Word prevalence
0.84 0.74 0.26
Context availability
0.84 0.90 0.10
Ease of definition (estimation)
0.67 0.36 0.80 0.20
Pronounceability
0.63 0.72 0.28
Lexical decision time
-0.54 0.34 0.80 0.20
Naming time
-0.53 0.38 0.64 0.36
Familiarity rating 1
0.52 0.42 0.81 0.19
Frequency SUBTLEX-US
0.97 0.99 0.01
Contextual diversity SUBTLEX-US
0.96 0.99 0.01
Kucera-Francis frequency
0.86 0.76 0.24
Frequency HAL
0.76 0.82 0.18
Thorndike-Lorge frequency
0.66 0.71 0.29
Word availability (used in dictionary)
0.64 0.58 0.42
Word availability (produced as associate)
0.58 0.72 0.28
Numer of childhood dictionaries
0.58 0. 44 0.70 0.30
Concreteness rating Brysbaert
0.91 0.91 0.09
Concreteness rating
0.91 0.90 0.10
Imaginability rating
0.91 0.88 0.12
Imaginability rating 2
0.42 0.67 0.89 0.11
Meaningfulness (number of associates produced)
0.62 0.59 0.41
Age of acquisition
-0.45 0.75 0.25
age of acquisition Kuperman
-0.33 - 0.39 0.76 0.24
Ambiguity word (estimation)
-0.37 0.31 0.69
Number of rhyming words
0.85 0.76 0.24
Number of similar looking words
0.80 0.70 0.30
Number of similar sounding words
0.79 0.85 0.15
Number orthographic neighbors
0.77 0.72 0.28
Number phonological neighbors
0.72 0.72 0.28
Words with same initial sounds
0.56 0.42 0.38 0.62
Number of letters
0.82 0.90 0.10
Number of phonemes
0.78 0.89 0.11
Number of syllables
0.65 0.75 0.25
Number of morphemes
0.64 0.56 0.44
PLD20
-0.39 0.62 0.83 0.17
OLD20
-0.47 0.55 0.83 0.17
Words with same initial letters
0.45 0.52 0.28 0.72
High frequency words starting with same letters
0.36 0.45 0.18 0.82
Mean Bigram frequency
0.41 0.19 0.81
Valence
0.92 0.88 0.12
Goodness
0.91 0.85 0.15
Dominance
0.88 0.83 0.17
Pleasantness
0.84 0.82 0.18
Deviation of goodness from mean
0.85 0.75 0.25
Deviation of pleasantness from mean
0.82 0.64 0.36
Emotionality
0.80 0.78 0.22
Arousal
0.63 0.57 0.43
Associates to words (estimation)
0.31 0.34 0.50 0.50
Gender ladenness rating 2
0.98 0.95 0.05
Gender ladenness rating 1
0.97 0.96 0.04
SS loadings
6.80 6.96 4.97 5.23 4.83 3.54 3. 05 2.05
Proportion Var
0.13 0.14 0.10 0.10 0.09 0.07 0. 06 0.04
Cumulative Var
0.13 0.27 0.37 0.47 0.56 0.63 0. 69 0.73
29
Table 7: Twenty very low frequency words with various prevalence scores, illustrating the danger of
experimenter bias if words are selected on the basis of frequency only
Word Pknown Prevalence FreqZipfUS
zarzuela 0.09 -1.32 1.59
cleek 0.13 -1.10 1.59
fovea 0.21 -0.80 1.59
motet 0.25 -0.66 1.59
cantle 0.30 -0.51 1.59
jackleg 0.35 -0.38 1.59
scenarist 0.40 -0.26 1.59
ropy 0.45 -0.11 1.59
snaffle 0.51 0.01 1.59
ablate 0.55 0.12 1.59
karting 0.60 0.25 1.59
lionize 0.66 0.39 1.59
maraud 0.70 0.52 1.59
bluesy 0.75 0.66 1.59
endomorph 0.80 0.83 1.59
inundation 0.85 1.04 1.59
straggle 0.90 1.27 1.59
bullish 0.95 1.62 1.59
dishearten 0.98 1.99 1.59
binocular 1.00 2.45 1.59
... Nearly all unknown words are low in frequency, and it is very difficult to predict which of these words will be known by 40%, 60%, or 80% of the population (the type of words you need in order to find a reliable difference between low and high performers). Vermeiren et al. (2023) were helped because they had information about word prevalence (Brysbaert et al., 2019) and age of acquisition (Kuperman et al., 2012) for thousands of English words, but even then it took them several attempts to develop a valid test. ...
... An obvious error to avoid for such words is translating a difficult English word into an easy Spanish word (e.g., translate baneful to funesto, which translates back to disastrous). To find the best translation, we were helped by the fact that both English and Spanish have word prevalence values (Aguasvivas et al., 2018;Brysbaert et al., 2019). Word prevalence refers to the percentage of people reporting that they know the word in a yes/no form recognition task (i.e., the format used in Lextale). ...
... Texts with reading rates lower than 100 words per minute and higher than 668 words per minute were considered outliers, because they are 2.5 times slower/faster than usual reading speed (Kuperman et al., 2021). Slow reading rates are characteristic of text studying, fast reading rates of text scanning (Brysbaert, 2019). This resulted in the loss of 16.1% of text data (on a total of 231 × 14 = 3,234 texts read). ...
Article
Full-text available
This study examined the efficiency of item translation in a challenging language-testing situation. We created a Spanish translation of recently developed English vocabulary tests to assess word knowledge in Spanish-speaking students and highly educated adults, a group for whom it is a challenge to find words that some people know and others do not. The English tests were multiple-choice tests based on meaning recognition and consisted of a total of 150 items. From these, we were able to create two Spanish tests with 37 questions each. We constructed and validated the tests in two separate studies, including another established vocabulary test (Lextale-Esp, based on form recognition), general knowledge tests, and a test for reading comprehension. Two online studies with 161 and 196 participants confirmed that both vocabulary tests have reliability above .75 (.86 when combined) and correlate more highly with general knowledge and reading comprehension than Lextale-Esp. This shows that test translation is an efficient way to find useful items for language tests in different languages. All materials (including the general knowledge tests and the reading comprehension test) are freely available for research purposes.
... This study used the same set of 4,625 monomorphemic words and corresponding polysemy-form mapping values from Study 1. The following sublexical and lexical variables were included: orthographic length (number of letters; Gao et al., 2023), orthographic and phonological Levenshtein distances (the mean Levenshtein distance from a word to its 20 closest neighbors, i.e., those with the shortest sequence of substitutions, insertions, and deletions needed to transform one word into another; OLD20 and PLD20; Suárez et al., 2011;Yarkoni et al., 2008), log mean bigram frequency (Gao et al., 2023), Zipf SUBTLEXus word frequency (Brysbaert & New, 2009;Zipf, 1949), single-and biphone phonotactic probabilities (the relative frequencies of a word's sound sequences at the level of its phonemes and phoneme pairs; Vaden et al., 2009), phonographic neighborhood size (words differing in one letter and one phoneme; Adelman & Brown, 2007), feedforward (spelling-to-sound) and feedback (soundto-spelling) word onset token consistency , age of acquisition (Kuperman et al., 2012), prevalence (the number of people who know the word; Brysbaert et al., 2019), and part of speech . In addition, four measures related to either polysemy or form-meaning mappings were included: concreteness ratings (Brysbaert et al., 2014), SemD (Hoffman et al., 2013), iconicity ratings (Winter et al., 2024), and overall sound-meaning systematicity (Monaghan et al., 2014). ...
... The identical predictor variables from Study 2 plus the log transformed Wordnet number of senses were entered into a relative weights analysis with robust standard errors with the zRTs as the dependent variable, although word prevalence (Brysbaert et al., 2019) was omitted for the ECP analysis as it is derived from the same data set (see Mandera et al., 2020). All predictors were entered as continuous variables, excepting part of speech that was again entered as a categorical predictor with nouns as the reference category. ...
Article
Full-text available
Most English word forms convey multiple meanings, that is, they are semantically ambiguous. A relatively small proportion of these ambiguous forms are homonyms that convey distinct meanings (e.g., bank may refer to a financial institution or the land bordering a river), while the majority are polysemes that convey interrelated senses (mouth may refer to the mouth of a person or a river). Empirical investigations have demonstrated an advantage for polysemous word forms across various lexical processing tasks, suggesting differences in the way they are organized in semantic memory. However, polysemous forms also tend to be more frequent, comprise fewer phonemes and syllables, and occur in more dense neighborhoods involving more similar sounding words. The nature and extent of these systematic polysemy-form mappings and their influence on processing have yet to be fully investigated. The present study reports an analysis of a corpus of English monomorphemic words (N = 4,466), confirming that phonological features predict a significant proportion of variance (16.8%) in the number of senses conveyed by a word. A series of experiments using relative weight analyses of megastudy data sets of word recognition and production tasks demonstrates that these systematic polysemy-form mappings have a relatively important influence on lexical processing compared to other lexical and semantic variables. These findings suggest that polysemous word forms might be structured systematically to minimize cognitive costs and maintain a compact or kernel lexicon. Implications for current accounts of lexical ambiguity based solely on semantic similarity are discussed.
... Therefore, the low-frequency items may still be useful even for participants with diverse language proficiency levels. Indeed, Brysbaert et al. (2019) showed that words with low frequency in frequency corpus may be known by many language users (i.e., high word prevalence). Currently, word prevalence norm is not available in Chinese. ...
Article
Full-text available
To examine how individual differences in language skills affect language processing, it is essential to have good-quality tests that can assess such individual differences accurately. This study introduces a dictation test and a vocabulary knowledge test in Chinese, which aim to measure lexical expertise in proficient Chinese language users like university students. The psychometric properties of the two tests were examined with two groups of participants. In the first group, exploratory factor analyses confirmed that each of these tests was unidimensional, measuring a single underlying construct of lexical expertise. After removing some problematic items, the two tests also demonstrated satisfactory internal reliabilities. Although the test scores were only weakly correlated with self-reported measures of language proficiency, the correlation with word recognition performance was moderate. These results were successfully replicated with the second cross-validation group, confirming the reliability and convergent validity of the tests. An additional dataset further showed that the vocabulary test score was positively correlated with sentence comprehension performance. Taken together, the tests have acceptable psychometric quality and can serve as tools for examining individual differences in Chinese language processing. The tests are freely available online, and normative performance data are provided, facilitating their use in future research.
... Stimuli consisted of 40 cue words. We first collated concrete words (rated at least 3.5/5 in concreteness; Brysbaert et al., 2014) that were highly prevalent (scoring at least 1.96 in UK prevalence, corresponding to 95 % of people knowing the word; Brysbaert et al., 2019) in order to ensure that the cue words were familiar to participants and could easily evoke associations. We then filtered the set to include only words with available ratings from the Lancaster Sensorimotor Norms dataset (Lynott et al., 2020), which we required to calculate sensorimotor overlap values. ...
... Есть общеизвестные слова, а есть знакомые немногим. Недавно были собраны нормы для этой переменной [Brysbaert et al. 2019]. Это было сделано следующим образом. ...
Article
The article deals with the semantic analyses of the words marked as “unfamiliar” by more than 50% of the respondents when collecting the age of acquisition ratings. Each unfamiliar word is included into a more general lexico-semantic category and into a more specific category (subcategory). Lexico-semantic categories are defined according to the Russian National Corpus (ruscorpora.ru). The proportions of unfamiliar words falling within the categories (subcategories) are calculated. The authors make assumptions about the factors determining the increased representation of members of some categories (subcategories) among unfamiliar words.
Article
Full-text available
Many abstract words refer to internal cognitive events or states, such as thinking or believing, or to cognitive products, such as theories, ideas, or whims (Binder et al., Cognitive Neuropsychology, 33, 130–174, 2016). Mental state information is proposed to be an important component in the grounding of abstract meaning (Kiefer et al., 2022, Muraki et al., 2022), such that our inner cognitive experiences form a foundational aspect of semantic representation. We tested this proposal by first collecting cognition ratings for over 8000 English words. Then, we used the norms generated from our ratings to examine the unique variance explained by cognition ratings in performance on lexical-semantic tasks. We found a significant effect of cognition, such that there was a facilitative relationship between cognition ratings and behavioral responses, even when controlling for other key lexical and semantic variables. Specifically, words rated as more cognitive in nature elicited faster and more accurate task responses, especially for words with more abstract meanings. This study highlights a novel behavioral effect that is consistent with a multidimensional account of semantic representation.
Article
Full-text available
Introduction Post-stroke cognitive impairment (PSCI) affects up to 75% of stroke survivors but remains challenging to detect with traditional neuropsychological assessments. Recent advances in artificial intelligence and natural language processing have opened new avenues for cognitive screening through speech analysis, yet their application to PSCI remains largely unexplored. This study aims to characterize speech markers of PSCI in the first-year post-stroke and evaluate their utility for predicting cognitive outcomes in a Singapore cohort. Methods This prospective mixed-methods study will recruit 30 stroke survivors from the Alexandra Hospital and National University Hospital in Singapore. Participants will be assessed at four timepoints: baseline (within 6 weeks of stroke onset), 3-, 6-, and 12-months post-stroke. At each visit, participants will complete the Montreal Cognitive Assessment (MoCA) and a standardized speech protocol comprising picture description and semi-structured conversation tasks. Speech recordings will be automatically transcribed using automated speech recognition (ASR) systems based on pretrained acoustic models, and comprehensive linguistic and acoustic features will be extracted. Machine learning models will be developed to predict MoCA-defined cognitive impairment. Statistical analysis will include correlation analysis between speech features and MoCA scores, as well as machine learning classification and regression models to predict cognitive impairment. Linear mixed-effects models will characterize trajectories of MoCA scores and speech features over time. Qualitative analysis will follow an inductive thematic approach to explore acceptability and usability of speech-based screening. Discussion This study represents a critical step toward developing speech-based digital biomarkers for PSCI detection that are sensitive, culturally appropriate, and clinically feasible. If validated, this approach could transform current models of PSCI care by enabling remote, frequent, and naturalistic monitoring of cognitive health, potentially improving outcomes through earlier intervention.
Article
What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogy. A key shortcoming of prior research is its focus on regular linguistic phenomena, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a method for investigating linguistic generalization in LLMs: Focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J’s behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account but not a rule-based one. These findings refute the hypothesis that GPT-J’s linguistic generalization on adjective nominalization involves rules, suggesting analogy as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.
Article
Full-text available
Background and Objectives Frontotemporal dementia (FTD) includes behavioral-variant FTD (bvFTD) with predominant frontal atrophy and semantic behavioral-variant FTD (sbvFTD) with predominant right anterior temporal lobe (rATL) atrophy. These variants present diagnostic challenges because of overlapping symptoms and neuroanatomy. Accurate differentiation is crucial for clinical trial inclusion targeting TDP-43 proteinopathies. This study investigated whether automated speech analysis can distinguish between FTD-related rATL and frontal atrophy, potentially offering a noninvasive diagnostic tool. Methods This cross-sectional study used data from the University of California, San Francisco Memory and Aging Center. Using stepwise logistic regression and receiver-operating characteristic curve analysis, we analyzed 16 linguistic and acoustic features that were extracted automatically from audio-recorded picture description tasks. Voxel-based morphometry was used to investigate brain-behavior relationships. Results We evaluated 62 participants: 16 with FTD-related predominant frontal atrophy, 24 with predominant rATL atrophy, and 22 healthy controls (mean age 68.3 years, SD = 9.2; 53.2% female). Logistic regression identified 3 features (content units, lexical frequency, and familiarity) differentiating the overall FTD group from controls (area under the curve [AUC] = 0.973), adjusted for age. Within the FTD group, 5 features (adpositions/total words ratio, arousal, syllable pause duration, restarts, and words containing “thing”) differentiated frontal from rATL atrophy (AUC = 0.943). Neuroimaging analyses showed that semantic features (lexical frequency, content units, and “thing” words) were linked to bilateral inferior temporal lobe structures, speech and lexical features (syllable pause duration, and adpositions/total words ratio) to bilateral inferior frontal gyri, and socioemotional features (arousal) to areas known to mediate social cognition including the right insula and bilateral anterior temporal structures. As a composite score, this set of 5 features was uniquely associated with rATL atrophy. Discussion Automated speech analysis demonstrated high accuracy in differentiating FTD subtypes and provided insights into the neural basis of language impairments. Automated speech analysis could enhance early diagnosis and monitoring of FTD, offering a scalable, noninvasive alternative to traditional methods, particularly in resource-limited settings. Future research should focus on further clinical validation with other neuroimaging or fluid biomarkers and longitudinal cognitive data, as well as external validation in larger and more diverse populations.
Article
Full-text available
To examine the effect of lexical variables on two-character Chinese compound word processing, we performed item-level hierarchical regression analyses on lexical decision megastudy data of 18,983 two-character Chinese compound words. The first analysis determined the unique item-level variance explained by orthographic (frequency and stroke count), phonological (consistency, homophonic density), and semantic (transparency) variables. Both character and word variables were considered. Results showed that orthographic and semantic variables, respectively, accounted for more collective variance than phonological variables, suggesting that Chinese skilled readers rely more on orthographic and semantic information than phonological information when processing visually presented words. The second analysis tested interactive effects of lexical variables and showed significant semantic transparency × cumulative character frequency and word frequency × cumulative character frequency interactions. The effect of cumulative character frequency was stronger for transparent words than for opaque words and was stronger for low-frequency words than for high-frequency words. However, there was no semantic transparency × word frequency interaction in reaction time. Implications of the current findings on models of Chinese compound word processing are discussed.
Article
Full-text available
Results from a megastudy on conditional reading aloud for 2,145 monosyllabic words are reported. In stepwise regression analyses, the predictor variables accounted for over 66% of the reaction time (RT) variance. Linear mixed effect modelling on log RT indicated that every variable that related to RT in either reading aloud or lexical decision also related to RT in conditional reading aloud. Notably, differences among tasks were observed. Specifically, lexical decision showed stronger reliance on semantic information than the other two tasks, but conditional reading aloud also showed strong reliance on semantic information. Interestingly, feedback consistency affected reading aloud and conditional reading but not lexical decision. Pairwise correlations revealed that conditional reading aloud performance showed moderately strong relationships to lexical decision and reading aloud performance, whereas reading aloud and lexical decision performance were weakly related to each other. Conditional reading aloud produces reliable data that can be used to examine word processing. Theoretical challenges moving forward include how to best conceptualise and model processes involved in this task.
Article
Full-text available
Using the megastudy approach, we report a new database (MEGALEX) of visual and auditory lexical decision times and accuracy rates for tens of thousands of words. We collected visual lexical decision data for 28,466 French words and the same number of pseudowords, and auditory lexical decision data for 17,876 French words and the same number of pseudowords (synthesized tokens were used for the auditory modality). This constitutes the first large-scale database for auditory lexical decision, and the first database to enable a direct comparison of word recognition in different modalities. Different regression analyses were conducted to illustrate potential ways to exploit this megastudy database. First, we compared the proportions of variance accounted for by five word frequency measures. Second, we conducted item-level regression analyses to examine the relative importance of the lexical variables influencing performance in the different modalities (visual and auditory). Finally, we compared the similarities and differences between the two modalities. All data are freely available on our website (https://sedufau.shinyapps.io/megalex/) and are searchable at www.lexique.org, inside the Open Lexique search engine.
Article
Full-text available
Here we report on MELD-SCH (MEgastudy of Lexical Decision in Simplified CHinese), a dataset that contains the lexical decision data of 1,020 one-character, 10,022 two-character, 949 three-character, and 587 four-character simplified Chinese words obtained from 504 native Chinese users. It also includes a number of word-level and character-level variables. Analyses showed that the reliability of the dataset is satisfactory, as indicated by split-half correlations and comparisons with other datasets. Item-based regression showed that both word-level and character-level variables contributed significantly to the reaction times and error rates of lexical decision. Moreover, we discovered a U-shape relationship between word-length and reaction times, which has not been reported in Chinese before. MELD-SCH can facilitate research in Chinese word recognition by providing high quality normative data and information of different linguistic variables. It also encourages researchers to extend their empirical findings, which are mostly based on one-character and two-character words, to words of different lengths.
Article
Full-text available
The word frequency effect refers to the observation that high-frequency words are processed more efficiently than low-frequency words. Although the effect was first described over 80 years ago, in recent years it has been investigated in more detail. It has become clear that considerable quality differences exist between frequency estimates and that we need a new standardized frequency measure which does not mislead users. Research also points to consistent individual differences in the word frequency effect, meaning that the effect will be present at different word frequency ranges for people with different degrees of language exposure. Finally, a few ongoing developments are mentioned, which point to the importance of semantic diversity rather than mere differences in the number of times words have been encountered, and to the importance of taking into account word prevalence in addition to word frequency.
Article
Full-text available
With the Developmental Lexicon Project (DeveL), we present a large-scale study that was conducted to collect data on visual word recognition in German across the lifespan. A total of 800 children from Grades 1 to 6, as well as two groups of younger and older adults, participated in the study and completed a lexical decision and a naming task. We provide a database for 1,152 German words, comprising behavioral data from seven different stages of reading development, along with sublexical and lexical characteristics for all stimuli. The present article describes our motivation for this project, explains the methods we used to collect the data, and reports analyses on the reliability of our results. In addition, we explored developmental changes in three marker effects in psycholinguistic research: word length, word frequency, and orthographic similarity. The database is available online.
Article
Full-text available
Similarity plays an important role in organizing the semantic system. However, given that similarity cannot be defined on purely logical grounds, it is important to understand how people perceive similarities between different entities. Despite this, the vast majority of studies focus on measuring similarity between very closely related items. When considering concepts that are very weakly related, little is known. In this article, we present 4 experiments showing that there are reliable and systematic patterns in how people evaluate the similarities between very dissimilar entities. We present a semantic network account of these similarities showing that a spreading activation mechanism defined over a word association network naturally makes correct predictions about weak similarities, whereas, though simpler, models based on direct neighbors between word pairs derived using the same network cannot. (PsycINFO Database Record
Article
Full-text available
Based on an analysis of the literature and a large scale crowdsourcing experiment, we estimate that an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas or about one new lemma every 2 days. The knowledge of the words can be as shallow as knowing that the word exists. In addition, people learn tens of thousands of inflected forms and proper nouns (names), which account for the substantially high numbers of ‘words known’ mentioned in other publications.
Article
Full-text available
Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, although NLP-inspired research has focused on adding more complex readability features, there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and crowdsourcing, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring deep linguistic processing, resulting in ten different feature groups. Both a regression and classification set-up are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task that provides considerable insights in which feature combinations contribute to the overall readability prediction. Because we also have gold standard information available for those features requiring deep processing, we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully automatic readability prediction pipeline is on par with the pipeline using gold-standard deep syntactic and semantic information.
Article
Using a megastudy approach, we developed a database of lexical variables and lexical decision reaction times (RTs) and accuracy rates for more than 25,000 traditional Chinese two-character compound words. Each word was responded to by about 33 native Cantonese speakers in Hong Kong. This resource provides a valuable adjunct to influential mega-databases, such as the Chinese single-character, English, French, and Dutch Lexicon Projects. Three analyses were conducted to illustrate the potential uses of the database. First, we compared the proportion of variance in lexical decision performance accounted for by six word frequency measures and established that the best predictor was Cai and Brysbaert’s (2010) contextual diversity subtitle frequency. Second, we ran virtual replications of three previously published lexical decision experiments and found convergence between the original experiments and the present megastudy. Finally, we conducted item-level regression analyses to examine the effects of theoretically important lexical variables in our normative data. This is the first publicly available large-scale repository of behavioral responses pertaining to Chinese two-character compound word processing (see the supplementary file), which should be of substantial interest to psychologists, linguists, and other researchers.