ArticlePDF Available

A New, Open Source English Vocabulary Test

Authors:
  • Ulster Institute for Social Research

Abstract and Figures

Vocabulary is one of the best measures of general intelligence (g). However, readily available English vocabulary tests are often based on outdated words or are proprietary. To remedy this problem and to provide a large item pool for computerized adaptive testing (CAT), we sought to construct a new English vocabulary test. We constructed a total of 224 items using different formats (choose 1 of 5, choose 2 of 5, and choose 3 of 5). After initial experimentation, we validated and normed the test using an online US sample from Prolific (N = 441). The resulting full test had near-perfect overall reliability, 0.97. The reliability was high for a wide range of ability, above 0.90 for z scores from-3.89 to 2.32. Differential item functioning testing found negligible bias for sex. The total test score showed a small advantage in favour of women (d = 0.16) in the White sample but virtually no sex difference in the entire sample (d = 0.07). Abbreviated versions of the test were constructed based on multiple algorithms.
Content may be subject to copyright.
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
ISSN: 0025-2344
A New, Open Source English Vocabulary Test
Emil O. W. KirkegaardMeng HuJurij Fedorov
Abstract
Vocabulary is one of the best measures of general intelligence (g). However, readily available English vocabulary tests
are often based on outdated words or are proprietary. To remedy this problem and to provide a large item pool for
computerized adaptive testing (CAT), we sought to construct a new English vocabulary test. We constructed a total
of 224 items using different formats (choose 1 of 5, choose 2 of 5, and choose 3 of 5). After initial experimentation,
we validated and normed the test using an online US sample from Prolific (N= 441). The resulting full test had
near-perfect overall reliability, 0.97. The reliability was high for a wide range of ability, above 0.90 for zscores from
-3.89 to 2.32. Differential item functioning testing found negligible bias for sex. The total test score showed a small
advantage in favour of women (d= 0.16) in the White sample but virtually no sex difference in the entire sample (d
= 0.07). Abbreviated versions of the test were constructed based on multiple algorithms.
Keywords: Vocabulary, IRT, DIF, Predictive validity, Reliability, Test optimization, Prolific, Online testing
1 Introduction
There are many standardized tests available to measure general intelligence (g) to some degree (Jensen,
1998, ch. 4; Warne, 2020, pp. 73–84). These are commonly called intelligence tests. The various tests
vary in their g-loading, that is, the degree to which they measure “general” intelligence, or g. Vocabulary
tests are some of the most g-loaded, around .80 without adjustment for reliability (Kirkegaard, 2022b).
However, most available vocabulary tests for English are either outdated or expensive. For instance, a copy
of the current adult Wechsler test has a listed price of 1600 USD.
1
Older tests such as the Wordsum are
based partly on outdated words (Dorius et al., 2016). In order to facilitate the measurement of English
vocabulary, and general intelligence indirectly, it is necessary to construct new English vocabulary tests,
something which has already been done for other types of tests (Dworak et al., 2020). The purpose of this
study was to construct and validate such a test. We desired a large pool of items such that they might be
used for computerized adaptive testing for faster testing sessions.
Tests of crystallized ability, which includes vocabulary, have been subject to criticism. The most
common argument is that verbal or crystallized ability reflects nothing more than acculturated learning
(Roberts et al., 2000). However, this view is misguided. Compared to tests of fluid intelligence, crystallized
tests such as vocabulary correlate more strongly with the Raven test (Jensen, 1998, pp. 90, 120) and also
more strongly with reaction time, prototypical measures of fluid intelligence (Jensen, 1998, pp. 234–238).
Therefore, the opposition between fluid and crystallized tests, the former deemed as pure and the latter
as culturally loaded or biased, is based neither on empirical findings nor on sound theoretical grounds
(Jensen, 1966, p. 104). The non-verbal “culture-fair” tests elicit verbal mediation processes such as labelling,
associative network, categorization, and mnemonic elaboration (Jensen, 1967). The most cognitively
demanding items on the Raven matrices, a test considered (albeit wrongly) by some researchers as a pure
measure of g, require verbal processes rather than visuospatial processes (DeShon et al., 1995; Hinton, 2015;
Arthur Jensen Institute, Denmark; Corresponding author; Email: the.dfx@gmail.com
Independent researcher, France
Independent researcher, Denmark
1https://www.pearsonassessments.com/store/usassessments/en/Store/Professional-Assessments/Cognition
-%26-Neuro/Wechsler-Adult-Intelligence-Scale-%7C-Fourth-Edition/p/100000392.html
, archived at
https://archive.ph/wip/cPj46.
217
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Lynn et al., 2004). And while a vocabulary test can be considered as culturally loaded, it mainly reflects
fluid ability because most words in a person’s vocabulary are learned through inferences of their meaning,
by the eduction of relations and correlates (Jensen, 1998, p. 89). The higher the level of a person’s g,
the fewer encounters with a word are needed to correctly infer its meaning. Knowledge gap is a necessary
outcome of a ggap (Jensen, 1973, pp. 89–90; 1980, pp. 110, 235).
2 Test Construction and Data
There were two overall steps in the construction of the test. First, we created 159 vocabulary questions, of
which 8 items were merely used as attention checks, using the format “[word/meaning]” “Pick the synonym
from the 5 words listed below” (choose 1 of 5). Most of the words used in our items are either adjectives or
nouns. The great majority of the words are used only once in creating the answer options. Figures A1–A3
in the Appendix show item examples. The items were created by using several online dictionaries like the
Merriam Webster. The synonym items were created by looking up synonym words on the dictionary site,
reading all their definitions, and using other sites to compare the list to make sure the words were properly
defined. It is important to read sentences featuring the word to make sure it is used in the same way as
the synonym. All definitions of the wrong answer words were read too, to make sure none of the many
definitions overlapped in the correct vs. wrong answer options. With multiple definitions for each word and
multiple answer choices, the chances of some obscure unintended synonym overlap is great, so it is crucial
to remember all definitions for each word when creating an item. If testers picked the same wrong answers,
then definitions were looked into again on multiple sites to make sure no error was made when making the
item.
The initial easy synonym items were made using the word list in Brysbaert et al. (2019). That study
mixed real words with fake words and asked 220K test takers to select the real words. This created a
list with 61,858 English words sorted by difficulty level and other criteria. Words were picked from each
difficulty level and the wrong answer options were picked from the same difficulty level. The synonym was
picked based on less stringent criteria. This works well for easy items, but such word groups are often too
dissimilar to pose a challenge for the higher intelligence group. The more difficult items were later created
by more loosely picking complex words that had similar styles to the synonyms. Words were also searched
online to make sure they were used in real-life settings today and to make sure they were used the same way
as defined in the dictionary. But test takers were also told that the words were not always full synonyms in
case this could become an issue. Word definitions change fast, and this test could start to become outdated
in 30 years, therefore hampering intergenerational comparisons of verbal IQ scores (Roivainen, 2014). But
most ambiguity was avoided as much as possible to ideally make the test last 50+ years.
After some initial rounds of testing using small samples, we collected data for 499 American subjects
using Prolific, an online platform which generally provides high-quality survey data (Douglas et al., 2023).
These were recruited using the “representative sample” setting such that they reflect the US adult population
in terms of age, sex, and race. The first survey was administered on 12th June 2023 with an average hourly
pay rate of $8.15 and the follow-up survey on 26th June 2023 with an average hourly pay rate of $8.27
(+20% upon completion). The first and second survey had a median time of completion of 36 and 19
minutes, respectively.
The follow-up survey was done because a preliminary analysis of the first survey data showed that
the vocabulary test was too easy (mean item pass rate = 81%). For this reason, we constructed another
set of 73 harder items using two additional formats: choose 2 of 5, and choose 3 of 5. In this format, the
subject is asked to select 2 or 3 words from a list of 5 that are synonymous or closest in meaning. This task
is much harder than the choose 1 of 5 format because the subject no longer has a 1
/
5 = 20% chance of
guessing the correct answer at random. Instead, the chance of randomly picking the right combination is
reduced to 10%.
We then invited the same subjects who had participated in the first wave to take this second set of
items as well. In hope that we retain a larger number of participants, we announced to the participants
who completed the follow-up survey that they would receive a 20% increase in their hourly pay rate. Of
218
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
the invited 499 subjects, 434 participated again, a retention rate of 86%. However, we were left with 441
and 383 subjects, respectively, after removal of bad quality data (i.e., participants who failed the attention
checks in the vocabulary test).
2
The second survey was carried out 14 days after the first, which may have
contributed to the follow-up loss. This is unfortunate, but it reflects the time it took to design the new set
of items and carry out some initial small-scale tests. The complete set of vocabulary items then consisted
of 224 items for which we had complete data from 383 subjects.
During the base survey, participants were told that the survey takes about 25–30 minutes and is
composed of two main parts: a vocabulary test with 159 items, and a personality evaluation with 107 items.
They were also asked to answer some basic demographic questions and some ideology questions. Regarding
the vocabulary test, the participants were told that they need to find the synonym word in each of the 159
items and that the items are randomly sorted rather than sorted by difficulty. The participants were then
told that the synonym of the target word need not be a direct synonym but can sometimes cover the same
idea. We illustrate with a hypothetical example, using “exploit” as the target word and “car,” “house,”
“milk,” “little,” “sun” as answer options. In this case, milk is the right choice as it can mean “exploit” too
(e.g., he milked the company for millions). Included were 8 items that served as attention checks and were
not intended to measure verbal ability. The final version of this test was composed of 151 items.
In the follow-up survey, participants were also told that the test is harder. It is unclear whether such
an announcement has contributed to non-random attrition by discouraging some participants. Respondents
who did not participate or were not retained for the follow-up study had lower scores on the easy test (Mean
= 112, SD = 24.5,
N
= 58) compared to those who were retained (Mean = 123, SD = 22.4,
N
= 383).
However, non-random attrition is very common in surveys.
3 Test Properties
We scored each item as a binary (correct or incorrect). Because 14% of subjects did not fill out the second
set of harder items, they have missing data for these. As a result, a simple sum score cannot be used for
the entire dataset, but it can be used for the first half. Item response theory (IRT) scores, however, can
be used with the complete dataset. The one-factor 2PL model was fit to the 224 items using the
mirt
package (Chalmers et al., 2020). The results were mostly satisfactory. Figures 1 and 2 show the distribution
of item pass rates and factor loadings based on the complete test.
The first set of items appeared somewhat too easy (with a % correct rate of 81%) for this study
population (American adults answering surveys online for pay). This was the problem that we noticed after
the collection of the first 151 items. The second set of items were considerably harder, with an average %
correct of 44%. The complete test then had a % correct average of 68%, as shown in Figure 1. Despite
this, the easy and hard parts of the test displayed a high correlation (r=.81).
Most items had very good loadings, as shown in Figure 2, but there were exceptions: 2 items showed
negative loadings, and 4 other items had loadings below .25. These items were discarded. As the items had
already gone through multiple rounds of small-scale testing before the main survey data was collected, it
was not expected that many items would show poor loadings.
The relationship between pass rate and factor loading is displayed in Figure 3. This illustrates the
problem with constructing high-range vocabulary tests. The harder an item is, the lower its factor loading.
The putative explanation for this finding is that the harder items are composed of rarer words, to which
subjects have perhaps never been exposed, thus having had no chance to learn. While exposure to rare
words is itself caused by intelligence, e.g., reading habits or having smarter friends, some of this exposure
is also random. When exposure is not evenly distributed in the population, the difference in knowledge
of a given rare word is not necessarily due to ability differences, but due to coincidence. This source of
performance variance is random, and thus decreases the factor loading of such items because more variance
is attributed to causes unrelated to ability (Jensen, 1980, p. 157, 2001).
2
We also analysed the entire data, including participants who failed the attention checks, and all results were extremely similar
to the results presented here.
219
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure 1: Distribution of item pass rates.
Figure 2: Distribution of item factor loadings.
The observed relationship between pass rate and factor loadings, aside from a few outliers, appears
approximately linear.
3
This linearity is noteworthy because if careless responding were systematically related
3
To assess whether the observed linear relationship between item factor loadings and item difficulty could be influenced by
guessing parameters, we conducted a supplementary analysis. Using a 3PL IRT model, we derived the loading, difficulty, and
guessing parameters and then performed a regression analysis with difficulty, guessing, and their interaction as predictors of
220
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure 3: Scatterplot of item pass rates and factor loadings. Item names are displayed.
to item difficulty, we would expect deviations from a linear trend. Instead, this linear relationship suggests
that carelessness introduces random noise rather than a bias tied to item difficulty. This interpretation is
consistent with the findings that “rapid-guessing” responding (which measures low effort) is idiosyncratic,
being neither related to item difficulty nor decreasing effort over time (Wise & Kingsbury, 2016).
Figure 4 shows the reliability of the overall score as a function of ability level. This result is derived
from the IRT model.
4
The test demonstrates consistent reliability across most of the ability spectrum. For
example, the average reliability is .973 for the ability range from
2to 2
z
(covering 98% of the population)
and .960 from
3to 3
z
. Put another way, the test maintains a reliability of at least .90 from
3
.
89 to
2.32 z, or at least .80 from 4.55 to 3.41 z. The overall empirical reliability was .973.
Based on the above, we dropped the 5 items with factor loadings below .25, giving a new set of 219
good items. The IRT model was then refit. Results were nearly identical to those above. This is because
items with poor loadings don’t contribute much to the test. The average loading of items was .63.
In terms of creating norms, it is known that age has a large positive effect on performance on
vocabulary and other knowledge tests (Kavé, 2024), although this advantage can be due to measurement
bias (Fox et al., 2014). Figure 5 shows the relationship on the 219-item test.
The effect of age was almost perfectly linear.
5
As such, adjusting the IRT scores for age is a
simple matter of using a linear regression to obtain the residualized
z
scores. However, there is also
item loadings. The interaction effect between guessing and difficulty was non-significant (
p
=
.
55). A similar result was
obtained using parameters derived from a 2PL IRT model (
p
=
.
42), where the guessing parameter was fixed at 20% for the
“pick 1 out of 5” item format and 10% for the “pick 2 out of 5” and “pick 3 out of 5” formats. This suggests that the
observed relationship between difficulty and factor loadings is not an artifact owing to difficulty-by-guessing interaction.
4Using mirt package: plot(all_items_fit, type = "rxx").
5
This pattern is at odds with previous findings showing a decline in vocabulary test scores during the adults’ late 60s or early
70s onwards (Kavé, 2024). One possible explanation is that the older respondents are more highly selected compared to
younger ones (Zelinski & Kennison, 2007). One possible reason is that elderly individuals with lower ability may not use
computers much and therefore are unlikely to know about Prolific. To evaluate this hypothesis, respondents aged 70 and
above (
N
= 25) are compared with respondents below 70 years old (
N
= 358) using the combined sample (i.e., base plus
follow-up survey). Not only did the group of elderly perform better on the easy test (134 versus 122) and the hard test (39.6
versus 31.3) but they were also faster on the easy test (1199 versus 1437 seconds) and the hard test (1270 versus 1406
seconds).
221
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure 4: Test reliability as a function of ability level.
Figure 5: The relationship between age and vocabulary score. Orange line shows linear fit, blue line shows nonlinear
fit (LOESS).
heteroscedasticity in the residuals, i.e., the variance is greater for higher ages (
p<.
0001).
6
This complicates
the interpretation of the resulting residuals, as a
z
score of 1 does not mean the same centile for age 20 as
6
Heteroscedasticity was tested using the method used in (Kirkegaard & Gerritsen, 2021). This method is based on the
absolute valued residuals and is implemented in the test_HS() function from the kirkegaard package.
222
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
it does for age 60. When the sample is split into 3 age groups (19–40, 40–62, and 62–83), the standard
deviations for the groups are 0.83, 0.96, and 1.02, respectively. Figure 6 shows the boxplots.
Figure 6: Boxplots of vocabulary scores by age groups.
To remove both the mean and variance effects of age, we used a multistep approach:
1. Fit an OLS model to predict the mean IRT score as a function of age.
2. Fit an OLS model to predict the absolute mean residual score from model (1).
3. Subtract the average residual (2) and divide by the standard deviation.
4. Multiply scores by 15 and add 100 to obtain IQ values.
The models were fit on the White-only subset (
N
= 327) to control for the correlation between age
and race affecting the age adjustment (Whites were 7 years older on average). The details of the IRT score
to norm conversion are given in the supplement. Figure 7 shows the resulting score distribution by White
vs. non-White status. By design, the score mean is 100 for the White subset (SD = 15). It was 95.2 for
the non-White subset (SD = 16.1). There was no age heteroscedasticity in the resulting scores (
p
=
.
925).
4 Measurement Invariance by Sex
For practical use, it is important that tests are approximately measurement invariant for major demographic
groups. Here we tested for measurement invariance by sex, the largest subgroups we have (55% female,
45% male). We used the same approach used in prior studies (Dutton & Kirkegaard, 2021; Kirkegaard,
2022a). Each item was tested for invariance in a leave-one-out approach, which consists of freeing (i.e.,
estimating) one item at a time while leaving other items constrained across groups. Any items found to be
variant (non-invariant, i.e., differing in loading or difficulty or both) were allowed to vary by group (partial
invariance). The scores were then computed for the original model and the partially invariant model to see
how this affects scores.
Of the 219 items tested, we found that 4 were non-invariant (
p < .
05, using Bonferroni). However,
their biasing effects were in opposite directions (2 pro-female, 2 pro-male), so the effect on the total score
223
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure 7: Distribution of vocabulary IQ scores by White race status.
was negligible (Cohen’s d = 0.07). Figure 8 shows the item probability functions for the biased items. The
sex biases for complex multiple-select items are difficult to intuitively understand as they involve multiple
words. The words in the biased items are shown in Table 1.
Figure 8: Items showing sex bias.
Certain patterns can be seen. The one item where the target word relates to clothing is biased in
favour of women (type of skirt), unsurprising given women’s stronger interest in clothing. Two more target
224
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Table 1: Items with detectable sex bias in intercepts or slopes (p<.05, Bonferroni). Correct choices in bold.
Target Word Biased in Favour of Options
A type of skirts Female autologous, culottes, mimetic, scrag, tomography
colossal Male leviathan, neuroses, proximal, slipstream, vicar
rebellious Male disintegrable, fatuously, hemodialysis, insurrectionary, opioid
3of5_21* Female profuse, riotous, bountiful, dinky, genteel
* No target word, participants must pick 3 synonymous words out of 5.
words relate to war or history (rebellious,colossal), and show the expected pro-male bias. Another word
relates to alcohol, which men have more interest in and which shows a pro-male bias (drink ). The last
pro-female item doesn’t seem to have any obvious explanation. Studies of item biases often find that there
is no obvious explanation for biased items in many cases, and we can add to this here (Recueil, 2023).
The absence of measurement bias with respect to gender is perplexing at first considering that
vocabulary tests are culturally loaded. This is because cultural load is a necessary but not a sufficient
condition for cultural bias. Specifically, cultural bias is the result of differential exposure to the knowledge
elicited by the test item given that the groups have equal ability. That the test contains a cultural component
does not invalidate its predictive validity. Quite the opposite is true (te Nijenhuis & van der Flier, 2003).
Lack of severe measurement bias implies that IQ scores are fairly comparable across groups. In the present
data, we found that males (Mean = 99.4, SD = 16.2,
N
= 199) have slightly higher scores than females
(Mean = 98.2, SD = 14.8, N= 242).7
5 Predictive Validity
We included a few questions in the first wave to which one can relate the vocabulary scores (based on the
219-item test) to check for the expected validity. The majority of these were items from the MMPI-2,
which were planned to be used in another study. Nevertheless, it is worth providing a few examples. When
asked “I was a slow learner in school” (yes/no), people who answered yes scored on average 6.8 points
lower in verbal IQ. Similarly, people who answered yes to “I like to read about science” scored 8.5 points
higher in verbal IQ. People who answered yes to “A person shouldn’t be punished for breaking a law that he
thinks is unreasonable” scored on average 6.2 points lower in verbal IQ.
Aside from the MMPI-2 items, we also asked people at the end of the first wave how many items
they thought they would get correct on this test composed of easy items. The results are shown in Figure 9.
The results show excellent self-knowledge (
r
=
.
68) in line with recent results using the same question
phrasing with a 25-item science knowledge test (Kirkegaard & Gerritsen, 2021). We also asked participants
about their ability level in a different way: “Compared to the other Prolific survey users who took this survey,
how well do you think you did?” (in centiles). The correlation was still substantial, but lower (
r
=
.
50).
The difference was far beyond chance (
p<.
001, paired correlations test). These correlations for the full
test composed of 219 good items are very similar (respectively, r=.63 and r=.49).
6 Scale abbreviation
Because the full scale based on 219 items is too long to use in most situations, we constructed abbreviated
scales. This was done using two main algorithms: step forward optimization and max loading. In the first
method, an initial set of 3 items is selected based on items with the highest factor loadings in the full model,
as IRT requires at least 3 items to work. After this, the algorithm tries every possible 4-item model and
7
Prior to the data quality check, females had slightly higher scores than males (98.7 versus 97.6). This is because the
participants who provided bad data were primarily men (i.e., those who failed our attention checks typically had low scores,
likely due to lack of motivation).
225
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure 9: Self-estimated score on the easy part of the vocabulary test and actual number of correct answers.
selects the optimal one. The metric to optimize is the average of the reliability and correlation with the full
scale (both scaled so that 1 = value from the full scale).
In the second method, items are also picked in step-forward fashion, but they are chosen based on
their factor loading in the full model. Three variants of this algorithm were used:
1. Selection based only on factor loading.
2. Selection based on the factor loading residualized by the difficulty.
3.
Selection based on the factor loading, but with difficulty groups so that the selected items are
representative of the full ability range.
The motive for the latter two variants is that selecting purely based on factor loading resulted in
predominantly easy items being chosen because of the relationship between the two parameters shown in
Figure 3. Furthermore, because many of the 219 items had very low or very high pass rates, their item
parameters could not be precisely estimated and they would be relevant for relatively few subjects. For this
reason, the abbreviated scale was constructed only from the items with pass rates between 5 % and 95 %,
resulting in exactly 100 items. Figure 10 shows the results.
Based on these results, the step-forward optimization algorithm did the best, though the max loading
algorithm with balancing was not far behind. Runtime was about 1200 seconds for step-forward, and about
10 seconds for max loading, a ratio of about 120:1. Thus, to achieve a scale with reliability of
> .
85, 14
items are needed, and to achieve
> .
90, 25 are needed. Researchers are thus free to pick whichever trade-off
between speed of measurement and quality of measurement they want.
However, in order to facilitate the use of the abbreviated scales, we prepared norms for tests of 10
to 50 items in increments of 5. These can be found in the supplementary materials. Figure 11 shows the
reliability function of the abbreviated tests.
The figure shows that very short tests can have good reliability (which we may define as
> .
90) for
average test takers, but reliability rapidly drops off at the tails (e.g., a 10-item scale has reliability of about
.42 at 2
z
). If decent (
> .
85) reliability is needed for the tails, a longer test is required. For instance, the
226
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure 10: Optimization results for abbreviated scales up to 50 items. Combined index is the average of the other
two values, scaled so that 1 = value of the full test.
Figure 11: Scale reliabilities as a function of latent ability.
50-item scale provides at least .85 reliability from
2
.
14 to 1
.
84
z
, thus covering about 95 % of subjects.
Table 2 shows the reliability ranges by criterion for the abbreviated scales.
From the perspective of a prospective researcher planning a study, one can use the table to choose
a desired test. If one wants a very brief measurement and isn’t particularly interested in the tails of the
distribution, one could choose the 10-item test. It has an overall reliability of .82, which is acceptable,
and measures with at least .80 reliability from a
z
score of
1
.
24 to 0
.
81, thus covering 68 % of a normal
distribution. If better measurement is sought, one could choose a 25-item test, which has an overall
reliability of .90 and measures with at least .80 reliability from
1
.
78
z
to 1
.
54, covering 90 % of a normal
distribution. If precision at the tails is required, particularly the right tail, a longer test is needed (e.g., 50
items). Appendix Figure A4 further illustrates the data in Table 2.
227
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Table 2: Reliability ranges for the abbreviated and full scales.
Items in Scale Minimum Reliability Lower z Upper z Coverage Reliability Overall
10 0.80 1.24 0.81 0.68 0.82
10 0.85 1.06 0.51 0.55 0.82
10 0.90 0.69 0.09 0.29 0.82
15 0.80 1.42 1.18 0.80 0.86
15 0.85 1.24 0.87 0.70 0.86
15 0.90 0.94 0.39 0.48 0.86
20 0.80 1.60 1.42 0.87 0.89
20 0.85 1.42 1.12 0.79 0.89
20 0.90 1.12 0.69 0.62 0.89
25 0.80 1.78 1.54 0.90 0.90
25 0.85 1.54 1.24 0.83 0.90
25 0.90 1.24 0.81 0.68 0.90
30 0.80 1.90 1.72 0.93 0.91
30 0.85 1.66 1.42 0.87 0.91
30 0.90 1.36 1.00 0.75 0.91
35 0.80 2.02 1.84 0.95 0.92
35 0.85 1.78 1.54 0.90 0.92
35 0.90 1.48 1.12 0.80 0.92
40 0.80 2.14 1.96 0.96 0.93
40 0.85 1.90 1.66 0.92 0.93
40 0.90 1.54 1.24 0.83 0.93
45 0.80 2.26 2.14 0.97 0.93
45 0.85 2.02 1.78 0.94 0.93
45 0.90 1.66 1.30 0.85 0.93
50 0.80 2.38 2.20 0.98 0.94
50 0.85 2.14 1.84 0.95 0.94
50 0.90 1.78 1.36 0.88 0.94
100 0.80 3.11 2.98 1.00 0.95
100 0.85 2.74 2.56 0.99 0.95
100 0.90 2.32 2.02 0.97 0.95
219 0.80 4.55 3.41 1.00 0.97
219 0.85 4.25 2.92 1.00 0.97
219 0.90 3.89 2.32 0.99 0.97
7 Discussion
We sought to create a new English vocabulary test with high reliability and a broad range. After an initial
pool of 151 items, we added another 73 items, for a total of 224 items. Of these, we kept 219 items based
on a factor loading
> .
25 criterion. The resulting test was long but had exceptionally high reliability, .97,
which was also true for a large range of ability. Measurement invariance testing by sex revealed little bias.
We created abbreviated tests along with two algorithms, allowing researchers to choose shorter tests with
known norms for their particular purposes.
Because the test and the associated data are freely available for public use (CC-BY-NC license
8
), it
will be an easy replacement for traditional, expensive tests. It is possible to use the items with computerized
adaptive testing (Huang et al., 2022; Reckase, 2024), which would be even more efficient than abbreviated
fixed-length tests but comes at the cost of requiring specialized software that may be difficult to use with
8https://creativecommons.org/licenses/by-nc/4.0/deed.en.
228
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
most survey software.
Our use and development of two new algorithms for abbreviating tests is of note. A number of such
existing algorithms exist, such as the genetic algorithm and ant colony optimization (Schroeders et al.,
2016; Yarkoni, 2010). We additionally used the genetic algorithm on the present dataset (not reported in
the results), but we found that it was much slower than the step-forward algorithm and produced equivalent
results. This indicates that getting stuck in local optima due to the constraint of only selecting one item
at a time and being unable to deselect an item once selected was not a concern. This finding does not
necessarily generalize to more advanced tests, especially those measuring multiple dimensions simultaneously.
The fact that a quite decent test in terms of overall reliability and range of reliability could be constructed
based on an item pool of only about 100 items is interesting, and may mean that constructing further tests
and abbreviations thereof will not be so difficult. This could prove valuable for the further development of
free-to-use tests to compete with commercially available tests.
Limitations
Limitations of the current study include the following. First, we used online survey data from Prolific.
Though we used the national representative sampling method, the data were not entirely representative
of the USA. Furthermore, there is a self-selection issue with regards to who participates in online for-pay
survey work (Callahan et al., 2007; van Alten et al., 2024). We don’t know how this potentially selects for
intelligence, though it is likely that at least some individuals are missing on the left tail, as they are unable
to participate in this kind of online work. Furthermore, it is possible that individuals are missing on the right
tail because they have more gainful employment opportunities. As such, the distribution, and the norms
created upon it, are probably somewhat compressed towards the center. Relatedly, Prolific participants
may have an advantage in terms of word knowledge and reading speed because they take Prolific research
surveys on a regular basis. The number of approved submissions of our initial 441 participants was quite
high (Mean = 1494, Median = 1177, SD = 1278), confirming that they have long-time experience in taking
surveys. However, the number of approvals among participants correlates poorly with the 219-item verbal
IQ test (r=.16).
Second, the current study only investigated measurement invariance with regards to sex. It is possible
that the test shows bias for other groups (e.g., age, race, or USA vs. Canada vs. UK vs. Australia). We
would, of course, expect that the test shows a large bias as a measure of general intelligence if it is used
with non-native speakers of English, though it would still function as a valid measure of English proficiency.
Third, though the authors were fluent in English, none were native speakers. This possibly biased the
construction of the items in some way. However, all authors were male, and this did not appear to bias the
items with regards to sex, as measurement invariance testing found almost no biased items.
Fourth, only subjects from the USA were sampled. As such, the norms may not be entirely accurate
for other English-speaking populations. This is the case even though they are based on the White subset.
This population may still differ in average intelligence from other White English-speaking populations around
the world.
Fifth, one concern about online cognitive tests is poor data quality due to careless or low-effort
responding (DeSimone & Harms, 2018; Huang et al., 2015; Maniaci & Rogge, 2014), which particularly
threatens low-stakes tests. It was recently reported, however, that test motivation has only a small impact
on test scores (Bates & Gignac, 2022). In our situation, on the other hand, our vocabulary tests were
especially lengthy, and the follow-up vocabulary test was quite difficult and even stressful to some (two
respondents reached out to us, complaining that the test was exceedingly hard). If the test felt boring, along
with being frustrating due to the difficulty level, then one would expect a positive relationship between the
time taken to complete the vocabulary test and the final score, as the respondents sped up their response
time and made more errors due to carelessness. But there was actually no positive correlation between time
and score for either the easy test (r=.14) or the hard test (r=.02).
Reasoning tests are interesting alternatives to verbal tests, especially for test takers who are not fluent
in English. It is, however, challenging to get good quality data from online survey participants when the test
requires abstract and/or novel thinking and the test is not monitored. We have indeed attempted multiple
sessions involving reasoning tests such as Raven-like matrices or figure weights, but many participants spent
229
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
merely a few seconds (sometimes less than 10 or even 5 seconds) to solve medium and high-difficulty items
and often ended up finishing the test well below our time limit with a low score. Indeed, the time spent
taking the test was correlated with the total score, which indicated that many participants lacked motivation.
Another piece of evidence is that the item factor loadings obtained from these reasoning tests were lower
than the loadings obtained from our vocabulary tests, which means random guessing was pulling the item
loadings down. On the other hand, the task involved in this vocabulary test is very straightforward and
requires very little time investment. Despite vocabulary tests being culturally loaded, these tests require a
person to de-contextualize a term and derive its meaning. For this reason, it is a very good measure of
inductive reasoning, which is why it reflects gmuch more than it reflects verbal ability residualized from g
(Jensen, 2001).
Supplementary materials
The study materials including data, R notebook, high resolution figures, complete set of items, item
parameters, and supplementary analyses are available at OSF: https://osf.io/6gcy4/.
References
Bates, T. C., & Gignac, G. E. (2022). Effort impacts iq test scores in a minor way: A multi-study
investigation with healthy adult volunteers. Intelligence,92 , 101652.
Brysbaert, M., Mandera, P., McCormick, S. F., & Keuleers, E. (2019). Word prevalence norms for 62,000
english lemmas. Behavior Research Methods,51 , 467–479.
Callahan, C. A., Hojat, M., & Gonnella, J. S. (2007). Volunteer bias in medical education research: An
empirical study of over three decades of longitudinal data. Medical Education,41 (8), 746–753.
Chalmers, P., Pritikin, J., Robitzsch, A., Zoltak, M., Kim, K., Falk, C. F., . . . Oguzhan, O. (2020).
mirt: Multidimensional item response theory (1.32.1) [computer software]. Retrieved from
https://
CRAN.R-project.org/package=mirt
DeShon, R., Chan, D., & Weissbein, D. (1995). Verbal overshadowing effects on raven’s advanced
progressive matrices: Evidence for multidimensional performance determinants. Intelligence,21(2),
135–155.
DeSimone, J. A., & Harms, P. D. (2018). Dirty data: The effects of screening respondents who provide
low-quality data in survey research. Journal of Business and Psychology,33, 559–577.
Dorius, S. F., Alwin, D. F., & Pacheco, J. (2016). Twentieth century intercohort trends in verbal ability in
the united states. Sociological Science,3, 383–412.
Douglas, B. D., Ewell, P. J., & Brauer, M. (2023). Data quality in online human-subjects research:
Comparisons between mturk, prolific, cloudresearch, qualtrics, and sona. PLoS ONE ,18(3), e0279720.
Dutton, E., & Kirkegaard, E. (2021). The negative religiousness-iq nexus is a jensen effect on individual-level
data: A refutation of dutton et al.’s ‘the myth of the stupid believer.. Journal of Religion and Health,61 ,
3253–3275.
Dworak, E. M., Revelle, W., Doebler, P., & Condon, D. M. (2020). Using the international cognitive ability
resource as an open source tool to explore individual differences in cognitive ability. Personality and
Individual Differences, 109906.
Fox, M. C., Berry, J. M., & Freeman, S. P. (2014). Are vocabulary tests measurement invariant between
age groups? an item response analysis of three popular tests. Psychology and Aging ,29 (4), 925–938.
230
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Hinton, D. (2015). Uncovering the root cause of ethnic differences in ability testing: Differential test
functioning, test familiarity and trait optimism as explanations of ethnic group differences (Unpublished
doctoral dissertation). Aston University.
Huang, H.-T. D., Hung, S.-T. A., Chao, H.-Y., Chen, J.-H., Lin, T.-P., & Shih, C.-L. (2022). Developing
and validating a computerized adaptive testing system for measuring the english proficiency of taiwanese
efl university students. Language Assessment Quarterly,19 (2), 162–188.
Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidious
confound in survey data. Journal of Applied Psychology ,100(3), 828–845.
Jensen, A. R. (1966). Verbal mediation and educational potential. Psychology in the Schools,3(2),
99–109.
Jensen, A. R. (1967). The culturally disadvantaged: Psychological and educational aspects. Educational
Research,10(1), 4–20.
Jensen, A. R. (1973). Educability and group differences. New York: Harper & Row.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport CT: Praeger.
Jensen, A. R. (2001). Vocabulary and general intelligence. Behavioral and Brain Sciences,24 (6), 1109–1110.
Kavé, G. (2024). Vocabulary changes in adulthood: Main findings and methodological considerations.
International Journal of Language & Communication Disorders,59 (1), 58–67.
Kirkegaard, E. O. W. (2022a). The intelligence gap between black and white survey workers on the prolific
platform. Mankind Quarterly,63 (1), 79–88.
Kirkegaard, E. O. W. (2022b, April 15). Which test has the highest g loading? Just Emil Kirkegaard
Things. (Retrieved from
https://www.emilkirkegaard.com/p/which-test-has-the-highest-g
-loading)
Kirkegaard, E. O. W., & Gerritsen, A. (2021). Looking for evidence of the dunning-kruger effect: An
analysis of 2400 online test takers. OpenPsych,1(1).
Lynn, R., Allik, J., & Irwing, P. (2004). Sex differences on three factors identified in raven’s standard
progressive matrices. Intelligence,32 (4), 411–424.
Maniaci, M. R., & Rogge, R. D. (2014). Caring about carelessness: Participant inattention and its effects
on research. Journal of Research in Personality ,48, 61–83.
Reckase, M. D. (2024). The influence of computerized adaptive testing (cat) on psychometric theory and
practice (Vol. 11) (No. 1). (Retrieved from
https://jcatpub.net/index.php/jcat/article/view/
115)
Recueil, R. (2023, December 21). Bias is often unpredictable. Aporia. (Retrieved from
https://
www.aporiamagazine.com/p/bias-is-often-unpredictable)
Roberts, R. D., Goff, G. N., Anjoul, F., Kyllonen, P. C., Pallier, G., & Stankov, L. (2000). The armed
services vocational aptitude battery (asvab): Little more than acculturated learning (gc)!? Learning and
Individual Differences,12(1), 81–103.
Roivainen, E. (2014). Changes in word usage frequency may hamper intergenerational comparisons of
vocabulary skills: An ngram analysis of wordsum, wais, and wisc test items. Journal of Psychoeducational
Assessment,32(1), 83–87.
231
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Schroeders, U., Wilhelm, O., & Olaru, G. (2016). Meta-heuristics in short scale construction: Ant colony
optimization and genetic algorithm. PLOS ONE ,11 (11), e0167110.
te Nijenhuis, J., & van der Flier, H. (2003). Immigrant–majority group differences in cognitive performance:
Jensen effects, cultural effects, or both? Intelligence,31 (5), 443–459.
Van Alten, S., Domingue, B. W., Faul, J., Galama, T., & Marees, A. T. (2024). Reweighting uk biobank
corrects for pervasive selection bias due to volunteering. International Journal of Epidemiology,53 (3),
dyae054.
Warne, R. T. (2020). In the know: Debunking 35 myths about human intelligence. Cambridge Univ. Press.
Wise, S. L., & Kingsbury, G. G. (2016). Modeling student test-taking motivation in the context of an
adaptive achievement test. Journal of Educational Measurement,53 (1), 86–105.
Yarkoni, T. (2010). The abbreviation of personality, or how to measure 200 personality scales with 200
items. Journal of Research in Personality ,44(2), 180–198.
Zelinski, E. M., & Kennison, R. F. (2007). Not your parents’ test scores: Cohort reduces psychometric
aging effects. Psychology and Aging ,22 (3), 546–557.
Appendix
Figure A1: An example of a choose 1 of 5 vocabulary item.
Figure A2: An example of a choose 2 of 5 vocabulary item.
232
Mankind Quarterly, 65(2), 217-233 2024 Winter Edition
Figure A3: An example of a “choose 3 of 5 vocabulary” item.
Figure A4: Reliability functions and coverages by item in test count and desired minimum reliability.
To read the figure, choose a number of items, e.g., 30, and a minimum reliability level desired, e.g.,
.85. The bars then show the range of ability where this level of reliability is reached and the percentage of a
normal distribution this covers. In this case, the coverage is 87 % with coverage from about
1
.
7to 1
.
4
z
.
Alternatively, one may start with a desired coverage range, 95 % and a minimum reliability of .80, and look
for the smallest test that reaches this, which would be 35 items.
233
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the proliferation of online data collection in human-subjects research, concerns have been raised over the presence of inattentive survey participants and non-human respondents (bots). We compared the quality of the data collected through five commonly used platforms. Data quality was indicated by the percentage of participants who meaningfully respond to the researcher’s question (high quality) versus those who only contribute noise (low quality). We found that compared to MTurk, Qualtrics, or an undergraduate student sample (i.e., SONA), participants on Prolific and CloudResearch were more likely to pass various attention checks, provide meaningful answers, follow instructions, remember previously presented information, have a unique IP address and geolocation, and work slowly enough to be able to read all the items. We divided the samples into high- and low-quality respondents and computed the cost we paid per high-quality respondent. Prolific (1.90)andCloudResearch(1.90) and CloudResearch (2.00) were cheaper than MTurk (4.36)andQualtrics(4.36) and Qualtrics (8.17). SONA cost $0.00, yet took the longest to collect the data.
Article
Full-text available
Background: Vocabulary scores increase until approximately age 65 years and then remain stable or decrease slightly, unlike scores on tests of other cognitive abilities that decline significantly with age. Aims: To review the findings on ageing-related changes in vocabulary, and to discuss four methodological issues: research design; test type; measurement; and vocabulary ability as a proxy for general intelligence. Main contribution: A discussion of cross-sectional and longitudinal research designs shows that cohort membership accounts for some but not all ageing-related changes in vocabulary, and that drop-out and test-retest effects do not alter conclusions regarding these changes. Test type affects age trends in vocabulary, and if researchers use only one test, they should choose a multiple-choice synonym test. While some authors suggest that vocabulary tests do not measure the same underlying ability in younger and older adults, more research of this suggestion is needed. A brief examination of the use of vocabulary ability as a proxy for general intelligence in healthy ageing and for premorbid abilities in dementia indicates that such practice is often questionable. Conclusions: Vocabulary knowledge increases through the mid-60s regardless of measurement method. However, there is little information on how word knowledge serves other verbal skills in old age, how and when adults learn new words, or how much exposure is necessary for meanings to remain in storage for a lifetime. Research of these issues may require new methodologies, as well as novel theoretical accounts of ageing-related effects on vocabulary. What this paper adds: What is already known on this subject Unlike many cognitive abilities that decline with ageing, vocabulary knowledge continues to increase until approximately age 65, and then remains stable or decreases slightly. These findings have been replicated in different research designs and across languages. What this paper adds to existing knowledge The article presents a summary of findings on changes in vocabulary across adulthood, and a discussion of four key methodological issues: research design, test type, measurement, and the use of vocabulary ability as a proxy for general intelligence. What are the potential or actual clinical implications of this work? To better understand changes in vocabulary knowledge across adulthood, clinicians must be aware of methodological considerations that affect the field. Such considerations have direct clinical implications regarding the choice of vocabulary tests and their use as a proxy for other abilities in both healthy older adults and in individuals with dementia.
Article
Full-text available
This brief report analyzes data from a series of studies carried out by Bates and Gignac (2022), collected from paid survey takers on the Prolific platform (total n = 3357). In this UK sample, Black-White gap sizes on cognitive tests were substantial with an overall effect size d of 0.99 standard deviations adjusted for unreliability (unadjusted means = 0.84 d). Testing for measurement invariance via differential item functioning found either no bias or bias of trivial magnitude. We conclude that the Black-White intelligence gap seen in Prolific workers is of similar magnitude to the gap seen elsewhere in America.
Article
Full-text available
The Dunning-Kruger effect is a well-known psychological finding. Unfortunately, there are two aspects of the finding, one trivial, indeed a simple statistically necessary empirical pattern, and the other an unsupported theory that purports to explain this pattern. Recently, Gignac & Zajenkowski (2020) suggested two ways to operationalize and test the theory. We carried out a replication of their study using archival data from a larger dataset. We used two measures of self-estimated ability: estimated sumscore (correct responses), and estimated own-centile. We find no evidence of nonlinearity for either. We find evidence of heteroscedasticity for self-centile estimates, but not raw score estimates. Overall, the evidence was mostly inconsistent with Dunning-Kruger theory.
Article
Full-text available
A recent study by Dutton et al. (J Relig Health 59:1567–1579. https://doi.org/10.1007/s10943-019-00926-3, 2020) found that the religiousness-IQ nexus is not on g when comparing different groups with various degrees of religiosity and the non-religious. It suggested, accordingly, that the nexus related to the relationship between specialized analytic abilities on the IQ test and autism traits, with the latter predicting atheism. The study was limited by the fact that it was on group-level data, it used only one measure of religiosity that measure may have been confounded by the social element to church membership and it involved relatively few items via which a Jensen effect could be calculated. Here, we test whether the religiousness-IQ nexus is on g with individual-level data using archival data from the Vietnam Experience Study, in which 4462 US veterans were subjected to detailed psychological tests. We used multiple measures of religiosity—which we factor-analysed to a religion-factor—and a large number of items. We found, contrary to the findings of Dutton et al. (2020), that the IQ differences with regard to whether or not subjects believed in God are indeed a Jensen effect. We also uncovered a number of anomalies, which we explore.
Article
Full-text available
Although the measurement of intelligence is important, researchers sometimes avoid using them in their studies due to their history, cost, or burden on the researcher. To encourage the use of cognitive ability items in research, we discuss the development and validation of the International Cognitive Ability Resource (ICAR), a growing set of items from 19 different subdomains. We consider how these items might benefit open science in contrast to more established proprietary measures. A short summary of how these items have been used in outside studies is provided in addition to ways we would love to see the use of public-domain cognitive ability items grow.
Article
Background Biobanks typically rely on volunteer-based sampling. This results in large samples (power) at the cost of representativeness (bias). The problem of volunteer bias is debated. Here, we (i) show that volunteering biases associations in UK Biobank (UKB) and (ii) estimate inverse probability (IP) weights that correct for volunteer bias in UKB. Methods Drawing on UK Census data, we constructed a subsample representative of UKB’s target population, which consists of all individuals invited to participate. Based on demographic variables shared between the UK Census and UKB, we estimated IP weights (IPWs) for each UKB participant. We compared 21 weighted and unweighted bivariate associations between these demographic variables to assess volunteer bias. Results Volunteer bias in all associations, as naively estimated in UKB, was substantial—in some cases so severe that unweighted estimates had the opposite sign of the association in the target population. For example, older individuals in UKB reported being in better health, in contrast to evidence from the UK Census. Using IPWs in weighted regressions reduced 87% of volunteer bias on average. Volunteer-based sampling reduced the effective sample size of UKB substantially, to 32% of its original size. Conclusions Estimates from large-scale biobanks may be misleading due to volunteer bias. We recommend IP weighting to correct for such bias. To aid in the construction of the next generation of biobanks, we provide suggestions on how to best ensure representativeness in a volunteer-based design. For UKB, IPWs have been made available.
Article
Test motivation has been suggested to strongly influence low-stakes intelligence scores, with for instance, a recent meta-analysis of monetary incentive effects suggesting an average 9.6 IQ point impact (d = 0.64). Effects of such magnitude would have important implications for the predictive validity of intelligence tests. We report six studies (N = 4208) investigating the association and potential causal link of effort on cognitive performance. In three tests of the association of motivation with cognitive test scores we find a positive, but modest linear association of scores with reported effort (N = 3007: r ~ 0.28). In three randomized control tests of the effects of monetary incentive on test scores (total N = 1201), incentive effects were statistically non-significant in each study, showed no dose dependency, and jointly indicated an effect one quarter the size previously estimated (d = 0.166). These results suggest that, in neurotypical adults, individual differences in test motivation have, on average, a negligible influence on intelligence test performance. (≈ 2.5 IQ points). The association between test motivation and test performance likely partly reflects differences in ability, and subjective effort partly reflects outcome expectations.
Article
Prompted by Taiwanese university students’ increasing demand for English proficiency assessment, the absence of a test designed specifically for this demographic subgroup, and the lack of a localized and freely-accessible proficiency measure, this project set out to develop and validate a computerized adaptive English proficiency testing (E-CAT) system for Taiwanese EFL university students. Drawing on the guidelines posited by L2 testing specialists, we devised and followed a six-stage procedure to develop this E-CAT system: determining the test purpose, defining the construct, developing the test items, designing the administration processes, performing the field testing, and constructing the E-CAT system. Upon its completion, we performed two validation studies on the simulated data and as such offered the backing for the generalization inference and the explanation inference, which combined to lend support for the validity argument for the E-CAT score-based interpretations and uses. This project highlighted the importance of test localization, foregrounded the utility of computerized adaptive testing and item response theory in language test development and validation, and generated a localized, free-of-charge English proficiency test for Taiwanese university students to satisfy the graduation benchmark requirement and/or to demonstrate their English proficiency levels when job hunting.
Book
Emotional intelligence is an important trait for success at work. IQ tests are biased against minorities. Every child is gifted. Preschool makes children smarter. Western understandings of intelligence are inappropriate for other cultures. These are some of the statements about intelligence that are common in the media and in popular culture. But none of them are true. In the Know is a tour of the most common incorrect beliefs about intelligence and IQ. Written in a fantastically engaging way, each chapter is dedicated to correcting a misconception and explains the real science behind intelligence. Controversies related to IQ will wither away in the face of the facts, leaving readers with a clear understanding about the truth of intelligence.