ArticlePDF Available

An examination of the openpsychometrics.org vocabulary test

Authors:
  • Ulster Institute for Social Research

Abstract and Figures

We examined data from the popular free online 45-item “Vocabulary IQ Test” from https://openpsychometrics.org/tests/VIQT/. We used data from native English speakers (n = 9,278). Item response theory analysis (IRT) showed that most items had substantial g-loadings (mean = .59, sd = .22), but that some were problematic (4 items being lower than .25). Nevertheless, we find that using the site’s scoring rules (that include penalty for incorrect answers) give results that correlate very strongly (r = .92) with IRT-derived scores. This is also true when using nominal IRT. The empirical reliability was estimated to be about .90. Median test completion time was 9 minutes (median absolute deviation = 3.5) and was mostly unrelated to the score obtained (r = -.02). The test scores correlated well with self-reported criterion variables educational attainment (r = .44) and age (r = .40). To examine the test for measurement bias, we employed both Jensen’s method and differential item functioning (DIF) testing. With Jensen’s method, we see strong associations with education (r = .89) and age (r = .88), and less so for sex (r = .32). With differential item functioning, we only tested the sex difference for bias. We find that some items display moderate biases in favor of one sex (13 items had pbonferroni < .05 evidence of bias). However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05). Overall, the test seems mostly well-constructed, and recommended for use with native English speakers.
Content may be subject to copyright.
Submitted: 25th of January 2021 DOI: 10.26775/OP.2021.07.05
Published: 5th of July 2021 ISSN: 2597-324X
An examination of the openpsychometrics.org vocabulary
test
Emil O. W. Kirkegaard
OpenPsych
Abstract
We examined data from the popular free online 45-item “Vocabulary IQ Test” from
https://openpsychometrics.org/
tests/VIQT/
. We used data from native English speakers (n = 9,278). Item response theory analysis (IRT) showed that
most items had substantial g-loadings (mean = .59, sd = .22), but that some were problematic (4 items being lower than .25).
Nevertheless, we find that using the site’s scoring rules (that include penalty for incorrect answers) give results that correlate
very strongly (r = .92) with IRT-derived scores. This is also true when using nominal IRT. The empirical reliability was
estimated to be about .90. Median test completion time was 9 minutes (median absolute deviation = 3.5) and was mostly
unrelated to the score obtained (r = -.02). The test scores correlated well with self-reported criterion variables educational
attainment (r = .44) and age (r = .40). To examine the test for measurement bias, we employed both Jensen’s method and
dierential item functioning (DIF) testing. With Jensen’s method, we see strong associations with education (r = .89) and
age (r = .88), and less so for sex (r = .32). With dierential item functioning, we only tested the sex dierence for bias. We
find that some items display moderate biases in favor of one sex (13 items had p
bonf er roni
< .05 evidence of bias). However,
the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible
(|d| < 0.05). Overall, the test seems mostly well-constructed, and recommended for use with native English speakers.
Keywords: cognitive ability, intelligence, online testing, vocabulary, openpsychometrics.org, sex dierence,
measurement invariance, dierential item functioning, Jensen’s method, method of correlated vectors, sex bias
1 Introduction
Online psychological testing is popular. Unfortunately, there is a lack of validation of most online tests. This is
also true for cognitive ability tests. The main exception is the ICAR (International Cognitive Ability Resource),
which has seen extensive validation studies (Condon & Revelle,2014;Dworak et al.,2021;Merz et al.,2020;
Young et al.,2019).
1
Various national Mensa websites provide free figure reasoning tests (Raven-like) that
provide IQ-normed results, but with unknown psychometric properties and norm data.
2
Here we examine a
lesser known test simply called “Vocabulary IQ Test”, which is available at
https://openpsychometrics.org/
tests/VIQT/
. This is a 45-item multiple choice vocabulary test. The test origins or construction details are not
given on the site, but this is presumably a newly developed test considering that the website brands itself as
open source. The response format is the select-2-of-5 format, a somewhat unusual format (e.g., not covered in
introduction books such as Kline (2015)). Figure 1shows a screenshot of the test with the first item shown.
The purpose of the present study was to examine the psychometric properties of this test, as well as a limited
exploration of the related data.
Ulster Institute for Social Research, United Kingdom, Email: emil@emilkirkegaard.dk
1
The ICAR test is widely available for public use:
https://discovermyprofile.com/tests/Intelligence/-/-
.
https://www.idrlabs
.com/iq-16/test.php,https://www.sapa-project.org/.
2
There is generally a Mensa group in each country, and many of them provide their own online screening or “for fun” tests. Examples:
Denmark
https://mensa.dk/iqtest/
, Norway
https://test.mensa.no/
, Sweden
https://www.mensa.se/bli-medlem/provtest
-r1/, Romania https://mensaromania.ro/testari-mensa/test-online/.
1
Published: 5th of July 2021 OpenPsych
Figure 1: First item and test instructions.
2 Data
Data for 12,173 persons are publicly available at the data page (
https://openpsychometrics.org/_rawdata/
).
To reduce language bias, we only used data from persons who reported they were English native speakers.
Inspection of the histogram of correct responses showed a small (<1 %) lump of persons with near zero scores.
These are users that click through the test for test purposes. We removed subjects with scores below 10 (less
than 1 %). The final sample had n = 9,278 subjects. Of these, 4,603 (49.6 %) were female, and 4,286 (46.2 %)
were male. The remainer did not disclose their sex or reported “Other”. Aside from the 45 test items, the
dataset also contains age, nationality, 25 items from a Big Five personality test, and the amount of time spent.
Time spent was given in seconds. It had extreme skew (some people leave the browser tab open for days before
completing it), and it was converted to minutes and winsorised to a maximum of 120 minutes. The personality
data were not used in the present study.
All data and code output are available in the supplementary materials.
3 Results
The select-2-of-5 format of the data is mathematically equivalent to the select-1-of-10 format because there
are 10 ways to pick 2 of the 5 options without duplication and order (i.e., (5*4)/2). By having people select
two options, however, it is more space ecient than enumerating the pairwise options and having the subject
read 10 response options. The site’s approach to scoring the test is to convert the responses to dichotomous
correct/incorrect format, and then sum the correct responses subtracted by the incorrect responses. A more
advanced approach involves using item response theory (IRT) on the dichotomized items and then scoring
the persons using the resulting model. However, a further refinement is to employ categorical/nominal IRT
(Storme et al.,2019;Suh & Bolt,2010). In this approach, each response is allowed to have its own relationship
to the underlying trait. The benefit of this approach comes from the fact that the dierent distractors (incorrect
response options) do not have the same expected trait levels, that is, some responses are more obviously
incorrect than others, and this variation can be used for more precise or prediction scoring of persons given
sucient sample size (for a machine learning example, see Cutler et al. (2019)). Here we scored the test data
using 4 methods,
2
Published: 5th of July 2021 OpenPsych
1) sum of correct responses,
2) sum of correct minus incorrect,
3) dichotomous/binary IRT using 2PL (2-parameter logistic),
4)
categorical/nominal IRT using 2PLNRM (2-parameter logistic nominal response model) (Suh & Bolt,2010).
3
The simple sum is the most commonly used method, and can be interpreted as a latent variable model with
equal loadings (McNeish & Wolf,2020). The advantage here is the simplicity of use, especially for manual
scoring by hand, and the fact that one does not need to estimate factor loadings. Estimation of factor loadings in
small samples produces unreliable results, and it may be better to simply assume equal loadings (Gorsuch,2015;
Ree et al.,1998). The simple sum with subtraction for incorrect responses attempts to deal with dierences in
guessing rates, by subtracting the expected score gains from this. This method should produce better estimates
if all guessing is done completely at random and individuals simply vary in how much they guess. This
assumption is not likely to be accurate, so it is unclear how this correction will aect estimates. The binary
2-parameter logistic model (2PL) allows for items to vary in diculty and factor loading. Thus, items that are
more informative for a subject are given more weight in the scoring, and there is no bias from the binary nature
of the data. This model should produce more accurate estimates than the simple sum when items actually
vary in factor loadings, which almost any collection of items will do to a large degree. The nominal nominal
model further extends this by allowing that dierent incorrect responses may be dierentially informative. In
the binary models, each response is assumed to be informative only in two degrees, whether it is correct or
incorrect. In the nominal, some incorrect responses are deemed more incorrect than others, and thus used to
estimate the ability. This approach should be slightly more eective if a large sample is available for the model
training (Storme et al.,2019).
In every case, the data were modeled as unidimensional. This score is best considered an approximation of the
general intelligence factor (g) but with some influence by an orthogonal verbal ability. The IRT analyses were
done using the mirt package for R (Chalmers et al.,2020) (MIRT = Multidimensional Item Response Theory).
Table 1shows the correlations between cognitive scores and criterion variables.
Table 1: Correlations between four dierent variants of test scores and criterion variables IRT = item response theory.
Sum score Sum score penalty IRT binary IRT cat Educati on Age Time spent
Sum score 1.00 0.95 0.97 0.96 0.43 0.38 -0.02
Sum score penalty 0.95 1.00 0.92 0.87 0.42 0.35 -0.02
IRT binary 0.97 0.92 1.00 0.99 0.44 0.40 -0.02
IRT cat 0.96 0.87 0.99 1.00 0.44 0.40 -0.02
Educati on 0.43 0.42 0.44 0.44 1.00 0.35 0.00
Age 0.38 0.35 0.40 0.40 0.35 1.00 0.02
Time spent -0.02 -0.02 -0.02 -0.02 0.00 0.02 1.00
The various scoring methods produced scores that were very strongly correlated, r’s .87 to .99. The two more
advanced scoring methods produced slightly stronger correlations with the criterion variables. Notably, the
penalty method produced the weakest results, perhaps due to being confounded with guessing strategies that
are not much related to cognitive ability. Since the two IRT methods produced equivalent results, we chose the
simpler binary version for further analysis.
In terms of reliability, Cronbach’s alpha was .90, and the empirical reliability of the IRT scores was also about
.90 (.89 for binary IRT, .90 for categorical IRT; see empirical_rxx() function for details).
With regards to time spent, it is possible there could be nonlinear associations. Figure 2shows the scatterplot.
While there was some evidence of a nonlinear non-monotonic trend, it was trivial in size. With regards to sex
dierneces, males obtained higher scores, as shown in Figure 3.
3
We also tried other item models available in mirt’s mirt() function, namely nominal, graded, gpcm, and gpcmIRT. These all produced
worse results than 2PLNRM. See the mirt documentation for details of implementation.
https://rdrr.io/cran/mirt/man/mirt.html
.
3
Published: 5th of July 2021 OpenPsych
Figure 2: Scatterplot of time spent and obtained score. Nonlinear fit provided by LOESS.4
Figure 3: Density-histogram of scores by sex.
Quantitatively speaking, the male advantage is 0.28 Cohen’s d [95CI: -0.32 to -0.23, p < .0001]. While men had
higher scores, women had higher dispersion, with standard deviations of 0.97 and 1.01, respectively. However,
this female-advantage in dispersion may be a function of the test ceiling, as more men than women obtained
perfect scores (3.1 % vs. 1.6 %, and 2.4 % of all subjects). To examine whether some of this gap may be due to
test bias, we carried out dierential item functioning (DIF) testing using the functions provided by mirt.
4
This
approach involves doing an initial leave-one-out run to look for items that show detectable DIF as compared to
the other items as anchors. Then, in the second step, letting these items be freely estimated for each sex, using
the remaining items as anchors (these are assumed to be unbiased). Finally, the total tests can be scored as
4
Specifically, we followed the approach by the package developer, as presented in two workshops (Chalmers,2015a,b). We emailed
Chalmers in May 2020 to ask if the approach was still considered valid, and he armed that it is.
4
Published: 5th of July 2021 OpenPsych
Figure 4: Item response functions by sex.
scored using the invariant or partially invariant models (Meade,2010) which will show the degree to which the
items with bias impact the test scores. The results show negligible test level bias, with estimates of -0.04 and
0.03 (positive values indicate items that favor males), depending on a multiple testing adjustment (bonferroni)
or not. Figure 4shows the item functions.
It can be seen that some items are more informative than others (have a greater maximal slope), and that some
show notable sex bias (when the lines are not overlapping, e.g., item 37 has male-bias and item 21 female-bias).
Table 2provides item-level information.
Item
Pass
rate
Diculty
Discrimi-
nation
Loading Male d
Male
bias
Age r
Education
r
Time
spent r
1 0.99 -5.13 0.51 0.29 -0.06 0.00 -0.06 0.05 -0.21
2 0.94 -3.87 1.66 0.70 0.36 0.00 0.24 0.20 -0.09
3 0.99 -4.99 0.67 0.37 -0.04 0.00 -0.01 0.06 -0.21
4 0.50 0.02 0.71 0.38 0.20 0.00 0.11 0.22 0.01
5 0.98 -4.29 0.70 0.38 0.04 0.00 0.10 0.06 -0.11
6 0.85 -2.96 2.25 0.80 0.24 0.00 0.31 0.38 -0.05
7 0.49 0.04 0.12 0.07 0.00 0.00 -0.09 -0.05 -0.02
8 0.85 -2.88 2.09 0.77 0.05 -0.23 0.34 0.38 -0.01
9 0.83 -2.47 1.95 0.75 0.33 0.12 0.33 0.40 -0.02
10 0.97 -3.40 0.40 0.23 0.21 0.00 -0.12 0.01 -0.05
11 0.99 -4.68 0.83 0.44 0.11 0.00 0.01 0.14 -0.07
12 0.92 -4.00 2.26 0.80 0.14 0.00 0.34 0.42 -0.02
13 0.90 -3.48 2.10 0.78 0.02 -0.25 0.20 0.22 -0.06
14 0.98 -4.18 1.00 0.51 0.15 0.00 0.08 0.09 -0.07
15 0.85 -2.09 1.13 0.55 0.35 0.00 0.25 0.25 -0.02
16 0.97 -3.53 0.29 0.17 0.07 0.00 -0.08 0.00 -0.03
17 0.99 -5.81 1.36 0.62 -0.07 0.00 0.06 0.16 0.01
18 0.58 -0.43 1.43 0.64 0.41 0.31 0.30 0.34 0.01
19 0.65 -0.90 1.59 0.68 -0.05 -0.42 0.40 0.36 0.02
20 0.80 -2.43 2.30 0.80 0.17 0.00 0.33 0.34 0.00
21 0.58 -0.60 2.26 0.80 -0.13 -0.50 0.40 0.39 -0.05
22 0.98 -4.39 1.01 0.51 0.17 0.00 0.08 0.24 0.01
5
Published: 5th of July 2021 OpenPsych
23 0.97 -3.51 0.61 0.34 0.09 0.00 -0.04 0.04 0.02
24 0.92 -4.64 2.79 0.85 0.07 -0.18 0.33 0.38 -0.01
25 0.61 -0.52 0.91 0.47 0.27 0.00 0.13 0.28 -0.01
26 0.91 -2.27 -0.01 -0.01 0.05 0.00 -0.12 -0.07 -0.01
27 0.31 1.43 2.22 0.79 0.35 0.10 0.28 0.39 0.00
28 0.69 -1.35 2.09 0.78 0.32 0.00 0.42 0.45 -0.03
29 0.86 -2.31 1.33 0.61 0.47 0.45 0.12 0.23 0.00
30 0.89 -2.52 1.05 0.52 0.00 0.00 0.12 0.23 -0.04
31 0.99 -5.24 0.86 0.45 0.15 0.00 -0.03 0.01 0.00
32 0.94 -2.79 0.52 0.29 0.03 0.00 -0.06 0.05 -0.03
33 0.89 -3.71 2.49 0.83 0.29 0.00 0.40 0.46 0.00
34 0.82 -2.89 2.54 0.83 -0.03 -0.32 0.40 0.42 -0.03
35 0.77 -1.59 1.33 0.61 0.13 0.00 0.43 0.45 0.02
36 0.58 -0.47 1.80 0.73 0.12 0.00 0.41 0.41 0.00
37 0.92 -4.10 2.33 0.81 0.57 0.32 0.29 0.34 0.02
38 0.43 0.41 1.65 0.70 0.40 0.00 0.39 0.31 0.04
39 0.79 -2.14 1.96 0.76 0.12 0.00 0.33 0.36 0.00
40 0.52 -0.15 1.76 0.72 0.37 0.00 0.22 0.43 0.02
41 0.72 -1.66 2.21 0.79 0.46 0.23 0.24 0.32 -0.07
42 0.61 -0.76 2.08 0.77 0.22 0.00 0.37 0.37 -0.01
43 0.31 1.24 1.90 0.74 0.10 -0.20 0.34 0.37 0.00
44 0.50 0.01 0.94 0.48 0.19 0.00 0.20 0.31 0.03
45 0.80 -1.54 0.81 0.43 0.16 0.00 0.23 0.25 -0.02
Table 2: Item statistics. _rmeans latent correlation with that variable (biserial; (Uebersax, 2015)). Male bias measured in
Cohen’s d.
Of the 45 items, not all are good items, as scored using the site’s key. The mean g-loading is .59 (SD = 0.22). 4
items (7, 10, 16, and 26) had g-loadings below .25, and 1 below 0. These items should be revised or replaced.
Of the 45 items, 13 showed evidence of sex-bias (p
bonf er roni
< .05). However, because the direction bias was
symmetric around 0 (6 and 7 items), essentially no test level bias was seen. The distribution of item sex-bias is
shown in Figure 5.
Jensen’s method (also called method of correlated vectors; (Dragt,2010;Jensen,1998)) is an alternative and
simpler approach to examining the influence of latent variables. For any given scale, there are always a number
of latent sources of variance, which may have dierent relationships to criterion variables. In the case of
cognitive data, much research has been concerned with the relative influence of the general factor of intelligence
(g) related to other sources of variance (non-g) (Fernandes et al.,2014;te Nijenhuis et al.,2014;te Nijenhuis
& van der Flier,2013;Woodley of Menie et al.,2019). By theory, if gis the cause of the relationship between
test scores and some criterion variable, then the items that are better measures of g should show stronger
associations with that criterion variable. Figure 6shows the scatterplots for the 4 criterion variables, and Table 3
shows the correlations between the item-level variables.
The relationships for age and education are very strong. The relationship to male d is comparatively weaker,
despite the results of the DIF analysis finding that the gap was not due to test bias. Our interpretation is that
the biased items upset the relationship to the g-loadings. To test this, we carried out regression analysis and
included the estimated bias from DIF. Results are shown in Table 4.
The regression results confirm the hypothesis. The sex-biased items are outliers in the plot, and including their
estimated bias results in a well fitting model (model adj. R
2
= 72 %). Figure 7shows the item scatterplot with
biased items marked.
It can be seen in the plot that the outlying items with strong g-loadings are colored in the expected ways, with
male-biased items above the regression line, and female-biased below.
6
Published: 5th of July 2021 OpenPsych
Figure 5: Density-histogram of item sex-bias. The vertical line shows the mean.
Figure 6: Jensen’s method applied to 4 criterion variables. Correlations are .88, .89, .32, and .25, respectively, for age,
education, male, and time spent.
7
Published: 5th of July 2021 OpenPsych
Table 3: Item-level variables correlation matrix (45 items). Values in brackets are 95 % confidence intervals.
Pass
rate
Diculty Discrimi-
nation
Loading
Male d
Male
bias
Age r
Education
r
Time
spent r
Pass
rate
1
-0.95
[-0.97
-0.91]
-0.23
[-0.49
0.07]
-0.25
[-0.51
0.05]
-0.24
[-0.50
0.06]
0.09
[-0.21
0.37]
-0.43
[-0.65
-0.16]
-0.45
[-0.66
-0.18]
-0.43
[-0.64
-0.15]
Diculty
-0.95
[-0.97
-0.91]
1
0.13
[-0.17
0.40]
0.15
[-0.15
0.42]
0.26
[-0.03
0.52]
-0.05
[-0.34
0.25]
0.40
[0.12
0.62]
0.40
[0.12
0.62]
0.46
[0.19
0.67]
Discrimi-
nation
-0.23
[-0.49
0.07]
0.13
[-0.17
0.40]
1
0.97
[0.94
0.98]
0.27
[-0.02
0.52]
-0.17
[-0.44
0.13]
0.85
[0.75
0.92]
0.86
[0.76
0.92]
0.23
[-0.07
0.49]
Loading
-0.25
[-0.51
0.05]
0.15
[-0.15
0.42]
0.97
[0.94
0.98]
1
0.32
[0.03
0.56]
-0.13
[-0.41
0.17]
0.88
[0.79
0.93]
0.89
[0.81
0.94]
0.25
[-0.04
0.51]
Male d
-0.24
[-0.50
0.06]
0.26
[-0.03
0.52]
0.27
[-0.02
0.52]
0.32
[0.03
0.56]
1
0.71
[0.53
0.83]
0.27
[-0.02
0.52]
0.34
[0.05
0.57]
0.33
[0.04
0.57]
Male
bias
0.09
[-0.21
0.37]
-0.05
[-0.34
0.25]
-0.17
[-0.44
0.13]
-0.13
[-0.41
0.17]
0.71
[0.53
0.83]
1
-0.22
[-0.48
0.08]
-0.13
[-0.41
0.17]
0.07
[-0.23
0.35]
Age r
-0.43
[-0.65
-0.16]
0.40
[0.12
0.62]
0.85
[0.75
0.92]
0.88
[0.79
0.93]
0.27
[-0.02
0.52]
-0.22
[-0.48
0.08]
1
0.94
[0.90
0.97]
0.35
[0.07
0.59]
Education
r
-0.45
[-0.66
-0.18]
0.40
[0.12
0.62]
0.86
[0.76
0.92]
0.89
[0.81
0.94]
0.34
[0.05
0.57]
-0.13
[-0.41
0.17]
0.94
[0.90
0.97]
1
0.39
[0.11
0.61]
Time
spent r
-0.43
[-0.64
-0.15]
0.46
[0.19
0.67]
0.23
[-0.07
0.49]
0.25
[-0.04
0.51]
0.33
[0.04
0.57]
0.07
[-0.23
0.35]
0.35
[0.07
0.59]
0.39
[0.11
0.61]
1
Table 4: Regression models for item analysis (Jensen’s method extended). Regression variables not standardized. * = p <
.01, ** = p < .005, *** = p < .001.
Predictor/Model Simple Add diculty Add bias
Intercept 0.03 (0.066) 0.09 (0.076) 0.07 (0.043)
loading 0.24 (0.106) 0.21 (0.105) 0.28 (0.060***)
diculty 0.02 (0.013) 0.02 (0.007**)
male_bias 0.77 (0.081***)
R2 adj. 0.083 0.111 0.717
N 45 45 45
4 Discussion
There were multiple findings of note. First, despite a few poor items, the test works well overall. The correlations
to self-reported educational attainment and age were expected, as were the positive Jensen’s method results for
these (Dragt,2010;Strenze,2015). The reliability was good, estimated around .90 across methods. As such, the
test can be recommended for public use. However, it should be noted that the norms are unknown. Under the
assumption that they are based on the test takers, they are possibly inaccurate insofar as test takers are not
representative of the general population. They may be smarter or duller, or have practiced the test more, or
cheated by looking up the word definitions during the test (Cavanagh,2014). To acquire better norm data, it is
necessary to administer the test to a large representative population.
Second, we examined the test for sex bias using DIF testing. On this test, males obtained somewhat higher
8
Published: 5th of July 2021 OpenPsych
Figure 7: Jensen’s method on items for male advantage, with colors for DIF estimated item bias.
scores, d = 0.28 (4.2 IQ points). We found evidence of sex bias in 13 of the 45 items. However, the directions of
bias were roughly balanced (6 and 7 items) such that the test level bias was near zero. This test can justifiably be
used to compare scores of male and female examinees. We also employed the simpler Jensen’s method approach
and found that the results were congruent with the DIF testing results. Jensen’s method showed a positive slope
for g-loading and male advantage on an item, and when the eect of item bias was removed, the model fit very
well (adj. R2= 72 %).
Jensen’s method yields very strong results for education and age, indicating older and more educated persons
have greater vocabularies related to the general factor of the test. This finding is in line with prior results using
test-level analysis (Dragt,2010). With regards to Jensen’s method and item-level data, some prior studies have
used suboptimal metrics (e.g. Al-Shahomee et al. 2017;Philippe et al. 2007), spawning a long list of critical
papers (Wicherts,2017,2018a,b;Wicherts & Johnson,2009). Instead of using the diculty, pass rates were
used, which are nonlinear. Instead of g-loadings, item-whole point-biserial correlations were used, which are
aected by the pass rate. Because of this, items with pass rates close to 0.50 have higher ‘g-loadings’, and these
are the same items that have larger group gaps when measured in pass rates since a dierence in latent ability
of e.g. 1 d has the larger pass rate dierence for an item when the overall pass rate is closest to 0.50. This
confounding biases the resulting correlations in a positive direction. This present study did not use these faulty
metrics and is thus unaected by the criticism in those papers (see also Woodley of Menie et al. (2020) for
another study using this approach).
Per the above results, the male advantage we find cannot be explained by bias in the items. Though overall IQ
scores usually favor males in adults (Lynn,2017), vocabulary scores do not. Table 5shows a comparison of
large representative studies of adults on vocabulary tests.
The meta-analysis by Hyde & Linn included studies of children as well as adults, and the remaining studies
included only adults and usually had more than 1000 subjects, mostly from standardization samples. Only
the two studies from Taiwan show a male advantage of note, and the mean across all rows is 0.07 (without
Taiwan, 0.02). Lynn (2021) carried out a meta-analysis of subtest results from Wechsler tests, and found similar
results. For instance, there was a median male advantage of only 0.12 d on the vocabulary scale across 34
studies, less than half the male advantage we find in this study. While we don’t know why we observe a notable
dierence where others generally don’t, we speculate this may be due to a sex dierence self-selection bias for
online testing, such that duller men have a stronger tendency not to participate compared to duller women.
It is known that academic study participation is related to intelligence and educational attainment, and that
this eect diers by sex (Pirastu et al. 2021; but note that direction of bias varied between 23andme and UK
Biobank!). However, it is not known how this generalizes to online tests taken at leisure.
9
Published: 5th of July 2021 OpenPsych
Table 5: Summary of large representative studies of sex dierences in vocabulary in adults.
Test Country Year d citation
meta-analysis of 40 studies various until 1988 0.02 (Hyde & Linn,1988)
WAIS-3 vocabulary USA 1997 0.04 (Chen & Lynn,2020)
WAIS-3 vocabulary Taiwan 2001 0.31 (Chen & Lynn,2020)
WAIS-4 vocabulary USA 2008 0.05 (Chen & Lynn,2018)
WAIS-4 vocabulary Taiwan 2015 0.20 (Chen & Lynn,2018)
WAIS-4 vocabulary Chile 2013 0.02 (Lynn,2016)
WAIS-4 vocabulary South Korea 2011 -0.01 (Lynn & Hur,2016)
custom vocabulary test Brazil 2014 -0.03 (Flores-Mendoza et al.,2016)
Third, we find that the site’s scoring approach of summing correct answers and subtracting the incorrect ones is
inferior to using the simpler approach of summing correct answers only. Furthermore, using an IRT scoring
approach is slightly superior to both of these simpler approaches (r with age .40 vs. .35/.38; r with education .44
vs. .42/.43). However, we find that using the full categorical data is not better than using the dichotomized data
(Storme et al.,2019). It is thus suggested that the website also adopt a dichotomous IRT approach for scoring in
conjunction with the sum of correct responses approach, given its ease of understanding. The current scoring
rule that subtracts points for incorrect answers is suboptimal. The main limitation of this criterion analysis is
that we only have 2 variables to investigate. It would be preferable to repeat this method comparison using a
wider range of criterion variables, and preferably in a larger dataset, so that precision would be suciently
high to detect even small dierences between correlations (e.g., r = .20 vs. .22).
References
Al-Shahomee, A. A., Nijenhuis, J. t., Hoek, M. v. d., Spanoudis, G., & Žebec, M. S. (2017). Spearman’s hypothesis
tested comparing young libyan with european children on the items of the standard progressive matrices.
Mankind Quarterly,57(3). doi: 10.46469/mq.2017.57.3.15
Cavanagh, T. (2014). Cheating on online assessment tests: Prevalence and impact on validity. Collected Faculty
and StaScholarship. Retrieved from https://scholar.dominican.edu/all-faculty/174
Chalmers, P. (2015a). Multidimensional item response theory workshopin r. Retrieved from
https://philchalmers
.github.io/mirt/extra/mirt-Workshop-2015_Day-1.pdf
Chalmers, P. (2015b). Multidimensional item response theory workshopin r (day 2). Retrieved from
https://
philchalmers.github.io/mirt/extra/mirt-Workshop-2015_Day-2.pdf
Chalmers, P., Pritikin, J., Robitzsch, A., Zoltak, M., Kim, K., Falk, C. F., .. . Oguzhan, O. (2020). mirt:
Multidimensional item response theory (1.32.1) [computer software]. Retrieved from
https://CRAN.R-project
.org/package=mirt
Chen, H.-Y., & Lynn, R. (2018). Sex dierences on the wais-iv in taiwan and the united states. Mankind Quarterly,
59(1). doi: 10.46469/mq.2018.59.1.11
Chen, H.-Y., & Lynn, R. (2020). Sex dierences on the wais-iii in taiwan and the united states. Mankind
Quarterly,61(2). doi: 10.46469/mq.2020.61.2.9
Condon, D. M., & Revelle, W. (2014). The international cognitive ability resource: Development and initial
validation of a public-domain measure. Intelligence,43, 52-64. doi:
https://doi.org/10.1016/j.intell
.2014.01.004
Cutler, A., Dunkel, C. S., McLoughlin, S., & Kirkegaard, E. O. W. (2019). Machine learning psy-
chometrics: Improved cognitive ability validity from supervised training on item level data.
International Society for Intelligence Research. Retrieved from
https://www.researchgate.net/
publication/334477851_Machine_learning_psychometrics_Improved_cognitive_ability_validity
_from_supervised_training_on_item_level_data doi: 10.13140/RG.2.2.21354.26562
10
Published: 5th of July 2021 OpenPsych
Dragt, J. (2010). Causes of group dierences studied with the method of correlated vectors: A psychometric meta-
analysis of spearman’s hypothesis. Retrieved from https://osf.io/qk8jm/
Dworak, E. M., Revelle, W., Doebler, P., & Condon, D. M. (2021). Using the international cognitive ability
resource as an open source tool to explore individual dierences in cognitive ability. Personality and Individual
Dierences,169, 109906. (Celebrating 40th anniversary of the journal in 2020) doi:
https://doi.org/
10.1016/j.paid.2020.109906
Fernandes, H. B., Woodley, M. A., & te Nijenhuis, J. (2014). Dierences in cognitive abilities among primates are
concentrated on g: Phenotypic and phylogenetic comparisons with two meta-analytical databases. Intelligence,
46, 311-322. doi: https://doi.org/10.1016/j.intell.2014.07.007
Flores-Mendoza, C., Darley, M., & Fernandes, H. B. F. (2016). Cognitive sex dierences in brazil. Mankind
Quarterly,57(1). doi: 10.46469/mq.2016.57.1.4
Gorsuch, R. L. (2015). Factor analysis (Classic ed.). Routledge, Taylor & Francis Group.
Hyde, J. S., & Linn, M. C. (1988). Gender dierences in verbal ability: A meta-analysis. Psychological Bulletin,
104(1), 53–69. doi: 10.1037/0033-2909.104.1.53
Jensen, A. R. (1998). The g factor: The science of mental ability. Praeger.
Kline, P. (2015). A handbook of test construction (psychology revivals): Introduction to psychometric design (1st ed.).
Routledge. doi: 10.4324/9781315695990
Lynn, R. (2016). Sex dierences on the wais-iv in chile. Mankind Quarterly,57(1). doi:
10.46469/mq.2016.57
.1.5
Lynn, R. (2017). Sex dierences in intelligence: The developmental theory. Mankind Quarterly,58(1). doi:
10.46469/mq.2017.58.1.2
Lynn, R. (2021). Sex dierences in verbal abilities in the wechsler tests: A review. Mankind Quarterly,61(3).
doi: 10.46469/mq.2021.61.3.16
Lynn, R., & Hur, Y.-M. (2016). Sex dierences on the wais-iv in the south korean standardization sample.
Mankind Quarterly,57(1). doi: 10.46469/mq.2016.57.1.6
McNeish, D., & Wolf, M. G. (2020). Thinking twice about sum scores. Behavior Research Methods,52, 2287–2305.
doi: 10.3758/s13428-020-01398-0
Meade, A. W. (2010). A taxonomy of eect size measures for the dierential functioning of items and scales.
The Journal of Applied Psychology,95(4), 728–743. doi: 10.1037/a0018966
Merz, Z. C., Lace, J. W., & Eisenstein, A. M. (2020). Examining broad intellectual abilities obtained within an
mturk internet sample. Current Psychology. doi: 10.1007/s12144-020-00741-0
Philippe, R. J., Ann, B. T., A, V. P., & ČvorovićJelena. (2007). Genetic and environmental contributions
to population group dierences on the raven’s progressive matrices estimated from twins reared together
and apart. Proceedings of the Royal Society B: Biological Sciences,274(1619), 1773–1777. doi:
10.1098/
rspb.2007.0461
Pirastu, N., Cordioli, M., Nandakumar, P., Mignogna, G., Abdellaoui, A., Hollis, B., .. . Ganna, A. (2021).
Genetic analyses identify widespread sex-dierential participation bias. BioRxiv,2020.03.22.001453. doi:
10.1101/2020.03.22.001453
Ree, M. J., Carretta, T. R., & Earles, J. A. (1998). In top-down decisions, weighting variables does not
matter: A consequence of wilks’ theorem. Organizational Research Methods,1(4), 407–420. doi:
10.1177/
109442819814003
Storme, M., Myszkowski, N., Baron, S., & Bernard, D. (2019). Same test, better scores: Boosting the reliability
of short online intelligence recruitment tests with nested logit item response theory models. Journal of
Intelligence,7(3), 17. doi: 10.3390/jintelligence7030017
Strenze, T. (2015). Intelligence and success (In: Goldstein S., Princiotta D., Naglieri J. ed.). Springer New York.
Retrieved from http://link.springer.com/10.1007/978-1-4939-1562-0_25
11
Published: 5th of July 2021 OpenPsych
Suh, Y., & Bolt, D. (2010). Nested logit models for multiple-choice item response data. Psychometrika,75,
454–473. doi: 10.1007/s11336-010-9163-7
te Nijenhuis, J., Jongeneel-Grimen, B., & Kirkegaard, E. O. (2014). Are headstart gains on the g factor? a
meta-analysis. Intelligence,46, 209-215. doi: https://doi.org/10.1016/j.intell.2014.07.001
te Nijenhuis, J., & van der Flier, H. (2013). Is the flynn eect on g?: A meta-analysis. Intelligence,41(6), 802-807.
(The Flynn Eect Re-Evaluated) doi: https://doi.org/10.1016/j.intell.2013.03.001
Wicherts, J. M. (2017). Psychometric problems with the method of correlated vectors applied to item scores
(including some nonsensical results). Intelligence,60, 26-38. doi:
https://doi.org/10.1016/j.intell
.2016.11.002
Wicherts, J. M. (2018a). Ignoring psychometric problems in the study of group dierences in cognitive test
performance. Journal of Biosocial Science,50(6), 868–869.
Wicherts, J. M. (2018b). This (method) is (not) fine. Journal of Biosocial Science,50(6), 872–874. doi:
10.1017/
S0021932018000184
Wicherts, J. M., & Johnson, W. (2009). Group dierences in the heritability of items and test scores. Proceedings
of the Royal Society B: Biological Sciences,276(1667), 2675–2683. doi: 10.1098/rspb.2009.0238
Woodley of Menie, M. A., te Nijenhuis, J., Shibaev, V., Li, M., & Smit, J. (2019). Are the eects of lead
exposure linked to the g factor? a meta-analysis. Personality and Individual Dierences,137, 184-191. doi:
https://doi.org/10.1016/j.paid.2018.09.005
Woodley of Menie, M. A., Kirkegaard, E. O. W., & Meisenberg, G. (2020). Latent variable moderation of
the negative fertility-item pass rate association in two, large datasets: An item response theory analysis. doi:
10.13140/RG.2.2.27203.96809
Young, S. R., Keith, T. Z., & Bond, M. A. (2019). Age and sex invariance of the international cognitive ability
resource (icar). Intelligence,77, 101399. doi: https://doi.org/10.1016/j.intell.2019.101399
12
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Widely used in social science research, samples of participants obtained via Amazon’s Mechanical Turk (mTurk) tend to be representative across many sociodemographic variables. However, to date, no research has investigated and reported the global cognitive ability level (i.e., intelligence) of samples obtained via mTurk. The present study contributes to the literature by investigating a previously well-validated, public domain measure of cognitive ability in a sample of American adults recruited via mTurk. As part of a larger cross-sectional, survey-based study, four hundred thirty-four (434) Americans (M age = 37.86; 35.7% men) completed a demographic questionnaire and the 16-item International Cognitive Ability Resource, Sample Test (ICAR-16). Results revealed a normal distribution of ICAR-16 scores across the current sample. Additionally, total scores were positively correlated with participants’ level of education, income, and self-estimated intelligence, but did not significantly correlate with participant age. No gender differences were identified on ICAR-16 total scores. Finally, ICAR-16 scores did not significantly differ from normative data derived from its validation study. These results suggested that American mTurk samples may be representative of the broader population in terms of global cognitive ability, and that the ICAR-16 is likely a reasonable, psychometrically sound, and inexpensive measure of global cognitive ability appropriate for use in mTurk samples.
Article
Full-text available
A common way to form scores from multiple-item scales is to sum responses of all items. Though sum scoring is often contrasted with factor analysis as a competing method, we review how factor analysis and sum scoring both fall under the larger umbrella of latent variable models, with sum scoring being a constrained version of a factor analysis. Despite similarities, reporting of psychometric properties for sum scored or factor analyzed scales are quite different. Further, if researchers use factor analysis to validate a scale but subsequently sum score the scale, this employs a model that differs from validation model. By framing sum scoring within a latent variable framework, our goal is to raise awareness that (a) sum scoring requires rather strict constraints, (b) imposing these constraints requires the same type of justification as any other latent variable model, and (c) sum scoring corresponds to a statistical model and is not a model-free arithmetic calculation. We discuss how unjustified sum scoring can have adverse effects on validity, reliability, and qualitative classification from sum score cut-offs. We also discuss considerations for how to use scale scores in subsequent analyses and how these choices can alter conclusions. The general goal is to encourage researchers to more critically evaluate how they obtain, justify, and use multiple-item scale scores.
Preprint
Full-text available
Genetic association results are often interpreted with the assumption that study participation does not affect downstream analyses. Understanding the genetic basis of this participation bias is challenging as it requires the genotypes of unseen individuals. However, we demonstrate that it is possible to estimate comparative biases by performing GWAS contrasting one subgroup versus another. For example, we show that sex exhibits autosomal heritability in the presence of sex-differential participation bias. By performing a GWAS of sex in ~3.3 million males and females, we identify over 150 autosomal loci significantly associated with sex and highlight complex traits underpinning differences in study participation between sexes. For example, the body mass index (BMI) increasing allele at the FTO locus was observed at higher frequency in males compared to females (OR 1.02 [1.02-1.03], P=4.4x10-36). Finally, we demonstrate how these biases can potentially lead to incorrect inferences in downstream analyses and propose a conceptual framework for addressing such biases. Our findings highlight a new challenge that genetic studies may face as sample sizes continue to grow.
Article
Full-text available
Although the measurement of intelligence is important, researchers sometimes avoid using them in their studies due to their history, cost, or burden on the researcher. To encourage the use of cognitive ability items in research, we discuss the development and validation of the International Cognitive Ability Resource (ICAR), a growing set of items from 19 different subdomains. We consider how these items might benefit open science in contrast to more established proprietary measures. A short summary of how these items have been used in outside studies is provided in addition to ways we would love to see the use of public-domain cognitive ability items grow.
Article
Full-text available
Assessing job applicants' general mental ability online poses psychometric challenges due to the necessity of having brief but accurate tests. Recent research (Myszkowski & Storme, 2018) suggests that recovering distractor information through Nested Logit Models (NLM; Suh & Bolt, 2010) increases the reliability of ability estimates in reasoning matrix-type tests. In the present research, we extended this result to a different context (online intelligence testing for recruitment) and in a larger sample (N = 2949 job applicants). We found that the NLMs outperformed the Nominal Response Model (Bock, 1970) and provided significant reliability gains compared with their binary logistic counterparts. In line with previous research, the gain in reliability was especially obtained at low ability levels. Implications and practical recommendations are discussed.
Article
To encourage broader assessment of cognitive abilities in research across scientific fields, Condon and Revelle (2014) developed the first cognitive assessment in the public domain, the International Cognitive Ability Resource (ICAR). Despite initial support for its psychometric properties, little is known about the construct validity of the ICAR across distinct groups of individuals. In order to meaningfully interpret ICAR scores across diverse populations, measurement invariance must be established. To this end, a multiple group confirmatory factor analysis (MGCFA) was conducted on the full 60-item ICAR (ICAR60) and the 16-item Sample Test (ICAR16) to test for invariance across self-reported biological sex and age groups. A moderated nonlinear factor analysis (MNLFA) was conducted on the ICAR16 to test for differential item functioning (DIF) across linear age, age squared, sex, and their respective interaction terms. The baseline MGCFA models proposed by the test developers fit the data well for males and females and across age groups based on the fit indices Mc, CFI, TLI, and RMSEA. In the MCGFA, both forms demonstrated acceptable changes in the Mc (∆< .02) and CFI (∆< .01) in both loading- and threshold-constrained models for the sex and age group models. The MNLFA supported the weak and strong measurement invariance of the ICAR16 found in the MGCFA; no meaningful differences in item thresholds or factor-loadings were found across age, sex, or their interaction terms. Overall findings provide evidence that both forms measure the same constructs across sex and age, and the same strength of the relations exists among the first-order factors and the items. Despite these findings, the internal consistency of the subscales suggests only the total score of the ICAR16 be used for research; those interested in the subscales of the ICAR are advised to use the ICAR60.