ArticlePDF Available

The Intelligence Gap between Black and White Survey Workers on the Prolific Platform

  • Ulster Institute for Social Research


This brief report analyzes data from a series of studies carried out by Bates and Gignac (2022), collected from paid survey takers on the Prolific platform (total n = 3357). In this UK sample, Black-White gap sizes on cognitive tests were substantial with an overall effect size d of 0.99 standard deviations adjusted for unreliability (unadjusted means = 0.84 d). Testing for measurement invariance via differential item functioning found either no bias or bias of trivial magnitude. We conclude that the Black-White intelligence gap seen in Prolific workers is of similar magnitude to the gap seen elsewhere in America.
MANKIND QUARTERLY 2022 63.1 79-88
The Intelligence Gap between Black and White Survey
Workers on the Prolific Platform
Emil O.W. Kirkegaard*
Ulster Institute for Social Research, London
* Address for correspondence:
This brief report analyzes data from a series of studies carried out by
Bates and Gignac (2022), collected from paid survey takers on the Prolific
platform (total n = 3357). In this UK sample, Black-White gap sizes on
cognitive tests were substantial with an overall effect size d of 0.99
standard deviations adjusted for unreliability (unadjusted means = 0.84 d).
Testing for measurement invariance via differential item functioning found
either no bias or bias of trivial magnitude. We conclude that the Black-
White intelligence gap seen in Prolific workers is of similar magnitude to
the gap seen elsewhere in America.
Key Words: Black-White gap, Race, Intelligence, Cognitive ability,
Differential item functioning, Measurement invariance, Test bias, Survey,
Questionnaire, Prolific
Ethnic/racial groups vary in their average levels of intelligence. The
differences are generally, but not entirely, consistent across time and place (Lynn,
2008, 2015). Of particular interest is the difference between Africans and
Europeans, as these groups are present in large numbers in the United States,
Canada, United Kingdom and increasingly elsewhere. Historically, this is also the
best studied ethnic difference, with studies dating back over more than a century
in the United States. Overall, the gap size is about 15 IQ points in the United
States and has been stable since it was first measured (Roth et al., 2001). This
difference was still present in multiple large, broadly representative samples in
recent years, showing that claims of substantial narrowing were mistaken (Frisby
& Beaujean, 2015; Fuerst et al., 2021; Kirkegaard et al., 2019; Lasker et al.,
2019). This is in contrast to findings and predictions by Dickens and Flynn (2006).
Recently, Murray (2021) computed the Black (African American) and White
(European American) IQ gaps for various occupations. His findings are
reproduced in Table 1.
Table 1. Mean occupation IQs by race in the U.S., copied from Murray (2021).
100 is defined as the population mean, not the White mean. Gaps within
occupations are in Cohen’s d values using the standard deviations of those
subjects, so they reflect restriction of variance inside groups. Data derived from
NLSY79 and NLSY97.
K12 teachers
Registered nurses
Social workers
Retail sales
Child care
Secretaries & AAs
Janitors &
Social science increasingly relies on survey data gathered online. It is
possible that the use of online surveys induces selection bias towards smarter
subjects. A recent study found that, compared to a representative norming
sample, paid survey workers at Amazon’s MTurk platform
( had an average IQ (100) very close to the norming
sample (Merz et al., 2022). As far as we are aware, there is no published study
of whether the well-known ethnic gaps in intelligence are also present on the
Prolific platform, a rival to Amazon’s MTurk that is marketed at Academic
Research (; Palan & Schitter, 2018). This is an academic-
focused platform for buying and selling survey data. Subjects can join the platform
and participate in ongoing survey studies. Researchers can similarly run studies
on the platform. Prior research has found that data from online samples work
similarly to traditional student samples, but are more representative. Prolific offers
nationally representative data in terms of characteristics such as age, sex, race,
and education. The purpose of the present study was to examine the Black-White
gap on data derived from this platform.
We used data from a recent study of the effect of motivation on intelligence
test scores (Bates & Gignac, 2022), with research subjects from the United
Kingdom (T. Bates, personal communication). The study had multiple
subsamples, as described in the study:
(studies 1a b)
Subjects in all studies were recruited from Prolific Academic, a crowd
sourcing online platform to recruit human subjects for research purposes.
For study 1a, we recruited 1001 adult subjects (age M = 28.41, SD = 6.04;
range: 18 to 39 years, 499 male and 499 female, 2 did not answer this
item). For study 1b, we recruited 1000 adult subjects (age M = 34.49, SD
= 11.75; range: 18 to 76 years) from Prolific Academic (497 male and 503
female). The sample was predominantly white (White = 89.7%; Asian =
4.5%; Black = 1.8%; South-East Asian = 1.4%; Other = 2.6%). For study
1c, we recruited 1006 adult subjects (age M = 24.31, SD = 4.79; range: 18
to 39 years) from Prolific Academic (502 male and 504 female). The
sample composition was: White = 41.5%; Asian = 0.9%; Black = 35.3%;
South-East Asian = 0.4%; Native American = 0.9%; Other = 21.1%.
(study 2)
We recruited 400 adult subjects (age M = 29.75, SD = 5.90; range: 18 to
40 years) from Prolific Academic (202 male and 198 female). The sample
was predominantly white (White = 92.5%; Asian = 3.0%; Black = 1.3%;
South East Asian = 0.5%; Other = 2.8%).
(study 3)
We recruited 801 adult subjects (age M = 36.11, SD = 12.89; range: 18 to
76 years) from Prolific Academic (402 male and 399 female). The sample
was predominantly white (White = 89.0%; Asian = 4.6%; Black = 2.2%;
South-East Asian = 1.0%; Other = 3.1%).
(study 4)
We recruited an additional 150 adult subjects (age M = 28.83, SD = 6.24;
range: 18 to 39 years) from Prolific Academic (75 male and 75 female).
The sample was predominantly white (White = 85.3%; Asian = 7.3%; Black
= 3.3%; South-East Asian = 0.7%; Other = 3.1%).
The first sample (1a) did not include any questions about race, so we were unable
to use the data from that study. Sample 1c oversampled Blacks to reach a
substantial percentage (35.3%), whereas 1b, 2, 3, and 4 sampled freely from
adults on the platform, resulting in very small percentages of Blacks (1.8% to
3.3%). The different studies used different, abbreviated tests (10 to 32 items):
(studies 1a c)
Study 1b used the test of Single Word Comprehension (Warrington et al.,
1998). This test consists of 52 target words, each presented with two
potential response words arranged below them, and for each target must
select the word which is the best synonym (e.g., MARQUEE: Tent;
Palace). Half are concrete and half abstract. Based on item-level data, we
created a short form with 13 concrete items and 12 abstract items.
Coefficient ω in our sample was 0.62. Study 1C used Form A (10-items)
of the 20-item Visual Paper Folding test (Ekstrom et al., 1976). Dating
back in form not only to the work of Thurstone, but at least as early as
Binet (1905/1916), this scale consists of illustrations depicting a square
sheet of paper being folded two or three times and a hole punched in it.
The task is to select which of 5 graphical response options depicts how
the holes would appear if the sheet was unfolded. Matched versions are
provided as part of the Kit of Factor-Referenced Cognitive Tests. Each
block consisted of 10 items with a 3-minute time limit. Coefficient ω in our
sample was 0.68.
(studies 2 4)
Intelligence was measured using Form A and Form B of the Visual Paper
Folding test (Ekstrom et al., 1976). Each form includes 10 items (3-minute
time limit) and the forms are calibrated as approximately equally difficult.
For further details, see Study 1. For the total sample, coefficient ω was
estimated: Form A = 0.688; Form B = 0.627. Effort was measured using
the 10-item Sundre and Thelk (2007) Student Opinion Scale Internal
consistency reliability was estimated at ω = 0.81 and 0.80 for the pre- and
post-effort conditions. The reliabilities for each experimental condition are
included in Table 2.
As studies 2 through 4 used the same intelligence test items, they were
combined into a single dataset (by the original users of the data). This left us with
3 samples to examine: 1b, 1c, and the combined 2 4. The data was published
in the source article’s repository (
705617acaf734286844b1521ed87afdc), which is where we obtained it. The data
and R code from the present study can be found at, and the R
notebook can also be found at
Our approach was the same across datasets:
1. Fit an item response model to the full dataset (including subjects who were
neither Black nor White).
2. Compute the g factor scores based on a g-only model. Standardize the scores
to the White subset.
3. Test the items for differential item functioning using the approach outlined by
Chalmers (2015), which was previously used in Dutton and Kirkegaard (2022),
Kirkegaard (2021), and Lasker et al. (2021). This was done using the mirt
package (Chalmers et al., 2020).
4. Compute the Black-White gap size in White standard deviations. Additionally,
compute the reliability-adjusted gap size based on the estimated reliability of
the test. This correction is done by converting the d value to a point biserial
correlation, adjusting for imperfect reliability using the Spearman correction,
and converting back to a d value. Reliability was estimated using the item
response theory based method implemented in the function empirical_rxx().
5. Bootstrap the confidence intervals and standard errors for the gap sizes (1000
Finally, we used the output of step (5) to perform a Hunter-Schmidt
psychometric meta-analysis of the various results. Table 2 shows the results.
Table 2. Black-White test score gaps by sample. d adjusted refers to values
adjusted for imperfect reliability. CI = confidence interval (bootstrap, centile
method). The test bias effect size was estimated with differential item functioning
partial fits and positive values indicate higher scores for the White group. Note
that one item was excluded from sample 1b because it had no variance in the
Black group, thus leaving 24/25 items.
n total
d adjusted
(95% CI)
Test bias d
Test reliability
Test items
(0.38 - 1.51)
(0.86 - 1.07)
2 4
(0.55 - 1.25)
Almost all the information about the gap is coming from sample 1c, which
had a fairly balanced split between Black and White subjects. Yet the three
samples found much the same magnitudes when measurement error was
adjusted for. We carried out the random effects meta-analysis using metafor
(Viechtbauer, 2015), which resulted in an overall adjusted effect size of 0.99 d
(95CI: 0.89 1.09) corresponding to 14.8 IQ if one assumes the standard
deviations are identical to the general population. The unadjusted overall effect
size was 0.83 d (0.75 0.92). There was no detectable heterogeneity in either
analysis (p > .05, I² = 0%).
With regards to test bias, a small test-level bias of 0.05 was found in the large
sample. This was due to two biased items, one favoring each group (see Figure
1). However, the one favoring Whites was somewhat stronger in its effect
resulting in a 0.05 d bias at the test level, favoring Whites. This, however, is fairly
trivial compared to the 1.00 d gap between Blacks and Whites.
Figure 1. Item plots for differential item functioning analysis. Items PFA10 (top)
and PFA3 (bottom) show bias favoring Blacks and Whites, respectively.
Our analysis of a moderately large online sample of UK adults from the
Prolific platform showed that the Black-White intelligence gap observed in
population-representative samples was also present here. This is not surprising
given that it is present inside many other occupations studied, a nearly-necessary
consequence of the population-level differences coupled with racially fair
selection (Murray, 2021; Roth et al., 2001). My meta-analysis of the three
samples found an overall gap size of 0.99 d (14.82 IQ points) when adjusted for
imperfect reliability. Adjusting for measurement error is important so as not to be
misled by differences between samples in measurement reliability, which would
otherwise show up as artificial heterogeneity between samples (Hunter &
Schmidt, 2015). If one does not adjust, the overall effect size is somewhat
reduced to 0.83 d (12.53 IQ points).
However, the moderately large size of the gap is remarkable considering that
according to Lynn and Fuerst (2021) most of the recent studies of adolescents in
the UK have shown smaller Black-White gaps of less than 10 points and in some
cases less than 5 points. Because these authors also report that the Black-White
gap diminished over time, it is likely that much of the discrepancy between their
adolescent data and the adult Prolific workers in the present study can be
explained as a cohort effect, with Flynn effects being stronger in the black than
the white population.
Because intelligence test scores are a strong predictor of job performance
(Schmidt et al., 2016), ethnic gaps in mean intelligence within occupations will
also result in job performance differences; these have also been meta-analyzed
(Roth et al., 2003, 2008). For survey workers, job performance would consist of
following instructions, not using multiple accounts, not filling in responses
dishonestly and so on (Arthur et al., 2021). Given the intelligence gap, we expect
that survey data quality from Black subjects should be somewhat lower quality
than data gathered from White subjects. This can be important because lower-
quality survey responses might result in differential reliability for psychological
scales or tests by race, which could create spurious interactions between race
and other variables as the strength of association would differ by race due to the
differential reliability. In terms of actual results, a number of studies have
compared self-report to objective measures of drug use. These have found that
Blacks and Hispanics more often lie or otherwise falsely report their drug usage
as compared with urine test results than do White subjects (Fendrich & Johnson,
2005; Hughes et al., 2010). A number of studies have found that scores on lying
scales of various tests are higher for Blacks and Hispanics than for Whites in the
United States (Pina et al., 2001; Reynolds & Paget, 1983; Reynolds & Richmond,
1978). Taken together, there is reason for caution when interpreting self-report
based findings that might instead be explained by differences in survey data
The limitations of the study include, first, the open sampling of American
adults on the platform, as opposed to a nationally representative sample,
something which Prolific also offers for a higher price. Second, the tests used
were very short. It would be better to use a more diverse set of items, though this
has to be balanced against the cost of buying the data. As we did not collect the
data, we could not have chosen a different trade-off. Third, the use of a spatial
test might have inflated the Black-White gap because Blacks have been found to
underperform especially strongly on spatial tests (Frisby & Beaujean, 2015), and
thus, if the sample is representative of the general population, it might be
consistent with a substantial narrowing of the Black-White gap.
Arthur, W., Hagen, E. & George, F. (2021). The lazy or dishonest respondent: Detection
and prevention. Annual Review of Organizational Psychology and Organizational
Behavior 8: 105-137.
Bates, T.C. & Gignac, G.E. (2022). Effort impacts IQ test scores in a minor way: A multi-
study investigation with healthy adult volunteers. Intelligence 92: 101652.
Binet, A. & Simon, T. (1905/1916). New methods for the diagnosis of the intellectual
level of subnormals. L'Année Psychologique 12: 191-244.
Chalmers, P. (2015). Multidimensional item response theory workshop in R.
Chalmers, P., Pritikin, J., Robitzsch, A., Zoltak, M., Kim, K., Falk, C.F., Meade, A.,
Schneider, L., King, D., Liu, C.-W. & Oguzhan, O. (2020). mirt: Multidimensional item
response theory (1.32.1) [Computer software].
Dickens, W.T. & Flynn, J.R. (2006). Black Americans reduce the racial IQ gap: Evidence
from standardization samples. Psychological Science 17: 913-920.
Dutton, E. & Kirkegaard, E. (2022). The negative religiousness-IQ nexus is a Jensen
effect on individual-level data: A refutation of Dutton et al.’s ‘The Myth of the Stupid
Believer.’ Journal of Religion and Health 61(4): 3253-3275.
Ekstrom, R.B., French, J.W., Harman, H.H. & Dermen, D. (1976). Kit of factor-
referenced cognitive tests. Princeton NJ: Educational Testing Service.
Fendrich, M. & Johnson, T.P. (2005). Race/ethnicity differences in the validity of self-
reported drug use: Results from a household survey. Journal of Urban Health: Bulletin
of the New York Academy of Medicine 82(2 Suppl 3): iii67-81.
Frisby, C.L. & Beaujean, A.A. (2015). Testing Spearman’s hypotheses using a bi-factor
model with WAIS-IV/WMS-IV standardization data. Intelligence 5: 79-97.
Fuerst, J.G.R., Hu, M. & Connor, G. (2021). Genetic ancestry and general cognitive
ability in a sample of American youths. Mankind Quarterly 62: 186-216.
Lynn, R. & Fuerst, J.G. (2021). Recent studies of ethnic differences in the cognitive
ability of adolescents in the United Kingdom. Mankind Quarterly 61: 987-999.
Hughes, A., Heller, D. & Marsden, M.E. (2010). The validity of self-reported tobacco and
marijuana use, by race/ethnicity, gender, and age. In: Ninth Conference on Health
Survey Research Methods: Conference proceedings. Hyattsville, MD: U.S. Department
of Health and Human Services, Public Health Service, Centers for Disease Control and
Prevention, National Center for Health Statistics.
Hunter, J.E. & Schmidt, F.L. (2015). Methods of Meta-Analysis: Correcting Error and
Bias in Research Findings, 3rd ed. Sage.
Kirkegaard, E.O.W. (2021). An examination of the vocabulary
test. OpenPsych 1(1).
Kirkegaard, E.O.W., Woodley of Menie, M.A., Williams, R.L., Fuerst, J. & Meisenberg,
G. (2019). Biogeographic ancestry, cognitive ability and socioeconomic outcomes.
Psych 1(1): 1-25.
Lasker, J., Nyborg, H. & Kirkegaard, E.O.W. (2021). Spearman’s hypothesis in the
Vietnam Experience Study and National Longitudinal Survey of Youth ‘79. PsyArXiv.
Lasker, J., Pesta, B.J., Fuerst, J.G.R. & Kirkegaard, E.O.W. (2019). Global ancestry and
cognitive ability. Psych 1(1): 431-459.
Lynn, R. (2008). The Global Bell Curve: Race, IQ, and Inequality Worldwide.
Washington Summit.
Lynn, R. (2015). Race Differences in Intelligence, revised edition. Washington Summit.
Merz, Z.C., Lace, J.W. & Eisenstein, A.M. (2020). Examining broad intellectual abilities
obtained within an mTurk internet sample. Current Psychology 41: 2241-2249.
Murray, C.A. (2021). Facing Reality: Two Truths about Race in America. Encounter
Palan, S. & Schitter, C. (2018). Prolific.acA subject pool for online experiments.
Journal of Behavioral and Experimental Finance 17: 22-27.
Pina, A.A., Silverman, W.K., Saavedra, L.M. & Weems, C.F. (2001). An analysis of the
RCMAS lie scale in a clinic sample of anxious children. Journal of Anxiety Disorders 15:
Reynolds, C.R. & Paget, K.D. (1983). National normative and reliability data for the
revised Children’s Manifest Anxiety Scale. School Psychology Review 12: 324-336.
Reynolds, C.R. & Richmond, B.O. (1978). What I think and feel: A revised measure of
children’s manifest anxiety. Journal of Abnormal Child Psychology 6: 271-280.
Roth, P., Bobko, P., McFarland, L. & Buster, M. (2008). Work sample tests in personnel
selection: A meta-analysis of BlackWhite differences in overall and exercise scores.
Personnel Psychology 61: 637-661.
Roth, P.L., Bevier, C.A., Bobko, P., Switzer, F.S. & Tyler, P. (2001). Ethnic group
differences in cognitive ability in employment and educational settings: A meta-analysis.
Personnel Psychology 54: 297-330.
Roth, P.L., Huffcutt, A.I. & Bobko, P. (2003). Ethnic group differences in measures of job
performance: A new meta-analysis. Journal of Applied Psychology 88: 694-706.
Schmidt, F.L., Oh, I. & Shaffer, J. (2016). The validity and utility of selection methods in
personnel psychology: Practical and theoretical implications of 100 Years of research
findings. Retrieved July 3, 2018.
Sundre, D.L. & Thelk, A.D. (2007). The Student Opinion Scale (SOS). A measure of
examinee motivation. Test manual. Harrisonburg: Center for Assessment and Research
Studies, James Madison University.
Viechtbauer, W. (2015). metafor: Meta-analysis package for R (1.9-8) [Computer
Warrington, E.K., McKenna, P. & Orpwood, L. (1998). Single word comprehension: A
concrete and abstract word synonym test. Neuropsychological Rehabilitation 8: 143-
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Black and Hispanic children in the United States have lower mean cognitive test scores than White children. The reasons for this are contested. The test score gap may be caused by socio-cultural factors, but the high heritability of g suggests that genetic variance might play a role. Differences between self-identified race or ethnicity (SIRE) groups could be the product of ancestral genetic differences. This genetic hypothesis predicts that genetic ancestry will predict g within these admixed groups. To investigate this hypothesis, we performed admixture-regression analyses with data from the Adolescent Brain Cognitive Development Cohort. Consistent with predictions from the genetic hypothesis, African and Amerindian ancestry were both found to be negatively associated with g. The association was robust to controls for multiple cultural, socioeconomic, and phenotypic factors. In the models with all controls the effects were as follows: (a) Blacks, African ancestry: b =-0.89, N = 1690; (b) Hispanics, African ancestry: b =-0.58, Amerindian ancestry: b =-0.86, N = 2021), and (c) a largely African-European mixed Other group, African ancestry: b =-1.08, N = 748). These coefficients indicate how many standard deviations g is predicted to change when an individual's African or Amerindian ancestry proportion changes from 0% to 100%. Genetic ancestry statistically explained the self-identified race and ethnicity (SIRE) differences found in the full sample. Lastly, within all samples, the relation between genetic ancestry and g was partially accounted for by cognitive ability and educational polygenic scores (eduPGS). These eduPGS were found to be significantly predictive of g within all SIRE groups, even when controlling for ancestry. The results are supportive of the genetic model.
Full-text available
A recent study by Dutton et al. (J Relig Health 59:1567–1579., 2020) found that the religiousness-IQ nexus is not on g when comparing different groups with various degrees of religiosity and the non-religious. It suggested, accordingly, that the nexus related to the relationship between specialized analytic abilities on the IQ test and autism traits, with the latter predicting atheism. The study was limited by the fact that it was on group-level data, it used only one measure of religiosity that measure may have been confounded by the social element to church membership and it involved relatively few items via which a Jensen effect could be calculated. Here, we test whether the religiousness-IQ nexus is on g with individual-level data using archival data from the Vietnam Experience Study, in which 4462 US veterans were subjected to detailed psychological tests. We used multiple measures of religiosity—which we factor-analysed to a religion-factor—and a large number of items. We found, contrary to the findings of Dutton et al. (2020), that the IQ differences with regard to whether or not subjects believed in God are indeed a Jensen effect. We also uncovered a number of anomalies, which we explore.
Full-text available
We examined data from the popular free online 45-item “Vocabulary IQ Test” from We used data from native English speakers (n = 9,278). Item response theory analysis (IRT) showed that most items had substantial g-loadings (mean = .59, sd = .22), but that some were problematic (4 items being lower than .25). Nevertheless, we find that using the site’s scoring rules (that include penalty for incorrect answers) give results that correlate very strongly (r = .92) with IRT-derived scores. This is also true when using nominal IRT. The empirical reliability was estimated to be about .90. Median test completion time was 9 minutes (median absolute deviation = 3.5) and was mostly unrelated to the score obtained (r = -.02). The test scores correlated well with self-reported criterion variables educational attainment (r = .44) and age (r = .40). To examine the test for measurement bias, we employed both Jensen’s method and differential item functioning (DIF) testing. With Jensen’s method, we see strong associations with education (r = .89) and age (r = .88), and less so for sex (r = .32). With differential item functioning, we only tested the sex difference for bias. We find that some items display moderate biases in favor of one sex (13 items had pbonferroni < .05 evidence of bias). However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05). Overall, the test seems mostly well-constructed, and recommended for use with native English speakers.
Full-text available
Research from the 20th century showed that ethnic minorities under-performed White British on measures of cognitive ability in the United Kingdom. However, academic qualification results from the first two decades of the 21st century suggest minimal to reverse ethnic differences. To better understand the pattern of contemporary cognitive differences among adolescents in the 21st century, we analyzed academic achievements at age 16 in the GCSE and cognitive ability in four cognitive tests: the National Reference Test, NRT; the Programme for International Student Assessment, PISA; Cognitive Ability Test 3 (CAT3); and Center for Evaluation and Monitoring (CEM) 11-plus. Results from the PISA, CAT3 and CEM 11-plus tests correlate strongly across ethnic groups. These results show that Bangladeshi, Pakistani and Black students score approximately one half of a standard deviation below Indian and White students, while Chinese students perform significantly above the latter groups. In contrast, but consistent with academic qualifications, results based on the NRT suggest smaller ethnic gaps.
Full-text available
There are few empirically derived theories explaining group differences in cognitive ability. Spearman's hypothesis is one such theory which holds that group differences are a function of a given test's relationship to general intelligence, g. Research into this hypothesis has generally been limited to the application of a single method lacking sensitivity, specificity, and the ability to assess test bias: Jensen’s method of correlated vectors. In order to overcome the resulting empirical gap, we applied three different psychometrically sound methods to examine the hypothesis among American blacks and whites in the Vietnam Experience Study (VES) and the National Longitudinal Survey of Youth 1979 (NLSY ‘79). We first used multi-group confirmatory factor analysis to assess bias and evaluate the hypothesis directly; we found that strict factorial invariance was tenable in both samples and either the strong or the weak form of the hypothesis was supported, with 87 and 78% of the group differences attributable to g in the VES and NLSY ’79 respectively. Using item response theory metrics to avoid pass rate confounding, a strong relationship between g loadings and group differences (r = 0.80 and 0.79) was observed. Finally, assessing differential item functioning with item level data revealed that a handful of items functioned differently, but their removal did not affect gap sizes much beyond what would be expected from shortening tests, and assessing the effect this had on scores using an anchoring method, the differential functioning was found to be negligible in size. In aggregate, results supported Spearman's hypothesis but not test bias as an explanation for the cognitive differences between the groups we studied.
Full-text available
Widely used in social science research, samples of participants obtained via Amazon’s Mechanical Turk (mTurk) tend to be representative across many sociodemographic variables. However, to date, no research has investigated and reported the global cognitive ability level (i.e., intelligence) of samples obtained via mTurk. The present study contributes to the literature by investigating a previously well-validated, public domain measure of cognitive ability in a sample of American adults recruited via mTurk. As part of a larger cross-sectional, survey-based study, four hundred thirty-four (434) Americans (M age = 37.86; 35.7% men) completed a demographic questionnaire and the 16-item International Cognitive Ability Resource, Sample Test (ICAR-16). Results revealed a normal distribution of ICAR-16 scores across the current sample. Additionally, total scores were positively correlated with participants’ level of education, income, and self-estimated intelligence, but did not significantly correlate with participant age. No gender differences were identified on ICAR-16 total scores. Finally, ICAR-16 scores did not significantly differ from normative data derived from its validation study. These results suggested that American mTurk samples may be representative of the broader population in terms of global cognitive ability, and that the ICAR-16 is likely a reasonable, psychometrically sound, and inexpensive measure of global cognitive ability appropriate for use in mTurk samples.
Full-text available
Using data from the Philadelphia Neurodevelopmental Cohort, we examined whether European ancestry predicted cognitive ability over and above both parental socioeconomic status (SES) and measures of eye, hair, and skin color. First, using multi-group confirmatory factor analysis, we verified that strict factorial invariance held between self-identified African and European-Americans. The differences between these groups, which were equivalent to 14.72 IQ points, were primarily (75.59%) due to difference in general cognitive ability (g), consistent with Spearman’s hypothesis. We found a relationship between European admixture and g. This relationship existed in samples of (a) self-identified monoracial African-Americans (B = 0.78, n = 2,179), (b) monoracial African and biracial African-European-Americans, with controls added for self-identified biracial status (B = 0.85, n = 2407), and (c) combined European, African-European, and African-American participants, with controls for self-identified race/ethnicity (B = 0.75, N = 7,273). Controlling for parental SES modestly attenuated these relationships whereas controlling for measures of skin, hair, and eye color did not. Next, we validated four sets of polygenic scores for educational attainment (eduPGS). MTAG, the multi-trait analysis of genome-wide association study (GWAS) eduPGS (based on 8442 overlapping variants) predicted g in both the monoracial African-American (r = 0.111, n = 2179, p < 0.001), and the European-American (r = 0.227, n = 4914, p < 0.001) subsamples. We also found large race differences for the means of eduPGS (d = 1.89). Using the ancestry-adjusted association between MTAG eduPGS and g from the monoracial African-American sample as an estimate of the transracially unbiased validity of eduPGS (B = 0.124), the results suggest that as much as 20%–25% of the race difference in g can be natively explained by known cognitive ability-related variants. Moreover, path analysis showed that the eduPGS substantially mediated associations between cognitive ability and European ancestry in the African-American sample. Subtest differences, together with the effects of both ancestry and eduPGS, had near-identity with subtest g-loadings. This finding confirmed a Jensen effect acting on ancestry-related differences. Finally, we confirmed measurement invariance along the full range of European ancestry in the combined sample using local structural equation modeling. Results converge on genetics as a partial explanation for group mean differences in intelligence.
Full-text available
The number of online experiments conducted with subjects recruited via online platforms has grown considerably in the recent past. While one commercial crowdworking platform - Amazon's Mechanical Turk - basically has established and since dominated this field, new alternatives offer services explicitly targeted at researchers. In this article, we present and lay out its suitability for recruiting subjects for social and economic science experiments. After briefly discussing key advantages and challenges of online experiments relative to lab experiments, we trace the platform's historical development, present its features, and contrast them with requirements for different types of social and economic experiments.
Test motivation has been suggested to strongly influence low-stakes intelligence scores, with for instance, a recent meta-analysis of monetary incentive effects suggesting an average 9.6 IQ point impact (d = 0.64). Effects of such magnitude would have important implications for the predictive validity of intelligence tests. We report six studies (N = 4208) investigating the association and potential causal link of effort on cognitive performance. In three tests of the association of motivation with cognitive test scores we find a positive, but modest linear association of scores with reported effort (N = 3007: r ~ 0.28). In three randomized control tests of the effects of monetary incentive on test scores (total N = 1201), incentive effects were statistically non-significant in each study, showed no dose dependency, and jointly indicated an effect one quarter the size previously estimated (d = 0.166). These results suggest that, in neurotypical adults, individual differences in test motivation have, on average, a negligible influence on intelligence test performance. (≈ 2.5 IQ points). The association between test motivation and test performance likely partly reflects differences in ability, and subjective effort partly reflects outcome expectations.
Self-report measures are characterized as being susceptible to threats associated with deliberate dissimulation or response distortion (i.e., social desirability responding) and careless responding. Careless responding typically arises in low-stakes settings (e.g., participating in a study for course credit) where some respondents are not motivated to respond in a conscientious manner to the items. In contrast, in high-stakes assessments (e.g., prehire assessments), because of the outcomes associated with their responses, respondents are motivated to present themselves in as favorable a light as possible and, thus, may respond dishonestly in an effort to accomplish this objective. In this article, we draw a distinction between the lazy respondent, which we associate with careless responding, and the dishonest respondent, which we associate with response distortion. We then seek to answer the following questions for both careless responding and response distortion: ( a) What is it? ( b) Why is it a problem or concern? ( c) Why do people engage in it? ( d) How pervasive is it? ( e) Can and how is it prevented or mitigated? ( f ) How is it detected? ( g) What does one do when one detects it? We conclude with a discussion of suggested future research directions and some practical guidelines for practitioners and researchers. Expected final online publication date for the Annual Review of Organizational Pscyhology and Organizational Behavior, Volume 8 is January 21, 2021. Please see for revised estimates.