ArticlePDF Available

Abstract and Figures

Headstart studies of compensatory education tend to show impressive gains on IQ scores for children from low-quality environments. However, are these gains on the g factor of intelligence? We report a meta-analysis of the correlation between Headstart gains on the subtests of IQ batteries and the g loadings of these same subtests (K = 8 studies, total N = 602). A meta-analytic sample-weighed correlation of − .51 was found, which became −.80 after corrections for measurement error. We conclude that the pattern in Headstart gains on subtests of an IQ battery is highly similar to the pattern in test–retest gains and is hollow with respect to g. So, Headstart leads to gains in IQ scores, but not to gains in g. We discuss this finding in relation to the Flynn effect, training effects, and heritability.
Content may be subject to copyright.
Are Headstart gains on the gfactor? A meta-analysis
Jan te Nijenhuis
, Birthe Jongeneel-Grimen
, Emil O.W. Kirkegaard
University of Amsterdam, Work and Organizational Psychology, The Netherlands
University of Amsterdam, Amsterdam Medical Center, The Netherlands
University of Århus, Department of Linguistics, Denmark
article info abstract
Article history:
Received 12 March 2014
Received in revised form 21 May 2014
Accepted 1 July 2014
Available online xxxx
Headstart studies of compensatory education tend to show impressive gains on IQ scores for
children from low-quality environments. However, are these gains on the gfactor of
intelligence? We report a meta-analysis of the correlation between Headstart gains
on the subtests of IQ batteries and the gloadings of these same subtests (K= 8 studies,
total N= 602). A meta-analytic sample-weighed correlation of .51 was found, which
became .80 after corrections for measurement error. We conclude that the pattern in
Headstart gains on subtests of an IQ battery is highly similar to the pattern in testretest gains
and is hollow with respect to g. So, Headstart leads to gains in IQ scores, but not to gains in g.
We discuss this finding in relation to the Flynn effect, training effects, and heritability.
© 2014 Elsevier Inc. All rights reserved.
Jensen effect
Compensatory education
1. Introduction
Changes in IQ scores are one of the big puzzles of intelligence
research. Minor changes are often due to measurement error,
but this is unlikely to be the cause of more substantial
fluctuations. For instance, raw scores on standard IQ tests have
been going up for decades (the Flynn effect; Lynn, 2013), and as
the effect is large and unidirectional, simple measurement error
does not offer sufficient explanatory power. Much research in
the past three decades has been centered on the Flynn effect, e.g.
the recent special issue in Intelligence (Thompson, 2013). The
nature of the effect is hotly debated. Some authors, like Lynn
(2013), believe it to be a real increase in intelligence, citing,
among other things, the similar rise in height as evidence. Many
non-specialists similarly treat the Flynn effect as a real increase
in intelligence (e.g. Somin, 2013). Hypothesized causes for a real
increase include: better nutrition (Flynn, 1987; Lynn, 2006),
heterosis (i.e. outbreeding, Mingroni, 2007), improvement in
hygiene (Eppig, Fincher, & Thornhill, 2010), and reduced lead
poisoning (Nevin, 2000).
An alternate explanation posits that the effect has little or
nothing to do with general intelligence, or g, itself. Jensen
(1998, p. 143) invented the method of correlated vectors to
check whether a phenomenon has to do with the underlying
latent variable of interest, i.e. g, or whether it has to do with
the non-gvariance. Other researchers have since called
phenomena that show a positive relation to the gloading of
subtests Jensen effects(e.g. Colom, Juan-Espinosa, & Garcı́a,
2001; Rushton, 1998). Wholly or partly genetically influ-
enced variables, such as subtest heritabilities (Rushton &
Jensen, 2010), dysgenic fertility (Woodley & Meisenberg,
2013), fluctuating asymmetry (Prokosch, Yeo, & Miller,
2005), brain size (Rushton & Ankney, 2009), inbreeding
depression (Jensen, 1998), and reaction times (Jensen, 1998)
have been shown to be Jensen effects.
On the other hand, environmental variables seem to be
negative Jensen effects. te Nijenhuis and van der Flier (2013)
reported a meta-analysis of the Flynn effect which yielded a
negative Jensen effect of .38 (after corrections). Moreover,
in a newer study, Woodley, te Nijenhuis, Must, and Must
Intelligence 46 (2014) 209215
Corresponding author at: Gouden Leeuw 746, 1103 KR Amsterdam, The
E-mail address: (J. te Nijenhuis).
0160-2896/© 2014 Elsevier Inc. All rights reserved.
Contents lists available at ScienceDirect
(2014) reexamined one of the datasets in this meta-analysis
and found that if one corrects for increased guessing at the
harder items (the Brand effect) then the negative Jensen
effect came even closer to 1at.82, indicating that the
gains may be more hollow with respect to gthan previously
thought (see also Flynn, te Nijenhuis, & Metzen, 2014).
In a related study, te Nijenhuis, van Vianen, and van der
Flier (2007) reported a meta-analysis of 64 studies (total
N= 26,990) on score gains from test training yielding a
negative Jensen effect of 1.0 (after corrections). Score gains
from training are theoretically interesting because they
present a clear case that one can increase the proxy
(or manifest variable), IQ, without increasing the underlying
latent variable of interest, g. Whatever causes the Flynn effect
gains, it seems likely this effect is similarly mostly hollow
with respect to g; it represents no large gain in g. Accordingly,
we have not seen the substantial increase in the number of
geniuses in Western countries that we could expect to result
from a mean increase in gof a standard deviation or more
(Jensen, 1987, pp. 445446). As Herrnstein and Murray
(1994, p. 364) point out, a mere 3 IQ point increase in g
would make a large difference on the tails of the distribution.
For instance, it would increase the number of people above
IQ = 130, often taken as the threshold of giftedness, by 68%
(from 2.3% to 3.6%). An increase of one or more SD in gcould
not possibly be overlooked.
1.1. Compensatory education and IQ gains: g-loaded?
The largest program for compensatory education is
Project Headstart, which began as a program to improve
intellectual functioning and to increase academic achieve-
ment (Caruso, Taylor, & Detterman, 1982) and has been
running since 1965. It is a public preschool program that was
designed for disadvantaged children to close the achieve-
ment gaps between the disadvantaged child and their more
advantaged peers (Soriano, Duenas, & LeBlanc, 2006). The
program is massive, involving 1 million children, and cost
almost 8 billion dollars in 2012 (U.S. Department of Health &
Human Services, 2012).
Several meta-analyses of Headstart studies showed that
children in the program outscored children in control groups
(Caruso et al., 1982; Ramey, Bryant, & Suarez, 1985; Nelson,
Westhues, & MacLeod, 2003; see also Protzko, Aronson, &
Blair, 2013). However, no one, to our knowledge, has yet
carried out an analysis to see if the gains are a Jensen effect.
In 1969, when Jensen published his famous article How
much can we boost IQ and scholastic achievement?(Jensen,
1969) he drew the conclusion that compensatory education
had been tried and had failed. Although initial IQ gains were
sometimes large, they diminished with time and so could not
be expected to close the gaps between racial and economic
groups. Spitz (1986) reviewed most of the literature on the
attempts to increase intelligence and his conclusions were
also mostly negative. He mentions (p. 103) that in the Perry
Preschool Program, the teachers seemed to focus on teaching
material that was similar to the content of subtests of the IQ
tests, so-called teaching to the test. It is not unlikely that
highly comparable practices were present in many other
programs, including Headstart.
In the widely accepted model in Fig. 1 U
is the variance
specific to each subtest, V
. The teaching to the test-hypothesis
can be clearly stated in terms of the model. According to the
hypothesis, when one trains test takers on the exact subtests or
subtests very similar tothose used in a test, the resultant effect
is on the U
factors in the model (and maybe somewhat on the
group factors F
), but there is no increase in the latent variable
g. If one assumes that test takers are taught comparably on all
the subtests, then this leads directly to the prediction that any
resultant training effect should have a strong negative
correlation with the gloading of the subtests. This is because,
for each V
, the greater the influence of U
, the smaller the
influence of g(through the group factors). If ability in U
increased, it will be higher on the V
swhereghas a smaller
influence, that is, that are less g-loaded (see also Jensen, 1998,
pp. 336337).
This leads us to the present study. The goal was to determine
whether the gains from Headstart are similar to training effects,
with a strong negative Jensen effect, or whether they are
genuine increases in g, in which case they should show a strong
Jensen effect.
2. Method
Psychometric meta-analysis (Hunter & Schmidt, 2004)
aims to estimate what the results of studies would have been
if all studies had been conducted without methodological
limitations or flaws. The results of perfectly conducted
studies would allow a less obstructed view of the underlying
constructlevel relationships (Schmidt & Hunter, 1999). The
goal of the present psychometric meta-analysis is to provide
reliable estimates of the true correlation between Headstart
gains and the magnitude of gloadings. As the techniques we
use are relatively unknown to the majority of readers we
choose to give a detailed description of the techniques.
However, highly similar descriptions have also been used in
other recent publications.
2.1. Searching and screening studies
To identify studies for inclusion in the meta-analysis, both
electronic and manual searches for studies that contained
cognitive ability data of Headstart children or adults who
participated in a Headstart program as a child were conducted
in 2007. Four methods were used to obtain Headstart gains
from both published and unpublished studies for the present
meta-analysis. First, an electronic search of published research
was conducted, using PsycINFO, ERIC, PiCarta, Academic
Search Premier, Web of Science, and PubMed. The following
keyword combinations were used to conduct searches:
Headstart, Head Start, preschool children, and kindergarten
children in combination with the keywords IQ, intelligence,
intellectual development, g, GMA, general mental ability,
cognitive development, cognitive ability, and general cognitive
ability. Second, we browsed the tables of contents of several
major research journals of education, development, and of
intelligence, such as American Educational Research Journal
19682007, Journal of Educational Research 19652007, Intelli-
gence 19772007, Psychological Science 19902007, Child
Development 19302007, and Developmental Psychology
19692007. Third, several well-known researchers who have
210 J. te Nijenhuis et al. / Intelligence 46 (2014) 209215
conducted cognitive ability research of Headstart, preschool,
and kindergarten children or adults who participated in a
Headstart program, preschool, or kindergarten as a child were
contacted in order to obtain any additional articles or
supplementary information. Finally, we checked the reference
list of all currently included empirical studies to identify any
potential articles that may have been missed by earlier search
2.2. Inclusion rules
Studies that reported IQ scores of Headstart children,
preschool, and kindergarten children were included in the
meta-analysis. We used the term Headstartin a generic
sense, so it included preschool and kindergarten children as
well. For a study to be included in the meta-analysis two
criteria had to be met: First, to get a reliable estimate of the
true correlation between Headstart gains and the gloadings
the cognitive batteries had to have a minimum of seven
subtests; second, well-validated tests had to be used. The
general inclusion rules were applied and yielded six papers
which resulted in eight correlations between gand d
(Headstart gains).
2.3. Computation of Headstart gains
One of the goals of the present meta-analysis is to obtain a
reliable estimate of the true correlation between Headstart
gains (d) and g. To be able to compute d(Headstart gains) we
needed to compare the results of the intervention groups
against the results of comparison groups. A limitation of all
the studies included in this meta-analysis is that none of
them included a comparison group. In general, FACES (Head
Start Family and Child Experiences Survey study) students
entered the program with measures of vocabulary, letter
recognition, and math that were about one-half to a full
standard deviation below the national average (see Zill,
Resnick, Kim, O'Donnell, & Sorongon, 2003). We therefore
decided to compare the mean of the scaled scores of
Headstart children with an artificially generated comparison
group with total IQ scores one SD below the mean of the
scaled scores of the standardization groups of the particular
test in question, because such a simulated comparison group
is cognitively more similar to the Headstart children than the
national standardization groups. Headstart, preschool, and
kindergarten gains (d) were computed by subtracting the
mean of the comparison group from the mean of the
intervention group. The result was then divided by the
(mean) SD of the standardization group(s) of the particular
test in question.
2.4. Computation of g loadings
In general, gloadings were computed by submitting a
correlation matrix to a principal-axis factor analysis and
using the loadings of the subtests on the first unrotated
factor. In some cases gloadings were taken from studies
where other procedures were followed; these procedures
have been shown empirically to lead to highly comparable
results (Jensen & Weng, 1994). Finally, Pearson correlations
between Headstart gains and the gloadings were computed.
2.5. Corrections for artifacts
Psychometric meta-analytical techniques (Hunter &
Schmidt, 2004) were applied using the software package
Fig. 1. Hierarchical model of human cognitive abilities, using a simplified form of the model in Jensen and Weng (1994). Circles are latent factors and squares are
manifest variables.
211J. te Nijenhuis et al. / Intelligence 46 (2014) 209215
developed by Schmidt and Le (2004). Psychometric
meta-analysis is based on the principle that there are artifacts
in every dataset and that most of these artifacts can be
corrected. In the present meta-analyses we corrected for five
artifacts that alter the value of outcome measures listed by
Hunter and Schmidt (2004). These are: (1) sampling error,
(2) reliability of the vector of gloadings, (3) reliability of the
vector of Headstart gains (d), (4) restriction of range of g
loadings, and (5) deviation from perfect construct validity.
2.5.1. Correction for sampling error
In many cases sampling error explains the majority of the
variation between studies, so the first step in a psychometric
meta-analysis is to correct the collection of effect sizes for
differences in sample size between the studies.
2.5.2. Correction for reliability of the vector of g loadings
The values of r(g× Headstart gains) are attenuated by
the reliability of the vector of gloadings for a given battery.
When two samples have a comparable N, the average
correlation between vectors is an estimate of the reliability
of each vector. Several samples were compared that differed
little on background variables. For the comparisons using
children we chose samples that were highly comparable with
regard to age. Samples of children in the age of 3 to 5 years
were compared against other samples of children who did
not differ more than 0.5 years of age. Samples of children in
the age of 6 to 17 years were compared against other
samples of children who did not differ more than 1.5 years
of age. For the comparisons of adults we compared samples
in the age of 18 to 95 years.
We collected correlation matrices from test manuals,
books, articles, and technical reports. The large majority came
from North America, with a large number of European
countries, and also a substantial number from Korea, China,
Hong Kong, and Australia. This resulted in about 700 data
points, which led to 385 comparisons of gloadings of
comparable groups which provided an indication of the
reliability for that group. To give an illustration of the
procedure, van Haasen et al. (1986) report correlation
matrices of the Dutch and the Flemish WISC-R for 22 samples
in the age of 616 years. We compared samples of children in
the age of 6 to 17 years with other samples of children who
do not differ by more than 1.5 years. Because the samples of
children reported in van Haasen et al. (1986) were between 6
and 17 years we only compared children who did not
differ more than 1.5 years. The Ns in these samples were
comparable. This resulted in an average correlation of .78
(combined N= 3018; average N= 137).
A scatter plot of reliabilities against Ns should show that
the larger Nbecomes, the higher the value of the reliability
coefficients, with an asymptotic function between r(g×g)
and Nexpected. We checked to see which curve gave the best
fit to the expected asymptotic function. The logarithmic
regression line resembled quite well the expected asymptotic
distribution for reliabilities.
2.5.3. Correction for reliability of the vector of Headstart
gains (d)
The values of r(g× Headstart gains) are attenuated by
the reliability of the vector of Headstart gains for a given
battery. When two samples have a comparable N, the average
correlation between vectors is an estimate of the reliability of
each vector. The reliability of the vector of Headstart gains
was estimated using the present datasets and by comparing
the samples that took the same test and that were
comparable with regard to age and sample size. As an
illustration of the procedure, consider the vectors of
Headstart gains from datasets on the WPPSI. McNamara,
Porterfield, and Miller (1969) tested children (N= 42) with
an average age of 5.8 years (age range 4.8 to 6.6 years);
Yater, Barclay, and Leskosky (1971) tested children (N= 48)
with an average age of 5.3 years (age range 4.8 to 6.0 years);
and Henderson and Rankin (1973) tested children (N= 49)
with an estimated mean age of 5.5 years (age range 5.0 to
6.0 years). The correlations between the dvectors of the
three studies are respectively .90 (total N= 90; average
N= 45), .64 (total N= 97; average N= 49), and .72 (total
N= 91; average N= 46). Lowe, Anderson, Williams, and
Currie (1987) also tested children (N= 169) on the WPPSI.
They had an average age of 5.9 years (age range 5.6 to
6.2 years). We decided not to compare vectors of Headstart
gains from the dataset in Lowe et al. (1987) because the
differences in sample size were too large.
An asymptotic function between r(d×d) and Nis
expected. We checked to see which curve gave the best fit
to the expected asymptotic function. Fig. 2 presents the
scatter plot of the reliability of the vector of Headstart gains
and sample size, and the curve that fitted optimally.
2.5.4. Correction for restriction of range of g loadings
The values of r(g× Headstart gains) are attenuated by the
restriction of range of gloadings in many of the standard test
batteries. The most highly g-loaded batteries tend to have the
smallest range of variation in the subtests' gloadings. Jensen
(1998, pp. 381382) showed that restriction in the magnitude
of gloadings strongly attenuates the correlation between g
loadings and standardized group differences. Hunter and
Schmidt (1990, pp. 4749) state that the solution to range
variation is to define a reference population and express all
correlations in terms of that reference population. The Hunter
and Schmidt meta-analytical program computes what the
correlation in a given population would be if the standard
deviation were the same as in the reference population. The
standard deviations can be compared by dividing the standard
deviation of the study population by the standard deviation of
the reference group, that is u=SD
. As references
we used tests that are broadly regarded as exemplary for the
measurement of the intelligence domain, namely the various
versions of the Wechsler tests for children and adults. The
average standard deviation of gloadings of the various versions
of the Wechsler Bellevue (W-B), Wechsler Preschool and
Primary Scale of Intelligence (WPPSI), Wechsler Intelligence
Scale for Children (WISC), Wechsler Intelligence Scale for
Children Revised (WISC-R), Wechsler Intelligence Scale for
Children Third Edition (WISC-III), and the Wechsler
Intelligence Scale for Children Fourth Edition (WISC-IV)
from datasets from countries all over the world was 0.132. We
used this value as our reference in the studies with children.
The average standard deviation of gloadings of the various
versions of the Wechsler Adult Intelligence Scale (WAIS),
Wechsler Adult Intelligence Scale Revised (WAIS-R), and the
212 J. te Nijenhuis et al. / Intelligence 46 (2014) 209215
Wechsler Adult Intelligence Scale Third Edition (WAIS-III)
from datasets from countries all over the world was 0.107. This
was used as the reference value in the studies with adults. In so
doing, the SD of gloadings of all test batteries was compared to
the average SD in gloadings in the Wechsler tests for,
respectively, children and adults.
2.5.5. Correction for deviation from perfect construct validity
The deviation from perfect construct validity in g
attenuates the values of r(g× Headstart gains). In making
up any collection of cognitive tests, we do not have a
perfectly representative sample of the entire universe of all
possible cognitive tests. Therefore any one limited sample of
tests will not yield exactly the same gas another such sample.
The sample values of gare affected by psychometric sampling
error, but the fact that gis very substantially correlated across
different test batteries implies that the differing obtained
values of gcan all be interpreted as estimates of a trueg
(Johnson, Bouchard, Krueger, McGue, & Gottesman, 2004;
Johnson, te Nijenhuis, & Bouchard, 2008). The values of r
(g× Headstart gains) are attenuated by psychometric
sampling error in each of the batteries from which a gfactor
has been extracted.
The more tests and the higher their gloadings, the higher
the gsaturation is of the composite score. The Wechsler tests
have a large number of subtests with quite high gloadings.
This yields a highly g-saturated composite score. Jensen
(1998, pp. 9091) states that the gscore of the Wechsler tests
correlates more than .95 with the tests' IQ score. However,
shorter batteries with a substantial number of tests with
lower gloadings will lead to a composite with somewhat
lower gsaturation. Jensen (1998, ch. 10) states that the
average gloading of an IQ score as measured by various
standard IQ tests lies in the +.80s. When we take this value
as an indication of the degree to which an IQ score is a
reflection of trueg, we can estimate that a tests' gscore
correlates about .85 with trueg.A
sgloadings represent the
correlations of tests with the gscore, it is most likely that
most empirical gloadings will underestimate trueg
loadings; therefore, empirical gloadings correlate about .85
with truegloadings. As the Schmidt and Le (2004)
computer program only includes corrections for the first
four artifacts, the correction for deviation from perfect
construct validity was carried out on the values of r
(g× Headstart gains) after correction for the first four
artifacts. To limit the risk of over-correction, we conserva-
tively chose the value of .90 for the correction.
3. Results
The results of the studies on the correlation between g
loadings and Headstart gains are shown in Table 1. The table
gives data derived from six studies, with participants
numbering a total of 602. It presents the reference for the
study, the cognitive ability test used, the correlation between
gloadings and Headstart gains, the sample size, and the mean
age (and range of age). It is clear that all these correlations
are negative and about half quite strongly.
Table 2 lists the results of the psychometric meta-analysis
of the eight data points. The estimated true correlation has a
value of .72, and artifacts explain 71% of the variance in the
observed correlations. Finally, a correction for deviation from
perfect construct validity in gwas made, using a conservative
value of .90. This resulted in a value of .80 for the final
estimated true negative Jensen effect.
Fig. 2. Scatter plot of reliability of the vector of Headstart gains and sample size and regression line.
Table 1
Studies of correlations between gloadings and Headstart gains.
Reference Test rNAge mean (range)
McNamara et al. (1969) WPPSI .391 42 5.80 (4.806.60)
Yater et al. (1971) WPPSI .298 48 5.30 (4.806.00)
Henderson and Rankin
WPPSI .284 49 5.50
Lowe et al. (1987) WPPSI .386 169 5.90 (5.606.20)
Lowe et al. (1987) WISC-R .770 94 9.80 (9.5010.20)
Lowe et al. (1987) WAIS-R .356 40 17.40 (17.1017.80)
Krohn, Lamp, and Phelps
.757 38 4.25 (3.304.75)
Gridley, Miller, Barke,
Fischer, and Smith
.665 122 4.50 (3.175.42)
Note. In general, the gloadings were based on the correlation matrix taken
from test manuals or from the correlation matrix based on the largest
sample size we could find. A detailed description of all the data points for
this meta-analysis can be found in the Supplementary material.
Estimated mean age.
Kaufman Assessment Battery for Children.
213J. te Nijenhuis et al. / Intelligence 46 (2014) 209215
4. Discussion
Studies of compensatory education sometimes show
impressive gains on IQ scores for children from low-quality
environments. Are the Headstart gains similar to training
effects and the Flynn effect, showing a strong negative Jensen
effect, or are they genuine increases in g, in which case they
should show a strong Jensen effect?
Results were strongly in line with the prediction that
Headstart involves a lot of teaching to the test, so that the
gains would be strongly at the level of the specific or group
factors. The gains involve mostly the non-gvariance, which
means that they were mostly hollow in terms of g. The final
estimated true correlation of .80, rather than a correlation
of exactly 1.0, need not mean that there was some gain in
g. It might instead indicate that the teachers did not give
equal amounts of training to activities related to each subtest.
The finding that the IQ gains from Headstart were mostly
on the non-gvariance might explain why IQ gains from such
programs fade with time (Brody, 1992). IQ tests given to
people of different ages do not have the same items, as items
that are useful for discriminating between small children are
generally too easy for adults (Jensen, 1980). If one trains
young children on the specific factors U
, and U
, and one
later tests the same group with another test battery with the
specific factors U
, and U
then the earlier training
would be irrelevant (barring any near-transfer effects), and
therefore any IQ gain would vanish.
Alternatively, one might view the fading of IQ gains in
light of the repeated finding that heritability increases with
age, or equivalently, environmentality decreases with age
(Plomin, DeFries, Knopik, & Neiderhiser, 2013). Since com-
pensatory education is an environmental effect, its strength
should decrease with time. As Jensen (1998, p. 184) pointed
out, the most g-loaded subtests are also the most heritable
ones, indicating that influencing gthrough environmental
interventions is not easily accomplished.
4.1. Future studies
In this study, we focused on the Headstart program.
Future studies should examine other compensatory educa-
tional programs to see whether the IQ gains were g-loaded.
Indeed, the study of any phenomenon's relation to IQ scores
could benefit from applying the method of correlated vectors.
This is as true for compensatory education and dual n-back
training (e.g., Jaeggi et al., 2010; but see Chooi & Thompson,
2012) as it is for fluoride poisoning (Choi, Sun, Zhang, &
Grandjean, 2012) and myopia (Saw et al., 2004). The fact that
the literature on intelligence focuses so much on manifest
variables (i.e. IQ) leads to confusion in the press when
phenomena such as the Flynn effect are reported, as well as
when people observe that they can make their IQ scores go
up by taking a test more than once. The only remedy is to
focus on latent traits and always report the gloading.
A reviewer came up with an interesting suggestion for
additional analyses, starting with the observation that the
Headstart program is aimed at disadvantaged participants in
the lower tail of the intelligence distribution, and therefore
large and homogeneous gloadings are expected. So, it would
be interesting to compare the gvector for these participants
after the intervention with the vector of a comparable group
without the intervention. The prediction for a successful
program will be a significant reduction of the gloadings for
the intervention group. The theoretical implication of such a
result would be a reduction of the cognitive complexity of the
completed measures.
4.2. Limitations of the studies
Estimates of the reliabilities of the vectors of gloadings
were based on a very large number of high-quality studies.
However, reliabilities of vectors of Headstart gains were
based on a limited number of studies albeit the complete
empirical literature on this topic leading to non-optimal
estimates of the distribution of reliabilities.
4.3. Conclusion
Based on meta-analytical data and employment of the
method of correlated vectors we showed that there is a
strong and negative correlation of Headstart gains with g.
Headstart programs can raise IQ test scores successfully, but
not general mental ability per se. A very large amount of
money was spent on Headstart programs, and it most likely
led to increases in socially-desirable outcomes such as
adequate nutrition, self-care skills, and social skills. However,
our study shows it did not lead to the intended increase in
intelligence as reflected in the very strong negative correla-
tion with g. The outcomes of our study could be included in a
costbenefit analysis of Headstart programs.
We would like to thank Dr. John McNamara, Dr. Deborah
Stipek, Dr. Emily Krohn, and Dr. Craig Ramey for their
enthusiastic support of this project. We would also like to
thank reviewers Roberto Colom and Meredith Frey and one
anonymous reviewer for their constructive criticism of our
Appendix A. Supplementary data
Supplementary data to this article can be found online at
Table 2
Meta-analytical results for correlation between Headstart gains and g
loadings after corrections for reliability and restriction of range.
K N r SDr Rho-4 SDrho-4 Rho-5 %VE 80% CI
8 602 .51 .16 .72 .01 .80 71 .58 to .85
Note.K= number of correlations; N= total sample size; r= mean
observed correlation (sample size weighted); SDr = standard deviation of
observed correlation; rho-4 = observed correlation (corrected for
unreliability and range restriction); SDrho = standard deviation of true
correlation; rho-5 = observed correlation (corrected for unreliability,
range restriction, and imperfect measurement of the construct);%VE =
percentage of variance accounted for by artifactual errors; 80% CI = 80%
credibility interval.
214 J. te Nijenhuis et al. / Intelligence 46 (2014) 209215
References marked with an asterisk indicate studies included in the
Brody, N. (1992). Intelligence. New York: Academic Press.
Caruso, D. R., Taylor, J. J., & Detterman, D. K. (1982). Intelligence research and
intelligent policy. In D. K. Detterman, & R. J. Sternberg (Eds.), How and
how much can intelligence be increased (pp. 4565). Norwood, NJ: Ablex
Publishing Corporation.
Choi, A. L., Sun, G., Zhang, Y., & Grandjean, P. (2012). Developmental fluoride
neurotoxicity: A systematic review and meta-analysis. Environmental
Health Perspectives,120, 13621368.
Chooi, W. T., & Thompson, L. A. (2012). Working memory training does not
improve intelligence in healthy young adults. Intelligence,40, 531542.
Colom, R., Juan-Espinosa, M., & Garcı́a, L. F. (2001). The secular increase in
test scores is a Jensen effect.Personality and Individual Differences,30,
Eppig, C., Fincher, C. L., & Thornhill, R. (2010). Parasite prevalence and the
worldwide distribution of cognitive ability. Proceedings of the Royal
Society B: Biological Sciences,277(1701), 38013808.
Flynn, J. R. (1987). Massive IQ gains in 14 nations: What IQ tests really
measure. Psychological Bulletin,101, 171.
Flynn, J. R., te Nijenhuis, J., & Metzen, D. (2014). The gbeyond Spearman's g:
Flynn's paradoxes resolved using four exploratory meta-analyses.
*Gridley, B. E., Miller, G., Barke, C., Fischer, W., & Smith, D. (1990). Construct
validity of the K-ABC with an at-risk preschool population. Journal of
School Psychology,28,3949.
*Henderson, R. W., & Rankin, R. J. (1973). WPPSI reliability and predictive
validity with disadvantaged Mexican-American children. Journal of
School Psychology,11,1620.
Herrnstein, R. J., & Murray, C. (1994). The Bell curve: Intelligence and class
structure in American life. New York: Free Press.
Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis. London: Sage.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis (2nd ed.).
London: Sage.
(2010). The relationship between n-back performance and matrix
reasoning Implications fortraining and transfer. Intelligence,38,625635.
Jensen, A. R. (1969). How much can we boost IQ and scholastic achievement.
Harvard Educational Review,39,1123.
Jensen, A. R. (1980). Bias in mental testing. New York: Free Press.
Jensen, A. R. (1987). Differential psychology: Towards consensus. In S.
Modgil, & C. Modgil (Eds.), Arthur Jensen: Consensus and controversy.
New York: The Falmer Press.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT:
Jensen, A. R., & Weng, L. J. (1994). What is a good g?Intelligence,18, 231258.
Johnson, W., Bouchard, T. J., Jr., Krueger, R. F., McGue, M., & Gottesman, I. I.
(2004). Just one g: Consistent results from three test batteries.
Johnson, W., te Nijenhuis, J., & Bouchard, T. J., Jr. (2008). Still just 1 g:
Consistent results from five test batteries. Intelligence,36,8195.
*Krohn, E. J., Lamp, R. E., & Phelps, C. G. (1988). Validity of the K-ABC for a
Black preschool population. Psychology in the Schools,25,1521.
*Lowe, J. D., Anderson, H. N., Williams, A., & Currie, B. B. (1987). Long-term
predictive validity of the WPPSI and the WISC-R with Black school
children. Personality and Individual Differences,8, 551559.
Lynn, R. (2006). Race differences in intelligence: An evolutionary analysis.
Augusta, GA: Washington Summit Publishers.
Lynn, R. (2013). Who discovered the Flynn effect? A review of early studies
of the secular increase of intelligence. Intelligence,41,765769.
*McNamara, J. R., Porterfield, C. L., & Miller, L. E. (1969). The relationship of
the Wechsler Preschool and Primary Scale of Intelligence with the
coloured Progressive Matrices (1956) and the Bender Gestalt Test.
Journal of Clinical Psychology,25,6568.
Mingroni, M. A. (2007). Resolving the IQ paradox: Heterosis as a cause of the
Flynn effect and other trends. Psychological Review,114, 806.
Nelson, G., Westhues, A., & MacLeod, J. (2003). A meta-analysis of
longitudinal research on preschool prevention programs for children.
Prevention & Treatment,6.
Nevin, R. (2000). How lead exposure relates to temporal changes in IQ,
violent crime, and unwed pregnancy. Environmental Research,83,122.
Plomin, R., DeFries, J. C., Knopik, V. S., & Neiderhiser, J. M. (2013). Behavioral
genetics (6th ed.). New York: Worth.
Prokosch, M. D., Yeo, R. A., & Miller, G. F. (2005). Intelligence tests with
higher g-loadings show higher correlations with body symmetry:
Evidence for a general fitness factor mediated by developmental
stability. Intelligence,33, 203213.
Protzko, J., Aronson, J., & Blair, C. (2013). How to make a young child
smarter: Evidence from the database of raising intelligence. Perspectives
on Psychological Science,8,2540.
Ramey, C. T., Bryant, D. M., & Suarez, T. M. (1985). Preschool compensatory
education and the modifiability of intelligence: A critical review. In D.
Detterman (Ed.), Current topics in human intelligence. Norwood, NJ:
Ablex Publishing Company.
Rushton, J. P. (1998). The Jensen effectand the SpearmanJensen
hypothesisof BlackWhite IQ differences. Intelligence,26, 217225.
Rushton, J. P., & Ankney, C. D. (2009). Whole brain size and general mental
ability: A review. International Journal of Neuroscience,119, 692732.
Rushton, J. P., & Jensen, A. R. (2010). The rise and fall of the Flynn effect as a
reason to expect a narrowing of the BlackWhite IQ gap. Intelligence,38,
Saw, S. M., Tan, S. B., Fung, D., Chia, K. S., Koh, D., Tan, D. T., et al. (2004). IQ
and the association with myopia in children. Investigative Ophthalmology
& Visual Science,45, 29432948.
Schmidt, F. L., & Hunter, J. E. (1999). Theory testing and measurement error.
Intelligence,27, 183198.
Schmidt, F. L., & Le, H. (2004). Software for the HunterSchmidt meta-analysis
methods. Iowa City, IA: University of Iowa, Department of Management
& Organization, 42242.
Somin, I. (2013). Democracy and political ignorance: Why smaller government
is smarter. Stanford University Press.
Soriano, D., Duenas, M., & LeBlanc, P. (2006, August). The short-term and
long-term effects of Head Start education and no child left behind. Paper
presented at the meeting of the Association of Teacher Educators,
Philadelphia, PA.
Spitz, H. H. (1986). The raising of intelligence: A selected history of attempts to
raise retarded intelligence. Hillsdale, NJ: Erlbaum.
te Nijenhuis, J., & van der Flier, H. (2013). Is the Flynn effect on g?: A
meta-analysis. Intelligence,41, 802807.
te Nijenhuis, J., van Vianen, A. E., & van der Flier, H. (2007). Score gains on
g-loaded tests: No g.Intelligence,35, 283300.
Thompson, J. (2013). The Flynn effect re-evaluated. In James Thompson (Ed.),
Intelligence. Vol. 41, Issue 6.(pp.751858) (NovemberDecember 2013).
U.S. Department of Health and Human Services (2012). Head Start program
facts fiscal year 2012.
2012-hs-program-factsheet.html (Retrieved from)
van Haasen,P. P., de Bruyn, E. E. J., Pijl,Y. J., Poortinga, Y. H., Lutje Spelberg,H. C.
, Vander Steene, G., et al. (1986). WISC-R: Wechsler Intelligence Scale for
Children Revised; Nederlandstalige uitgave [WISC-R: Wechsler Intelligence
Scale for Children Revised; Dutch edition]. Lisse, The Netherlands: Swets.
Woodley, M. A., & Meisenberg, G. (2013). A Jensen effect on dysgenic
fertility: An analysis involving the National Longitudinal Survey of
Youth. Personality and Individual Differences,55, 279282.
Woodley, M. A., te Nijenhuis, J., Must, O., & Must, A. (2014). Controlling for
increased guessing enhances the independence of the Flynn effect from
g: The return of the Brand effect. Intelligence,43,2734.
*Yater, A. C., Barclay, A., & Leskosky, R. (1971). GoodenoughHarris drawing
test and WPPSI performance of disadvantaged preschool children.
Perceptual and Motor Skills,33, 967970.
Zill, N.,Resnick, G., Kim, K., O'Donnell, K.,& Sorongon, A. (2003). HeadStart faces
2003: A whole-child perspective on program performance. Administration for
Children and Families, U.S. Department of Health and Human Services
(Retrieved December 1, 2005, from:
opre/hs/faces/reports/faces00_4thprogress/faces 00_4thprogress.pdf).
215J. te Nijenhuis et al. / Intelligence 46 (2014) 209215
... 162). It is unlikely such IQ gains are on g, rather than specific skills though (see te Nijenhuis et al. 2001Nijenhuis et al. , 2007Nijenhuis et al. , 2014Ritchie et al. 2015). ...
Full-text available
One commonly studied aspect of the importance of IQ is its validity in predicting job performance. Previous research on this subject has yielded impressive results, regularly finding operational validities for general mental ability exceeding 0.50. In 2015, Ken Richardson and Sarah Norgate criticized the research on the relationship between IQ and job performance, reducing it to virtually nothing. Their assessment of this topic has enjoyed little criticism since its publication despite the crux of their arguments being undermined by readily available empirical evidence and thirty years of replication of the contrary. This article replies to their main criticisms, including the construct validity of IQ tests and supervisory ratings, the validity of the Hunter-Schmidt meta-analytic methods, and possible psychological confounders.
... Looking at those data individually leads to a similar conclusion. In the CPP, the significant decrease in the mother's effect between age 4 and age 7 among Black-White families follows the same downward trend observed in education programs and adoption studies (te Nijenhuis et al., 2014(te Nijenhuis et al., , 2015. In both the Add Health and HSLS data, the mother's effect was null or inconsistent. ...
Full-text available
Extensive research has been conducted on the effect of mothers’ socialization on their children’s cognitive test scores. But less is known about the relation between mothers’ race/ethnicity and the performance of children from interracial families. It has been proposed by Willerman et al. (1974) that cognitive scores of interracial children will be more similar to those of the mother’s race/ethnic group. This is because the mother is the main agent of socialization in youth and adolescence and, as such, the mother provides most of the environmental stimulation. Using the Collaborative Perinatal Project (CPP) and the High School Longitudinal Study of 2009 (HSLS: 2009) data, the current study re-analyzes Willerman et al.’s (1974) observation that mother’s race is a strong determinant of the child’s cognitive ability. In both datasets, we did not find consistent support for the mother’s involvement hypothesis. Furthermore, in the CPP, which was analyzed prior by Willerman et al. (1974), it was found that the earlier superior IQ scores of interracial children of White mothers at age 4 eventually fade out at age 7. Alternative theories are considered.
... Fernandes et al., 2014;te Nijenhuis et al., 2014;te Nijenhuis & van der Flier, 2013;Woodley of Menie et al., 2019). By theory, if g is the cause of the relationship between test scores and some criterion variable, then the items that are better measures of g should show stronger associations with that criterion variable. ...
Full-text available
We examined data from the popular free online 45-item “Vocabulary IQ Test” from We used data from native English speakers (n = 9,278). Item response theory analysis (IRT) showed that most items had substantial g-loadings (mean = .59, sd = .22), but that some were problematic (4 items being lower than .25). Nevertheless, we find that using the site’s scoring rules (that include penalty for incorrect answers) give results that correlate very strongly (r = .92) with IRT-derived scores. This is also true when using nominal IRT. The empirical reliability was estimated to be about .90. Median test completion time was 9 minutes (median absolute deviation = 3.5) and was mostly unrelated to the score obtained (r = -.02). The test scores correlated well with self-reported criterion variables educational attainment (r = .44) and age (r = .40). To examine the test for measurement bias, we employed both Jensen’s method and differential item functioning (DIF) testing. With Jensen’s method, we see strong associations with education (r = .89) and age (r = .88), and less so for sex (r = .32). With differential item functioning, we only tested the sex difference for bias. We find that some items display moderate biases in favor of one sex (13 items had pbonferroni < .05 evidence of bias). However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05). Overall, the test seems mostly well-constructed, and recommended for use with native English speakers.
... While some argue that Jensen effects are readily accountable in terms of cultural factors [52], it so happens that known environmental effects generally do not produce these. This includes adoption gains [53], gains from educational programs like Head Start [54], gains from learning potential programs [55], practice and retest gains [55], secular gains [56], the effects of lead exposure [57], iodine deficiency [58], prenatal toxins like cocaine and alcohol [58], or the effect of traumatic brain injury [58], and environmentality in general [59]. The reason seems to be that environmental effects tend to have larger effects on specific and broad abilities (i.e., Stratum I and II in the conventional three-stratum model of intelligence) than on general mental ability, as indicated by the negative correlation between vectors [60]. ...
Full-text available
Little research has dealt with intragroup ancestry-related differences in intelligence in Black and White Americans. To help fill this gap, we examined the association between intelligence and both color and parent-reported ancestry using the NLSY97. We used a nationally-representative sample, a multidimensional measure of cognitive ability, and a sibling design. We found that African ancestry was negatively correlated with general mental ability scores among Whites (r = −0.038, N = 3603; corrected for attenuation, rc = −0.245). In contrast, the correlation between ability and parent-reported European ancestry was positive among Blacks (r = 0.137, N = 1788; rc = 0.344). Among Blacks, the correlation with darker skin color, an index of African ancestry, was negative (r = −0.112, N = 1455). These results remained with conspicuous controls. Among Blacks, both color and parent-reported European ancestry had independent effects on general cognitive ability (color: β = −0.104; ancestry: β = 0.118; N = 1445). These associations were more pronounced on g-loaded subtests, indicating a Jensen Effect for both color and ancestry (rs = 0.679 to 0.850). When we decomposed the color results for the African ancestry sample between and within families, we found an association between families, between singletons (β = −0.153; N = 814), and between full sibling pairs (β = −0.176; N = 225). However, we found no association between full siblings (β = 0.027; N = 225). Differential regression to the mean results indicated that the factors causing the mean group difference acted across the cognitive spectrum, with high-scoring African Americans no less affected than low-scoring ones. We tested for measurement invariance and found that strict factorial invariance was tenable. We then found that the weak version of Spearman’s hypothesis was tenable while the strong and contra versions were not. The results imply that the observed cognitive differences are primarily due to differences in g and that the Black-White mean difference is attributable to the same factors that cause differences within both groups. Further examination revealed comparable intraclass correlations and absolute differences for Black and White full siblings. This implied that the non-shared environmental variance components were similar in magnitude for both Blacks and Whites.
... This was shown in the case of the Dutch famine of 1944 caused by the Nazi occupying forces who limited the supply of food in the Netherlands, causing widespread starvation [42][43][44][45]. In this way, this fade-out is similar to the fade-out from other early-life interventions that also boost IQ for a limited time and which do not show a g-loaded pattern of subtest gains [46][47]. Overall, we conclude that the existing transracial adoption data leave much to be desired, but that the largest studies (i.e., the Minnesota Transracial Adoption Study and Swedish international adoptee studies) tend to support genetic models over environmental ones. ...
Full-text available
The IQ averages of biracial children have long been of interest to intelligence researchers for clarifying the causes of group differences in intelligence. We carried out a search for IQ test results of biracial children fathered by U.S. servicemen after World War 2 and indigenous Asian women in northeast Asian countries (Japan, Korea, China). We were able to locate a report from Japan from a foster home (n = 28–48 biracial children across tests). Results showed that there was only a minuscule IQ gap (<1 IQ) between children of Black–Japanese and White–Japanese parents. However, interpretation of the results is difficult owing to the very small sample size, the non-representative sample, and unknown patterns of assortative mating. We suggest possible avenues for future research.
... Second, in order to obtain a reliable estimate of the correlation between each of the variables and g loadings, the cognitive batteries had to be based on a minimum of five subtests. Jensen [10,31] used six subtests as a minimum, but in recent meta-analyses using the method of correlated vectors [12,13,[32][33][34][35] it has been empirically shown that if the underlying relationship is strongly positive or strongly negative a test battery with five subtests in the majority of cases still strongly bring out the theoretically expected correlation. However, in many cases using four subtests leads to unstable outcomes, although for exploratory purposes four subtests could still be used. ...
Full-text available
Spearman’s hypothesis states that the difference in intelligence between groups is a function of the g loadings of the subtests, where larger differences are found on tests with higher g loadings. This finding has consistently been supported on various groups. In this study we look at samples of Latin-American Hispanics in comparison to Whites. We carried out a meta-analysis based on 14 data points and a total of 16,813 Latin-American Hispanics, including a new way to correct for imperfectly measuring the construct of g. Spearman’s hypothesis was strongly supported with a mean r of 0.63. After correction for various statistical artifacts this value became rho = 0.91. Therefore, we conclude that Spearman’s hypothesis also holds true for White/Latin-American Hispanic differences.
Early life interventions impacting cognitive abilities are most often followed by post-treatment fadeout. Some have hypothesized that persistence is unlikely when gains are specific to trained skills and distinguishable from impacts on general cognitive ability (classically modeled as a hierarchical factor, so-called psychometric g). Using measurement invariance testing and multiple-indicators multiple-causes models, we investigated impacts on IQ subtests from the Abecedarian early childhood intervention (n = 107). We found that (1) observed impacts on IQ scores from age 5 to age 21 were consistent with persistent positive effects on g; (2) subtest-specific variance that was differentiable from changes on g did fade. Together, these findings indicated that Abecedarian early impact persisted across a range of cognitive skills, providing some evidence for the hypothesis that breadth and persistence of impacts from educational interventions are related.
Mediators of IQ score differences of Whites with Hispanics and African Americans are reviewed in the WISC-IV, WISC-V, and WAIS-IV. Mediators included parent education, income and academic expectations for children and adolescents, and self-education, income and occupation for adults. Results showed that these variables account for substantial portions of variance in all group comparisons, but least for African Americans and adults. The critical influence of cognitively enriching and impoverishing environments on the neurocognitive development of children, and the unequal distribution of these influences across social and economic groups are discussed as complementary with interpretations of acculturation and heredity.
The concept of a general intelligence factor or g is controversial in psychology. Although the controversy swirls at many levels, one of the most important involves g's identification and measurement in a group of individuals. If g is actually predictive of a range of intellectual performances, the factor identified in one battery of mental ability tests should be closely related to that identified in another dissimilar aggregation of abilities. We addressed the extent to which this prediction was true using three mental ability batteries administered to a heterogeneous sample of 436 adults. Though the particular tasks used in the batteries reflected varying conceptions of the range of human intellectual performance, the g factors identified by the batteries were completely correlated (correlations were .99, .99, and 1.00). This provides further evidence for the existence of a higher-level g factor and suggests that its measurement is not dependent on the use of specific mental ability tasks.
Data from 14 nations reveal IQ gains ranging from 5 to 25 points in a single generation. Some of the largest gains occur on culturally reduced tests and tests of fluid intelligence. The Norwegian data show that a nation can make significant gains on a culturally reduced test while suffering losses on other tests. The Dutch data prove the existence of unknown environmental factors so potent that they account for 15 of the 20 points gained. The hypothesis that best fits the results is that IQ tests do not measure intelligence but rather a correlate with a weak causal link to intelligence. This hypothesis can also explain differential trends on various mental tests, such as the combination of IQ gains and Scholastic Aptitude Test losses in the United States.
-Analysis of the performance of 48 disadvantaged preschool children on the Goodenough-Harris and the WPPSI showed that both tests yielded IQ estimates below rhe respective norm groups for both instruments. Goodenough-Harris IQs were in the dull-normal range, while WPPSI IQs were in the normal range. The Man and Woman scales were not equivalent measures and neither appeared to be an adequate predictor of WPPSI IQ levels. The purpose of this study was to investigate and compare the performance of a group of disadvantaged preschool children on the Goodenough-Harris Drawing Test (Harris, 1963) and the Wechsler Preschool and Primary Scale of Intelligence ( Wechsler, 1963 ). METHOD Ss were 48 children (24 boys and 24 girls, 14 black and 10 white children in each sex group), ranging in chronological age from 58 to 72 mo. (M = 63.96; SD = 2.49 mo.), who had been in a Head Start program for about 6 mo. at the time the Goodenough-Harris Drawing Test was administered. The children were sheeted from neighboring Head Start centers in an effort to minimize differences in their residential environments. The Man and Woman scales were administered in group sessions by experienced examiners and were scored by experienced raters who had not administered the test. Standard proced~ues (Harris, 1963) were followed in both administration and scoring. Datta (1967), Dunn ( 1967a), Harris ( 1963), Levine and Gross (1968), and Yater, Barclay, and McGilligan (.1969) have reported satisfactory inter- and/or intra-rater reliability of the Goodenough-Harris scoring system when applied to children's drawings. Raw scores for the Man and Woman scales were converted to standard scores using the sex-age norm tables provided in the manual (Harris, 1963 ) . Administration of the Wechsler Preschool and Primary Scale of Intelligence (WPPSI) followed that of the Goodenough-Harris by approximately 2 mo. The test was administered individually by two experienced female examiners according to instructions in the manual. The Sentences supplementary subtest and the Animal House retest were both omitted, as suggested by Wechsler ' (1963). 'This scudy was conducted during the 1968-1969 Full-year Headstart Child Development Program of the St. Louis Human Development Corporation supported by funds granted by OEO pursuant to the provisions of the Economic Opportunity Act of 1964.