Content uploaded by Emil O. W. Kirkegaard

Author content

All content in this area was uploaded by Emil O. W. Kirkegaard on Nov 18, 2016

Content may be subject to copyright.

The Winnower

Published March 4th, 2015

Examining the S factor in US states

Emil O. W. Kirkegaard1

Abstract

A dataset of 25 diverse socioeconomic indicators for US states was compiled and subjected to factor

analysis. Results showed that Washington DC was a strong outlier, but if it is excluded, then the S

factor correlated strongly with state IQ (based on NAEP) at .75.

Ethnoracial demographics of the states were related to the state’s IQ and S in the expected order

(White>Hispanic>Black).

Key words: USA, United States, states, social inequality, S factor, general socioeconomic factor, IQ,

intelligence, cognitive ability, NAEP, cognitive sociology

1. Introduction and data sources

In two previous studies, I analyzed the S factors in 33 Indian states (Kirkegaard, 2015a) and 31 Chinese

regions (Kirkegaard, 2015b). Both studies found strongish S factors and they both correlated positively

with cognitive estimates (IQ or G). The purpose of this study was to examine the S factor in the US.

2. Data sources

State IQ data from McDaniel (2006) were used. He gave two sets of estimated IQs based on SAT-ACT

and on NAEP. Unfortunately, they only correlated .58, so at least one of them is not a very accurate

estimate of general intelligence.

McDaniel reports a few correlations between his IQs and socioeconomic variables: Gross State Product

per capita, median income and percent poverty. However, data for these variables is not given in the

article, so I could not copy them.

An analysis of US states should be a strong test of the S factor model because plenty of high quality

data are readily available and the number of cases is decent (50 or 51, depending on whether the capital

is included). Factor analysis requires a case to variable ratio of at least 2:1 to deliver reliable results

(Zhao, 2009). So, this means that one can do an S factor analysis with about 25 variables.

1 University of Aarhus, Denmark. Email: emil@emilkirkegaard.dk

Page 1 of 15.

A dataset of 25 diverse socioeconomic variables was compiled. There are two reasons to gather a very

diverse sample of variables. First, for method of correlated vectors to work (Jensen, 1998), there must

be variation in the indicators’ loading on the factor. Lack of variation causes restriction of range

problems. Second, lack of diversity in the indicators of a latent variable leads to psychometric sampling

error (Jensen & Weng, 1994).

The primary source was The 2012 Statistical Abstract website. I simply searched for “state” and picked

a diverse set of variables. An attempt was made to pick variables that weren’t strongly dependent on

geography. To increase reliability, I generally used all data for the last 10 years and averaged them.

Curious readers can read the datafile for details.

The following variables were chosen:

1. Murder rate per 100k, 10 years

2. Proportion with high school or more education, 4 years

3. Proportion with bachelor or more education, 4 years

4. Proportion with advanced degree or more, 4 years

5. Voter turnout, presidential elections, 3 years

6. Voter turnout, house of representatives, 6 years

7. Percent below poverty, 10 years

8. Personal income per capita, 1 year

9. Percent unemployed, 11 years

10.Internet usage, 1 year

11.Percent smokers, male, 1 year

12.Percent smokers, female, 1 year

13.Physicians per capita, 1 year

14.Nurses per capita, 1 year

15.Percent with health care insurance, 1 year

16.Percent in ‘Medicaid Managed Care Enrollment’, 1 year

17.Proportion of population urban, 1 year

18.Abortion rate, 5 years

19.Marriage rate, 6 years

20.Divorce rate, 6 years

21.Incarceration rate, 2 years

Page 2 of 15.

22.Gini coefficient, 10 years

23.Top 1%, proportion of total income, 10 years

24.Obesity rate, 1 year

Most of these are self-explanatory. For the economic inequality measures, I found 6 different measures

(here). Because I wanted diversity, I chose the GINI and the top 1% because these correlated the least

and are both well-known.

Additonally, racial demographical data were downloaded.

3. Analyses

3.1. Missing data

Figure 1 shows a matrixplot of the missing data.

We see that there aren’t many missing values. The missing data were imputed using irmi from the VIM

package (Templ, Alfons, Kowarik, & Prantner, 2015).

3.2. Extreme values

A useful feature of the matrixplot is that it shows in grey-tone the relative outliers for each variable.

Some outlying datapoints can be seen and these were inspected for possible data error.

The outlier in the two university degree variables is DC, surely because it’s the seat of the government,

and there is a large lobbyist center. For the marriage rate, the outlier is Nevada. Many people go there

Page 3 of 15.

Figure 1: Matrixplot of socioeconomic data before imputation.

and get married. Physician and nurse rates are also DC.

Figure shows the matrixplot after data imputation.

It looks much the same as before. This is good because it means the imputation did not radically

change the data.

3.3. Factor analysis

The data were factor analyzed using fa from psych (Revelle, 2015). The loadings are shown in Figure

3.

Page 4 of 15.

Figure 2: Matrixplot of socioeconomic data after imputation.

We see a wide spread of variable loadings. All but two of them load in the expected direction —

positive are socially valued outcomes, negative the opposite — showing the existence of the S factor.

The exceptions are: abortion rate loading +.60, but often considered as a negative thing. It is however

open to discussion. Maybe higher abortion rates can be interpreted as less backward religiousness or

more freedom for women (both good in my view). The other is marriage rate at -.19 (weak loading).

I’m not sure how to interpret that. In any case, both of these are debatable which way the proper

desirable direction is.

Page 5 of 15.

Figure 3: S factor loadings. DC included.

3.4. Correlations with cognitive measures

Because have two sets of IQ estimates, we will plot both to see if we can see which is superior. Figures

4 and 5 show the relationships between the IQ measures and S.

First, the SAT-ACT estimates are pretty strange for three states: California, Arizona and Nevada. I note

that these are three adjacent states, so it is quite possibly some kind of regional testing practice that’s

throwing off the estimates. Second, DC is a strong outlier in S, as we may have expected from our short

Page 6 of 15.

Figure 4: ACT-SAT based IQ estimates and S. DC included.

Figure 5: NAEP based IQ estimates and S. DC included.

discussion of extreme values above. It’s the only state that’s almost entirely a city.

3.5. Dealing with outliers – Spearman’s correlation

There are various ways to deal with outliers. One simple way is to convert the data into ranked data,

and then analyze like normal. Pearson’s correlations assume that the data are normally distributed,

which is often not the case with higher-level data (states, countries). Figures 6 and 7 show the

relationships between ACT-SAT, NAEP and S with ranked data.

Page 7 of 15.

Figure 6: ACT-SAT based IQ estimates and S. DC included. Rank data.

Figure 7: NAEP based IQ estimates and S. DC included. Rank data.

The rank order correlations are stronger as expected.

3.6. Results without DC

An alternative approach is excluding DC before carrying out the factor analysis. A parallel dataset was

created without DC. Figure shows the factor loadings.

These are very similar to before, excluding DC did not substantially change results. The factor size

increased from 30% to 36% indicating that DC was distoring the general factor. The reason this

Page 8 of 15.

Figure 8: S factor loadings. DC excluded.

happens is that DC is an odd case, scoring very high in some indicators (e.g. education) and very

poorly in others (e.g. murder rate). Figures 9 and 10 show the IQ x S correlations again, but based on

the dataset without DC.

Not surprisingly, we see an increase in the effect sizes from before: .14 to .31 and .43 to .69.

Page 9 of 15.

Figure 9: ACT-SAT based IQ estimates and S. DC excluded.

Figure 10: NAEP based IQ estimates and S. DC excluded.

3.6.1. Without DC and rank-order

Still, one may wonder what the results would be with rank-order and DC removed. These are shown in

Figures .

Compared to before, effect size increased for the SAT-ACT IQ and decreased slightly for the NAEP IQ.

One could also do regression with weights based on some metric of the state population and this may

Page 10 of 15.

Figure 11: ACT-SAT based IQ estimates and S. DC excluded. Rank data.

Figure 12: NAEP based IQ estimates and S. DC excluded. Rank data.

further change results, but it seems safe to say that the cognitive measures correlate in the expected

direction and with the removal of one odd case, the better measure performs at about the expected level

with or without using rank-order correlations.

3.7. Method of correlated vectors

The MCV (Jensen, 1998) can be used to test whether a specific latent variable underlying some data is

responsible for the observed correlation between the factor score (or factor score approximation such as

IQ — an unweighted sum) and some criteria variable. Although originally invented for use on cognitive

test data and the general intelligence factor, I have previously used it in other areas (e.g. Kirkegaard,

2014, 2015a).

Using the dataset without DC, the MCV result for NAEP is shown in Figure 13.

We see that MCV can reach high r’s when there is a large number of diverse variables. But note that the

value can be considered inflated because of the negative loadings of some variables. It is debatable

whether one should reverse them.

3.8. Racial proportions of states and S and IQ

A last question is whether the states’ racial proportions predict their S and IQ. There are many problems

with this approach. First, the actual genomic proportions within these racial groups vary by state (Bryc,

Durand, Macpherson, Reich, & Mountain, 2015). Second, within ‘pure-breed’ groups, general

Page 11 of 15.

Figure 13: Method of correlated vectors applied to the S x NAEP relationship. DC excluded.

intelligence varies by state too (this was shown in the testing of draftees in the US in WW1). Third,

there is an ‘other’ group that varies from state to state, presumably different kinds of Asians (Japanese,

Chinese, Indians, other SE Asia). Fourth, it is unclear how one should combine these proportions into

an estimate used for correlation analysis or model them. Standard multiple regression is unsuited for

handling data like these. This is because there is a perfect linear dependency among the proportions, i.e.

the total proportion must add up to 1 (100%). Given the four problems above, one will not expect near-

perfect results, but one would probably expect most going in the right direction with moderate effect

sizes.

Perhaps the simplest way of analyzing the data is the correlations. These are susceptible to confounds

e.g. if White% correlates differentially with the other racial proportions. However, they should get the

basic directions correct if not the effect size order too.

3.8.1. Racial proportions, NAEP IQ and S

For this analysis I use only the NAEP IQs and without DC, as I believe this is the best subdataset to

rely on. I correlate this with the S factor and each racial proportion. The results are:

Racial

group NAEP IQ S

White 0.69 0.18

Black -0.50 -0.42

Hispanic -0.38 -0.08

Other -0.26 0.20

For NAEP IQ, depending on what one thinks of the ‘other’ category, these have either exactly or

roughly the order one expects: W>O>H>B. If one thinks “other” is mostly East Asian (Japanese,

Chinese, Korean) with higher cognitive ability than Europeans, one would expect O>W>H>B. For S,

however, the order is O>W>H>B and the effect sizes are much weaker. In general, given the limitations

above, these are perhaps reasonable if somewhat on the weak side.

3.8.2. Estimating state IQ from racial proportions using racial IQs

One way to utilize all the four variables (White, Black, Hispanic and Other) without having MR assign

them weights is to assign them weights based on known group IQs and then calculate a weighted mean

estimated IQ for each state.

Depending on which estimates for group IQs one accepts, one might use something like the following:

State IQ est. = White*100 + Other*100 + Black*85 + Hispanic*90

Or if one thinks Other is somewhat higher than Whites (this is not entirely unreasonable, but recall that

the NAEP includes reading tests which foreigners and Asians perform less well on), one might want to

Page 12 of 15.

use 105 for the other group (#2). Or one might want to raise Black and Hispanic IQs a bit if one thinks

the group differences have narrowed, say, to 88 and 93 (#3). Or do both (#4). All the variations are

shown in Table .

Variable Race.IQ Race.IQ2 Race.IQ3 Race.IQ4

Race.IQ 1 0.96 1 0.93

Race.IQ2 0.96 1 0.96 0.99

Race.IQ3 1.00 0.96 1 0.94

Race.IQ4 0.93 0.99 0.94 1

NAEP IQ 0.67 0.56 0.67 0.51

S0.41 0.44 0.42 0.45

Table 1: Intercorrelations of state IQ estimates, NAEP and S.

As far as I can tell, there is no strong reason to pick any of these over the others. However, what we

learn is that the racial IQ estimate and NAEP IQ estimate is somewhere between .51 and .67, and the

racial IQ estimate and S is somewhere between .41 and .45. These are reasonable results given the

problems of this analysis described above.

3.9. Added March 11: New NAEP data

Shortly after publication of this study, I came across a series of posts by science blogger The Audacious

Epigone, who had also estimated IQs based on NAEP data. He has done this three times (for 2013,

2009 and 2005 data), so along with McDaniel’s estimates, this gives us 4 non-identical estimates. The

intercorrelations of these new variables is shown in Table 2. NAEP.1 is a factor score extracted from

the base NAEP variables.

NAEP.IQ.13 NAEP.IQ.09 NAEP.IQ.05 NAEP M. NAEP.1

NAEP.IQ.09 0.96

NAEP.IQ.05 0.83 0.89

NAEP M. 0.88 0.93 0.96

NAEP.1 0.95 0.99 0.95 0.97

S0.81 0.76 0.64 0.69 0.75

Table 2: Intercorrelations between NAEP variables and S. NAEP M = McDaniel’s IQs.

We see that intercorrelations between NAEP estimates are not that high, they average only .86. Still,

this should result in improved results due to measurement error being removed, and it does, NAEP IQ x

S is now .75, up from .69.

Page 13 of 15.

4. Discussion

Washington DC was found to be a strong outlier that caused problems with the data analysis. Future

studies should be careful about capital districts for this reason.

The correlations between IQ and S were strong, as expected from previous studies. The relationships

between demographic variables and IQ were strong, while those for S only weak to moderate.

Supplementary material

Data files and R source code available at https://osf.io/t7e5y/files/.

References

Bryc, K., Durand, E. Y., Macpherson, J. M., Reich, D., & Mountain, J. L. (2015). The genetic ancestry

of African Americans, Latinos, and European Americans across the United States. American

Journal of Human Genetics, 96(1), 37–53. https://doi.org/10.1016/j.ajhg.2014.11.010

Jensen, A. R. (1998). The g factor: the science of mental ability. Westport, Conn.: Praeger.

Jensen, A. R., & Weng, L.-J. (1994). What is a good g? Intelligence, 18(3), 231–258.

https://doi.org/10.1016/0160-2896(94)90029-9

Kirkegaard, E. O. W. (2014). The international general socioeconomic factor: Factor analyzing

international rankings. Open Differential Psychology. Retrieved from

http://openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-

analyzing-international-rankings/

Kirkegaard, E. O. W. (2015a). Indian states: G and S factors. The Winnower. Retrieved from

https://thewinnower.com/papers/indian-states-g-and-s-factors

Kirkegaard, E. O. W. (2015b). The S factor in China. The Winnower. Retrieved from

https://thewinnower.com/papers/the-s-factor-in-china

McDaniel, M. A. (2006). State preferences for the ACT versus SAT complicates inferences about SAT-

derived state IQ estimates: A comment on Kanazawa (2006). Intelligence, 34(6), 601–606.

https://doi.org/10.1016/j.intell.2006.07.005

Revelle, W. (2015). psych: Procedures for Psychological, Psychometric, and Personality Research

(Version 1.5.4). Retrieved from http://cran.r-project.org/web/packages/psych/index.html

Page 14 of 15.

Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and

Imputation of Missing Values. CRAN. Retrieved from http://cran.r-

project.org/web/packages/VIM/index.html

Zhao, N. (2009, March 23). The Minimum Sample Size in Factor Analysis. Retrieved November 16,

2016, from

https://www.encorewiki.org/display/~nzhao/The+Minimum+Sample+Size+in+Factor+Analysis

Page 15 of 15.