Content uploaded by Emil O. W. Kirkegaard

Author content

All content in this area was uploaded by Emil O. W. Kirkegaard on Nov 16, 2015

Content may be subject to copyright.

An S factor among census tracts of Boston

Abstract

A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S

factor was found with positive loadings for median value of owner-occupied homes and average

number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the

proportion of the population of 'lower status'. The S factor scores were negatively correlated with the

estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was

biased downwards due to data error that could not be corrected for.

Introduction

The general socioeconomic factor (s/S

1

) is a similar construct to that of general cognitive ability (GCA;

g factor, intelligence, etc., (Gottfredson, 2002; Jensen, 1998). For ability data, it has been repeatedly

found that performance on any cognitive test is positively related to performance on any other test, no

matter which format (pen pencil, read aloud, computerized), and type (verbal, spatial, mathematical,

figural, or reaction time-based) has been tried. The S factor is similar. It has been repeatedly found that

desirable socioeconomic outcomes tend are positively related to other desirable socioeconomic

outcomes, and undesirable outcomes positively related to other undesirable outcomes. When this

pattern is found, one can extract a general factor such that the desirable outcomes have positive

loadings and then undesirable outcomes have negative loadings. In a sense, this is the latent factor that

underlies the frequently used term “socioeconomic status” except that it is broader and not just

restricted to income, occupation and educational attainment, but also includes e.g. crime and health.

So far, S factors have been found for country-level (Kirkegaard, 2014b), state/regional-level (e.g.

Kirkegaard, 2015), country of origin-level for immigrant groups (Kirkegaard, 2014a) and first name-

level data (Kirkegaard & Tranberg, In preparation). The S factors found have not always been strictly

general in the sense that sometimes an indicator loads in the 'wrong direction', meaning that either an

undesirable variable loads positively (typically crime rates), or a desirable outcome loads negatively.

These findings should not be seen as outliers to be explained away, but rather to be explained in some

coherent fashion. For instance, crime rates may load positively despite crime being undesirable because

the justice system may be better in the higher S states, or because of urbanicity tends to create crime

and urbanicity usually has a positive loading. To understand why some indicators sometimes load in the

wrong direction, it is important to examine data at many levels. This paper extends the S factor to a

new level, that of census tracts in the US.

Data source

While taking a video course on statistical learning based on James, Witten, Hastie, & Tibshirani (2013),

I noted that a dataset used as an example would be useful for an S factor analysis. The dataset concerns

506 census tracts of Boston and includes the following variables (Harrison & Rubinfeld, 1978):

• Median value of owner-occupied homes

• Average number of rooms in owner units.

• Proportion of owner units built before 1940.

• Proportion of the population that is 'lower status'. “Proportion of adults without, some high

school education and proportion of male workers classified as laborers)”.

• Crime rate.

• Proportion of residential land zoned for lots greater than 25k square feet.

• Proportion of nonretail business acres.

• Full value property tax rate.

• Pupil-teacher ratios for schools.

• Whether the tract bounds the Charles River.

• Weighted distance to five employment centers in the Boston region.

• Index of accessibility to radial highways.

• Nitrogen oxide concentration. A measure of air pollution.

• Proportion of African Americans.

See the original paper for a more detailed description of the variables.

This dataset has become very popular as a demonstration dataset in machine learning and statistics

which shows the benefits of data sharing (Wicherts & Bakker, 2012). As Gilley & Pace (1996) note

“Essentially, a cottage industry has sprung up around using these data to examine alternative statistical

techniques.”. However, as they re-checked the data, they found a number of errors. The corrected data

can be downloaded here, which is the dataset used for this analysis.

The proportion of African Americans

The variable concerning African Americans have been transformed by the following formula: 1000(x

- .63)

2

. Because one has to take the square root to reverse the effect of taking the square, some

information is lost. For example, if we begin with the dataset {2, -2, 2, 2, -2, -2} and take the square of

these and get {4, 4, 4, 4, 4, 4}, it is impossible someone to reverse this transformation and get the

original because they cannot tell whether 4 results from -2 or 2 being squared.

In case of the actual data, the distribution is shown in Figure 1.

Due to the transformation, the values around 400 actually mean that the proportion of blacks is around

0. The function for back-transforming the values is shown in Figure 2.

We can now see the problem of back-transforming the data. If the transformed data contains a value

between 0 and about 140, then we cannot tell which original value was with certainty. For instance, a

transformed value of 100 might correspond to an original proportion of .31 or .95.

To get a feel for the data, one can use the Racial Dot Map explorer and look at Boston. Figure 3 shows

the Boston area color-coded by racial groups.

As can be seen, the races tend to live rather separate with large areas dominated by one group. From

looking at it, it seems that Whites and Asians mix more with each other than with the other groups, and

that African Americans and Hispanics do the same. One might expect this result based on the groups'

relative differences in S factor and GCA (Fuerst, 2014). Still, this should be examined by numerical

analysis, a task which is left for another investigation.

Still, we are left with the problem of how to back-transform the data. The conservative choice is to use

only the left side of the function. This is conservative because any proportion above .63 will get back-

transformed to a lower value. E.g. .80 will become .46, a serious error. This is the method used for this

analysis.

Factor analysis

Of the variables in the dataset, there is the question of which to use for S factor analysis. In general

when doing these analyses, I have sought to include variables that measure something

socioeconomically important and which is not strongly influenced by the local natural environment.

For instance, the dummy variable concerning the River Charles fails on both counts. I chose the

following subset:

• Median value of owner-occupied homes

• Average number of rooms in owner units.

• Proportion of the population that is 'lower status'.

• Crime rate.

• Pupil-teacher ratios for schools.

• Nitrogen oxide concentration. A measure of air pollution.

Which concern important but different things. Figure 4 shows the loadings plot for the factor analysis

(reversed).

2

The S factor was confirmed for this data without exceptions, in that all indicator variables loaded in the

expected direction. The factor was moderately strong, accounting for 47% of the variance.

Relationship between S factor and proportions of African Americans

Figure 5 shows a scatter plot of the relationship between the back-transformed proportion of African

Americans and the S factor.

We see that there is a wide variation in S factor even among tracts with no or very few African

Americans. These low S scores may be due to Hispanics or simply reflect the wide variation within

Whites (there few Asians back then). The correlation between proportion of African Americans and S is

-.36 [CI95 -0.43; -0.28].

We see that many very low S points lie around S [-3 to -1.5]. Some of these points may actually be

census tracts with very high proportions of African Americans that were back-transformed incorrectly.

Discussion

The value of r = -.36 should not be interpreted as an estimate of effect size of ancestry on S factor for

census tracts in Boston because the proportions of the other sociological races were not used. A

multiple regression or similar method with all sociological races as the predictors is necessary to

answer this question. Still, the result above is in the expected direction based on known data concerning

the mean GCA of African Americans, and the relationship between GCA and socioeconomic outcomes

(Gottfredson, 1997).

Limitations

The back-transformation process likely introduced substantial error in the results.

Data are relatively old and may not reflect reality in Boston as it is now.

Supplementary material

Data, high quality figures and R source code is available at the Open Science Framework repository.

References

Fuerst, J. (2014). Ethnic/Race Differences in Aptitude by Generation in the United States: An

Exploratory Meta-analysis. Open Differential Psychology. Retrieved from

http://openpsych.net/ODP/2014/07/ethnicrace-differences-in-aptitude-by-generation-in-the-

united-states-an-exploratory-meta-analysis/

Gilley, O. W., & Pace, R. K. (1996). On the Harrison and Rubinfeld data. Journal of Environmental

Economics and Management, 31(3), 403–405.

Gottfredson, L. S. (1997). Why g matters: The complexity of everyday life. Intelligence, 24(1), 79–132.

http://doi.org/10.1016/S0160-2896(97)90014-3

Gottfredson, L. S. (2002). Where and Why g Matters: Not a Mystery. Human Performance, 15(1-2),

25–46. http://doi.org/10.1080/08959285.2002.9668082

Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal

of Environmental Economics and Management, 5(1), 81–102.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (Eds.). (2013). An introduction to statistical

learning: with applications in R. New York: Springer.

Jensen, A. R. (1998). The g factor: the science of mental ability. Westport, Conn.: Praeger.

Kirkegaard, E. O. W. (2014a). Crime, income, educational attainment and employment among

immigrant groups in Norway and Finland. Open Differential Psychology. Retrieved from

http://openpsych.net/ODP/2014/10/crime-income-educational-attainment-and-employment-

among-immigrant-groups-in-norway-and-finland/

Kirkegaard, E. O. W. (2014b). The international general socioeconomic factor: Factor analyzing

international rankings. Open Differential Psychology. Retrieved from

http://openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-

analyzing-international-rankings/

Kirkegaard, E. O. W. (2015). Examining the S factor in Mexican states. The Winnower. Retrieved from

https://thewinnower.com/papers/examining-the-s-factor-in-mexican-states

Kirkegaard, E. O. W., & Tranberg, B. (In preparation). What is a good name? The S factor in Denmark

at the name-level. Open Differential Psychology. Retrieved from https://osf.io/t2h9c/

Rindermann, H. (2007). The g-factor of international cognitive ability comparisons: the homogeneity

of results in PISA, TIMSS, PIRLS and IQ-tests across nations. European Journal of

Personality, 21(5), 667–706. http://doi.org/10.1002/per.634

Wicherts, J. M., & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish

your data too? Intelligence, 40(2), 73–76. http://doi.org/10.1016/j.intell.2012.01.004

1 Capital S is used when the data are aggregated, and small s is used when it is individual level data. This follows the

nomenclature of (Rindermann, 2007).

2 To say that it is reversed is because the analysis gave positive loadings for undesirable outcomes and negative for

desirable outcomes. This is because the analysis includes more indicators of undesirable outcomes and the factor

analysis will choose the direction to which most indicators point as the positive one. This can easily be reversed by

multiplying with -1.

Figure 1: Transformed data for the proportion of blacks by census

tract.

Figure 2: The transformation function.

Figure 3: Racial dot map of Boston area.

Figure 4: Loadings plot for the S factor.

Figure 5: Scatter plot of S scores and the back-transformed

proportion of African Americans by census tract in Boston.