Project

The S factor

Goal: To understand the structure of social inequality

Updates
0 new
1
Recommendations
0 new
0
Followers
0 new
17
Reads
2 new
383

Project log

Emil O. W. Kirkegaard
added a research item
A dataset of the relative general social status (S factor) of 1,890 first names of persons living in Denmark was obtained from a previous study. 1,100 linguistic features were generated based on n-grams augmented by regex and each name was scored on each feature. An initial check using t-tests showed strong signal in the features taken as a whole (42.5 % of p values were < .05), and that this was due mostly to low status names having rarer patterns. OLS and lasso regression were used to combine the linguistic features into a single model. The results showed strong evidence of signal in the data. As a control, the main geographic origin of each name was inferred using data from behindthename.com. I validated this by comparing social status by origin group with data from official sources, r = .72, n = 28. The main origin for each name was then entered as a covariate and models were rerun. The results indicated that subtle linguistic features still provide substantial incremental validity, though a precise numerical estimate was difficult to arrive at. I validated this conclusion by training the model only on the subset of data identified as Danish. Model out of sample predictive validity was substantial in general, r = .75 (including origin covariate), and r = .46 in the Danish subset (linguistic features only). I conclude that it is possible to train fairly accurate social status predictors from subtle linguistic patterns in names. It's possible that humans might pick up on such cues to inform social perception when limited data is available.
Emil O. W. Kirkegaard
added a research item
We compiled cognitive, ethnic, and socioeconomic data for the 63 provinces of Vietnam. The cognitive data came from math and reading achievement tests administered to 70,000 fifth-graders in 2001 (World Bank, 2004). Ethnic and socioeconomic data were coded from various official sources (e.g., The General Statistics Office of Vietnam). Analysis of the socioeconomic data revealed a general factor (S) that was robust to variations in extraction method and controls. The average cognitive ability of the provinces correlated .47 with the S factor. The strongest predictor of S, however, was ethnicity. Specifically, the percent of Vietnamese (Kinh) within each province correlated .74 with S. Moreover, this effect was not mediated by cognitive ability. The lack of mediation is inconsistent with results from earlier studies that examined relations between ethnicity, cognitive ability, and socioeconomic outcomes (see, e.g., Fuerst & Kirkegaard, 2016). Also inconsistent with prior studies, although latitude correlated positively with cognitive ability, it did so inversely with the S factor. We discuss several potential hypotheses for why these discrepant effects occurred.
Emil O. W. Kirkegaard
added a research item
Differences in intelligence have previously been found to be related to a wide range of inter-individual and international social outcomes. There is evidence indicating that intelligence differences are also related to different regional outcomes within nations. A quantitative and narrative review is provided for twenty-two countries (number of regions in parentheses): Argentina (24 to 437), Brazil (27 to 31), British Isles (12 to 392), to 79), Spain (15 to 48), Switzerland (47), Turkey (12), the USA (30 to 3100), and Vietnam (61). Between regions, intelligence is significantly associated with a wide range of economic, social, and demographic phenomena, including income (r unweighted = .56), educational attainment (r unweighted = .59), health (r unweighted = .49), general socioeconomic status (r unweighted = .55), and negatively with fertility (r unweighted = −.51) and crime (r unweighted = −.20). Proposed causal models for these differences are noted. It is concluded that regional differences in intelligence within nations warrant further focus; methodological concerns that need to be addressed in future research are detailed.
Emil O. W. Kirkegaard
added a research item
A dataset of socioeconomic, demographic and geographic data for US counties (N≈3,100) was created by merging data from several sources. A suitable subset of 28 socioeconomic indicators was chosen for analysis. Factor analysis revealed a clear general socioeconomic factor (S factor) which was stable across extraction methods and different samples of indicators (absolute split-half sampling reliability = .85). Self-identified race/ethnicity (SIRE) population percentages were strongly, but non-linearly, related to cognitive ability and S. In general, the effect of White% and Asian% were positive, while those for Black%, Hispanic% and Amerindian% were negative. The effect was unclear for Other/mixed%. The best model consisted of White%, Black%, Asian% and Amerindian% and explained 41/43% of the variance in cognitive ability/S among counties. SIRE homogeneity had a non-linear relationship to S, both with and without taking into account the effects of SIRE variables. Overall, the effect was slightly negative due to low S, high White% areas. Geospatial (latitude, longitude, and elevation) and climatological (temperature, precipitation) predictors were tested in models. In linear regression, they had little incremental validity. However, there was evidence of non-linear relationships. When models were fitted that allowed for non-linear effects of the environmental predictors, they were able to add a moderate amount of incremental validity. LASSO regression, however, suggested that much of this predictive validity was due to overfitting. Furthermore, it was difficult to make causal sense of the results. Spatial patterns in the data were examined using multiple methods, all of which indicated strong spatial autocorrelation for cognitive ability, S and SIRE (k nearest spatial neighbor regression [KNSNR] correlations of .62 to .89). Model residuals were also spatially autocorrelated, and for this reason the models were re-fit controlling for spatial autocorrelation using KNSNR-based residuals and spatial local regression. The results indicated that the effects of SIREs were not due to spatially autocorrelated confounds except possibly for Black% which was about 50% weaker in the controlled analyses. Pseudo-multilevel analyses of both the factor structure of S and the SIRE predictive model showed results consistent with the main analyses. Specifically, the factor structure was similar across levels of analysis (states and counties) and within states. Furthermore, the SIRE predictors had similar betas when examined within each state compared to when analyzed across all states. It was tested whether the relationship between SIREs and S was mediated by cognitive ability. Several methods were used to examine this question and the results were mixed, but generally in line with a partial mediation model. Jensen's method (method of correlated vectors) was used to examine whether the observed relationship between cognitive ability and S scores was plausibly due to the latent S factor. This was strongly supported (r = .91, Nindicators=28). Similarly, it was examined whether the relationship between SIREs and S scores was plausibly due to the latent S factor. This did not appear to be the case.
Emil O. W. Kirkegaard
added a research item
Many studies have examined the correlations between national IQs and various country-level indexes of well-being. The analyses have been unsystematic and not gathered in one single analysis or dataset. In this paper I gather a large sample of country-level indexes and show that there is a strong general socioeconomic factor (S factor) which is highly correlated (.86-.87) with national cognitive ability using either Lynn and Vanhanen's dataset or Altinok's. Furthermore, the method of correlated vectors shows that the correlations between variable loadings on the S factor and cognitive measurements are .99 in both datasets using both cognitive measurements, indicating that it is the S factor that drives the relationship with national cognitive measurements, not the remaining variance.
Emil O. W. Kirkegaard
added a research item
A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S factor was found with positive loadings for median value of owner-occupied homes and average number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the proportion of the population of ‘lower status’. The S factor scores were negatively correlated with the estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was biased downwards due to data error that could not be corrected for.
Emil O. W. Kirkegaard
added an update
Project goal
To understand the structure of social inequality
Background and motivation
Social inequality in all variables relates to all other variables, more or less. Why is this? What determines the centrality/loading of an indicator? What causes the nexus?
 
Emil O. W. Kirkegaard
added 9 research items
Sizeable S factors were found across 3 different datasets (from years 1991, 2000 and 2010), which explained 56 to 71% of the variance. Correlations of extracted S factors with cognitive ability were strong ranging from .69 to .81 depending on which year, analysis and dataset is chosen. Method of correlated vectors supported the interpretation that the latent S factor was primarily responsible for the association (r’s .71 to .81).
Two methods are presented that allow for identification of mixed cases in the extraction of general factors. Simulated data is used to illustrate them.
We present and analyze data from a dataset of 2358 Danish first names and socioeconomic outcomes not previously made available to the public (“Navnehjulet”, the Name Wheel). We visualize the data and show that there is a general socioeconomic factor with indicator loadings in the expected directions (positive: income, owning your own place; negative: having a criminal conviction, being without a job). This result holds after controlling for age and for each gender alone. It also holds when analyzing the data in age bins. The factor loading of being married depends on analysis method, so it is more difficult to interpret. A pseudofertility is calculated based on the population size for the names for the years 2012 and 2015. This value is negatively correlated with the S factor score r = -.35 [95CI: -.39; -.31], but the relationship seems to be somewhat non-linear and there is an upward trend at the very high end of the S factor. The relationship is strongly driven by relatively uncommon names who have high pseudofertility and low to very low S scores. The n-weighted correlation is -.21 [95CI: -.25; -.17]. This dysgenic pseudofertility was mostly driven by Arabic and African names. All data and R code is freely available.
Emil O. W. Kirkegaard
added 2 research items
A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.
A dataset of 127 variables concerning socioeconomic outcomes for US states was analyzed. Of these, 81 were used in a factor analysis. The analysis revealed a general socioeconomic factor. This factor correlated .961 with one from a previous analysis of socioeconomic data for US states.
Emil O. W. Kirkegaard
added 3 research items
Two sets of socioeconomic data for 90-96 French departements were analyzed. One dataset was found in Lynn (1980) and contained four socioeconomic variables. Mixed results were found for this dataset, both with regards to the factor structure and the relationship to cognitive ability. Another dataset with 53 variables was created by compiling variables from the official French statistics bureau (Insee). This dataset contained an impure general socioeconomic (S) factor (some undesirable variables loaded positively), but after controlling for the presence of immigrants, the S factor became purer. This was especially salient for crime, unemployment and poverty variables. The two S factors correlated at r = 0.66 [CI95:0.52-0.76; N = 88]. The IQ scores from the 1950s dataset correlated at 0.33 [CI95:0.13-0.51, N = 88] with the S factor from the 2010-2015 dataset.
A dataset of 30 diverse socioeconomic variables was collected covering 32 London boroughs. Factor analysis of the data revealed a general socioeconomic factor. This factor was strongly related to GCSE (General Certificate of Secondary Education) scores (r's .683 to .786) and and had weak to medium sized negative relationships to demographic variables related to immigrants (r's -.295 to -.558). Jensen's method indicated that these relationships were related to the underlying general factor, especially for GCSE (coefficients |.48| to |.69|). In multiple regression, about 60% of the variance in S outcomes could be accounted for using GCSE and one variable related to immigrants.
Two datasets of Japanese socioeconomic data for Japanese prefectures (N=47) were obtained and merged. After quality control, there were 44 variables for use in a factor analysis. Indicator sampling reliability analysis revealed poor reliability (54% of the correlations were |r| > .50). Inspection of the factor loadings revealed no clear S factor with many indicators loading in opposite than expected directions. A cognitive ability measure was constructed from three scholastic ability measures (all loadings > .90). On first analysis, cognitive ability was not strongly related to 'S' factor scores, r = -.19 [CI95: -.45 to .19; N=47]. Jensen's method did not support the interpretation that the relationship is between latent 'S' and cognitive ability (r = -.15; N=44). Cognitive ability was nevertheless related to some socioeconomic indicators in expected ways. A reviewer suggested controlling for population size or population density. When this was done, a relatively clear S factor emerged. Using the best control method (log population density), indicator sampling reliability was high (93% |r|>.50). The scores were strongly related to cognitive ability r = .67 [CI95: .48 to .80]. Jensen's method supported the interpretation that cognitive ability was related to the S factor (r = .78) and not just to the non-general factor variance.