ArticlePDF Available

What is a good name? The S factor in Denmark at the name-level

  • Ulster Institute for Social Research

Abstract and Figures

We present and analyze data from a dataset of 2358 Danish first names and socioeconomic outcomes not previously made available to the public (“Navnehjulet”, the Name Wheel). We visualize the data and show that there is a general socioeconomic factor with indicator loadings in the expected directions (positive: income, owning your own place; negative: having a criminal conviction, being without a job). This result holds after controlling for age and for each gender alone. It also holds when analyzing the data in age bins. The factor loading of being married depends on analysis method, so it is more difficult to interpret. A pseudofertility is calculated based on the population size for the names for the years 2012 and 2015. This value is negatively correlated with the S factor score r = -.35 [95CI: -.39; -.31], but the relationship seems to be somewhat non-linear and there is an upward trend at the very high end of the S factor. The relationship is strongly driven by relatively uncommon names who have high pseudofertility and low to very low S scores. The n-weighted correlation is -.21 [95CI: -.25; -.17]. This dysgenic pseudofertility was mostly driven by Arabic and African names. All data and R code is freely available.
Content may be subject to copyright.
What is a good name? The S factor
in Denmark at the name-level
Emil O. W. Kirkegaard1
Bo Tranberg2
We present and analyze data from a dataset of 2358 Danish first names and socioeconomic
outcomes not previously made available to the public (Navnehjulet, the Name Wheel). We visualize
the data and show that there is a general socioeconomic factor with indicator loadings in the
expected directions (positive: income, owning your own place; negative: having a criminal
conviction, being without a job). This result holds after controlling for age and for each gender
alone. It also holds when analyzing the data in age bins. The factor loading of being married
depends on analysis method, so it is more difficult to interpret.
A pseudofertility is calculated based on the population size for the names for the years 2012 and
2015. This value is negatively correlated with the S factor score r = -.35 [95CI: -.39; -.31], but the
relationship seems to be somewhat non-linear and there is an upward trend at the very high end of
the S factor. The relationship is strongly driven by relatively uncommon names who have high
pseudofertility and low to very low S scores. The n-weighted correlation is -.21 [95CI: -.25; -.17].
This dysgenic pseudofertility seems to be mostly driven by Arabic and African names.
All data and R code is freely available.
Key words: names, Denmark, Danish, social status, crime, income, education, age, scraping, S
factor, general socioeconomic factor
It has been noted that good outcomes tend to go together, but to our knowledge, the factor structure
of such relationships have not been examined before recently by (Kirkegaard, 2014c). When it has,
it has repeatedly been found that there is a general socioeconomic factor to which good outcomes
nearly always have positive loadings and bad outcomes have negative loadings.3 Recent studies
have examined S factors at the national, regional/state and country of origin-level; see (Kirkegaard,
2015c) for a review of regional/state-level studies, and (Kirkegaard, 2014a) for country of origin-
level studies. In this paper we exploit a unique dataset to examine the S factor at the name-level in
1 University of Aarhus, Department of Culture and Society. Email:
2 University of Aarhus, Department of Physics and Astronomy. Email:
3 Note that sometimes a factor is reversed such that the good outcomes have negative loadings, and the bad outcomes
have positive loadings. This reversing is quite arbitrary and depends on the balance of good and bad variables
included in the analysis. A preponderance of bad variables means that the factor will be reversed. If the factor is
thus reversed, one can just multiple all loadings by -1 to unreverse it.
The dataset
Last year the Danish newspaper Ugebrevet A4 published an interactive infographic called
"Navnehjulet" (" t he N ame W heel"). It's simple: you just enter a first name and it shows you some
numbers about that name. The data was initially bought from Statistics Denmark and is based on
2012 data. There is no option available to download the dataset. A screenshot of the Name Wheel is
shown in Figure 1.
The more technical aspects of the scraping (“automatic downloading of the data”) are covered
elsewhere (Tranberg, 2015), here we focus on the data and the statistical analyses.
The statistical information shown for each name varies (presumably due to data availability), but in
the cases with full data, it includes:
1. Number of persons with the name.
2. 3 most common job types.
3. 3 most common living areas.
4. Average age.
5. Percents who rent and own their home. Note that this does not always sum to 100%.
6. Percentage with at least one conviction in the last 5 years.
7. Average monthly income in DKK.
8. Marital status (married, cohabiting, registered partner4, single).
9. Employee rate.
10. Student rate.
11. Outside the job market rate.
4 This is a pre-2012 category as an alternative to marriage for same sex couples. One can no longer attain this legal
status, but one can retain it if one acquired it before 2012. See (Danish).
Figure 1: A screenshot of the
Name Wheel with "Emil" entered.
12. Independent rate.
13. Unemployment rate.
14. Chief executive rate.
Of note is that the unemployment variable includes only those who spent at least half the year
without work or who received dagpenge (a kind of unemployment benefit). The outside the job
market variable includes heterogeneous groups: førtidspensionister (pre-time retirees),
folkepensionister (ordinary retirees), efterlønsmodtagere (another type of pre-time retirement),
kontanthjælpsmodtagere (another type of unemployment benefit), and andre (others). As such, this
last variable is a mixture of situations that are normal (ordinary pension, efterløn) and some which
are used by unproductive members of society (førtidspension, kontanthjælp). Thus, interpretation of
that variable is not straightforward. There is a more detailed description of the variables available at
the website. We have taken a copy of this in case the site goes down (see supplementary material; in
We downloaded the data for all variables for each of the 2358 names in the database. The gender of
the names was usually not marked, but because they were sorted by gender, we could easily assign
them genders. The gender distribution is 1266 females and 1092 males, or 54% female. This is a
higher female percentage than the actual population (50.3%5). This seems to be due to females
simply having a greater diversity of names. Table 1 shows the top 20 most common names by
5 Data from table FOLK1, year 2015Q1. Danish Statistics Agency.
Rank Name (F) Thousands Name (M) Thousands
1 Anne 46.690 Peter 49.550
2 Kirsten 43.405 Jens 48.506
3 Hanne 39.680 Lars 45.507
4 Mette 39.007 Michael 45.322
5 Anna 34.995 Henrik 42.775
6 Helle 34.346 Thomas 42.134
7 Susanne 31.593 Søren 41.616
8 Lene 31.270 Jan 38.903
9 Maria 28.651 Niels 38.050
10 Marianne 27.366 Christian 37.528
11 Inge 26.186 Martin 37.151
12 Karen 25.974 Jørgen 35.608
13 Lone 25.695 Hans 35.400
14 Bente 24.845 Anders 34.613
15 Camilla 24.712 Morten 34.230
16 Pia 24.424 Jesper 34.092
17 Louise 23.847 Ole 32.746
18 Charlotte 23.804 Per 32.576
19 Jette 23.775 Mads 31.055
20 Tina 23.320 Erik 30.769
sum sum
603.585 768.131
Table 1: Top 20 most common names by gender.
As can be seen, the top 20 most common female names have a smaller sum than the male sum, by
A few names have genders marked which was because these were unisex names. Such names were
quite rare (36 pairs).
Missing data
There is quite a bit of missing data, 20% of names have have at least some missing data. For this
reason we examined the distribution of missing data to see if some of it could fruitfully be imputed
(Donders, van der Heijden, Stijnen, & Moons, 2006). The matrix plot is shown in Figure 2.6
6 This plot is made using the matrixplot() function from the VIM package (Templ, Alfons, Kowarik, & Prantner,
2015). The 5 character/string variables are left out because due to a bug in the function, such variables are always
shown as missing all data, whereas in fact in this case none of them had any missing data.
Note: Not all cases are shown due to insufficient resolution of the image.
We see that data is not missing at random but that some cases tend to have a lot of missing data. We
also see that some variables have no missing data (unisex, number, age, conviction).
Which kind of cases have missing data? It cannot be seen from the above, but the missingness is
strongly related to the number of persons with that name, which is not surprising. The data is
limited to names where there are 100 or more persons. To see the relationship, we sort the data by
number of persons and replot the matrix plot; Figure 3.
Another way to examine missingness is to examine the distribution of cases by the number of
missing cases. A histogram of this is shown in Figure 4.
Figure 2: Matrix plot for missing data.
Figure 3: Matrix plot of missing data, cases sorted by number of
persons with the name.
While about 20% of the data has 13 missing datapoints, a small number of datapoints (71) have
only 2 missing datapoints. These can be imputed to slightly increase the sample size.
Getting an overview of the data
Before running numerical analyses on data, it is important to get a solid overview of it. This is
because one can rapidly identify patterns by eye that may go unnoticed by numerical analyses. For
instance, relying on correlations can miss important non-linear patterns, which can easily be
identified by eye if data or plotted using a moving average or similar (Lubinski & Humphreys,
The classic example of this is Anscombe's quartet (“Anscombe’s quartet,” 2016), 4 bi-variate
datasets with which have (almost) the same mean of x and y, variance/standard deviation,
correlation and regression coefficients (intercept and slope). However, plotting the data reveals that
they are very different.
Histograms are the easiest way to get a quick overview of the data structure. We plot selected
histograms in Figures 5-8. The rest are available in the supplementary material.
Figure 4: Histogram of cases by number of missing datapoints.
We see a power law distribution in that most of the names have only a few persons with it, while a
few have many thousands. The top 20 by gender were shown in Table 1. The mean and median
number of persons per name are: 2209 and 316. Since the data is capped at at least 100 persons per
name, showing the least common names is not particularly interesting. The curious reader can
consult the supplementary material (results/number_ranks.csv).
The distribution of the mean age of names is a fat normal distribution. Top 5 youngest: Elliot,
Milas, Noam, Storm, Mynte (MMMMF); oldest: Valborg, Hertha, Dagny, Magna, Erna (all F).
Figure 5: Histogram of number of persons per name. Note that the x-
axis is log-scale.
Figure 6: Histogram of ages.
The income distribution is fairly normal with a long right tail. Presumably, a few very rich people
with uncommon names result in those names having very high incomes. The top scores are: Renè
(M), Leise (F), Frants (M), Heine (M) and Thorleif (M). The bottom scorers are dominated by
names who are very young and thus have very low incomes, e.g. Alberte (mean age 8, mean income
4893 dkk). These have little interest so we shall not mention them.
It is clear that some names are much more criminal than others, the top scorers are: Alaa, Ferhat,
Walid, Rachid, Fadi (all male). The female top scorer is Vesna (top #51). These names are all
foreign, mostly Arabic, except the female name which is Slovenian according to This result is expected because persons from Muslim countries
are highly overrepresented in crime statistics (Kirkegaard & Fuerst, 2014).
Variables by age and gender
Since the mean age of the names has central importance to the other variables (e.g. income) and
since gender is a suitable dichotomous variable, we plot the other variables by age and gender.
These are shown in Figures 9 to 17.
Figure 7: Histogram of incomes.
Figure 8: Histogram of mean convictions past 5 years.
We see the familiar pattern in that men earn more money than women. The difference is stable until
about age 45 where it increases. Interpretation is difficult because the data is cross-sectional, not
longitudinal and hence there are both age and cohort differences between the names. Still, one
would expect something to happen at about that age that increases the difference.
It is well-known that crime tends to be committed by younger males, we see the same pattern here.
Recall that this is the percentage of persons with the name who has at least one conviction the last 5
years. Thus, it has a bit of lag which is probably why it is fairly high for even men in their 40 --
they could have gotten their conviction at age 35.
Figure 9: Income by age and gender.
Figure 10: Convictions by age and gender.
This variable is the odd one comprising both regular pensions as well as some unemployment
benefits and other benefits given to people who cannot/won't work (e.g. who had a work accident,
have severe psychological problems, are just lazy). As expected, it goes up heavily with age as
people go on pension.
There are known gender differences in rates of self-employment, and we see it here as well at all
ages. It seems to increase over the lifespan a bit being at maximum value perhaps around 45-50.
Figure 11: Being outside the job market by age and gender.
Figure 12: Being independently employed by age and gender.
This one is interesting in that it has an odd pattern at old age. Our guess is that the men who are
married tend to live longer which explains the male pattern, while the female pattern is explained by
the fact that women live longer than men and their husbands die off before them, leaving them
widowed (unmarried). In discussion with EOWK, A. J. Figueredo suggested that it may be due to
serial monogamy. Simply put, some men divorce their aging wives and marry a younger one. This
would tend to keep men married at older ages as well as decreasing the marriage rate of older
This one is odd in that at middle age around 30 more women have their own home, but men catch
up later. One could think of it as men making an earlier investment of their resources into career,
while women are more interested in getting a home. And when men’s careers get going at age 45
and above, they acquire their homes. Again, due to the cross-sectional data, it is difficult to say.
Figure 14: Owning a home by age and gender.
Figure 13: Marital status by age and gender.
This is the variable for 'pure' unemployment. The gender difference is only slight at early to mid
ages, while it reverses in direction at older ages. It is somewhat odd that it is highest around age 35.
Girls and women generally acquire more formal education than more and we see it in the data here
as well.
Figure 16: Being and student by age and gender.
Figure 15: No job by age and gender.
Finally, there is a clear gender and age pattern in being an executive. Males are more likely at all
ages, but there is an increase around age 45, especially for men. This is presumably the explanation
of the pattern seen for income in Figure 9.
Is there an S factor among names?
Some of the variables are (almost) linearly dependent on each other. and sum
to nearly 100, so using both in an analysis would perhaps cause problems. The same is true for the 4
civil status variables (married, cohabiting, reg. partnership, single), and the 6 employment
variables (no.job, employee,, student, independent, executive). To be safe, one
should probably not pick more than one from each of these three sets.
To do a factor analysis we must however pick some of them. We decided on the following: no.job,, married, conviction and income. The expectation is that no.job and conviction will have
negative loadings, while and income will have positive, and marriage perhaps somewhat
positive (Herrnstein & Murray, 1994).
What we want to measure is the general socioeconomic status factor (if it exists). However, gender
can disrupt the analysis. This is because men earn more money but are also more criminal. This may
lead to gender specific variance, which is error in the factor analysis. One could regress out the
effect of gender, but we instead divide the dataset into two which also allows for easier
interpretation of the results.
Age has a strong influence on the variables which may disrupt results. For instance, a very young
name will have lower income and a low conviction rate, which will result in high mixedness
(Kirkegaard, 2015b). For this reason, we use both the original variables for analysis and a version of
them where the effect of age has been partialed out. To do this, we regress every value on age, age2
and age3.
Some cases had some missing datapoints (refer back to Figure 2). We imputed the cases with 2 or
fewer missing datapoints and excluded the rest.
Correlation matrices
Figure 17: Being an executive by age and gender.
Before looking at the factor analysis results, we will look at the correlation matrices by gender and
together, as well as with and without partialing out the effects of age; Tables 2-4.
no.job married conviction income age
no.job -0.43 0.42 0.28 -0.27 0.00 -0.50 0.08 -0.31 0.68 0.00
married 0.18 0.43 0.00 0.09 0.00
conviction 0.42 -0.44 -0.13 -0.06 0.00
income -0.21 0.73 0.45 -0.12 0.00
age -0.28 0.58 0.57 -0.26 0.48
Table 2: Correlation matrix of S variables for both genders. Above diag., age partialed out.
no.job married conviction income age
no.job -0.46 0.34 0.51 -0.37 0.00 -0.53 0.13 -0.39 0.75 0.00
married 0.03 0.54 0.02 0.03 0.00
conviction 0.63 -0.59 -0.29 -0.35 0.00
income -0.29 0.78 0.46 -0.40 0.00
age -0.21 0.63 0.71 -0.35 0.60
Table 3: Correlation matrix of S variables for men. Above diag., age partialed out.
no.job married conviction income age
no.job -0.43 0.55 0.26 -0.21 0.00 -0.47 -0.12 -0.31 0.69 0.00
married 0.34 0.27 0.00 -0.04 0.00
conviction 0.42 -0.41 -0.06 -0.23 0.00
income -0.12 0.74 0.39 -0.18 0.00
age -0.33 0.56 0.46 -0.28 0.45
Table 4: Correlation matrix of S variables for women. Above diag., age partialed out.
Below the diagonal, one can see that the linear effect of age is often substantial, while above the
diagonal, the linear effect of age is zero, meaning that generally the partialization worked, at least
linearly speaking. Generally, the relationships were similar across gender. There are some
exceptions. To make them easier to see, Table 5 shows the delta (difference) correlation matrix.
no.job married conviction income age
no.job -0.03 -0.20 0.25 -0.16 0.00 -0.06 0.25 -0.09 0.06 0.00
married -0.31 0.28 0.03 0.07 0.00
conviction 0.20 -0.18 -0.23 -0.13 0.00
income -0.17 0.04 0.07 -0.22 0.00
age 0.12 0.06 0.26 -0.07 0.15
Table 5: Delta correlation matrix for genders. Higher values mean men's correlations are stronger.
The largest difference for the age-partialed data is the relationship between being married and
having no job (recall that this does not include those pensioned). Among female names, there is a
strong relationship between unemployment and being married. Perhaps because women are more
often reliant on their husbands (being a homemaker) than the reverse, but both correlations were
positive. It could also have something to do with Muslim immigrants (about 10% of the population)
who are often married and where a large fraction of the women are unemployed.
Factor analyses
The loadings plots are shown in Figure 18.
The factors were not particularly strong, as shown in Table 6.
analysis Var%
S.Both 0.43
S.Male 0.51
S.Female 0.40
S.BothNA 0.35
S.MaleNA 0.39
S.FemaleNA 0.35
Table 6: Variance explained by S factors.
The factors decreased in size after correcting for age, which could be because age was inflating the
factor size, or because the correction was too strong. The gender difference in the marriage indicator
is strong: about 0 vs. about .5 after age correction. Notice that the has loadings near 1, so
the S factor is about equal to variable in these datasets. It is probably an indicator sampling error
that would be corrected if more indicators of greater diversity were available.7 Some previous S
factor studies have found the same when only a few indicators were used, e.g. Kirkegaard (2015a,
first analysis).
Still, the factor loadings are in the expected directions for all variables in all analyses.
Given the similar factor loadings, one would also expect the extracted factor scores to be similar,
which Table 7 shows them to be.
S.both S.women
S.both 0.75 1.00 0.67 1.00 0.76 0.75 0.82 0.92 0.65 0.98 1.00 0.82 0.67 0.67 0.92 0.67
S.women 1.00 0.65 0.76
7Indicator sampling error is meant to be a generalized version of Jensen's psychometric sampling error, see e.g.
(Kranzler & Jensen, 1991).
Figure 18: Loadings plot for factor analyses. 0.76 0.98 0.76
Table 7: Correlations between S factors across analyses.
Note: The apparently missing values are because the data does not overlap. There are no scores for
men in the S factor analyses with only women.
Using age bins instead
In the above analyses, we have analyzed data for all ages both with and without partialing the
effects of age out. However, age may be insufficiently dealt with by the chosen correction method,
and its effect may be so strong that not correcting for it also leads to spurious results. Hence we
employed a third method, that of age bins. The dataset is large enough that we can split it up into
age groups as well as gender and analyze each subgroup separately. While this does not entirely
remove the age effect, it is more likely to not introduce any spurious over-correction effects.
Concretely, we analyzed subgroups within 5 year brackets starting at age 20-25 and stopping at age
50-55. We do this for both genders together and each separately. The analysis procedure is the same
as above, namely extracting the general factor and examining the loadings and the factor sizes.
Figures 19-21 show the factor loadings by age bin for both genders together and each separately.
Figure 19: Factor loadings by age bins, both genders together.
Figure 20: Factor loadings by age bins, males only
The most conspicuous finding is the marriage loadings which are now negative! Apparently, the
positive loadings from before were an age confound. The exception is the last two age groups where
the marriage indicator is positive, especially for the last group. The odd finding that for 50-55 year
olds, crime has a loading around 0 is presumably sampling error as well as reflecting the fact that
crime among people in their 50s is fairly rare. When the base rate is low, correlations become
weaker and factor loadings are based on the correlation patterns in the data (Ferguson, 2009). The
sample sizes are not terribly impressive, 126 to 257, and the least for the last two groups. The ones
by each gender about half that.
For the male data, the marriage loadings are about 0. The two last age bins are again positive. The
other four loadings are somewhat stronger in males with criminality actually having stronger
(negative) correlations than unemployment. This is presumably because crime is more common
among males which means the correlations are stronger.
Finally, for the female data, marriage loadings are more strongly negative except for the last two
age bins, same as with the male data.
Figure 22 shows the factor strength by age bin and gender, together and separate.
Figure 21: Factor loadings by age bins, females only.
Generally the male-only analyses had the strongest S factors (6/7), with the female-only analyses
being above the one with both (5/7). One might interpret this as being due to the lower base rate of
crime making the correlations with the crime variable smaller for females which makes the factor
size smaller. The mixed-gender analyses usually had smaller factors, perhaps because the of the
mixedness that results from this as discussed earlier.
Pseudofertility and the S factor
Since the Name Wheel data contains the count of persons with each name in 2012, if we could find
some data for a later year for the same names, we could calculate a name-wise 'fertility', which we
shall call pseudofertility. It is the growth (or decrease) in number of persons with each name in
Denmark. This may be due to actual births, immigration or name-changes. This pseudofertility can
then be compared to the S factor score for each name to see if there is any relationship. A somewhat
negative relationship is expected due to low S immigrant names increasing their number via higher
than average fertility (at least in the first generation, (Kirkegaard, 2014b)) and immigration.
The Danish Statistics agency (Danish Statistics) maintains a web page where one can look up any
first or last name and see how many people have that name in the current year and last year. Using a
similar method to that using to scrape the data form the Name Wheel, we scraped the count data for
the years 2014 and 2015 for every name in our dataset. From these data, we calculated the
pseudofertility by the fractional increase (or decrease) of each name over both the period 2012-2015
and 2014-2015. The first should give a more reliable number since it's over a few years as opposed
to the second which is over 1 year only. Their correlation is .95 (no outliers), so reliability was very
Figure 23 shows the scatter plot of pseudofertility 2012-2015 and S factor score (age adjusted, both
genders together).
Figure 22: Factor sizes by age bin and gender
Overall, there is a medium-sized negative relationship, r = -.35 [95CI: -.39; -.31], between
pseudofertility and S factor score (age-controlled). As can be seen in the plot, this is mainly due to
the names left of 0 S (the below average). There appears to be an upward trend at the other end, but
there are relatively few datapoints, so it may be a fluke. The point sizes show that the names
creating the trend are relatively uncommon (few people have those names, relatively speaking). The
largest names cluster around S [0-1.5]. For this reason, we also calculated the weighted correlation
which is -.21 [95CI: -.25; -.17], so the effect is still reliable but substantially smaller as expected
from the inspection of the plot.
We plotted the figure in very high resolution using vector graphics so that one can zoom in on any
given region. The reader can examine the pseudofertility_names.svg file in the supplementary
material to explore the figure. Looking at the names in the region creating the negative slope reveals
them to be almost exclusively immigrant names from Arabic or African countries, e.g.: Mohammad,
Hossein, Mostafa, Sayed, Malika, Mana, Slawomir, Omar (names from the region north of the
moving average near S = -1.5). Unfortunately the dataset does not contain information about the
immigration status of each name, so we could exclude all of them and see if the 'dysgenic'
relationship holds without immigrants.
Thus, the name data reveals a small 'dysgenic' effect on S in line with modeling by (Kirkegaard &
Tranberg, 2015). If the trend were to continue, and assuming that everything else is equal, then the
average level of socioeconomic status would fall in Denmark and there would be increasing
socioeconomic inequality.
Discussion and conclusion
Despite being a new level of analysis (at least to us), the results were generally in line with those
from more 'traditional' country, regional/state-level and origin country-level analyses.
This dataset contained first names, but one could also analyze last names which are more familial in
nature. Such data was not available at the Name Wheel website, but it could probably be acquired
from the statistical agency if one is willing to pay.
Figure 23: Pseudofertility 2012-2015 and S factor scores. Point
sizes are proportional to the number of persons with the name.
The dataset is especially useful for researchers wishing to investigate the (in)accuracy of
stereotypes of names, see e.g. (Jussim, Cain, Crawford, Harber, & Cohen, 2009; Jussim, 2012).
As mentioned earlier, the data are an odd kind of cross-sectional data which makes it difficult to
infer causality. A given difference observed between names with a mean age of 20 and 40, could be
either an effect of age (being 20 versus 40), a cohort effect (being born in 1995 versus 1975), or
something more complicated.
The mean age of the names is tricky to interpret since the distribution of age of persons with the
name is not shown. This could be a normal distribution if the name was fashionable at some point
but then faded out. However, it could also be bi-modal. For instance, if a name was fashionable in
1965 and in 1995, there would be two groups of persons. One aged about 50 and one aged about 20.
If they are about evenly distributed the mean age of the name would be about 35 despite few people
with the name being that age.
Aside from the extra population data from Danish Statistics, the dataset only has data from one year
(2012). It would be better if data for more than one year was available. Both to avoid fluke effects,
but also to examine e.g. the effects of macroeconomics on the relationships between the variables.
To our knowledge, this is a new kind of grouped data and so methods for analyzing it have not been
well-tested. This should give some extra caution about the inferences drawn from it.
Supplementary material
Data, source code and figures are available at
Anscombe’s quartet. (2016, November 7). In Wikipedia. Retrieved from
Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review: A
gentle introduction to imputation of missing values. Journal of Clinical Epidemiology,
59(10), 1087–1091.
Ferguson, C. J. (2009). Is psychological research really as good as medical research? Effect size
comparisons between psychology and medicine. Review of General Psychology, 13(2), 130.
Herrnstein, R. J., & Murray, C. (1994). The Bell Curve: Intelligence and Class Structure in
American Life. New York: Free Press.
Jussim, L. (2012). Social Perception and Social Reality: Why Accuracy Dominates Bias and Self-
Fulfilling Prophecy. Oxford University Press.
Jussim, L., Cain, T. R., Crawford, J. T., Harber, K., & Cohen, F. (2009). The Unbearable Accuracy
of Stereotypes. In Handbook of Prejudice, Stereotyping, and Discrimination (p. 608). Taylor
& Francis Group, LLC.
Kirkegaard, E. O. W. (2014a). Crime, income, educational attainment and employment among
immigrant groups in Norway and Finland. Open Differential Psychology. Retrieved from
Kirkegaard, E. O. W. (2014b). Criminality and fertility among Danish immigrant populations. Open
Differential Psychology. Retrieved from
Kirkegaard, E. O. W. (2014c). The international general socioeconomic factor: Factor analyzing
international rankings. Open Differential Psychology. Retrieved from
Kirkegaard, E. O. W. (2015a). Examining the S factor in Mexican states. The Winnower. Retrieved
Kirkegaard, E. O. W. (2015b). Finding mixed cases in exploratory factor analysis. The Winnower.
Retrieved from
Kirkegaard, E. O. W. (2015c). The S factor in Brazilian states. The Winnower. Retrieved from
Kirkegaard, E. O. W., & Fuerst, J. (2014). Educational attainment, income, use of social benefits,
crime rate and the general socioeconomic factor among 71 immigrant groups in Denmark.
Open Differential Psychology. Retrieved from
Kirkegaard, E. O. W., & Tranberg, B. (2015). Increasing inequality in general intelligence and
socioeconomic status as a result of immigration in Denmark 1980-2014 |. Retrieved from
Kranzler, J. H., & Jensen, A. R. (1991). Unitary g: Unquestioned postulate or Empirical fact?
Intelligence, 15(4), 437–448.
Lubinski, D., & Humphreys, L. G. (1996). Seeing the forest from the trees: When predicting the
behavior or status of groups, correlate means. Psychology, Public Policy, and Law, 2(2),
Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and
Imputation of Missing Values. CRAN. Retrieved from http://cran.r-
Tranberg, B. (2015, May 1). Data mining: “Navnehjulet.” Retrieved from
... The S factor has been found in numerous studies in the past two years at many different levels of analysis: between countries (Kirkegaard, 2014b), between regions within a country (Kirkegaard, 2015b, c, d, f, g, h, i, j, k), between districts within a city (Kirkegaard, 2015a), between people grouped by country of origin (Kirkegaard, 2014a;Kirkegaard & Fuerst, 2014) and between first names (Kirkegaard & Tranberg, 2015b). Substantial correlations with cognitive ability and demographic variables have often been reported as well. ...
... This near-identity of the factor with a single variable can happen when there is only a small number of variables. This was also found in the study of first names in Denmark (Kirkegaard & Tranberg, 2015b). ...
Full-text available
Two sets of socioeconomic data for 90-96 French departements were analyzed. One dataset was found in Lynn (1980) and contained four socioeconomic variables. Mixed results were found for this dataset, both with regards to the factor structure and the relationship to cognitive ability. Another dataset with 53 variables was created by compiling variables from the official French statistics bureau (Insee). This dataset contained an impure general socioeconomic (S) factor (some undesirable variables loaded positively), but after controlling for the presence of immigrants, the S factor became purer. This was especially salient for crime, unemployment and poverty variables. The two S factors correlated at r = 0.66 [CI95:0.52-0.76; N = 88]. The IQ scores from the 1950s dataset correlated at 0.33 [CI95:0.13-0.51, N = 88] with the S factor from the 2010-2015 dataset.
... This is related to aggregation issues with higher crime in wealthy urban areas. Aside from these, the pattern of loadings was much as expected and found in prior studies (Kirkegaard, 2014(Kirkegaard, , 2016Kirkegaard & Fuerst, 2017;Kirkegaard & Tranberg, 2015). Generally speaking, then, provinces that are better on one indicator, e.g. ...
Full-text available
Italy shows a strong north-south gradient in measures of well-being, with the northern areas being far wealthier than the southern. Less well known is that there is also a latitudinal gradient in intelligence. We combined numeracy scores based on age heaping data for Italian provinces from the censuses of 1861, 1871, and 1881 with modern data about scholastic ability from the INVALSI, and important social outcomes such as mortality and income (up to 107 provinces in analyses). We show that there is a strong stability of the intelligence differences across 150 years for the overlapping set of 69 provinces. Intelligence measured in the 1800s predicts overall well-being just as well as modern data, r’s .78 and .82, for age heaping and INVALSI, respectively. We discuss the findings in light of recent evidence of genetic differences in regional intelligence levels. Keywords: Age heaping, Intelligence, Italy, 19th century
... We used previously published name data (n = 1,903) for age adjusted social status in Denmark (Kirkegaard & Tranberg, 2015). The data originate from the Danish statistics agency (Danmarks Statistik, DST) who sold data about indicators of social status by first name to the magazine Ugebladet A4 ('The Weekly A4'). ...
Full-text available
It is well established that general intelligence varies in the population and is causal for variation in later life outcomes, in particular for social status and education. We linked IQ-test scores from the Danish draft test (Børge Prien Prøven, BPP) to social status for a list of 265 relatively common names in Denmark (85% male). Intelligence at the level of first name was strongly related to social status, r = .64. Ten names in the dataset were non-western, Muslim names. These names averaged an IQ of 81 (range 76-87) compared with 98 for the western, mostly Danish ones. Nonwestern names were also lower in social status, with a mean SES score of 2.66 standard deviations below that of western names. Mediation analysis showed that 30% of this very large gap can be explained by the IQ gap. Reasons for this relatively low level of mediation are discussed.
... For this reason, it was named the S factor by analogy with the g/G factor of cognitive ability (Kirkegaard, 2014b;Rindermann, 2007). S emerges not just when the units of analysis are immigrant groups in a given host country, but also when they are sovereign nations (Kirkegaard, 2014b), sub-national divisions (states, regions, counties, census tracts, city districts etc.; (Carl, 2016;Fuerst & Kirkegaard, 2016;Kirkegaard, 2016a,b)), first names (Kirkegaard & Tranberg, 2015b), and individuals (Kirkegaard & Fuerst, in print). When a dataset allows analyses at multiple levels and for multiple groups, it has been found that the factor structure is very stable (Kirkegaard, 2016a;Kirkegaard & Fuerst, in print). ...
Full-text available
The relationships between national IQs, Muslim% in origin countries and estimates of net fiscal contributions to public finances in Denmark (n=32) and Finland (n=11) were examined. The analyses showed that the fiscal estimates were near-perfectly correlated between countries (r = .89 [.56 to .98], n=9), and were well-predicted by national IQs (r’s .89 [.49 to .96] and .69 [.45 to .84]), and Muslim% (r’s -.75 [-.93 to -.27] and -.73 [-.86 to -.51]). Furthermore, general socioeconomic factor scores for Denmark were near-perfectly correlated with the fiscal estimates (r = .86 [.74 to .93]), especially when one outlier (Syria) was excluded (.90 [.80 to .95]). Finally, the monetary returns to higher country of origin IQs were estimated to be 917/470 Euros/person-year for a 1 IQ point increase, and -188/-86 for a 1% increase in Muslim%.
... The factor has been called the general socioeconomic factor (S factor) and is similar to the g factor of mental ability (Jensen, 1998;Kirkegaard, 2014b). The S factor has been replicated across numerous datasets at different levels of analysis (Kirkegaard, 2014a(Kirkegaard, , 2014b(Kirkegaard, , 2015aKirkegaard & Fuerst, 2014;Kirkegaard & Tranberg, 2015). ...
Full-text available
Two datasets of Japanese socioeconomic data for Japanese prefectures (N=47) were obtained and merged. After quality control, there were 44 variables for use in a factor analysis. Indicator sampling reliability analysis revealed poor reliability (54% of the correlations were |r| > .50). Inspection of the factor loadings revealed no clear S factor with many indicators loading in opposite than expected directions. A cognitive ability measure was constructed from three scholastic ability measures (all loadings > .90). On first analysis, cognitive ability was not strongly related to 'S' factor scores, r = -.19 [CI95: -.45 to .19; N=47]. Jensen's method did not support the interpretation that the relationship is between latent 'S' and cognitive ability (r = -.15; N=44). Cognitive ability was nevertheless related to some socioeconomic indicators in expected ways. A reviewer suggested controlling for population size or population density. When this was done, a relatively clear S factor emerged. Using the best control method (log population density), indicator sampling reliability was high (93% |r|>.50). The scores were strongly related to cognitive ability r = .67 [CI95: .48 to .80]. Jensen's method supported the interpretation that cognitive ability was related to the S factor (r = .78) and not just to the non-general factor variance.
... By now, S factors have been found between countries (Kirkegaard, 2014b), twice between country-oforigin groups within countries (Kirkegaard, 2014a), numerous times within countries (reviewed in Kirkegaard, 2015c), and at the level of first names (Kirkegaard & Tranberg, 2015). This paper analyses data for 33 Colombian departments including the capital district. ...
Full-text available
A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.
Full-text available
A factor analysis was carried out on 6 socioeconomic variables for 506 census tracts of Boston. An S factor was found with positive loadings for median value of owner-occupied homes and average number of rooms in these; negative loadings for crime rate, pupil-teacher ratio, NOx pollution, and the proportion of the population of ‘lower status’. The S factor scores were negatively correlated with the estimated proportion of African Americans in the tracts r = -.36 [CI95 -0.43; -0.28]. This estimate was biased downwards due to data error that could not be corrected for.
Full-text available
Two methods are presented that allow for identification of mixed cases in the extraction of general factors. Simulated data is used to illustrate them.
Full-text available
Sizeable S factors were found across 3 different datasets (from years 1991, 2000 and 2010), which explained 56 to 71% of the variance. Correlations of extracted S factors with cognitive ability were strong ranging from .69 to .81 depending on which year, analysis and dataset is chosen. Method of correlated vectors supported the interpretation that the latent S factor was primarily responsible for the association (r’s .71 to .81).
Full-text available
Two datasets of socioeconomic data was obtained from different sources. Both were factor analyzed and revealed a general factor (S factor). These factors were highly correlated with each other (.79 to .95), HDI (.68 to .93) and with cognitive ability (PISA; .70 to .78). The federal district was a strong outlier and excluding it improved results. Method of correlated vectors was strongly positive for all 4 analyses (r’s .78 to .92 with reversing).
Full-text available
We argue that if immigrants have a different mean general intelligence (g) than their host country and if immigrants generally retain their mean level of g, then immigration will increase the standard deviation of g. We further argue that inequality in g is an important cause of social inequality, so increasing it will increase social inequality. We build a demographic model to analyze change in the mean and standard deviation of g over time and apply it to data from Denmark. The simplest model, which assumes no immigrant gains in g, shows that g has fallen due to immigration from 97.1 to 96.4, and that for the same reason standard deviation has increased from 15.04 to 15.40, in the time span 1980 to 2014.
Full-text available
I present new predictive analyses for crime, income, educational attainment and employment among immigrant groups in Norway and crime in Finland. Furthermore I show that the Norwegian data contains a strong general socioeconomic factor (S) which is highly predictable from country-level variables (National IQ .59, Islam prevalence -.71, international general socioeconomic factor .72, GDP .55), and correlates highly (.78) with the analogous factor among immigrant groups in Denmark. Analyses of the prediction vectors show very high correlations (generally > ±.9) between predictors which means that the same variables are relatively well or weakly predicted no matter which predictor is used. Using the method of correlated vectors shows that it is the underlying S factor that drives the associations between predictors and socioeconomic traits, not the remaining variance (all correlations near unity).
Full-text available
Many studies have examined the correlations between national IQs and various country-level indexes of well-being. The analyses have been unsystematic and not gathered in one single analysis or dataset. In this paper I gather a large sample of country-level indexes and show that there is a strong general socioeconomic factor (S factor) which is highly correlated (.86-.87) with national cognitive ability using either Lynn and Vanhanen's dataset or Altinok's. Furthermore, the method of correlated vectors shows that the correlations between variable loadings on the S factor and cognitive measurements are .99 in both datasets using both cognitive measurements, indicating that it is the S factor that drives the relationship with national cognitive measurements, not the remaining variance.
Full-text available
We obtained data from Denmark for the largest 70 immigrant groups by country of origin. We show that three important socialeconomic variables are highly predictable from the Islam rate, IQ, GDP and height of the countries of origin. We further show that there is a general immigrant socioeconomic factor and that country of origin national IQs, Islamic rates, and GDP strongly predict immigrant general socioeconomic scores.
Social Perception and Social Reality reviews the evidence in social psychology and related fields and reaches three conclusions: 1. Although errors, biases, and self-fulfilling prophecies in person perception, are real, reliable, and occasionally quite powerful, on average, they tend to be weak, fragile and fleeting; 2. Perceptions of individuals and groups tend to be at least moderately, and often highly accurate; and 3. Conclusions based on the research on error, bias, and self-fulfilling prophecies routinely greatly overstates their power and pervasiveness, and consistently ignores evidence of accuracy, agreement, and rationality in social perception. The weight of the evidence – including some of the most classic research widely interpreted as testifying to the power of biased and self-fulfilling processes – is that interpersonal expectations related to social reality primarily because they reflect rather than cause social reality. This is the case not only of teacher expectations, but also social stereotypes, both as perceptions of groups, and as the bases of expectations regarding individuals. The time is long overdue to replace cherry-picked and unjustified stories emphasizing error, bias, the power of self-fulfilling prophecies and the inaccuracy of stereotypes with conclusions that more closely correspond to the full range of empirical findings, which includes multiple failed replications of classic expectancy studies, meta-analyses consistently demonstrating small or at best moderate expectancy effects, and high accuracy in social perception.