ArticlePDF Available

# Finding mixed cases in exploratory factor analysis

Authors:
• Ulster Institute for Social Research

## Abstract and Figures

Two methods are presented that allow for identification of mixed cases in the extraction of general factors. Simulated data is used to illustrate them.
Content may be subject to copyright.
Finding mixed cases in exploratory factor analysis
Emil O. W. Kirkegaard
Abstract
Two methods are presented that allow for identification of mixed cases in the extraction of general
factors. Simulated data is used to illustrate them.
Introduction
General factors can be extracted from datasets where all or nearly so the variables are correlated. At the
case-level, such general factors are decreased in size if there are mixed cases present. A mixed case is
an 'inconsistent' case according to the factor structure of the data.
A simple way of illustrating what I'm talking about is using the matrixplot() function from the VIM
package to R (Templ, Alfons, Kowarik, & Prantner, 2015) with some simulated data.
For simulated dataset 1, start by imaging that we are measuring a general factor and that all our
Furthermore, there is no error of measurement and there is only one factor in the data (no group factors,
i.e. no hierarchical or bi-factor structure, (Jensen & Weng, 1994)). I have used datasets with 50 cases
and 25 variables to avoid the excessive sampling error of small samples and to keep a realistic number
of cases compared to the datasets examined in S factor studies (e.g. Kirkegaard, 2015). The matrix plot
is shown in Figure 1.
Figure 1: Matrix plot of dataset 1
No real data looks like this, but it is important to understand what to look for. Every indicator is on the
x-axis and the cases are on the y-axis. The cases are colored by their relative values, where darker
means higher values. So in this dataset we see that any case that does well on any particular indicator
does just as well on every other indicator. All the indicators have the same factor loading of 1, and the
proportion of variance explained is also 1 (100%), so there is little point in showing the loadings plot.
To move towards realism, we need to complicate this simulation in some way. The first way is to
introduce some measurement error. The amount of error introduced determines the factor loadings and
hence the size of the general factor. In dataset 2, the error amount is .5, and the signal multiplier varies
from .05 to .95 all of which are equally likely (uniform distribution). The matrix and the loadings plots
are shown in Figures 2 and 3.
Figure 2: Matrix plot for dataset 2
By looking at the matrix plot we can still see a fairly simple structure. Some cases are generally darker
(whiter) than others, but there is also a lot of noise which is of course the error we introduced. The
loadings show quite a bit of variation. The size of this general factor is .45.
The next complication is to introduce the possibility of negative loadings (these are consistent with a
general factor, as long as they load in the right direction, (Kirkegaard, 2014)). We go back to the
simplified case of no measurement error for simplicity. Figures 4 and 5 show the matrix and loadings
plots.
The matrix plot looks odd, until we realize that some of the indicators are simply reversed. The
loadings plot shows this reversal. One could easily get back to a matrix plot like that in Figure 1 by
reversing all indicators with a negative loading (i.e. multiplying by -1). However, the possibility of
Figure 4: Matrix plot for dataset 3
For the 4th dataset, we make a begin with dataset 2 and create a mixed case. This we do by setting its
value on every indicator to be 2, a strong positive value (98 centile given a standard normal
distribution). Figure 6 shows the matrix plot. I won't bother with the loadings plot because it is not
strongly affected by a single mixed case.
Can you guess which case it is? Perhaps not. It is #50 (top line). One might expect it to be the same hue
all the way. This however ignores the fact that the values in the different indicators vary due to
sampling error. So a value of 2 is not necessarily at the same centile or equally far from the mean in
standard units in every indicator, but it is fairly close which is why the color is very dark across all
indicators.
For datasets with general factors, the highest value of a case tends to be on the most strongly loaded
indicator (Kirkegaard, 2014b), but this information is not easy to use in an eye-balling of the dataset.
Thus, it is not so easy to identify the mixed case.
roughly similar to that found in S factor analysis (there are still no true group factors in the data).
Figure 7 shows the matrix plot.
Figure 6: Matrix plot for dataset 4
Just looking at the dataset, it is fairly difficult to detect the general factor, but in fact the variance
explained is .38. The mixed case is easy to spot now (#50) since it is the only case that is consistently
dark across indicators, which is odd given that some of them have negative loadings. It 'shouldn't'
happen. The situation is however somewhat extreme in the mixedness of the case.
Automatic detection
Eye-balling figures and data is a powerful tool for quick analysis, but it cannot give precise numerical
values used for comparison between cases. To get around this I developed two methods for automatic
identification of mixed cases.
Method 1
A general factor only exists when multidimensional data can be usefully compressed, informationally
speaking, to 1-dimensional data (factor scores on the general factor). I encourage readers to consult the
very well-made visualization of principal component analysis (almost the same as factor analysis) at
this website. In this framework, mixed cases are those that are not well described or predicted by a
single score.
Thus, it seems to me that that we can use this information as a measure of the mixedness of a case. The
method is:
1. Extract the general factor.
2. Extract the case-level scores.
3. For each indicator, regress it unto the factor scores. Save the residuals.
4. Calculate a suitable summary metric, such as the mean absolute residual and rank the cases.
Using this method on dataset 5 in fact does identify case 50 as the most mixed one. Mixedness varies
between cases due to sampling error. Figure 8 shows the histogram.
Figure 7: Matrix plot for dataset 5
The outlier on the right is case #50.
How extreme does a mixed case need to be for this method to find it? We can try reducing its
mixedness by assigning it less extreme values. Table 1 shows the effects of doing this.
Mixedness values Mean absolute residual
2 1.91
1.5 1.45
1 0.98
Table 1: Mean absolute residual and mixedness
So we see that when it is 2 and 1.5, it is clearly distinguishable from the rest of the cases, but 1 is about
the limit of this since the second-highest value is .80. Below this, the other cases are similarly mixed,
just due to the randomness introduced by measurement error.
Method 2
Since mixed cases are poorly described by a single score, they don't fit well with the factor structure in
the data. Generally, this should result in the proportion of variance increasing when they are removed.
Thus the method is:
1. Extract the general factor from the complete dataset.
2. For every case, create a subset of the dataset where this case is removed.
3. Extract the general factors from each subset.
4. For each analysis, extract the proportion of variance explained and calculate the difference to
that using the full dataset.
Using this method on the dataset also used above correctly identifies the mixed case. The histogram of
Figure 8: Histogram of absolute mean residuals from dataset 5
results is shown in Figure 9.
Like we method 1, we then redo this analysis for other levels of mixedness. Results are shown in Table
2.
Mixedness values Improvement in proportion of variance
2 1.91
1.5 1.05
1 0.50
Table 2: Improvement in proportion of variance and mixedness
We see the same as before, in that both 2 and 1.5 are clearly identifiable as being an outlier in
mixedness, while 1 is not since the next-highest value is .45.
Large scale simulation with the above methods could be used to establish distributions to generate
confidence intervals from.
It should be noted that the improvement in proportion of variance is not independent of number of
cases (more cases means that a single case is less import, and non-linearly so), so the value cannot
simply be used to compare across cases without correcting for this problem. Correcting it is however
Comparison of methods
The results from both methods should have some positive relationship. The scatter plot is shown in
Figure 9: Histogram of differences in proportion of variance to the
full analysis
We see that the true mixedness case is a strong outlier with both methods -- which is good because it
really is a strong outlier. The correlation is strongly inflated because of this, to r=.70 with, but only .26
without. The relative lack of a positive relationship without the true outlier in mixedness is perhaps due
to range restriction in mixedness in the dataset, which is true because the only amount of mixedness
besides case 50 is due to measurement error. Whatever the exact interpretation, I suspect it doesn't
matter since the goal is to find the true outliers in mixedness, not to agree on the relative ranks of the
cases with relatively little mixedness.1
Implementation
I have implemented both above methods in R. They can be found in my unofficial psych2 collection of
useful functions located here.
Supplementary material
Source code and figures are available at the Open Science Framework repository.
References
Jensen, A. R., & Weng, L.-J. (1994). What is a good g? Intelligence, 18(3), 231–258.
http://doi.org/10.1016/0160-2896(94)90029-9
Kirkegaard, E. O. W. (2014a). The international general socioeconomic factor: Factor analyzing
http://openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-
analyzing-international-rankings/
Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its
1 Concerning the agreement about rank-order, it is about .4 both with and without case 50. But this is based on a single
simulation and I've seen some different values when re-running it. A large scale simulation is necessary.
Figure 10: Scatter plot of method 1 and 2
beyond-its-association-with-g/
Kirkegaard, E. O. W. (2015). Examining the S factor in US states. The Winnower. Retrieved from
https://thewinnower.com/papers/examining-the-s-factor-in-us-states
Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and
project.org/web/packages/VIM/index.html
... This method was first used in Kirkegaard (2015b). ...
... Such patterns are often seen for cases that consist mostly of one large city (Carl, 2015;Kirkegaard, 2015d). I previously called this phenomenon mixedness because the indicators of these cases give a decidedly mixed picture of the case, but it seems more suitable to use the term structural outlier (Kirkegaard, 2015b). The idea is that if a case follows the general structure of the data, then we should be able to predict that case's scores on the indicator variables from the factor score. ...
Article
Full-text available
Some new methods for factor analyzing socioeconomic data are presented, discussed and illustrated with analyses of new and old datasets. A general socioeconomic factor (S) was found in a dataset of 47 French-speaking Swiss provinces from 1888. It was strongly related (r’s .64 to .70) to cognitive ability as measured by an army examination. Fertility had a strong negative loading (r -.44 to -.67). Results were similar when using rank-transformed data. The S factor of international rankings data was found to have a split-half factor reliability of .93, that of the general factor of personality extracted from 25 OCEAN items .55, and that of the general cognitive ability factor .68 based on 16 items from the International Cognitive Ability Resource.
... When extracting factors from a dataset, one might find cases that poorly fit the factor structure of the data. Such cases are said to be highly mixed, but could also be called structural outliers (Kirkegaard, 2015e). Often when analyzing socioeconomic datasets for within country regions, the capital region is found to be a strongly mixed case. ...
... For instance, it may have a high mean income and a high level of educational attainment, but also have a high crime rate and high unemployment rate e.g. as with London in an analysis of regions of the UK (Kirkegaard, 2015g). The two methods for examining mixedness developed by Kirkegaard (2015e) were used on the dataset. Figure 3 shows the scatterplot. ...
Article
Full-text available
Two sets of socioeconomic data for 90-96 French departements were analyzed. One dataset was found in Lynn (1980) and contained four socioeconomic variables. Mixed results were found for this dataset, both with regards to the factor structure and the relationship to cognitive ability. Another dataset with 53 variables was created by compiling variables from the official French statistics bureau (Insee). This dataset contained an impure general socioeconomic (S) factor (some undesirable variables loaded positively), but after controlling for the presence of immigrants, the S factor became purer. This was especially salient for crime, unemployment and poverty variables. The two S factors correlated at r = 0.66 [CI95:0.52-0.76; N = 88]. The IQ scores from the 1950s dataset correlated at 0.33 [CI95:0.13-0.51, N = 88] with the S factor from the 2010-2015 dataset.
... Mixed cases are cases that do not fit the factor structure of a dataset. Previously I developed two methods for detecting such cases (Kirkegaard, 2015b). Neither method indicated any strong mixed cases in the unimputed, unreduced dataset or the imputed, reduced dataset. ...
Article
Full-text available
A dataset was compiled with 17 diverse socioeconomic variables for 32 departments of Colombia and the capital district. Factor analysis revealed an S factor. Results were robust to data imputation and removal of a redundant variable. 14 of 17 variables loaded in the expected direction. Extracted S factors correlated about .50 with the cognitive ability estimate. The Jensen coefficient for the S factor for this relationship was .60.
... Age has a strong influence on the variables which may disrupt results. For instance, a very young name will have lower income and a low conviction rate, which will result in high mixedness (Kirkegaard, 2015b). For this reason, we use both the original variables for analysis and a version of them where the effect of age has been partialed out. ...
Article
Full-text available
We present and analyze data from a dataset of 2358 Danish first names and socioeconomic outcomes not previously made available to the public (“Navnehjulet”, the Name Wheel). We visualize the data and show that there is a general socioeconomic factor with indicator loadings in the expected directions (positive: income, owning your own place; negative: having a criminal conviction, being without a job). This result holds after controlling for age and for each gender alone. It also holds when analyzing the data in age bins. The factor loading of being married depends on analysis method, so it is more difficult to interpret. A pseudofertility is calculated based on the population size for the names for the years 2012 and 2015. This value is negatively correlated with the S factor score r = -.35 [95CI: -.39; -.31], but the relationship seems to be somewhat non-linear and there is an upward trend at the very high end of the S factor. The relationship is strongly driven by relatively uncommon names who have high pseudofertility and low to very low S scores. The n-weighted correlation is -.21 [95CI: -.25; -.17]. This dysgenic pseudofertility was mostly driven by Arabic and African names. All data and R code is freely available.
... To examine whether there are any cases with strong mixedness --cases that are incongruent with the factor structure in the data --I developed two methods which are presented elsewhere (Kirkegaard, 2015c). Briefly, the first method measures the mixedness of the case by quantifying how predictable indicator scores are from the factor score for each case (mean absolute residual, MAR). ...
Article
Full-text available
Sizeable S factors were found across 3 different datasets (from years 1991, 2000 and 2010), which explained 56 to 71% of the variance. Correlations of extracted S factors with cognitive ability were strong ranging from .69 to .81 depending on which year, analysis and dataset is chosen. Method of correlated vectors supported the interpretation that the latent S factor was primarily responsible for the association (r’s .71 to .81).
Article
Full-text available
A dataset of socioeconomic, demographic and geographic data for US counties (N≈3,100) was created by merging data from several sources. A suitable subset of 28 socioeconomic indicators was chosen for analysis. Factor analysis revealed a clear general socioeconomic factor (S factor) which was stable across extraction methods and different samples of indicators (absolute split-half sampling reliability = .85). Self-identified race/ethnicity (SIRE) population percentages were strongly, but non-linearly, related to cognitive ability and S. In general, the effect of White% and Asian% were positive, while those for Black%, Hispanic% and Amerindian% were negative. The effect was unclear for Other/mixed%. The best model consisted of White%, Black%, Asian% and Amerindian% and explained 41/43% of the variance in cognitive ability/S among counties. SIRE homogeneity had a non-linear relationship to S, both with and without taking into account the effects of SIRE variables. Overall, the effect was slightly negative due to low S, high White% areas. Geospatial (latitude, longitude, and elevation) and climatological (temperature, precipitation) predictors were tested in models. In linear regression, they had little incremental validity. However, there was evidence of non-linear relationships. When models were fitted that allowed for non-linear effects of the environmental predictors, they were able to add a moderate amount of incremental validity. LASSO regression, however, suggested that much of this predictive validity was due to overfitting. Furthermore, it was difficult to make causal sense of the results. Spatial patterns in the data were examined using multiple methods, all of which indicated strong spatial autocorrelation for cognitive ability, S and SIRE (k nearest spatial neighbor regression [KNSNR] correlations of .62 to .89). Model residuals were also spatially autocorrelated, and for this reason the models were re-fit controlling for spatial autocorrelation using KNSNR-based residuals and spatial local regression. The results indicated that the effects of SIREs were not due to spatially autocorrelated confounds except possibly for Black% which was about 50% weaker in the controlled analyses. Pseudo-multilevel analyses of both the factor structure of S and the SIRE predictive model showed results consistent with the main analyses. Specifically, the factor structure was similar across levels of analysis (states and counties) and within states. Furthermore, the SIRE predictors had similar betas when examined within each state compared to when analyzed across all states. It was tested whether the relationship between SIREs and S was mediated by cognitive ability. Several methods were used to examine this question and the results were mixed, but generally in line with a partial mediation model. Jensen's method (method of correlated vectors) was used to examine whether the observed relationship between cognitive ability and S scores was plausibly due to the latent S factor. This was strongly supported (r = .91, Nindicators=28). Similarly, it was examined whether the relationship between SIREs and S scores was plausibly due to the latent S factor. This did not appear to be the case.
Article
Full-text available
Two datasets of Japanese socioeconomic data for Japanese prefectures (N=47) were obtained and merged. After quality control, there were 44 variables for use in a factor analysis. Indicator sampling reliability analysis revealed poor reliability (54% of the correlations were |r| > .50). Inspection of the factor loadings revealed no clear S factor with many indicators loading in opposite than expected directions. A cognitive ability measure was constructed from three scholastic ability measures (all loadings > .90). On first analysis, cognitive ability was not strongly related to 'S' factor scores, r = -.19 [CI95: -.45 to .19; N=47]. Jensen's method did not support the interpretation that the relationship is between latent 'S' and cognitive ability (r = -.15; N=44). Cognitive ability was nevertheless related to some socioeconomic indicators in expected ways. A reviewer suggested controlling for population size or population density. When this was done, a relatively clear S factor emerged. Using the best control method (log population density), indicator sampling reliability was high (93% |r|>.50). The scores were strongly related to cognitive ability r = .67 [CI95: .48 to .80]. Jensen's method supported the interpretation that cognitive ability was related to the S factor (r = .78) and not just to the non-general factor variance.
Article
Full-text available
I analyzed the S factor in US states by compiling a dataset of 25 diverse socioeconomic indicators. Results show that Washington DC is a strong outlier, but if it is excluded, then the S factor correlated strongly with state IQ at .75. Ethnoracial demographics of the states are related to the state's IQ and S in the expected order (White>Hispanic>Black).
Article
Full-text available
Many studies have examined the correlations between national IQs and various country-level indexes of well-being. The analyses have been unsystematic and not gathered in one single analysis or dataset. In this paper I gather a large sample of country-level indexes and show that there is a strong general socioeconomic factor (S factor) which is highly correlated (.86-.87) with national cognitive ability using either Lynn and Vanhanen's dataset or Altinok's. Furthermore, the method of correlated vectors shows that the correlations between variable loadings on the S factor and cognitive measurements are .99 in both datasets using both cognitive measurements, indicating that it is the S factor that drives the relationship with national cognitive measurements, not the remaining variance.
What is a good g? Intelligence
• A R Jensen
• L.-J Weng
Jensen, A. R., & Weng, L.-J. (1994). What is a good g? Intelligence, 18(3), 231–258.
VIM: Visualization and Imputation of Missing Values
• M Templ
• A Alfons
• A Kowarik
• B Prantner
Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and Imputation of Missing Values. CRAN. Retrieved from http://cran.rproject.org/web/packages/VIM/index.html
The personal Jensen coefficient does not predict grades beyond its 1 Concerning the agreement about rank-order, it is about .4 both with and without case 50
• E O W Kirkegaard
Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its 1 Concerning the agreement about rank-order, it is about.4 both with and without case 50. But this is based on a single simulation and I've seen some different values when re-running it. A large scale simulation is necessary. Figure 10: Scatter plot of method 1 and 2
But this is based on a single simulation and I've seen some different values when re-running it. A large scale simulation is necessary. Figure 10: Scatter plot of method 1 and 2 association with g. Open Differential Psychology
• E O W Kirkegaard
Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its 1 Concerning the agreement about rank-order, it is about.4 both with and without case 50. But this is based on a single simulation and I've seen some different values when re-running it. A large scale simulation is necessary. Figure 10: Scatter plot of method 1 and 2 association with g. Open Differential Psychology. Retrieved from http://openpsych.net/ODP/2014/10/the-personal-jensen-coefficient-does-not-predict-gradesbeyond-its-association-with-g/