Content uploaded by Emil O. W. Kirkegaard

Author content

All content in this area was uploaded by Emil O. W. Kirkegaard on Nov 16, 2015

Content may be subject to copyright.

Finding mixed cases in exploratory factor analysis

Emil O. W. Kirkegaard

Abstract

Two methods are presented that allow for identification of mixed cases in the extraction of general

factors. Simulated data is used to illustrate them.

Introduction

General factors can be extracted from datasets where all or nearly so the variables are correlated. At the

case-level, such general factors are decreased in size if there are mixed cases present. A mixed case is

an 'inconsistent' case according to the factor structure of the data.

A simple way of illustrating what I'm talking about is using the matrixplot() function from the VIM

package to R (Templ, Alfons, Kowarik, & Prantner, 2015) with some simulated data.

For simulated dataset 1, start by imaging that we are measuring a general factor and that all our

indicator variables have a positive loading on this general factor, but that this loading varies in strength.

Furthermore, there is no error of measurement and there is only one factor in the data (no group factors,

i.e. no hierarchical or bi-factor structure, (Jensen & Weng, 1994)). I have used datasets with 50 cases

and 25 variables to avoid the excessive sampling error of small samples and to keep a realistic number

of cases compared to the datasets examined in S factor studies (e.g. Kirkegaard, 2015). The matrix plot

is shown in Figure 1.

Figure 1: Matrix plot of dataset 1

No real data looks like this, but it is important to understand what to look for. Every indicator is on the

x-axis and the cases are on the y-axis. The cases are colored by their relative values, where darker

means higher values. So in this dataset we see that any case that does well on any particular indicator

does just as well on every other indicator. All the indicators have the same factor loading of 1, and the

proportion of variance explained is also 1 (100%), so there is little point in showing the loadings plot.

To move towards realism, we need to complicate this simulation in some way. The first way is to

introduce some measurement error. The amount of error introduced determines the factor loadings and

hence the size of the general factor. In dataset 2, the error amount is .5, and the signal multiplier varies

from .05 to .95 all of which are equally likely (uniform distribution). The matrix and the loadings plots

are shown in Figures 2 and 3.

Figure 2: Matrix plot for dataset 2

Figure 3: Loadings plot for dataset 2

By looking at the matrix plot we can still see a fairly simple structure. Some cases are generally darker

(whiter) than others, but there is also a lot of noise which is of course the error we introduced. The

loadings show quite a bit of variation. The size of this general factor is .45.

The next complication is to introduce the possibility of negative loadings (these are consistent with a

general factor, as long as they load in the right direction, (Kirkegaard, 2014)). We go back to the

simplified case of no measurement error for simplicity. Figures 4 and 5 show the matrix and loadings

plots.

The matrix plot looks odd, until we realize that some of the indicators are simply reversed. The

loadings plot shows this reversal. One could easily get back to a matrix plot like that in Figure 1 by

reversing all indicators with a negative loading (i.e. multiplying by -1). However, the possibility of

negative loadings does increase the complexity of the matrix plots.

Figure 4: Matrix plot for dataset 3

Figure 5: Loadings plot for dataset 3

For the 4th dataset, we make a begin with dataset 2 and create a mixed case. This we do by setting its

value on every indicator to be 2, a strong positive value (98 centile given a standard normal

distribution). Figure 6 shows the matrix plot. I won't bother with the loadings plot because it is not

strongly affected by a single mixed case.

Can you guess which case it is? Perhaps not. It is #50 (top line). One might expect it to be the same hue

all the way. This however ignores the fact that the values in the different indicators vary due to

sampling error. So a value of 2 is not necessarily at the same centile or equally far from the mean in

standard units in every indicator, but it is fairly close which is why the color is very dark across all

indicators.

For datasets with general factors, the highest value of a case tends to be on the most strongly loaded

indicator (Kirkegaard, 2014b), but this information is not easy to use in an eye-balling of the dataset.

Thus, it is not so easy to identify the mixed case.

Now we complicate things further by adding the possibility of negative loadings. This gets us data

roughly similar to that found in S factor analysis (there are still no true group factors in the data).

Figure 7 shows the matrix plot.

Figure 6: Matrix plot for dataset 4

Just looking at the dataset, it is fairly difficult to detect the general factor, but in fact the variance

explained is .38. The mixed case is easy to spot now (#50) since it is the only case that is consistently

dark across indicators, which is odd given that some of them have negative loadings. It 'shouldn't'

happen. The situation is however somewhat extreme in the mixedness of the case.

Automatic detection

Eye-balling figures and data is a powerful tool for quick analysis, but it cannot give precise numerical

values used for comparison between cases. To get around this I developed two methods for automatic

identification of mixed cases.

Method 1

A general factor only exists when multidimensional data can be usefully compressed, informationally

speaking, to 1-dimensional data (factor scores on the general factor). I encourage readers to consult the

very well-made visualization of principal component analysis (almost the same as factor analysis) at

this website. In this framework, mixed cases are those that are not well described or predicted by a

single score.

Thus, it seems to me that that we can use this information as a measure of the mixedness of a case. The

method is:

1. Extract the general factor.

2. Extract the case-level scores.

3. For each indicator, regress it unto the factor scores. Save the residuals.

4. Calculate a suitable summary metric, such as the mean absolute residual and rank the cases.

Using this method on dataset 5 in fact does identify case 50 as the most mixed one. Mixedness varies

between cases due to sampling error. Figure 8 shows the histogram.

Figure 7: Matrix plot for dataset 5

The outlier on the right is case #50.

How extreme does a mixed case need to be for this method to find it? We can try reducing its

mixedness by assigning it less extreme values. Table 1 shows the effects of doing this.

Mixedness values Mean absolute residual

2 1.91

1.5 1.45

1 0.98

Table 1: Mean absolute residual and mixedness

So we see that when it is 2 and 1.5, it is clearly distinguishable from the rest of the cases, but 1 is about

the limit of this since the second-highest value is .80. Below this, the other cases are similarly mixed,

just due to the randomness introduced by measurement error.

Method 2

Since mixed cases are poorly described by a single score, they don't fit well with the factor structure in

the data. Generally, this should result in the proportion of variance increasing when they are removed.

Thus the method is:

1. Extract the general factor from the complete dataset.

2. For every case, create a subset of the dataset where this case is removed.

3. Extract the general factors from each subset.

4. For each analysis, extract the proportion of variance explained and calculate the difference to

that using the full dataset.

Using this method on the dataset also used above correctly identifies the mixed case. The histogram of

Figure 8: Histogram of absolute mean residuals from dataset 5

results is shown in Figure 9.

Like we method 1, we then redo this analysis for other levels of mixedness. Results are shown in Table

2.

Mixedness values Improvement in proportion of variance

2 1.91

1.5 1.05

1 0.50

Table 2: Improvement in proportion of variance and mixedness

We see the same as before, in that both 2 and 1.5 are clearly identifiable as being an outlier in

mixedness, while 1 is not since the next-highest value is .45.

Large scale simulation with the above methods could be used to establish distributions to generate

confidence intervals from.

It should be noted that the improvement in proportion of variance is not independent of number of

cases (more cases means that a single case is less import, and non-linearly so), so the value cannot

simply be used to compare across cases without correcting for this problem. Correcting it is however

beyond the scope of this article.

Comparison of methods

The results from both methods should have some positive relationship. The scatter plot is shown in

Figure 9: Histogram of differences in proportion of variance to the

full analysis

We see that the true mixedness case is a strong outlier with both methods -- which is good because it

really is a strong outlier. The correlation is strongly inflated because of this, to r=.70 with, but only .26

without. The relative lack of a positive relationship without the true outlier in mixedness is perhaps due

to range restriction in mixedness in the dataset, which is true because the only amount of mixedness

besides case 50 is due to measurement error. Whatever the exact interpretation, I suspect it doesn't

matter since the goal is to find the true outliers in mixedness, not to agree on the relative ranks of the

cases with relatively little mixedness.1

Implementation

I have implemented both above methods in R. They can be found in my unofficial psych2 collection of

useful functions located here.

Supplementary material

Source code and figures are available at the Open Science Framework repository.

References

Jensen, A. R., & Weng, L.-J. (1994). What is a good g? Intelligence, 18(3), 231–258.

http://doi.org/10.1016/0160-2896(94)90029-9

Kirkegaard, E. O. W. (2014a). The international general socioeconomic factor: Factor analyzing

international rankings. Open Differential Psychology. Retrieved from

http://openpsych.net/ODP/2014/09/the-international-general-socioeconomic-factor-factor-

analyzing-international-rankings/

Kirkegaard, E. O. W. (2014b). The personal Jensen coefficient does not predict grades beyond its

1 Concerning the agreement about rank-order, it is about .4 both with and without case 50. But this is based on a single

simulation and I've seen some different values when re-running it. A large scale simulation is necessary.

Figure 10: Scatter plot of method 1 and 2

association with g. Open Differential Psychology. Retrieved from

http://openpsych.net/ODP/2014/10/the-personal-jensen-coefficient-does-not-predict-grades-

beyond-its-association-with-g/

Kirkegaard, E. O. W. (2015). Examining the S factor in US states. The Winnower. Retrieved from

https://thewinnower.com/papers/examining-the-s-factor-in-us-states

Templ, M., Alfons, A., Kowarik, A., & Prantner, B. (2015, February 19). VIM: Visualization and

Imputation of Missing Values. CRAN. Retrieved from http://cran.r-

project.org/web/packages/VIM/index.html