A new look at the big five factor structure through exploratory structural equation modeling.
ABSTRACT NEO instruments are widely used to assess Big Five personality factors, but confirmatory factor analyses (CFAs) conducted at the item level do not support their a priori structure due, in part, to the overly restrictive CFA assumptions. We demonstrate that exploratory structural equation modeling (ESEM), an integration of CFA and exploratory factor analysis (EFA), overcomes these problems with responses (N = 3,390) to the 60-item NEO-Five-Factor Inventory: (a) ESEM fits the data better and results in substantially more differentiated (less correlated) factors than does CFA; (b) tests of gender invariance with the 13-model ESEM taxonomy of full measurement invariance of factor loadings, factor variances-covariances, item uniquenesses, correlated uniquenesses, item intercepts, differential item functioning, and latent means show that women score higher on all NEO Big Five factors; (c) longitudinal analyses support measurement invariance over time and the maturity principle (decreases in Neuroticism and increases in Agreeableness, Openness, and Conscientiousness). Using ESEM, we addressed substantively important questions with broad applicability to personality research that could not be appropriately addressed with the traditional approaches of either EFA or CFA.
-
Citations (0)
- Cited In (1)
-
Article: Cytokine production by leukocytes of military personnel with depressive symptoms after deployment to a combat-zone: a prospective, longitudinal study.
Mirjam van Zuiden, Cobi J Heijnen, Rens van de Schoot, Karima Amarouchi, Mirjam Maas, Eric Vermetten, Elbert Geuze, Annemieke Kavelaars[show abstract] [hide abstract]
ABSTRACT: Major depressive disorder (MDD) is frequently diagnosed in military personnel returning from deployment. Literature suggests that MDD is associated with a pro-inflammatory state. To the best of our knowledge, no prospective, longitudinal studies on the association between development of depressive symptomatology and cytokine production by peripheral blood leukocytes have been published. The aim of this study was to investigate whether the presence of depressive symptomatology six months after military deployment is associated with the capacity to produce cytokines, as assessed before and after deployment. 1023 military personnel were included before deployment. Depressive symptoms and LPS- and T-cell mitogen-induced production of 16 cytokines and chemokines in whole blood cultures were measured before (T0), 1 (T1), and 6 (T2) months after return from deployment. Exploratory structural equation modeling (ESEM) was used for data reduction into cytokine patterns. Multiple group latent growth modeling was used to investigate differences in the longitudinal course of cytokine production between individuals with (n = 68) and without (n = 665) depressive symptoms at T2. Individuals with depressive symptoms after deployment showed higher T-cell cytokine production before deployment. Moreover, pre-deployment T-cell cytokine production significantly predicted the presence of depressive symptomatology 6 months after return. There was an increase in T-cell cytokine production over time, but this increase was significantly smaller in individuals developing depressive symptoms. T-cell chemokine and LPS-induced innate cytokine production decreased over time and were not associated with depressive symptoms. These results indicate that increased T-cell mitogen-induced cytokine production before deployment may be a vulnerability factor for development of depressive symptomatology in response to deployment to a combat-zone. In addition, deployment to a combat-zone affects the capacity of T-cells and monocytes to produce cytokines and chemokines until at least 6 months after return.PLoS ONE 01/2011; 6(12):e29142. · 4.09 Impact Factor
Page 1
A New Look at the Big Five Factor Structure Through Exploratory
Structural Equation Modeling
Herbert W. Marsh
University of Oxford
Oliver Lu ¨dtke
University of Tu ¨bingen, Tu ¨bingen, Germany, and Max Planck
Institute for Human Development, Berlin, Germany
Bengt Muthe ´n
University of California, Los Angeles
Tihomir Asparouhov
Muthe ´n & Muthe ´n, Los Angeles, California
Alexandre J. S. Morin
University of Sherbrooke
Ulrich Trautwein
University of Tu ¨bingen, Tu ¨bingen, Germany, and Max Planck
Institute for Human Development, Berlin, Germany
Benjamin Nagengast
University of Oxford
NEO instruments are widely used to assess Big Five personality factors, but confirmatory factor analyses
(CFAs) conducted at the item level do not support their a priori structure due, in part, to the overly
restrictive CFA assumptions. We demonstrate that exploratory structural equation modeling (ESEM), an
integration of CFA and exploratory factor analysis (EFA), overcomes these problems with responses
(N ? 3,390) to the 60-item NEO–Five-Factor Inventory: (a) ESEM fits the data better and results in
substantially more differentiated (less correlated) factors than does CFA; (b) tests of gender invariance
with the 13-model ESEM taxonomy of full measurement invariance of factor loadings, factor variances–
covariances, item uniquenesses, correlated uniquenesses, item intercepts, differential item functioning,
and latent means show that women score higher on all NEO Big Five factors; (c) longitudinal analyses
support measurement invariance over time and the maturity principle (decreases in Neuroticism and
increases in Agreeableness, Openness, and Conscientiousness). Using ESEM, we addressed substantively
important questions with broad applicability to personality research that could not be appropriately
addressed with the traditional approaches of either EFA or CFA.
Keywords: exploratory structural equation modeling, factorial and measurement invariance, Big Five
personality structure, differential item functioning
Supplemental materials: http://dx.doi.org/10.1037/a0019227.supp
Arguably, the most important advance in personality psychol-
ogy in the past half century has been the emerging consensus that
individual differences in adults’ personality characteristics can be
organized in terms of five broad trait domains: Extraversion,
Agreeableness, Conscientiousness, Neuroticism, and Openness.
These Big Five factors now serve as a common language in the
field, facilitating communication and collaboration. Although
there are several Big Five instruments (e.g., Benet-Martinez &
John, 1998; Caprara & Perugini, 1994; Goldberg, 1990; Gosling,
Rentfrow, & Swann, 2003; John & Srivastava, 1999; Paunonen,
2003; Paunonen & Ashton, 2001; Saucier, 1998), the family of
NEO instruments—including the 60-item NEO–Five-Factor In-
ventory (NEO-FFI; Costa & McCrae, 1992; McCrae & Costa,
2004) considered here—appear to be the most widely used instru-
ments and to have received the most attention over recent years
(Boyle, 2008).
Factor analysis has been at the heart of these exciting break-
throughs. Exploratory factor analyses (EFAs) have consistently
identified the Big Five factors, and an impressive body of empir-
ical research supports their stability and predictive validity (see
McCrae & Costa, 1997). However, confirmatory factor analyses
Herbert W. Marsh and Benjamin Nagengast, Department of Education,
University of Oxford, Oxford, England; Oliver Lu ¨dtke and Ulrich Trautwein,
University of Tu ¨bingen, Tu ¨bingen, Germany, and Center for Educational
Research, Max Planck Institute for Human Development, Berlin, Germany;
Bengt Muthe ´n, Graduate School of Education and Information Studies, Uni-
versity of California, Los Angeles; Tihomir Asparouhov, Muthe ´n & Muthe ´n,
Los Angeles, California; Alexandre J. S. Morin, Department of Psychology,
University of Sherbrooke, Sherbrooke, Quebec, Canada.
This research was supported in part by a grant to Herbert W. Marsh from
the United Kingdom’s Economic and Social Research Council.
Correspondence concerning this article should be addressed to Herbert
W. Marsh, Department of Education, University of Oxford, 15 Norham
Gardens, Oxford OX2 6PY, United Kingdom. E-mail: herb.marsh@
education.ox.ac.uk
Psychological Assessment
2010, Vol. 22, No. 3, 471–491
© 2010 American Psychological Association
1040-3590/10/$12.00DOI: 10.1037/a0019227
471
Page 2
(CFAs) have failed to provide clear support for the five-factor
model on the basis of standard measures such as the NEO instru-
ments. For example, in a particularly relevant study comparing
EFA and CFA factor structures based on NEO–Personality Inven-
tory (NEO-PI) responses, Vassend and Skrondal (1997) reported
highly discrepant findings, leading them to conclude
(i) that the original NEO-PI model as well as later EFA-based revi-
sions are false or at least unsatisfactory, and (ii) that at present we do
not know how the NEO-PI scales should be modeled with the aim of
obtaining a common, acceptable NEO-PI version. (p. 157)
Problematic results based on CFAs have led some researchers to
question the appropriateness of CFA for Big Five research (see
Borkenau & Ostendorf, 1990; Church & Burke, 1994; McCrae,
Zonderman, Costa, Bond, & Paunonen, 1996; Parker, Bagby, &
Summerfeldt, 1993; Vassend & Skrondal, 1997). However, many
of the methodological and statistical advances in quantitative psy-
chology in the last 2 decades are associated with latent-variable
approaches such as CFA and structural equation models (SEMs).
Hence, failure to embrace these new and evolving methodologies
(throwing the baby out with the bathwater) would have dire
consequences—particularly for a field of research so fundamen-
tally based on factor analysis. Indeed, assumptions of factorial and
measurement invariance (in relation to multiple groups, time,
covariates, and outcomes) that underpin nearly all Big Five studies
cannot be appropriately evaluated with traditional approaches to
EFA and thus have been largely ignored in Big Five EFA research.
Here we outline a new approach to factor analysis—an integration
of EFA and CFA—that has the potential to resolve this dilemma
and has wide applicability to all disciplines of psychology that are
based on the measurement of latent constructs. Thus, our study is
a substantive-methodological synergy (Marsh & Hau, 2007), dem-
onstrating the importance of applying new and evolving method-
ological approaches to substantively important issues.
Methodological Focus: Exploratory Structural
Equation Modeling (ESEM)
EFA Versus CFA
Many measurement instruments used in psychological assess-
ment apparently have well-defined EFA structures but are not
supported by CFAs (Marsh et al., 2009). This concern led McCrae
et al. (1996) to conclude:
In actual analyses of personality data from Borkenau and Ostendorf
(1990) to Holden and Fekken (1994), structures that are known to be
reliable showed poor fits when evaluated by CFA techniques. We
believe this points to serious problems with CFA itself when used to
examine personality structure. (p. 568; also see Costa & McCrae,
1992, 1995; McCrae & Costa, 1997)
Church and Burke (1994) similarly concluded on the basis of their
empirical research that
Poor fits of a priori models highlighted not only the limited specificity
of personality structure theory, but also the limitations of confirmatory
factor analysis for testing personality structure models. (p. 93)
They argued that the independent clusters model (ICM) used in
CFA studies, which requires each indicator to load on only one
factor, is too restrictive for personality research, because indicators
are likely to have secondary loadings unless researchers resort to
using a small number of near-synonyms to infer each factor.
Marsh et al. (2009) claimed that, consistent with these concerns,
many ad hoc strategies used to compensate for the inappropriate-
ness of CFA in psychological research more generally are dubious,
counterproductive, misleading, or simply wrong. Of particular
relevance to the present investigation, the inappropriate imposition
of zero factor loadings usually leads to distorted factors with
positively biased factor correlations that might lead to biased
estimates in SEMs incorporating other constructs (also see Marsh
et al., 2009). In a similar vein, Marsh (2007; Marsh, Hau, &
Grayson, 2005) concluded that many psychological instruments
used in applied research do not even meet minimum criteria of
acceptable fit according to current standards.
Apparently, many applied researchers persist with inappropriate
ICM-CFA models because they believe that EFA approaches are
outdated and that methodological advances associated with CFAs
are not applicable to EFAs. Here we demonstrate how it is possible
to apply EFA rigorously in a way that allows researchers to define
more appropriately the underlying factor structure and to still
apply the advanced statistical methods typically associated with
CFAs and SEMs. This is accomplished with the ESEM procedure
recently implemented in the Mplus statistical package (Version
5.2, Muthe ´n & Muthe ´n, 2008). Within the ESEM framework, the
applied personality researcher has access to typical SEM parame-
ter estimates, standard errors, goodness-of-fit statistics, and statis-
tical advances normally associated with CFA and SEMs (see
Asparouhov & Muthe ´n, 2009; Marsh et al., 2009). Here we apply
ESEM to NEO-FFI responses.
Tests of Factorial and Measurement Invariance
We know of no CFAs carried out at the item level—particularly
for research based on the NEO-FFI instrument used to measure the
Big Five personality factors—that provide acceptable support for
the a priori Big Five factor structure. This is remarkable, given the
widespread acceptance of the Big Five factor structure and the
NEO-FFI. Hence, it is not surprising that research into the Big
Five factor structure on responses to individual items continues to
be based almost entirely on EFA (for exceptions, see Benet-
Martinez & John, 1998; Dolan, Oort, Stoel, & Wichterts, 2009;
Gustavsson, Eriksson, Hilding, Gunnarsson, & Ostensson, 2008;
also see Reise, Smith, & Furr, 2001). We suggest that this failure
to apply CFA models in Big Five research is due in large part to
the inappropriateness of the typical ICM-CFA structure. Although
identification of the appropriate factor structure is important in its
own right, there are many other important advantages to the use of
CFA that cannot be easily incorporated into EFA and thus have
been largely ignored in Big Five personality research. Thus, for
example, studies that use Big Five scale scores (or factor scores
based on EFAs) are not corrected for measurement error. Although
it is possible to correct for a simple form of measurement error
(i.e., the typical correction for attenuation based on reliability
estimates), in many applications the error structure is more com-
plex (e.g., longitudinal studies as considered here), so the typical
correction for attenuation is not sufficient.
472
MARSH ET AL.
Page 3
A particularly important application of CFA techniques is to test
the assumptions about the invariance of the Big Five factor struc-
ture over multiple groups or over time (Gustavsson et al., 2008;
Nye, Roberts, Saucier, & Zhou, 2008; Reise et al., 2001). Unless
the underlying factors are measuring the same construct in the
same way and the measurements themselves are operating in the
same way (across groups or over time), mean differences and other
comparisons are likely to be invalid. Although some aspects of
factor similarity can be addressed in part with EFA approaches
(e.g., the similarity of the factor loadings), most cannot. In partic-
ular, an important assumption in the comparison of Big Five
factors over different groups (e.g., men and women) or over time
is the invariance of item intercepts. More specifically, it is impor-
tant to ascertain that mean differences based on latent constructs
(Big Five factors) are reflected in each of the individual items used
to infer the latent constructs. For example, if the apparent level of
gender differences in Extraversion varies substantially from item
to item for different items used to infer this construct, then the
gender differences based on the corresponding latent construct are
idiosyncratic to the particular items used to infer Extraversion.
Similarly, if responses to individual Extraversion items differ
systematically with age (for different respondents) or over time
(for the same respondents), then findings based on comparisons of
scale scores might be invalid. In each case, these results would
suggest that conclusions about differences in Extraversion do not
generalize over even the set of items used in the instrument—let
alone the population of items that could have been used. Hence,
conclusions about differences in Extraversion might be idiosyn-
cratic to the particular set of items and not be generalizable. From
this perspective, it is important to evaluate the invariance of
different aspects of the factor structure at the level of the individual
item. Although issues of noninvariance of item intercepts (hereaf-
ter referred to as differential item functioning) are well known in
evaluating the appropriateness of standardized achievement tests,
these issues have been largely ignored in Big Five research (but
see Jackson et al., 2009; Nye et al., 2008; Reise et al., 2001).
Substantive Focus on Big Five Personality Factors
and the NEO-FFI
Gender Differences in Personality Traits
There is a long history of the search for gender differences in
personality research (e.g., Feingold, 1994; Hall, 1984; Maccoby &
Jacklin, 1974). Noting that Feingold (1994) had organized his
review in part on the basis of the five broad factors and 30 facets
of the NEO-PI, Costa, Terracciano, and McCrae (2001) greatly
expanded the research based on the 30 facets measured by the
NEO-PI-R for responses from 26 countries (N ? 23,031). Inter-
estingly, they found that gender differences within the set of six
facets comprising each of the Big Five factors were not entirely
consistent. Women had consistently higher scores across six facets
representing Neuroticism and Agreeableness, whereas gender dif-
ferences were consistently small for Conscientiousness. However,
gender differences were less consistent for Extraversion and Open-
ness; for each of these Big Five factors at least two (of six) facets
favored women and at least two favored men. Hence, the size and
even the direction of gender differences would differ depending on
which facet (or mix of facets) was considered. Thus, even at the
facet level there is apparently differential item (facet) functioning
for some of the Big Five factors that compromises conclusions
based on Big Five measures that are aggregated across facets.
Logically, this implies that there is also likely to be differential
item functioning at the level of individual items in relation to
gender differences for NEO-FFI responses considered here.
Although there is considerable study-to-study variation in ob-
served gender differences that may be a function of age, nation-
ality, and the particular instrument considered, there is clear sup-
port for the conclusions that women tend to score higher than men
in relation to Neuroticism and Agreeableness. Although less con-
sistent, there is also evidence that women score higher on Consci-
entiousness and Extraversion but no clear support for evidence of
gender differences in Openness. There is no evidence that men
score higher than women on any of the Big Five factors as
measured and labeled on the NEO-FFI (although women’s higher
scores on Neuroticism are sometimes summarized as lower scores
on emotional stability). Particularly relevant to the current study
(based on late-adolescent responses by Germans), Schmitt, Realo,
Voracek, and Allik (2008) reported that for their German sample
(N ? 790), women scored higher than men did on all Big Five
factors: Neuroticism (d ? 0.48), Extraversion (d ? 0.12), Agree-
ableness (d ? 0.09), Conscientiousness (d ? 0.23), and Openness
(d ? 0.11). Similarly, Donnellan and Lucas (2008) found that for
the late-adolescent sample (ages 16–19 years) most relevant to the
present investigation, German women consistently scored higher
than German men did: Neuroticism (d ? 0.47), Extraversion (d ?
0.24), Agreeableness (d ? 0.31), Conscientiousness (d ? 0.34),
and Openness (d ? 0.36).
Longitudinal Invariance: Stability and Change in
Personality Traits
The literature on personality development distinguishes several
types of personality change and continuity (Caspi & Shiner, 2006;
Lu ¨dtke, Trautwein, & Husemann, 2009). Here we distinguish
between correlational (rank-order), mean-level, and structural sta-
bility over time.
For correlational stability, cross-sectional and longitudinal re-
search (Roberts & DelVecchio, 2000; see also Fraley & Roberts,
2005; Klimstra, Hale, Raaijmakers, Branje, & Meeus, 2009;
Lu ¨dtke et al., 2009) shows that correlational stability increases
with age, particularly for the middle-to-late adolescent period that
is the focus of the present investigation.
Studies of mean-level change with respect to life-span changes
in Big Five traits show that most people become more dominant,
agreeable, conscientious, and emotionally stable. Caspi, Roberts,
and Shiner (2005) coined the term maturity principle to describe
these findings of increasing psychological maturity from adoles-
cence to middle age. In their meta-analysis of longitudinal studies,
Roberts, Walton, and Viechtbauer (2006) also found substantial
increases in Openness. For the 18–22 age group most relevant to
the present investigation, Robins, Fraley, Roberts, and Trz-
esniewski (2001) found that, over a 4-year period, Agreeableness
(d ? 0.44), Conscientiousness (d ? 0.27), and Openness (d ?
0.22) increased and Neuroticism (d ? –0.49) decreased. No sta-
tistically significant change was found for Extraversion. In sum-
mary, although results from these studies are not entirely consis-
tent, there is general support for the maturity principle of increases
473
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
Page 4
in all Big Five factors (or decreases in Neuroticism instead of
increases in Emotional Stability) except, perhaps, for Extraversion.
Structural stability assesses the extent to which the same factors
are being assessed in different groups or over time. At least some
level of structural invariance is a prerequisite for assessing either
mean differences between groups or stability over time. If the
nature of the factors changes so that factors are qualitatively
different, then interpretations of stability over time are question-
able. It is most appropriate to evaluate factorial and measurement
invariance on the basis of responses to individual items. However,
personality researchers have been remarkably unsuccessful in ob-
taining acceptable levels of goodness of fit for the a priori Big Five
factor CFA structure when analysis of the structure is based on
responses to individual items in studies of the NEO-FFI instrument
considered here. Indeed, this might be considered the major lim-
itation in Big Five personality research, particularly in relation to
testing assumptions underpinning the valid assessment of stability
over time as well as the valid comparison of latent means across
groups. For this reason some studies have sought to formally test full
measurement invariance based on mean responses averaged across
different items, facet scores (e.g., Gignac, 2009; McCrae et al., 1996;
Saucier, 1998; Small, Hertzog, Hultsch, & Dixon, 2003), parcel
scores (Allemand, Zimprich, & Hendriks, 2008; Allemand, Zimprich,
& Hertzog, 2007; Lu ¨dtke et al., 2009; Marsh, Trautwein, Lu ¨dtke,
Ko ¨ller, & Baumert, 2006), or scale scores (e.g., Mroczek & Spiro,
2003). Although these analyses are potentially useful, they have
important limitations when conducted without prior verification of
measurement invariance at the item level—an assumption underlying
tests of mean differences (over time or across groups) and differential
item functioning that could compromise the validity of interpretations
based on analyses of aggregated scores (see later discussion for
further elaboration). In the present investigation, we address these
concerns, introducing a new ESEM approach that integrates the logic
of the EFA approach typically used in Big Five personality research
and the CFA approach widely argued to be inappropriate to Big Five
research.
The Present Investigation:
A Substantive-Methodological Synergy
Our study is a substantive-methodological synergy, demonstrat-
ing the power and flexibility of ESEM methods that integrate CFA
and EFA (on the basis of the Mplus statistical package; Muthe ´n &
Muthe ´n, 2008) to address substantively important issues about the
Big Five factor structure on the basis of responses to the 60-item
NEO-FFI instrument. We begin by comparing CFA and ESEM
approaches, testing the assumption that ESEM models fit better
than corresponding CFA models. For both CFA and ESEM mod-
els, we include both freely estimated uniquenesses (reflecting a
combination of measurement-error-specific variances) and a priori
correlated uniquenesses (CUs; covariances between the specific
variance components associated with two different items from the
same Big Five facet). Big Five theory posits that the Big Five
factors should be reasonably orthogonal, but constraining all (non-
target) cross-loadings to be zero in the ICM-CFA model is posited
to systematically inflate and bias estimates of the factor correla-
tions. Hence, support for the prediction that Big Five factors are
reasonably orthogonal is hypothesized to be stronger in ESEM
models than in CFA models.
We then extend ESEM to test a 13-model taxonomy of mea-
surement invariance, testing invariance of factor loadings, factor
variances–covariances, item uniquenesses, CUs, item intercepts,
and latent means—with a specific focus on gender differences in
the latent means of the Big Five factors. Of particular interest are
tests of the invariance of item intercepts that are an implicit
assumption in the comparison of latent (or manifest) group means
but are largely ignored in previous Big Five research (but see
Jackson et al., 2009; Nye et al., 2008; Reise et al., 2001). We
expect, on the basis of previous research, systematic differences,
mostly reflecting higher means for women (particularly for the
late-adolescent German sample considered here). We also predict
that, consistent with previous research, there is differential item
functioning in NEO-FFI responses (noninvariance of item inter-
cepts) that would compromise the interpretation of latent mean
comparisons, but we explore alternatives to circumvent this prob-
lem.
Finally, we apply ESEM to test–retest data, testing a set of
models of measurement invariance over time with the inclusion of
CUs relating responses to the same item on multiple occasions.
Although these (within-group) tests of longitudinal invariance
largely parallel those based on (between-group) tests over gender,
the substantive implications are quite different. Indeed, given that
participants are tested in their final year of high school at Time 1
(T1) and are tested 2 years after graduation at Time 2 (T2), it is
reasonable that there might be systematic changes in Big Five
latent means. We expect to see, based on the maturity principle,
decreases in Neuroticism and increases in Agreeableness, Open-
ness, and Conscientiousness.
Previous research has suggested a problem with the evaluation
of stability over time for NEO-FFI responses that is especially
relevant to the present investigation. NEO-FFI responses consis-
tently have high levels of short-term test–retest stability (.86–.90;
McCrae & Costa, 2004; Robins et al., 2001) and internal consis-
tency (.68–.86; Costa & McCrae, 1992). However, this research
suggests problems associated with a complex error structure in that
test–retest correlations are larger than internal consistency mea-
sures of reliability. In particular, test–retest correlations would be
greater than 1.0 if corrected for (internal consistency) unreliability.
This suggests that observed test–retest correlations are more pos-
itively biased by CUs associated with specific variances of the
same items administered on different occasions than negatively
biased by the failure to control for measurement error in the
factors. Traditional EFA approaches are unable to appropriately
distinguish between measurement error on each occasion, CUs
over time, and true stability of latent traits over time, but these
issues can be addressed by ESEM, as demonstrated in the present
investigation.
Method
Participants
The data come from a large, ongoing German study (Transfor-
mation of the Secondary School System and Academic Careers
[TOSCA]; see Ko ¨ller, Watermann, Trautwein, & Lu ¨dtke, 2004;
also see Lu ¨dtke et al., 2009; Marsh, Trautwein, et al., 2006). A
random sample of 149 upper secondary schools in a single German
state was selected to be representative of the traditional and voca-
474
MARSH ET AL.
Page 5
tional gymnasium school types attended by the college-bound
student population. At T1, the students (N ? 3,390; 45% men,
55% women) were in their final year of upper secondary schooling
(M age ? 19.51, SD ? 0.77). Two trained research assistants
administered materials in each school, and students participated
voluntarily, without any financial incentive. At T1, all students
were asked to provide written consent to be contacted again later
for a second wave of data collection. At T2, 2 years after gradu-
ation from high school, participants completed an extensive ques-
tionnaire taking about 2 hr in exchange for a financial reward of 10
euros (US$13).
For evaluation of longitudinal stability, our analyses are re-
stricted to the responses by the 1,570 (39% men, 61% women)
students who completed the NEO-FFI at both T1 and T2. To test
for attrition effects, we compared continuers, who participated at
both time points, to dropouts, who participated in only the first
wave. Continuers had slightly lower grade point averages (M ?
2.3 vs. 2.5) and were more likely to be female. Selectivity effects
exceeding d ? 0.10 were found for two of the Big Five scale
scores; continuers had higher Conscientiousness and Agreeable-
ness scores. Although dropouts and continuers differed statistically
significantly in some domains, the magnitude of these differences
was small and indicative of only small selectivity effects. We also
compare, as part of the analysis, factor structures based on all
students at T1 as well as those who completed instruments at both
T1 and T2.
Measures: Big Five Dimensions
The 60-item NEO-FFI (Costa & McCrae, 1992) provides a short
measure of the Big Five personality factors (Costa & McCrae,
1989). For each factor, McCrae and Costa (1989) selected 12 items
from the 180 items of the longer NEO-PI (and the full 240-item
NEO-PI-R), based primarily on correlations between each NEO-PI
item and factor scores (McCrae & Costa, 1989). We measured the
Big Five factors using the German version (Borkenau & Osten-
dorf, 1993) of the NEO-FFI, whose responses have high reliability,
validity, and comparability with responses to the original English-
language version (e.g., Borkenau & Ostendorf, 1993). In our study,
items were rated on a 4-point scale ranging from 1 (strongly
disagree) to 4 (strongly agree). Psychometric analyses of the
4-point response format show that this format has some advantages
over a 5-point scale (Lu ¨dtke, Trautwein, Nagy, & Ko ¨ller, 2004).
Coefficient alpha reliabilities at T1 and T2, respectively, were .78
and .80 (Extraversion), .72 and .73 (Agreeableness), .83 and .84
(Conscientiousness), .83 and .87 (Neuroticism), and .73 and .74
(Openness). Hence, consistent with previous research (e.g., Church
& Burke, 1994; McCrae et al., 1996), there are small increases in
reliability with increased age during this late-adolescent period.
Statistical Analyses
Analyses were conducted with Mplus (Version 5.2; Muthe ´n &
Muthe ´n, 2008). Preliminary analyses consisted of a traditional
CFA based on the Mplus robust maximum likelihood estimator
(MLR), with standard errors and tests of fit that are robust in
relation to nonnormality and nonindependence of observations
(Muthe ´n & Muthe ´n, 2008). The main focus is on the application of
ESEM to responses to the 60-item NEO Big Five personality
instrument. The ESEM approach differs from the typical CFA
approach in that all factor loadings are estimated, subject to con-
straints so that the model can be identified (for further details of
the ESEM approach and identification issues, see technical appen-
dix, Appendix 1 in the online supplemental materials; also see
Asparouhov & Muthe ´n, 2009). Here we used an oblique geomin
rotation (the default in the Mplus) with an epsilon value of .5 and
the MLR estimation. A critical advantage of the ESEM approach
is the ability to test full measurement invariance for an EFA
solution in relation to multiple groups or occasions.
Factorial and measurement invariance.
proposed a 13-model taxonomy of invariance tests that integrated
factor analysis (e.g., Jo ¨reskog & Sörbom, 1988; Marsh, 1994,
2007) and measurement invariance (e.g., Meredith, 1964, 1993;
Meredith & Teresi, 2006) traditions to testing invariance over
multiple groups or occasions. Following the measurement invari-
ance tradition, we use terminology proposed by Meredith (1964,
1993) that has achieved broad acceptance. Although tests of in-
variance are frequently based on covariance matrices emerging
from the factor analysis tradition, tests of full measurement invari-
ance begin with raw data (or mean augmented covariance matri-
ces) and should be done at the item level to evaluate item func-
tioning.
In the Meredith (1964, 1993) tradition, the sequence of invari-
ance testing generally begins with a model with no invariance of
any parameter estimates (i.e., all parameters are freely estimated)
such that only similarity of the overall pattern of parameters is
evaluated (configural invariance). Technically, this model might
not be an invariance model in that it does not require any estimated
parameters to be the same. However, it does provide both a test of
the ability of the a priori model to fit the data in each group (or
occasion) without invariance constraints and a baseline for com-
paring other models that do impose equality constraints on the
parameter estimates across groups or over time. Configural invari-
ance models are followed by tests of weak measurement invari-
ance that are satisfied if factor loadings are invariant over groups
or occasions, although Byrne, Shavelson, and Muthe ´n (1989) also
argued for the usefulness of a less demanding test of partial
invariance in which some parameter estimates are not constrained
to be invariant. Strong measurement invariance is satisfied if the
indicator means (i.e., the intercepts of responses to individual
items) and factor loadings are invariant over groups. If factor
loadings and item intercepts are invariant over groups, then
changes in the latent factor means can reasonably be interpreted as
changes in the latent constructs. Strict measurement invariance is
satisfied if factor loadings, item intercepts, and item uniquenesses
are all invariant across groups or over time. Strict measurement
invariance is required in order to compare Big Five (manifest)
scale scores (or factor scores) over time or across different groups.
As comparisons based on latent constructs are corrected for mea-
surement error, they require only strong measurement invariance.
The taxonomy of 13 partially nested models (Marsh et al., 2009)
expand this measurement invariance tradition; models vary from
the least restrictive model of configural invariance with no invari-
ance constraints to a model of complete invariance that posits strict
invariance as well as the invariance of the latent means and of the
factor variance–covariance matrix (see Table 1; for a more ex-
tended discussion of these issues, see also Marsh et al., 2009). All
models except the configural invariance model (Model 1) assume
Marsh et al. (2009)
475
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
Page 6
the invariance of factor loadings, but it is possible to test, for
example, the invariance of indicator uniquenesses with or without
the invariance of item intercepts. However, models with freely
estimated indicator intercepts and freely estimated latent means are
not identified. So in models with freely estimated intercepts, the
latent means are fixed to be zero. Then, when the intercepts are
constrained to equality across groups (or occasions), the latent
means are constrained to be zero in one group (or occasion) and
freely estimated in the second group (or occasion). In this manner,
the latent means in the second group (or occasion) and their
statistical significance reflect the differences between the two
groups (or occasions).
Here we demonstrate the application of tests of measurement
invariance over gender and across time on the basis of our taxon-
omy of invariance tests (see Table 1). Such tests have typically
used SEM/CFA. Related multiple-group methods have been pro-
posed for EFA (e.g., Cliff, 1966; Meredith, 1964), but they mainly
focus on the similarity of factor patterns rather than formal tests of
invariance (but also see Dolan et al., 2009). However, the ESEM
model can be extended to multiple groups or longitudinal analyses
such that the ESEM solution is estimated separately for each group
or occasion and parameters can be constrained to be invariant
across groups or over time (Marsh et al., 2009; also see technical
appendix, Appendix 1 in the supplemental materials).
CUs.
avoided (e.g., Marsh, 2007), but there are some circumstances in
which a priori CUs should be included. When the same items are
used on multiple occasions, there are likely to be correlations
between the unique components of the same item administered on
the different occasions that cannot be explained in terms of cor-
relations between the factors. Indeed, Marsh and Hau (1996;
Marsh, 2007), Jo ¨reskog (1979), and others have argued that the
failure to include these CUs is likely to systematically bias param-
eter estimates such that test–retest correlations among matching
latent factors are systematically inflated, which can then system-
atically bias other parameter estimates (especially in SEMs). In the
extreme, test–retest correlations might be so substantially inflated
that the failure to include appropriate CUs can result in improper
solutions such as a nonpositive definite factor variance–covariance
matrix or estimated test–retest correlations that are greater than 1.0
(e.g., Marsh, Martin, & Debus, 2001; Marsh, Martin, & Hau,
2006). Previous research showed that short-term test–retest corre-
lations for NEO-FFI factors are systematically larger than internal
consistency estimates of reliability so that disattenuated test–retest
correlations would be greater than 1.0 (see earlier discussion). This
suggests that there are likely to be substantial CUs test–retest data
considered here. For this reason, Marsh and Hau argued that CUs
relating responses to the same items on different occasions should
always be included in the a priori model, but it is easy to evaluate
the extent to which the exclusion of these a priori CUs affects the
fit of the model and the nature of parameter estimates (particularly
test–retest stability coefficients) by constraining them to be zero.
Importantly, it is difficult to either test or correct complex struc-
tures of measurement error with EFAs and scale scores typically
used in Big Five research.
As described in more detail by McCrae and Costa (2004), in the
NEO-PI-R (with 240 items), each of the Big Five factors was
represented by six facets, and each facet was represented by
multiple items. However, in the construction of the (short)
NEO-FFI, items were selected to best represent each of the Big
Five factors without reference to the facets. More specifically,
each Big Five factor was represented by a factor score (based on
an EFA with varimax rotation), and items were selected that were
most highly correlated with this factor score. Hence, some facets
are overrepresented (relative to the design of the full NEO-PI-R),
whereas other facets are represented by a single item or not
represented at all. We posited that items that came from the same
facet of a specific Big Five factor would have higher correlations
than would items that came from different facets of the same Big
Five factor—beyond correlations that could be explained in terms
of the common Big Five factor that they represented. Here we
modeled these potentially inflated correlations due to facets as
CUs relating each pair of items from the same facet. Based on the
mapping of NEO-FFI items onto the NEO-PI-R facets (R. McCrae,
personal communication, December 1, 2008; also see Appendix 2
of the supplemental materials), this resulted in an a priori set of 57
CUs inherent to the design of the NEO-FFI. Although we argue
that this set of a priori CUs should be included in all factor
analyses of NEO-FFI responses, we systematically evaluate mod-
els with and without these CUs as well as the invariance of these
CUs over multiple (gender) groups and over time.
Goodness of fit.
CFA/SEM research typically focuses on the
ability of a priori models to fit the data as summarized by sample
In general, the use of ex post facto CUs should be
Table 1
Taxonomy of Invariance Tests for Evaluating Measurement
Invariance of Big Five Responses Across Multiple Groups or
Over Multiple Occasions
ModelParameters constrained to be invariant
1
2
3
4
5
6
7
None (configural invariance)
FL [1] (weak factorial/measurement invariance)
FL, Uniq [1, 2]
FL, FVCV [1, 2]
FL, Inter [1, 2] (strong factorial/measurement invariance)
FL, Uniq, FVCV [1–4]
FL, Uniq, Inter [1–3, 5] (strict factorial/measurement
invariance)
FL, FVCV, Inter [1, 2, 4, 5]
FL, Uniq, FVCV, Inter [1–8]
FL, Inter, LFMn [1, 2, 5] (latent mean invariance)
FL, Uniq, Inter, LFMn [1–3, 5, 7, 10] (manifest mean
invariance)
FL, FVCV, Inter, LFMn [1, 2, 4–6, 8, 10]
FL, Uniq, FVCV, Inter, LFMn [1–12] (complete factorial
invariance)
8
9
10
11
12
13
Note.
invariant across groups, whereas models in which intercepts are free imply
that mean differences are a function of intercept differences. Values in
brackets represent nesting relations in which the estimated parameters of
the less general model are a subset of the parameters estimated in the more
general model under which it is nested. All models are nested under Model
1 (with no invariance constraints), whereas Model 13 (complete invari-
ance) is nested under all other models. FL ? factor loadings; Uniq ? item
uniquenesses; FVCV ? factor variances–covariances; Inter ? item inter-
cepts; LFMn ? latent factor means. Parts of this table were adapted from
“Exploratory Structural Equation Modeling, Integrating CFA and EFA:
Application to Students’ Evaluations of University Teaching,” by H. W.
Marsh, B. Muthe ´n, T. Asparouhov, O. Lu ¨dtke, A. Robitzsch, A. J. S.
Morin, and U. Trautwein, 2009, Structural Equation Modeling, 16, p. 443,
Table 1. Copyright 2009 by Taylor & Francis.
Models with freely estimated LFMn constrain intercepts to be
476
MARSH ET AL.
Page 7
size independent indices of fit (e.g., Marsh, 2007; Marsh, Balla, &
Hau, 1996; Marsh, Balla, & McDonald, 1988; Marsh et al., 2005).
Here we consider the root-mean-square error of approximation
(RMSEA), the Tucker–Lewis index (TLI), and the comparative fit
index (CFI), as operationalized in Mplus in association with the
MLR estimator (Muthe ´n & Muthe ´n, 2008). We also considered the
robust chi-square test statistic and evaluation of parameter esti-
mates. For both the TLI and CFI, values greater than .90 and .95,
respectively, typically reflect acceptable and excellent fit to the
data. For the RMSEA, values less than .05 and .08 reflect a close
fit and a reasonable fit to the data, respectively (Marsh, Hau, &
Wen, 2004). However, we emphasize that these cutoff values
constitute only rough guidelines; there is considerable evidence
that realistically large factor structures (e.g., instruments with at
least 50 items and at least five factors) are typically unable to
satisfy even the minimally acceptable standards of fit (Marsh,
2007; Marsh et al., 2005; also see Marsh, Hau, Balla, & Grayson,
1998). However, because there are few applications of ESEM—
and none that fully evaluate the appropriateness of the traditional
CFA indices of fit—it is unclear how relevant these CFA indices
and proposed cutoff values are for ESEM studies (Marsh et al.,
2009).
In CFA studies it is typically more useful to compare the relative
fit of a taxonomy of nested (or partially nested) models designed
a priori to evaluate particular aspects of interest than to compare
that of single models (Marsh, 2007; Marsh et al., 2009). Any two
models are nested so long as the set of parameters estimated in the
more restrictive model is a subset of the parameters estimated in
the less restrictive model. This comparison can be based on a
chi-square difference test, but this test suffers the same problems
as the chi-square test used to test goodness of fit that led to the
development of fit indices (see Marsh et al., 1998). For this reason,
researchers have posited a variety of ad hoc guidelines to evaluate
when differences in fit are sufficiently large to reject a more
parsimonious model (i.e., the more highly constrained model with
fewer estimated parameters) in favor of a more complex model. It
has been suggested that support for the more parsimonious model
requires a change in CFI of less than .01 (Chen, 2007; Cheung &
Rensvold, 2001) or a change in RMSEA of less than .015 (Chen,
2007). Marsh (2007) noted that some indices (e.g., TLI and
RMSEA) incorporate a penalty for parsimony so that the more
parsimonious model can fit the data better than a less parsimonious
model can (i.e., the gain in parsimony is greater than the loss in
fit). Hence, a more conservative guideline is that the more parsi-
monious model is supported if the TLI or RMSEA is as good as or
better than that for the more complex model. Nevertheless, all
these proposals should be considered as rough guidelines or rules
of thumb.
Especially in relation to the taxonomy of invariance tests, sup-
port for the invariance of a set of parameters should be based in
part on the similarity of parameters in models that do not impose
invariance constraints as well as on the goodness of fit in models
that do. Here we focus on both the similarity of the patterns of
parameters and the levels of the parameter estimates. For example,
here we evaluate the similarity of factor loadings on the basis of
various CFA and ESEM models—whether the same item has a
relatively high or low factor loading across different groups (or
occasions)—with a profile similarity index (PSI). To compute the
PSI, we simply construct a column that contains all the factor
loadings for one group and a second column of corresponding
factor loadings for the second group and then correlate the values
from the two columns. Hence the PSI is merely the correlation
between the two sets of factor loadings. To evaluate levels of the
parameter estimates, we compare descriptive statistics for the set
of coefficients in each group. Ultimately, however, an evaluation
of goodness of fit must be based upon a subjective integration of
many sources of information, including fit indices, a detailed
evaluation of parameter estimates in relation to a priori hypotheses,
previous research, and common sense.
Results
Big Five Factor Structure: ESEM Versus CFA
The starting point for the present investigation is to test our a
prior hypothesis that the ESEM model provides a better fit to
NEO-FFI responses than does a traditional ICM-CFA model.
Indeed, as emphasized by Marsh et al. (2009), the ESEM analysis
is predicated on the assumption that ESEM performs noticeably
better than does the ICM-CFA model in terms of goodness of fit
(see Table 2) and the construct validity of the interpretation of the
factor structure.
The ICM-CFA solution does not provide an acceptable fit to the
data (CFI ? .685, TLI ? .672; see TGCFA1A in Table 2),
consistent with previous research. The next model (TGCFA1B)
incorporates a priori CUs (based on the facet structure of the
NEO-PI-R; see earlier discussion and Appendix 2 of the supple-
mental materials); results are still inadequate, albeit improved
Table 2
Summary of Goodness-of-Fit Statistics for Total Group Models (Time 1 Data)
Model and description
?2
df
CFITLINFParmRMSEA
Total group CFA
TGCFA1A: no CUs; no gender
TGCFA1B: CUs; no gender
Total group ESEM
TGESEM1A: no CUs; no gender
TGESEM1B: CUs; no gender
15,488
12,567
1700
1643
.685
.750
.672
.731
190
247
.049
.044
8,013
5,201
1480
1423
.851
.914
.821
.893
410
467
.036
.028
Note.
approximation; CFA ? confirmatory factor analysis; ESEM ? exploratory structural equation modeling; CUs ? a priori correlated uniquenesses (based
on the facet design of the instrument).
CFI ? comparative fit index; TLI ? Tucker–Lewis index; NFParm ? number of free parameters; RMSEA ? root-mean-square error of
477
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
Page 8
(CFI ? .750, TLI ? .731). The corresponding ESEM solutions fit
the data much better. Although the fit of the total group with no a
priori CUs is still not acceptable (TGESEM1A: CFI ? .851,
TLI ? .821; see Table 2), the inclusion of CUs results in a
marginally acceptable fit to the data (TGESEM1B: CFI ? .914,
TLI ? .893, RMSEA ? .028).
It is also instructive to compare parameter estimates based on
the ICM-CFA and ESEM solutions (see Appendix 3 of the sup-
plemental materials). In both types of models, the factor loadings
tend to be modest, with few factor loadings greater than .70 and
some factor loadings less than .30. Although CFA factor loadings
(Mdn ? .47) are slightly higher than those for the ESEM model
(Mdn ? .46), the differences are typically small and the pattern of
factor loadings is similar for the CFA and ESEM solutions. To
quantify this subjective evaluation, we computed a PSI in which
the vector of 60 CFA factor loadings was related to the corre-
sponding vector of 60 EFA target loadings. The PSI (r ? .87)
demonstrated that ESEM and CFA factor loadings were highly
related. Consistent with McCrae and Costa (2004), the 14 items
that they noted as potentially weak also had lower factor loadings
than the remaining 56 items did for both ICM-CFA (M ? .38 vs.
.49, respectively) and ESEM (M ? .32 vs. .48, respectively)
solutions. Although a few of these 14 items performed well here,
we note that these same items also did well in the original McCrae
and Costa study. Importantly, almost all 60 items load more
positively on the ESEM factor that each was designed to measure
and less positively on all other factors.
A detailed evaluation of the factor correlations among the Big
Five factors demonstrates a critical advantage of the ESEM ap-
proach over the ICM-CFA approach. Although patterns of corre-
lations are similar, the CFA factor correlations (–.502 to ?.400;
Mdn absolute value ? .197) tend to be systematically larger than
the ESEM factor correlations (–.205 to ?.140; Mdn absolute
value ? .064). Thus, for example, the negative correlation between
Neuroticism and Extraversion is –.502 on the basis of the CFA
solution but only –.205 for the ESEM solution. Similarly, the
correlation between Extraversion and Conscientiousness is ?.400
for the CFA results but only ?.104 for the ESEM results. In this
respect, the ESEM solution is more consistent with a priori pre-
dictions that the Big Five personality factors are reasonably or-
thogonal.
Clearly the ESEM solution is superior to the CFA solution, in
terms of both fit and distinctiveness of the factors that are consis-
tent with Big Five theory. The comparison of results from these
two models provides the initial and most important test for the
appropriateness of the ESEM model—at least relative to the CFA
model. It is also important to emphasize that the goodness of fit for
the ESEM model is apparently far better than what has ever been
achieved in previous research with the NEO-FFI on the basis of
factor analyses conducted at the item level.
Invariance Over Gender
How stable is the NEO-FFI factor structure over gender? Are there
systematic gender differences in latent means, and are the underlying
assumptions that are needed to justify interpretations of these results
met? To address these questions, we applied our taxonomy of 13
ESEM models (see Table 1). The basic strategy is to apply the set of
13 models designed to test different levels of factorial and measure-
ment invariance, ranging from the least demanding model, which
imposes no invariance constraints (configural invariance), to the most
demanding model, which posits complete gender invariance in rela-
tion to the Big Five factor structure, latent means, and item intercepts.
However, application of this taxonomy of models is complicated by
two features that are partially idiosyncratic to this application: the a
priori CUs and tests of partial invariance of item intercepts (Byrne et
al., 1989). The results already presented on the basis of the total
sample indicate that a priori CUs are necessary to achieve even a
minimally acceptable fit to the data. However, it is also important to
determine the extent to which these a priori CUs are invariant over
gender and how these influence the behavior of the various models.
For all 13 models we begin by evaluating the 57 a priori CUs.
Hence, we first test models with no CUs (e.g., MG1 in Table 3
corresponds to the first model in the invariance taxonomy in Table
1). We then test two additional variations: one in which the a priori
CUs are allowed to vary for men and women (submodels labeled
A in the Description column of Table 3, as in MG1A) and another
in which the CUs are constrained to be invariant over responses by
men and women (submodels labeled B in Table 3, as in MG1B).
Hence, within this set of three submodels there is a systematic
nesting to evaluate the a priori CUs and their invariance over
gender in relation to each of the 13 invariance models described in
Table 1.
For the models that posit gender differences in latent means for
the Big Five factors, we also test several models to evaluate partial
invariance. Submodels labeled C posit partial invariance (i.e., item
intercepts identified in preliminary analyses are freely estimated
and not constrained to be invariant over gender—see subsequent
discussion) but with no CUs. In submodels labeled D the set of
57 a priori CUs is added, and in submodels labeled E these a priori
CUs are constrained to be equal over gender. Hence, within this set
of five submodels there is a systematic nesting that allows evalu-
ation of the CUs and their invariance over gender, partial invari-
ance, and combinations of these constraints.
Model MG1 (see Table 3), with no invariance constraints, does
not provide an acceptable fit to the data (TLI ? .823, CFI ? .852).
Indeed, these fit statistics are approximately the same as those
based on the total group ESEM model (see TGESEM in Table 2)
with twice the degrees of freedom (2960 vs. 1480) and twice the
number of estimated parameters (820 vs. 410). However, consis-
tent with earlier results, the inclusion of the set of a priori CUs
substantially improves the fit to a marginally acceptable level
(TLI ? .891, CFI ? .912; see MG1A in Table 3). Importantly,
constraining these a priori CUs to be invariant over gender (see
MG1B in Table 3) resulted in almost no change in fit. For fit
indices that control for parsimony, the fit is essentially unchanged
or slightly better for MG1B than for MG1A, respectively (.891 to
.892 for TLI; .028 to .028 for RMSEA). For the CFI that is
monotonic with parsimony, the change (.912 to .911) is clearly less
than the .01 value typically used to support invariance constraints.
These results are substantively important, demonstrating that the
sizes of the 57 a priori CUs are reasonably invariant over gender.
For each of the 13 models used to test the factorial invariance of
the full mean structure (see Table 1), the inclusion of this set of a
priori CUs substantially improves the goodness of fit to a similar
degree. Furthermore, for each of these tests comparing freely
estimated CUs and constraining CUs to be invariant over gender,
there is support for the invariance of the CUs. The consistency of
478
MARSH ET AL.
Page 9
Table 3
Summary of Goodness-of-Fit Statistics for All Gender Invariance (IN) Models (Time 1 Data)
Model and description
?2
df
CFITLINFParmRMSEA
MG1 (configural IN)
MG1: no IN (configural IN)
MG1A: MG1 with CUs (not invariant over sex)
MG1B: MG1A with CUs IN (invariant over sex)
MG2 (FL; weak factorial/measurement IN)
MG2: IN ? FL (weak factorial/measurement IN)
MG2A: MG2 with CUs
MG2B: MG2A with CUs IN
MG3 (FL & Uniq)
MG3: IN ? FL, Uniq
MG3A: MG3 with CUs
MG3B: MG3A with CUs IN
MG4 (FL & FVCV)
MG4: IN ? FL, FVCV
MG4A: MG4 with CUs
MG4B: MG4A with CUs IN
MG5 (FL & Inter; strong factorial/measurement IN)
MG5: IN ? FL, Inter (strong factorial/measurement IN)
MG5A: MG5 with CUs
MG5B: MG5A with CUs IN
MG5C: MG5 with P-IN, no CUs
MG5D: MG5C with CUs
MG5E: MG5D with CUs IN
MG6 (FL, FVCV, Uniq)
MG6: IN ? FL, FVCV, Uniq
MG6A: MG6 with CUs
MG6B: MG6A with CUs IN
MG7 (FL, Uniq, Inter; strict factorial/measurement IN)
MG7: IN ? FL, Uniq, Inter (strict factorial/measurement IN)
MG7A: MG7 with CUs
MG7B: MG7A with CUs IN
MG7C: MG7 with Inter (P-IN), no CUs
MG7D: MG7C with CUs
MG7E: MG7D with CUs IN
MG8 (FL, FVCV, Inter)
MG8: IN ? FL, FVCV, Inter
MG8A: MG8 with CUs
MG8B: MG8A with CUs IN
MG8C: MG8 with Inter (P-IN), no CUs
MG8D: MG8C with CUs
MG8E: MG8D with CUs IN
MG9 (FL, Uniq, FVCV, Inter)
MG9: IN ? FL, FVCV, Uniq, Inter
MG9A: MG9 with CUs
MG9B: MG9A with CUs IN
MG9C: MG9 with Inter (P-IN), no CUs
MG9D: MG9C with CUs
MG9E: MG9D with CUs IN
MG10 (FL, Inter, LFMn; latent mean IN)
MG10: IN ? FL, Inter, LFMn
MG10A: MG10 with CUs
MG10B: MG10A with CUs IN
MG10C: MG10 with Inter (P-IN), no CUs
MG10D: MG10C with CUs
MG10E: MG10D with CUs IN
MG11 (FL, Uniq, Inter, LFMn; manifest mean IN)
MG11: IN ? FL, Uniq, Inter, LFMn
MG11A: MG10 with CUs
MG11B: MG10A with CUs IN
MG11C: MG10 with Inter (P-IN), no CUs
MG11D: MG10C with CUs
MG11E: MG10D with CUs IN
9,373
6,654
6,743
2960
2846
2903
.852
.912
.911
.823
.891
.892
820
934
877
.036
.028
.028
9,831
7,124
7,218
3235
3121
3178
.848
.908
.907
.833
.895
.896
545
659
602
.035
.028
.027
10,264
7,513
7,644
3295
3181
3238
.839
.900
.898
.827
.889
.889
485
599
542
.035
.028
.028
9,908
7,204
7,296
3250
3136
3193
.846
.906
.905
.833
.894
.895
530
643
587
.035
.028
.028
10,937
7,982
8,079
9,951
7,223
7,316
3290
3176
3233
3267
3153
3210
.824
.889
.888
.846
.906
.905
.810
.876
.878
.833
.895
.895
490
604
547
513
627
570
.037
.030
.033
.035
.028
.027
10,346
7,602
7,731
3310
3196
3253
.838
.898
.897
.826
.887
.888
470
584
527
.035
.029
.028
11,377
8,376
8,505
10,383
7,611
7,744
3350
3236
3293
3327
3213
3270
.815
.881
.880
.837
.899
.897
.804
.870
.871
.827
.888
.888
430
544
487
453
567
510
.038
.031
.031
.035
.028
.028
11,012
8,060
8,156
10,029
7,303
7,397
3305
3191
3248
3282
3168
3225
.822
.888
.887
.844
.905
.904
.809
.875
.877
.832
.893
.894
475
589
532
498
612
555
.037
.030
.030
.035
.028
.028
11,458
8,464
8,591
10,467
7,700
7,829
3365
3251
3308
3342
3228
3285
.813
.880
.878
.836
.897
.895
.803
.869
.870
.826
.887
.887
415
529
472
438
552
495
.038
.031
.031
.035
.029
.029
11,550
8,625
8,720
10,466
7,749
7,842
3295
3181
3238
3272
3158
3215
.809
.874
.873
.834
.894
.893
.795
.860
.862
.820
.881
.882
485
599
542
508
622
565
.039
.032
.032
.036
.029
.029
11,990
9,020
9,149
10,902
8,141
8,272
3355
3241
3298
3332
3218
3275
.801
.867
.865
.825
.886
.885
.790
.854
.855
.814
.875
.875
425
539
482
448
562
505
.039
.032
.032
.037
.030
.030
(table continues)
479
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
Page 10
this pattern of results over the wide variety of different models is
impressive and provides clear support for the inclusion of these a
priori CUs based on the design of the NEO-FFI. However, in order
to facilitate communication of the results, we will focus primarily
on models in which CUs are included and constrained to be
invariant over gender (e.g., Model MG1B for Model 1).
Descriptive similarity of solutions for men and women.
fore formally testing the invariance of different parameters over
gender, it is useful to evaluate the similarity of solutions when
these parameters are freely estimated for men and women (see
Appendix 4 of the supplemental materials). Of particular impor-
tance are the factor loadings. First we evaluate how similar the
pattern of factor loadings is for men and women based on a PSI
(i.e., the relation between the 300 factor loadings based on re-
sponses by men and those based on responses by women). The
extremely high PSI (r ? .97) indicates that the pattern of factor
loadings is similar. Furthermore, the actual values of the factor
loadings are similar across the two groups. Nontarget loadings are
consistently small for both groups (Men: –.33 to ?.32, Mdn ?
–.01; Women: –.38 to ?.32, Mdn ? –.01), whereas target loadings
were consistently higher (Men: .05 to .74, Mdn ? .46; Women: .10
to .73, Mdn ? .46). Although there are apparently a few weak
items, even these items are typically weak across both groups. The
pattern of factor correlations for the two groups is also similar
(PSI ? .93), whereas the absolute values of the correlations are
consistently small (Men: .01 to .20, Mdn ? .06; Women: .00 to
.25, Mdn ? .06). Item uniquenesses are also similar for the two
groups (PSI ? .91), as are the values for the two groups (Men: .43
to .99, Mdn ? .72; Women: .47 to .99, Mdn ? .73).
The invariance of item intercepts is especially important for
subsequent tests of measurement invariance. The pattern of item
intercepts is similar for the two groups (PSI ? .94), but intercepts
are somewhat higher for women (2.49 to 6.32, Mdn ? 3.46) than
for men (3.52 to 5.95, Mdn ? 3.42). A nominal test of the
significance of this difference was statistically significant (M for
men ? 3.52, M for women ? 3.83), t(59) ? 7.15, p ? .001
(similar tests of significance on each of the other sets of parameters
were nonsignificant). These differences in intercepts are consistent
Be-
with higher mean ratings by women, but more appropriate tests of
this observation require more formal tests of mean structure in-
variance pursued in the next section.
In summary, descriptive summaries of parameter estimates in
Appendix 4 of the supplemental materials suggest that the factor
solutions—with the possible exception of item intercepts—are
similar for the two groups. We now pursue formal tests of this
invariance in relation to the taxonomy of invariance models pre-
sented in Table 1.
Tests of invariance over gender.
ment invariance tests whether the factor loadings are the same
for men and women. Model MG2B (along with MG2 and
MG2A) tests the invariance of factor loadings over gender. The
critical comparison between the more parsimonious MG2B
(with factor loadings invariant) and less parsimonious MG1B
(with no factor loading invariance) supports the invariance of
factor loadings over gender. Fit indices that control for model
parsimony are as good or better for the more parsimonious
MG2B (TLI ? .896 vs. .892; RMSEA ? .027 vs. .028), whereas
the difference in CFI (.907 vs. .911) is less than the value of .01
typically used to reject the more parsimonious model.
Strong measurement invariance requires that item inter-
cepts—as well as factor loadings—be invariant over groups. The
critical comparison is thus between Models MG2B and MG5B and
tests whether differences in the 60 intercepts can be explained in
terms of five latent means (i.e., a complete absence of differential
functioning). The change in df ? 55 represents the 60 new con-
straints on item intercepts minus the five latent factor means that
are now freely estimated. However, the fit of MG5B (CFI ? .888,
TLI ? .878) is not acceptable and is worse than the fit of the
corresponding model MG2B (CFI ? .907, TLI ? .896). Hence,
gender differences at the level of item means cannot be explained
in terms of the factor means, and there is differential item func-
tioning between gender groups.
Because there is strong evidence that item intercepts are not
completely invariant and invariance of item intercepts is so central
to the evaluation of latent mean differences, we pursued alternative
tests of partial invariance of item intercepts (see Models MG5C–
Weak factorial/measure-
Table 3 (continued)
Model and description
?2
df
CFI TLINFParm RMSEA
MG12 (FL, FVCV, Inter, LFMn)
MG12: IN ? FL, FVCV, Inter, LFMn
MG12A: MG12 with CUs
MG12B: MG12A with CUs IN
MG12C: MG12 with Inter (P-IN), no CUs
MG12D: MG12C with CUs
MG12E: MG12D with CUs IN
MG13 (FL, Uniq, FVCV, Inter, LFMn; complete factorial IN)
MG13: IN ? FL, Inter, Uniq, FVCV, LFMn
MG13A: MG13 with CUs
MG13B: MG13A with CUs IN
MG13C: MG13 Inter (P-IN), no CUs
MG13D: MG13C with CUs
MG13E: MG13D with CUs IN
11,638
8,717
8,812
10,552
7,838
7,931
3310
3196
3253
3287
3173
3230
.808
.873
.872
.832
.892
.892
.794
.859
.860
.819
.888
.881
470
584
527
493
607
550
.039
.032
.032
.036
.029
.029
12,084
9,121
9,249
10,994
8,240
8,368
3370
3256
3313
3347
3233
3290
.799
.865
.863
.824
.884
.883
.789
.853
.854
.813
.873
.873
410
524
467
433
547
490
.039
.033
.033
.037
.030
.030
Note.
fit index; TLI ? Tucker–Lewis index; NFParm ? number of free parameters; RMSEA ? root-mean-square error of approximation; CUs ? correlated
uniquenesses; FL ? factor loadings; Uniq ? item uniquenesses; FVCV ? factor variances–covariances; Inter ? item intercepts; P-IN ? partial IN;
LFMn ? latent factor means.
For multiple-group (MG) IN models, IN refers to the sets of parameters constrained to be invariant across the multiple groups. CFI ? comparative
480
MARSH ET AL.
Page 11
MG5E in Table 3). We identified, on the basis of (ex post facto)
modification indices in which we freed parameters one at a time,
23 (of 60) item intercepts that contributed most to the misfit
associated with the complete invariance of item intercepts (see
Appendix 2 of the supplemental materials). The results support
partial invariance of item intercepts. For example, fit indices that
control for parsimony are nearly the same for MG5E compared
with MG2B (.895 vs. .896 for TLI; .027 vs. .027 for RMSEA),
whereas the difference in CFIs (.905 vs. .907) is less than the .01
value that would lead to the rejection of constraints imposed in
MG5E. However, the interpretation of these results is cautioned by
ex post facto modifications (see subsequent discussion about par-
tial invariance).
Strict measurement invariance requires that item uniquenesses,
item intercepts, and factors loadings all be invariant over the
groups. Here, the critical comparison is between Models MG5 and
MG7; the change in df ? 60 represents the 60 new constraints for
item uniquenesses. Although Model MG7B does not provide an
adequate goodness of fit to the data, the addition of the ex post
facto partial-invariance strategy for the intercepts substantially
improves the fit. However, the fit of MG7E (CFI ? .897, TLI ?
.888) is only marginally acceptable and is apparently worse than
the fit of the corresponding model MG5E (CFI ? .905, TLI ?
.895). However, comparison of all the various pairs of models that
test this invariance of the uniquenesses (MG3B vs. MG2B; MG6B
vs. MG4B; MG7B vs. MG5B; MG7E vs. MG5E; MG9B vs.
MG8B; MG9E vs. MG8E; MG11B vs. MG10B; MG13B vs.
MG12B; MG13E vs. MG12E) consistently results in a change in
CFIs that is slightly less than the .01 value typically used to
support the more parsimonious model with uniquenesses invariant.
Although it would be possible to pursue a strategy of partial
invariance of uniquenesses, we did not do so because the evalua-
tion of latent mean differences that is our main focus does not
depend on the invariance of uniquenesses.
Factor variance–covariance invariance is typically not a focus
of measurement invariance, but it is frequently an important focus
of studies of the invariance of covariance structures—particularly
studies of the discriminant validity of multidimensional constructs
that might subsequently be extended to include relations with other
constructs. Although the comparison of correlations among Big
Five factors across groups is common, these are typically based on
manifest scores that do not control for measurement error and
make implicit invariance assumptions that are rarely tested. Here,
the most basic comparison is between Models MG2 (factor load-
ings invariant) and MG4 (factor loadings and factor variance–
covariance invariant). The change in df ? 15 represents the 10
factor covariances and five factor variances. The results provide
reasonable support for the additional invariance constraints, both
in terms of the values for the fit indices and their comparison with
MG2. For example, fit indices that control for parsimony are
nearly the same for MG4B compared with MG2B (.895 vs. .896
for TLI; .028 vs. .027 for RMSEA), whereas the difference in CFIs
(.905 vs. .907) is less than the .01 cutoff value that would lead to
the rejection of constraints imposed in MG4B.
Tests of the invariance of the latent factor variance–covariance
matrix, as is the case with other comparisons, could be based on
any pair of the six models in Table 3 that differ only in relation to
whether the factor variance–covariance matrix is free or not.
Although each of these pairs of models differs by df ? 15,
corresponding to the parameters in the variance–covariance ma-
trix, they are not equivalent; support for the invariance of the
variance–covariance matrix could be found in some of those
comparisons but not in others. Although we suggest that the
comparison between Models MG4 and MG2 is the most basic
comparison, valuable information can also be obtained from the
other comparisons as well. Especially if there are systematic,
substantively important differences in the interpretations on the
basis of these different comparisons, further scrutiny would be
warranted in that true differences in the factor variance–
covariance matrix might be “absorbed” into differences in other
parameters that are not constrained to be invariant. Fortunately,
this complication is not evident in the present investigation, be-
cause support for the invariance of factor variance–covariance
matrix is consistent across each of these alternative comparisons.
Finally, we are now in a position to address the issue of the
invariance of the factor means across the two groups. The final
four models (see MG10–MG13 in Table 3) in the taxonomy all
constrain mean differences between men and women to be
zero—in combination with the invariance of other parameters.
Again, there are several models that could be used to test gender
mean invariance; they include (a) MG5 versus MG10, (b) MG7
versus MG11, (c) MG8 versus MG12, and (d) MG9 versus MG13.
However, our earlier inspection of item intercepts suggests that
there are systematic gender differences in latent means. Hence, it
is not surprising that Models 10–13 are also rejected. These results
imply that latent means representing the Big Five factors differ
systematically for men and women. Consistent with a priori pre-
dictions, latent means are systematically higher for women on all
Big Five latent means, although the largest differences are for
Neuroticism and Conscientiousness.
An alternative, pragmatic approach to the comparison of the
means for the different models is to evaluate the extent to which
the pattern of latent mean gender differences vary as a function of
the models considered. Hence, in Table 4 we summarize gender
differences on the basis of each of the 24 models that provide
estimates of gender differences. The set of 276 PSIs among all
possible pairs of the 24 profiles varied from .852 to .999 (mean r ?
.957). Therefore, the pattern of gender differences was similar
across the different models. This suggests, at least in this applica-
tion, that gender differences are reasonably robust in relation to
violations of underlying assumptions of gender invariance in the
various models.
Invariance Over Time
With some adaptation, it is possible to apply the same set of 13
models to test the invariance of the Big Five factor structure over
time using the ESEM approach with test–retest data. As with the
tests of invariance over gender, we hypothesized that the same set
of 57 a priori CUs (based on the design of the NEO instrument) are
required. Because there are parallel CUs for T1 and T2 responses,
we can also test the invariance of these CUs over time. However,
we also posit a second a priori set of 60 CUs to account for the
residual associations between matching items at T1 and T2 (see
earlier discussion). Here we distinguish within-wave CUs
(WWCUs) and cross-wave CUs (CWCUs). The WWCUs consist
of 57 WWCUs that are specific to the design of the NEO-FFI
already considered in previous analyses. In the longitudinal models
481
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
Page 12
considered here, we also posit that the same set of WWCUs affect
responses at T1 and T2, and we test their invariance over time.
CWCUs are the set of 60 CWCUs relating uniquenesses associated
with matching items at T1 and T2. In these longitudinal models,
we evaluate the effect of their inclusion on goodness of fit and on
other parameter estimates in the model—particularly latent test–
retest correlations of the same construct over time.
Longitudinal factor structure of NEO-FFI responses.
figural invariance refers to tests of whether the a priori model fits
the data when no invariance constraints are imposed (see LIM1 in
Table 5). In LIM1, no CUs are posited (neither WWCUs nor
CWCUs) and the fit of LIM1 is poor (CFI ? .737, TLI ? .712).
In LIM1A, the inclusion of the 60 CWCUs improves the fit
substantially (CFI ? .886, TLI ? .874,) but is still not acceptable.
In LIM1B, the two sets of 57 WWCUs (but not CWCUs) are added
to Model LIM1 and then constrained to be invariant over time in
LIM1C. Based on goodness of fit, there is a modest increase in fit
associated with the addition of WWCUs and little or no decrement
in fit associated with holding them invariant over the two waves of
data. However, both of these models are technically improper in
that the factor variance–covariance matrix is not positive definite
(suggesting that some single latent variable or combination of
latent variables is a linear combination of some other variable or
some different combination of variables). Clearly this dictates
caution in the interpretation of the results or, perhaps, that this
Con-
model should simply be rejected as misspecified. Although these
problems support our contention that CWCUs should be included,
we return to this issue shortly.
In Model LIM1D, all the a priori CUs are included (the two
sets of WWCUs and the one set of CWCUs). Then, in LIM1E,
the two sets of WWCUs are constrained to be invariant over
time. Unlike in the previous two longitudinal models, solutions
based on these models are fully proper, represent a substantial
improvement in goodness of fit over previous models, and are
at least marginally acceptable in terms of goodness of fit (TLIs
and CFIs are greater than .90). Furthermore, Model LIM1E
provides good support for the invariance of the WWCUs over
time (T1 and T2).
It is also instructive to compare the parameter estimates based
on T1 and T2 ESEM solutions (see Appendix 4 of the supplemen-
tal materials). The sizes of the factor loadings tend to be modest,
with few factor loadings greater than .70 and some target factor
loadings less than .30. However, the pattern of loadings is similar
across the two waves (PSI ? .98). Although T2 target loadings
(.10 to .72, Mdn ? .50) are slightly higher than the T1 target
loadings (.05 to .72, Mdn ? .48), the differences are small. For
both waves of data, the average nontarget loading is close to zero
but quite variable (T1: –.43 to .27, Mdn ? .00; T2: –.41 to .26,
Mdn ? .00). Also, the pattern of correlations among the 10 T1
factor correlations is similar to the matching T2 factor correlations
Table 4
Patterns of Gender Differences on Big Five Latent Mean Factors
Model and descriptionNEUREXTR OPENAGRECONC
MG5 (strong factorial/measurement IN)
MG5: IN ? FL, Inter
MG5A: MG5 with CUs
MG5B: MG5A with CUs IN
MG5C: MG5 with P-IN, no CUs
MG5D: MG5C with CUs
MG5E: MG5D with CUs IN
MG7 (strict factorial/measurement IN)
MG7: IN ? FL, Uniq, Inter
MG7A: MG7 with CUs
MG7B: MG7A with CUs IN
MG7C: MG7 with P-IN, no CUs
MG7D: MG7C with CUs
MG7E: MG7D with CUs IN
MG8
MG8: IN ? FL, FVCV, Inter
MG8A: MG8 with CUs
MG8B: MG8A with CUs IN
MG8C: MG8 with P-IN, no CUs
MG8D: MG8C with CUs
MG8E: MG8D with CUs IN
MG9
MG9: IN ? FL, FVCV, Uniq, Inter
MG9A: MG9 with CUs
MG9B: MG9A with CUs IN
MG9C: MG9 with P-IN, no CUs
MG9D: MG9C with CUs
MG9E: MG9D with CUs IN
.622
.647
.646
.524
.553
.552
.317
.330
.330
.436
.429
.430
.378
.363
.361
.362
.333
.334
.173
.156
.157
.289
.306
.307
.597
.660
.660
.571
.598
.596
.621
.642
.643
.525
.551
.551
.322
.338
.337
.443
.439
.437
.381
.365
.364
.365
.335
.335
.176
.159
.158
.294
.312
.311
.600
.667
.667
.576
.605
.603
.680
.706
.708
.586
.614
.614
.285
.294
.292
.405
.398
.398
.374
.361
.358
.359
.332
.332
.163
.156
.156
.281
.302
.302
.579
.641
.641
.552
.577
.576
.680
.706
.707
.588
.615
.614
.287
.297
.295
.408
.401
.400
.374
.358
.357
.359
.331
.330
.164
.156
.155
.283
.305
.304
.577
.639
.641
.553
.578
.578
Note.
different assumptions. The pattern of gender differences across the 28 models is similar, with the correlation varying from .848 to .999 (mean r ? .959).
NEUR ? Neuroticism; EXTR ? Extraversion; OPEN ? Openness; AGRE ? Agreeableness; CONC ? Conscientiousness; MG ? multiple group; IN ?
invariance (for multiple-group IN models, IN refers to the sets of parameters constrained to be invariant across the MGs); FL ? factor loadings; Inter ?
item intercepts; CUs ? correlated uniquenesses; P-IN ? partial IN; Uniq ? item uniquenesses.
See Tables 1 and 2 for a description of the models. Each of the 28 models provides estimates of gender differences in the Big Five factors under
482
MARSH ET AL.
Page 13
(PSI ? .954). In each case, the absolute value of correlations is
modest (T1: Mdn r ? .096; T2: Mdn r ? .088). Finally, the pattern
of intercepts is also similar (PSI ? .966), although T1 intercepts
are consistently somewhat lower than those at T2 (T1: Mdn ?
3.56, M ? 3.75; T2: Mdn ? 3.61, M ? 3.83). Particularly results for
T1 responses are similar to those considered earlier (see Table 2), but
this is hardly surprising, because the T1 responses considered here are
a subset of the data considered earlier. What is important, however, is
that the factor solution for T1 is highly similar to that based on T2
responses by the same students. Next we pursue more formal tests of
these observations for ESEM models of longitudinal invariance. On
the basis of our initial analyses, primarily submodel E, which includes
CWCUs and invariant WWCUs, is considered.
Invariance of NEO-FFI factor structure over time.
factorial/measurement invariance tests the invariance of factor
loadings over time. Because model LIM2E (with factor loadings
invariant over time) is so much more parsimonious than is LIM1E
(factor loadings free), it is not surprising that the CFI is marginally
better for LIM1E (.912) than for LIM2E (.907; see Table 5).
However, this difference is less than the .01 difference typically
taken as support for the less parsimonious model. Furthermore,
indices that take into account parsimony (TLI and RMSEA) are
nearly identical for the two models. Consistent with this observa-
tion, factor loadings for T1 and T2 when invariance constraints
were not imposed were very similar (see earlier discussion).
Strong measurement invariance requires that item inter-
cepts—as well as factor loadings—be invariant over time, and the
critical comparison is between Models LIM2E (factor loadings
invariance) and LIM5E (factor loadings and item intercepts invari-
ant). The CFI for LIM5E (.899) is marginally lower than those for
LIM2E (?CFI ? .008) and particularly LIM1E (?CFI ? .013),
and these differences approach or exceed the nominal .01 cutoff.
This difference is also evident in differences in TLIs that control
for parsimony (.893 vs. .901 and .902 for LIM5E, LIM2E, and
LIM1E, respectively). These results indicate that there is only
modest support for invariance of item intercepts and suggest that
there might be differential item functioning over time. Further-
more, this pattern of results is replicated in the comparison of other
models that differ only in terms of intercept invariance (e.g.,
LIM8E vs. LIM4E, LIM9E vs. LIM6E). Because the invariance of
item intercepts is so central to the evaluation of latent mean
differences, we pursued alternative tests of partial invariance of
item intercepts. We identified, on the basis of (ex post facto)
modification indices, 11 (of 60) item intercepts that contributed
most to the misfit associated with the complete invariance of
item intercepts. We conclude, on the basis of submodel LIM5Ep
(the p indicating partial invariance; CFI ? .904, TLI ? .898),
that there is at least reasonable support for the partial invariance
of item intercepts. Although the improved fit of this submodel
(LIM5Ep) over the corresponding submodel of full intercept
invariance (LIM5E) is not large, for now we focus on models of
partial intercept invariance (based on freeing these 11 item
intercepts) rather than complete intercept invariance (but return
to this issue in subsequent discussion).
Strict measurement invariance requires that item uniquenesses,
as well as item intercepts and factor loadings, be invariant over
time. The critical submodel LIM7Ep tests the invariance of factor
loadings and item uniquenesses and partial invariance of item
intercepts (CFI ? .899, TLI ? .894). Consistent with interpreta-
Weak
tions of previous models, comparison of this submodel LIM7Ep
with model LIM5Ep suggests modest support for the invariance of
item uniquenesses (?CFI ? .005, ?TLI ? .004). Additional com-
parisons of models differing only by the inclusion of invariant
items’ uniquenesses support this conclusion. Although it would be
possible to pursue tests of partial invariance of uniquenesses, we
did not do so as the evaluation of latent mean differences does not
depend on the invariance of uniquenesses.
Tests of the invariance of the latent factor variance–covariance
matrix, as is the case with other comparisons, could be based on
any pair of models in Table 5 that differ only in relation to whether
the factor variance–covariance matrix is free or not. The most
basic comparison (LIM4E vs. LIM2E) suggests good support for
the invariance of the factor variance–covariance matrix (?CFI ?
.000, ?TLI ? .000). Other pairs of models in Table 5 that differ
only in relation to whether the factor variance–covariance matrix
is free or not also show good support for the invariance of the
factor variance–covariance matrix over time (also see related
test–retest correlations in Table 6).
Finally, we are now in a position to address the issue of the
invariance of the latent factor means over time. Submodels
LIM10Ep–LIM13Ep each test the invariance of latent mean dif-
ferences in combination with the invariance of other parameter
estimates. Because there are only five latent mean differences, the
additional parsimony associated with these models is not substan-
tial in comparison with the corresponding models that do not
constrain latent mean differences to be invariant. In each case, the
fit of models positing no latent mean differences is at least mar-
ginally poorer than the corresponding models in which latent mean
differences are freely estimated: Differences in CFI (.005 to .006)
and TLI (.006 to .007) are based on comparisons of submodels
LIM10E and LIM5E, LIM11E and LIM7E, LIM12E and LIM8E,
and LIM13E and LIM9E. However, support for systematic differ-
ences in latent means is only marginal.
Because evaluation of latent means is a central, a priori feature
of these models, we present mean differences for each of the 28
models that result in mean differences (see Table 7) rather than
rely exclusively on indices of fit—especially given that the results
based on the fit indices do not seem conclusive. There is a
remarkably similar pattern to the mean differences. The set of 378
PSIs between all possible pairs of profiles vary from .993 to over
.999 (mean PSI ? .998). There are, however, small but systematic
differences in the size of means based on complete and partial
invariance constraints. In each case the absolute value of mean
differences based on complete invariance models is slightly larger
than that based on partial invariance. Thus, for example, the
standardized mean values for Neuroticism decline about .23 over
time for models of complete invariance but only about .20 for
models with partial invariance. For Agreeableness, there is an
increase of about .30 for models of complete invariance but in-
creases of only about .26 for models of partial invariance. There
are smaller increases in Openness and Conscientiousness that are
also slightly larger for models with complete invariance. Only for
measures of Extraversion are the standardized mean differences
consistently close to zero (statistically nonsignificant).
The changes in these latent mean differences over time—
especially the decrease in Neuroticism and the increases in Agree-
ableness, Openness, and Conscientiousness—are consistent with
the maturity principle (Caspi et al., 2005) discussed earlier. Indeed,
483
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
Page 14
Table 5
Summary of Goodness-of-Fit Statistics for All Longitudinal Invariance (IN) Models (Time 1/Time 2 Data)
Model and description
?2
df
CFITLI NFParmRMSEA
LIM1 (configural IN)
LIM1: no IN (configural IN)
LIM1A: LIM1 with 60 CWCUs
LIM1B: LIM1 with 57 WWCUs (free)a
LIM1C: LIM1 with 57 WWCUs (IN)a
LIM1D: LIM1 with 60 CWCUs & 57 WWCUs (free)
LIM1E: LIM1 with 60 CWCUs & 57 WWCUs (IN)
LIM2 (FL; weak factorial/measurement IN)
LIM2: IN ? FL (weak factorial/measurement IN)
LIM2A: LIM2 with 60 CWCUs
LIM2B: LIM2 with 57 WWCUs (free)a
LIM2C: LIM2 with 57 WWCUs (IN)a
LIM2D: LIM2 with 60 CWCUs & 57 WWCUs (free)
LIM2E: LIM2 with 60 CWCUs & 57 WWCUs (IN)
LIM3 (FL & Uniq)
LIM3: IN ? FL, Uniq
LIM3A: LIM3 with 60 CWCUs
LIM3B: LIM3 with 57 WWCUs (free)a
LIM3C: LIM3 with 57 WWCUs (IN)a
LIM3D: LIM3 with 60 CWCUs & 57 WWCUs (free)
LIM3E: LIM3 with 60 CWCUs & 57 WWCUs (IN)
LIM4 (FL & FVCV)
LIM4: IN ? FL, FVCV
LIM4A: LIM4 with 60 CWCUs
LIM4B: LIM4 with 57 WWCUs (free)a
LIM4C: LIM4 with 57 WWCUs (IN)a
LIM4D: LIM4 with 60 CWCUs & 57 WWCUs (free)
LIM4E: LIM4 with 60 CWCUs & 57 WWCUs (IN)
LIM5 (FL & Inter; strong factorial/measurement IN)
LIM5D: IN ? FL, Inter, with 60 CWCUs & 57 WWCUs (free)
LIM5E: LIM5D with 60 CWCUs & 57 WWCUs (IN)
LIM5Dp: LIM5D with Inter (P-IN), 60 CWCUs & 57 WWCUs (free)
LIM5Ep: LIM5D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(IN)
LIM6 (FL, FVCV, Uniq)
LIM6D: IN ? FL, FVCV, Uniq, with 60 CWCUs & 57 WWCUs (free)
LIM6E: LIM6D with 60 CWCUs & 57 WWCUs (IN)
LIM7 (FL, Uniq, Inter; strict factorial/measurement IN)
LIM7D: IN ? FL, Uniq, Inter, with 60 CWCUs & 57 WWCUs (free)
LIM7E: LIM7D with 60 CWCUs & 57 WWCUs (IN)
LIM7Dp: LIM7D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(free)
LIM7Ep: LIM7D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(IN)
LIM8 (FL, FVCV, Inter)
LIM8D: IN ? FL, FVCV, Inter, with 60 CWCUs & 57 WWCUs (free)
LIM8E: LIM8D with 60 CWCUs & 57 WWCUs (IN)
LIM8Dp: LIM8D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(free)
LIM8Ep: LIM8D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(IN)
LIM9 (FL, Uniq, FVCV, Inter)
LIM9D: IN ? FL, FVCV, Uniq, Inter, with 60 CWCUs & 57 WWCUs
(free)
LIM9E: LIM9D with 60 CWCUs & 57 WWCUs (IN)
LIM9Dp: LIM9D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(free)
LIM9Ep: LIM9D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(IN)
LIM10 (FL, Inter, LFMn; latent mean IN)
LIM10D: IN ? FL, Inter, LFMn, with 60 CWCUs & 57 WWCUs
(free)
LIM10E: LIM10D with 60 CWCUs & 57 WWCUs (IN)
22,586
13,439
19,608
19,689
11,700
11,775
6535
6475
6421
6478
6361
6418
.737
.886
.784
.783
.912
.912
.712
.874
.760
.761
.902
.902
845
905
959
902
1019
962
.040
.026
.036
.036
.023
.023
23,310
14,031
20,277
20,373
12,269
12,363
6810
6750
6696
6753
6636
6693
.729
.881
.777
.777
.908
.907
.716
.874
.763
.764
.901
.901
570
630
684
627
744
687
.039
.026
.036
.036
.023
.023
23,618
14,341
20,544
20,707
12,543
12,695
6870
6810
6756
6713
6696
6753
.725
.877
.774
.772
.904
.903
.715
.871
.761
.761
.898
.897
510
570
624
567
684
627
.039
.027
.036
.036
.024
.024
23,351
14,069
20,309
20,407
12,298
12,393
6825
6765
6711
6768
6651
6708
.729
.880
.777
.776
.907
.907
.717
.874
.763
.764
.901
.901
555
615
684
612
729
672
.039
.026
.036
.036
.023
.023
12,796
12,888
12,524
6691
6748
6680
.900
.899
.904
.893
.893
.898
689
632
700
.024
.024
.024
12,619 6737.904 .898643 .024
12,578
12,729
6711
6768
.904
.901
.898
.897
669
612
.024
.024
13,070
13,222
6751
6808
.896
.895
.890
.890
629
572
.024
.024
12,799 6740.901.895 640.024
12,9506797 .899.894 583.024
12,826
12,919
6706
6763
.900
.899
.893
.893
674
617
.024
.024
12,554 6695.904 .898685 .024
12,6496752.903 .898628.024
13,106
13,257
6766
6823
.896
.894
.890
.890
614
557
.024
.025
12,834 6755.900.895 625.024
12,9856812 .899.894568.024
13,166
13,258
6696
6753
.894
.893
.887
.887
684
627
.025
.025
484
MARSH ET AL.
Page 15
given the relatively short interval between the two measures, it
might be surprising that the differences are as large as they are.
However, it is also important to note that these results are based on
responses by the same students in their final year of high school
and again several years later, a period during which changes in
maturity might be expected to be significant.
Discussion
Summary and Implications
The a priori Big Five factors are clearly identified by both
ESEM and ICM-CFA. The pattern and even the sizes of factor
loadings are similar for the two approaches. However, the ESEM
solution fits the data much better than does the ICM-CFA solution
and resulted in substantially less correlated factors (Mdn absolute
r ? .06 vs. .20) that are consistent with Big Five theory.
Subsequent ESEM analyses support measurement invariance
over gender and over time—analyses that could not have been
done appropriately with traditional EFA approaches (or ICM-CFA
models that were not able to fit the data). The gender invariance
analysis showed that women scored higher on all five NEO-FFI
factors, whereas the analysis of test–retest data was supportive of
the maturity principle (Caspi et al., 2005). Although consistent
with previous research based on manifest variables, this is appar-
ently the first research to even pursue these issues in relation to
latent Big Five factors and appropriate tests of full measurement
and structural invariance in relation to a detailed taxonomy of
invariance models (e.g., see Table 1). This is critical in that
measurement invariance assumptions are prerequisite to making
valid mean comparisons—particularly the assumption of strong
measurement invariance with full or at least partial invariance of
item intercepts. Whereas we focused on mean differences across
gender and over time, strong measurement invariance require-
ments are equally relevant to all Big Five studies of mean differ-
ences for other groups or relations with other constructs. More
generally, we recommend that subsequent CFA studies routinely
consider ESEM solutions as a viable alternative, even when the fit
of CFA solutions is apparently acceptable.
Strengths, Limitations, and Directions for Further
Research
The size of factor correlations.
to be relatively uncorrelated. This was a key issue in the McCrae
et al. (1996; also see Parker et al., 1993) criticism of CFA, because
they suggested that forcing an ICM-CFA structure would lead to
inflated correlations among the Big Five factors. Our results sup-
port this contention. In an ICM-CFA solution, the relation between
a specific item and a nontarget factor that would be accounted for
by a cross-loading can be represented only through the factor
correlation between the two factors. If there are at least moderate
cross-loadings in the true population model and these are con-
strained to be zero as in the ICM-CFA model, then estimated factor
correlations are likely to be inflated and the differences can be
substantial (e.g., .34 vs. .72; Marsh et al., 2009). This issue is also
relevant to research based on simple scale scores and EFA factor
scores. Correlations based on (a) ICM-CFA latent factors are likely
Big Five factors are posited
Table 5 (continued)
Model and description
?2
df
CFITLI NFParmRMSEA
LIM10Dp: LIM10D with Inter (P-IN), with 60 CWCUs & 57 WWCUs
(free)
LIM10Ep: LIM10D with Inter (P-IN), 60 CWCUs & 57 WWCUs (IN)
LIM11 (FL, Uniq, Inter, LFMn; manifest mean IN)
LIM11D: IN ? FL, Uniq, Inter, LFMn, with 60 CWCUs & 57
WWCUs (free)
LIM11E: LIM11D with 60 CWCUs & 57 WWCUs (IN)
LIM11Dp: LIM11D with Inter (P-IN), 60 CWCUs & 57 WWCUs
(free)
LIM11Ep: LIM11D with Inter (P-IN), 60 CWCUs & 57 WWCUs (IN)
LIM12 (FL, FVCV, Inter, LFMn)
LIM12D: IN ? FL, FVCV, Inter, LFMn, with 60 CWCUs & 57
WWCUs (free)
LIM12E: LIM12D with 60 CWCUs & 57 WWCUs (IN)
LIM12Dp: LIM12D with Inter (P-IN), 60 CWCUs & 57 WWCUs
(free)
LIM12Ep: LIM12D with Inter (P-IN), 60 CWCUs & 57 WWCUs (IN)
LIM13 (FL, Uniq, FVCV, Inter, LFMn; complete factorial IN)
LIM13D: IN ? FL, Uniq, FVCV, Inter, LFMn, with 60 CWCUs & 57
WWCUs (free)
LIM13E: LIM13D with 60 CWCUs & 57 WWCUs (IN)
LIM13Dp: LIM13D with Inter (P-IN), 60 CWCUs & 57 WWCUs
(free)
LIM13Ep: LIM13D with Inter (P-IN), 60 CWCUs & 57 WWCUs (IN)
12,765
12,859
6685
6742
.900
.900
.894
.894
695
638
.024
.024
13,440
13,593
6756
6813
.890
.889
.884
.883
624
567
.025
.025
13,039
13,191
6745
6802
.897
.895
.891
.890
635
578
.024
.024
13,196
13,289
6711
6768
.894
.893
.887
.887
669
612
.025
.025
12,794
12,889
6700
6757
.900
.899
.894
.894
680
623
.024
.025
13,476
13,628
6771
6828
.890
.889
.884
.883
609
552
.025
.025
13,074
13,226
6817
6817
.896
.895
.891
.890
620
563
.024
.024
Note.
LIM5Dp) indicates partial IN (P-IN). CFI ? comparative fit index; TLI ? Tucker–Lewis index; NFParm ? number of free parameters; RMSEA ?
root-mean-square error of approximation; LIM ? longitudinal IN model; CWCUs ? cross-wave correlated uniquenesses (CUs); WWCUs ? within-wave
CUs; FL ? factor loadings; Uniq ? item uniquenesses; FVCV ? factor variances–covariances; Inter ? item intercepts; LFMn ? latent factor means.
aModel results in improper solutions and should be interpreted cautiously (or ignored).
For multiple-group IN models, IN refers to the sets of parameters constrained to be invariant across the multiple groups. The p in model names (e.g.,
485
NEW LOOK AT BIG FIVE FACTOR STRUCTURE
View other sources
Hide other sources
-
Available from Alexandre J S Morin · 26 Apr 2013
-
Available from uta.edu