ArticlePDF Available

Abstract and Figures

Latent variable modeling is a popular and flexible statistical framework. Concomitant with fitting latent variable models is assessment of how well the theoretical model fits the observed data. Although firm cut-offs for these fit indices are often cited, recent statistical proofs and simulations have shown that these fit indices are highly susceptible to measurement quality. For instance, an RMSEA value of 0.06 (conventionally thought to indicate good fit) can actually indicate poor fit with poor measurement quality (e.g., standardized factors loadings of around 0.40). Conversely, an RMSEA value of 0.20 (conventionally thought to indicate very poor fit) can indicate acceptable fit with very high measurement quality (standardized factor loadings around 0.90). Despite the wide-ranging effect on applications of latent variable models, the high level of technical detail involved with this phenomenon has curtailed the exposure of these important findings to empirical researchers who are employing these methods. This paper briefly reviews these methodological studies in minimal technical detail and provides a demonstration to easily quantify the large influence measurement quality has on fit index values and how greatly the cut-offs would change if they were derived under an alternative level of measurement quality. Recommendations for best practice are also discussed.
Content may be subject to copyright.
The Thorny Relation between Measurement Quality and Fit Index Cut-Offs in Latent Variable
Daniel MCNEISH, Ji AN, & Gregory R. HANCOCK
University of Maryland, College Park
Correspondence for this manuscript should be sent to the first author at Daniel McNeish who is
now at the University of North Carolina, Chapel Hill, 100 E. Franklin Street Suite 200, Chapel
Hill, NC 27599, USA. (Email:
Latent variable modeling is a popular and flexible statistical framework. Concomitant with fitting
latent variable models is assessment of how well the theoretical model fits the observed data.
Although firm cut-offs for these fit indices are often cited, recent statistical proofs and simulations
have shown that these fit indices are highly susceptible to measurement quality. For instance, an
RMSEA value of 0.06 (conventionally thought to indicate good fit) can actually indicate poor fit
with poor measurement quality (e.g., standardized factors loadings of around 0.40). Conversely,
an RMSEA value of 0.20 (conventionally thought to indicate very poor fit) can indicate acceptable
fit with very high measurement quality (standardized factor loadings around 0.90). Despite the
wide-ranging effect on applications of latent variable models, the high level of technical detail
involved with this phenomenon has curtailed the exposure of these important findings to empirical
researchers who are employing these methods. This paper briefly reviews these methodological
studies in minimal technical detail and provides a demonstration to easily quantify the large
influence measurement quality has on fit index values and how greatly the cut-offs would change
if they were derived under an alternative level of measurement quality. Recommendations for best
practice are also discussed.
The Thorny Relation between Measurement Quality and Fit Index Cut-Offs in Latent
Variable Models
Latent variable models have burgeoned in popularity and have been a fixture in the toolkits
of psychologists analyzing empirical data. As a critical step in the evaluation of latent variable
models, researchers routinely investigate how well their theoretical model fits to the observed
data. Such assessments of data-model fit are crucial in the appraisal of psychological theories
favorable data-model fit can lend support of a theory while poor data-model fit can call a theory
into question. Despite the substantial role data-model fit plays is the assessment of theories tested
via latent variable models, the appropriate manner to assess data-model fit has long been a
controversial issue in the methodological literature. Lively debates have erupted over whether the
minimum fit function chi-square test is the only true measure of fit or whether its tendency to
become highly powered with larger samples may necessitate the use of descriptive, approximate
goodness of fit indices (AFIs; see Antonakis, Bendahan, Jacquart, & Lalive, 2010; Barrett, 2007;
Bentler, 2007; Browne, MacCallum, Kim, Andersen, & Glaser, 2002; Chen, Curran, Bollen,
Kirby, & Paxton 2008; Credé & Harms, 2015; Hayduk, Cummings, Boadu, Pazderka-Robinson, &
Boulianne, 2007; Hayduk & Glaser, 2000; Hayduk, 2014; McIntosh, 2007; Miles & Shevlin,
2007; Mulaik, 2007; Steiger, 2007; Tomarken & Waller, 2003). If one concludes that AFIs are
appropriate, because they are largely descriptive measures, the absence of traditional inferential
properties (i.e., p-values) immediately raises the question of which values are indicative of
acceptable data-model fit. Throughout the 1980s and 1990s, there was little consensus about
which values of which AFIs could be considered reflective of a well-fitting model and researchers
often relied on experience, intuition, or subjective criteria to varying degrees (Marsh, Hau, &
Wen, 2004).
Hu and Bentler (1999; hereafter referred to as HB) attempted to address this considerable
issue by conducting a simulation study with an impressive breadth of conditions, ultimately
yielding empirically-based recommendations for values that are indicative of acceptable data-
model fit. That is, HB determined values maximally discerning models known to be incorrect
from those known to be correct, recommending a cut-off value for the standardized root mean
square residual (SRMR; Jöreskog & Sörbom, 1981) less than or equal to 0.08, a root mean square
error of approximation (RMSEA; Steiger & Lind, 1980) value less than or equal to 0.06, and a
comparative fit index (CFI; Bentler, 1990) value greater than or equal to 0.95, among other
indices. These recommended cut-off values have achieved near canonical status (as evidenced in
part by the work’s 35,000+ citations on Google Scholar) and nearly any researcher working with
latent variable models is likely familiar with these cut-offs values.
Despite several cautions by HB themselves advising against over-generalizing their findings
to conditions and models outside of what was contained in the simulation, many applied
researchers, textbook authors, journal editors, and reviewers have endorsed the HB criteria as
applicable to latent variable models broadly (Marsh et al., 2004; Jackson, Gillaspy, & Purc-
Stephenson, 2009). For instance, in a review of reporting practices for confirmatory factor analysis
(CFA) in general applied psychology studies, Jackson et al. (2009) found that almost 60% of
studies explicitly used or referenced the HB recommendations to judge the fit of the model.
Additionally, Jackson et al. (2009) showed implicit evidence of the omnipresent nature of the cut-
offs the average fit index values across over 350 published psychology studies were 0.060,
0.062, and 0.933 for the SRMR, RMSEA, and CFI, respectively, demonstrating that these cut-offs
have essentially become the hurdle which empirical researchers must clear in order to publish their
findings. To foreshadow the motivation for the current study, Jackson et al. (2009) stated, “we also
did not find evidence that warnings about strict adherence to Hu and Bentler’s suggestions were
being heeded” (p. 18). One such warning that is particularly important is the role of measurement
quality, which will be the focus of the remainder of this paper.
Hypothetical Scenario
Before delving into the specifics of latent variable data-model fit indices, we will advance a
hypothetical example to set the stage for our treatment of the role of measurement quality on fit
index cut-off values. Imagine that two studies are conducted, one testing Model A and the other
testing Model B. These models have the same latent structure and number of indicator variables,
are based upon the same sample size, and the data in each case meet all standard distributional
assumptions. Using commonly reported AFIs from Jackson et al. (2009), suppose Model A’s AFI
values surpass HB’s recommendations such that the RMSEA is 0.04, SRMR is 0.04, and CFI is
0.975, whereas Model B has an RMSEA of 0.20, an SRMR of 0.14, and a CFI of 0.775. Based on
this information and knowing that the model type, model complexity, adherence to assumptions,
and sample sizes are equivalent, we presume that many researchers would instinctively assess the
fit of Model A to be superior to Model B and, if Model B were under consideration at a
prestigious journal or conference, that the theory underlying it may be subjected to a steady stream
of criticism or it may dismissed outright based upon the seemingly egregious data-model fit.
However, in fairly routine scenarios found in psychology, there are circumstances in which
the AFI values such as those from Model B can not only be indicative of adequate data-model fit
in an absolute sense, but can better distinguish between well-fitting and poorly fitting models than
the fit index values reported for Model A, despite the apparently stable nature of AFI cut-off
values which are applied indiscriminately across studies. We ask readers to retain this motivating
hypothetical example in mind as we continue below, and we will return to this hypothetical
scenario in the concluding paragraphs of this paper after we review the literature and report results
from our illustrative simulation.
The Origin and Issues of the Hu and Bentler Cut-Offs
Although many empirical researchers treat the HB cut-offs more or less as firm cut-offs,
these cut-offs were derived via a Monte Carlo simulation study and were not mathematically
derived. With such an approach, the results are necessarily constrained to the conditions featured
in the simulation study and are not broadly generalizable to all types of models and data types.
Although the conditions within the seminal HB study were expansive (the original paper spanned
50 journal pages upon publication), one could not realistically expect for HB to cover all possible
conditions that may arise in empirical studies. As such, several studies, have spoken to some of the
problems of over-generalizing the HB recommendations and also possible shortcomings in the HB
simulation design. These criticisms range from the how realistic HB’s induced misspecifications
were (Marsh et al., 2004) to possible confounding of the models types can could have
differentially affected performance of different AFIs (Fan & Sivo, 2005).
Another major criticism that will be the focus of this study is the quality of measurement.
Although HB manipulated many conditions when deriving their cut-off values, the strength of the
standardized factor loadings was kept constant throughout their entire study. That is, HB tested
many different sample sizes, degrees of deviation from normality, and model types; however, all
of these conditions was tested with factor loadings that were always near 0.70 (most loadings were
0.70 in their models, a few were either 0.75 or 0.80 but the loadings were not systematically
manipulated in the study). Recalling that Monte Carlo derived values are only applicable to
conditions included it the study, the absence of multiple measurement quality conditions
undoubtedly limits the generalizability of the HB cut-offs because measurement quality is quite
variable not only from discipline to discipline, but also from study to study within a single
In the methodological literature, measurement quality does not necessarily have a strict definition and can be used to
refer to validity, reliability, or generalizability of a particular scale. In this paper, we use “measurement quality” to
refer to the strength of the standardized factor loadings, which is highly related to reliability.
discipline. Studies with standardized loadings that exceed 0.70 are commonly (but not
exclusively) found when measuring more concrete constructs such as cognitive abilities,
reasoning abilities, or attitudes (for recent empirical examples in the Journal of Personality
Assessment, see e.g., Bardeen, Fergus, Hannan, & Orcutt, 2016; Jashanloo, 2015; Rice,
Richardson, & Tueller, 2014; You, Leung, Lai, & Fu, 2013). Studies aiming to capture less well-
defined constructs such as creativity, risky or substance use behaviors, or abilities in very young
children tend to (but do not exclusively) have standardized loadings below 0.70 (for recent
empirical examples in the Journal of Personality Assessment, see e.g., Allan, Lonigan, & Phillips,
2015; Demianczyk, Jenkins, Henson, & Conner, 2014; Fergus, Valentiner, McGrath, Gier-
Lonsway, & Kim, 2012; Ion et al., 2016; Michel, Pace, Edun, Sawhney, & Thomas, 2014). In
practice, researchers can reasonably expect to see standardized factor loadings with magnitudes
between about 0.40 and 0.90 both in their own work and when reading the work of others. Despite
the considerable difference between a variable that loads at 0.40 on a latent variable and one that
loads at 0.90, the original HB study did not differentiate between these situations.
Issues of Measurement Quality and the Reliability Paradox
The lack of multiple measurement quality conditions in HB has been questioned over the
last decade with recent studies noting the how such an oversight can greatly limit the broad
validity of fit index cut-offs in latent variable models (Beauducel & Wittman, 2005; Cole &
Preacher, 2014; Hancock & Mueller, 2011; Heene, Hilbert, Draxler, Ziegler, & Bühner, 2011;
Kang, McNeish, & Hancock, 2016; Miles & Shevlin, 2007; Saris, Satorra, & van der Veld, 2009;
Savalei, 2012; Steiger, 2000). While some studies have merely noted the issues with such an
omission, other studies can gone so far as to mathematically prove that measurement quality
directly affects the values of AFIs. We will review the findings from each of these studies next.
For a given set of misspecifications in a latent variable model, holding all else equal,
models with poor measurement quality appear to fit much better than models with excellent
measurement quality. This phenomenon was first noted in a study investigating properties of
RMSEA by Saris and Satorra (1992) but has been developed further over the last few years.
Hancock and Mueller (2011) coined the phrase reliability paradox to describe this relationship.
The paradoxical nature of the phenomenon is evoked by the fact that researchers often strive for
the highest measurement quality possible for their latent variables, but, once obtained, AFIs will
be far worse than if measurement quality were much poorer. Using a population study, Hancock
and Mueller (2011) systematically showed how, with one hypothetical model, evaluations of data-
model fit slowly deteriorate as a function of measurement quality, even when all other model and
design factors are held constant. In their study, they kept the degree of misspecification, sample
size, and the model identical and only change the magnitude of the standardized factor loadings
from 0.40 to 0.95. For example, in their hypothetical model, the RMSEA value with standardized
loadings of 0.40 was 0.00 while the RMSEA value with standardized loadings of 0.95 was 0.10.
Hancock and Mueller (2011) further showed that standard error estimates of structural parameters
are much larger with poorer measurement quality and Lagrange multiplier test statistics (more
commonly known as modification indices) similarly lose their effectiveness to denote path that
should be introduced into the model to improve the fit of the model when poorer measurement
quality. Hancock and Mueller (2011) concluded that the nature of the AFI cut-offs is in direct
contrast to best data analytic practice poor measurement quality is rewarded while good
measurement quality is punished.
Heene et al. (2011) had a similar motivation to Hancock and Mueller (2011) but conducted
their study via a simulation and mathematically derived the direct relation of measurement quality
on AFIs. Heene et al. (2011) used HB to inspire their simulation conditions although they did
make several alterations including (1) using different magnitudes of factor loadings for each item
on each factor (i.e., a congeneric scale) rather than having the loadings be equal (i.e., a tau-
equivalent scale), (2) the number of manifest variables per factor was changed to 15 and 45 which
more closely represents a scale or instrument rather than the much smaller factors used by HB, and
(3) different standardized factor loading conditions were used with high (factor loadings near
0.80), medium (factor loadings near 0.60), and low (factor loadings near 0.40) factor reliability
conditions. Similar to Hancock and Mueller (2011), results showed that RMSEA, SRMR, CFI,
and TML values in misspecified models were seen as fitting the data well under the low factor
reliability condition whereas the high factor reliability condition showed RMSEA, SRMR, CFI,
and TML values that would routinely call for rejecting the model. When the model was perfectly
specified (i.e., the model used to generate data and the model used to fit the data were identical),
factor reliability was inconsequential and the RMSEA, SRMR, CFI, and values were indicative of
good fit regardless of the factor loading condition (however, measurement quality was only
inconsequential if the model were perfectly specified which is highly unlikely in empirical
studies). Noting that simulations are not broadly generalizable, Heene et al. (2011) went on to
explain via a mathematical derivation why the reliability paradox exists. Although the proof is
somewhat in-depth, the rationale stems from the fact that the eigenvalues of the model-implied
covariance matrix are bounded below by the residual variances which are a function of the factor
loadings larger factor loadings lead to low residual variances. That is, if the factor loadings are
high, the latent variable explains a larger amount of variance in the manifest variable and the
associated residual is low as a result. As the factor loadings decrease (and residual variances
increase), the lower bound of the eigenvalues decreases which means that TML and AFIs values
decrease as well (provided that the model is not perfectly specified and has at least some trivial
misspecification), making models appear to fit relatively better if cut-off criteria are held constant.
The lower bound of the model fit criteria changes as a function of measurement quality but the
cut-off for good fit is constant, meaning that the relative distance between the lower bound and the
cut-off is not constant and that studies with poor measurement quality therefore have an easier
path to reaching a conclusion of acceptable data-model fit.
Miles and Shevlin (2007) conducted an illustrative population analysis on incremental fit
indices to show how they are minimally affected by model conditions, including the magnitude of
the factor loadings. Incremental fit indices are those that compare the improvement in fit of the
model of interest to some baseline model (usually in independence model that simply models the
variance and mean for each individual manifest variable but does not allow the manifest variables
to covary). Miles and Shevlin (2007) compare three models: one with perfectly reliable manifest
indicator variables, one with manifest indicators with 0.80 reliability, and one with manifest
indicator variables with 0.50 reliability. Results showed that CFI, the Tucker-Lewis Index, and the
incremental fit index were able to demonstrate good data-model fit with a trivially misspecified
model (i.e., one that should fail to be rejected for practical purposes) with high reliabilities (i.e.,
factor loadings with strong magnitudes) despite the fact that both RMSEA and TML would
handedly reject the model. On these grounds, Miles and Shevlin (2007) advocate for wider use of
incremental fit indices because they are less affected by the reliability paradox because the effect
of measurement quality is partially included in the baseline model as well as the model of interest.
Incremental indices are still not immune, however, and Miles and Shevlin (2007) list their primary
conclusion, somewhat tongue-in-cheek, as “If you wish your model to fit, … ensure that your
measures are unreliable” (p. 874).
Saris et al. (2009) conducted simulations and provided population analyses to explore
whether TML and AFIs are actually detecting misspecifications in the model or whether they are
instead more sensitive to attributes of the model that are unrelated to misspecifications. Their
analyses showed that the magnitude of misspecification was just as related to incidental conditions
such as sample size and measurement quality as to actual misspecifications. Related to
measurement quality, their results showed that, for a constant degree of misspecification, as
standardized factor loadings were increased from 0.70 (as in HB) to 0.90, the RMSEA went from
demonstrating great fit (0.00) to very poor fit (0.14) and crossing the HB cut-off value at a
standardized loading value of 0.85 (see Table 3 in their article for complete results). Saris et al.
(2009) make a case for abandoning global fit measures (i.e., TML and AFIs) and replacing
assessment of latent variable models with single-parameter tests such as expected parameter
change or modification indices. Saris et al. (2009) also note that all possible misspecifications are
not theoretically equivalent and that the goal of data-model fit assessments should not be strictly
concerned with detecting any type of misspecification (which is the case when testing the fit of the
model globally) but rather data-model fit assessment should focus on detecting theoretical
meaningful misspecifications.
Kang et al. (2016) extended Hancock and Mueller (2011) to the context of multiple-group
latent variables models where a primary interest is determining whether parameter estimates are
invariant across group (e.g., whether items function similarly across different groups of people).
These invariance tests typically feature difference in fit index tests (e.g., ΔTML, ΔCFI; see Cheung
& Rensvold, 2002 for additional details) so the goal of the study was to examine whether
measurement quality similarly affects differences in fit indices or whether the reliability paradox is
confined to AFIs in their raw form. Kang et al. (2016) found that, for the purpose of testing either
measurement invariance (similarly of loadings across groups) or structural invariance (similarly of
structural paths across groups), only ΔMcDonald’s Non-Centrality Index was reasonably
unaffected by measurement quality and only for measurement invariance. For all other conditions,
as the measurement quality increased, the indices were much more likely to find non-invariance
while conditions with poor measurement quality (i.e., around 0.40) concluded that invariance was
Bridging the Gap to Empirical Researchers
These methodological studies have noted the existence of this phenomenon but, due to a
strong methodological focus, previous studies emphasize why it occurs rather than how it affects
cut-off values that a majority of researchers are using (or at least referencing as guidelines and/or
being subjected to through peer review). That is, the tangible implications of broadly applying the
HB cut-offs has yet to be demonstrated to primarily non-statistical audience, despite the fact that
non-statisticians and their substantive theories are the most widely affected.
The goal of this paper is not to extend the methodological conclusions of the so-called
reliability paradox to novel situations or to provide additional insight about the mechanism by
which it functions. As discussed in the preceding section, there are already several rigorous
methodological studies that achieve this goal. More importantly, despite the wide-ranging
implications of these methodological studies and the potentially serious implications for the
evaluation of applied research, these findings are largely confined to the pages of technical
journals and are examined from a more theoretical perspective. Instead, our goal is to elucidate
these findings to as broad an audience as possible by stripping the technical language and detail to
demonstrate the magnitude of the practical implications as plainly as possible. Thus, this paper is
not attempting to pass these ideas off as original but rather to illuminate highly relevant
methodological considerations that have yet to find their way into discussions of empirical studies.
To accomplish this goal, we provide an illustrative simulation next to show how (1) the behavior
of the fit index cut-offs varies as a function of measurement quality and (2) if HB used even a
slightly different measurement quality condition in their study, how the rampantly utilized cut-offs
values would be quite different. We will then discuss the effect this has when interpreting models
in empirical studies, ways that researchers can report AFIs to acknowledge this issue, and how it
may affect what journal editors and reviewers deem worthy of publication in top-tier outlets.
Illustrative Simulation Design
Although example analyses from real datasets are often the preferred method to demonstrate
methodological issues to non-statisticians, the nature of the problem at hand does not lend itself to
be examined in such a manner. That is, to fully grasp the severity of the issue, all components of
the data and associated model (sample size, number of latent variables, number of indicators per
latent variable, severity of misspecification) must be held constant with the exception of the
magnitude of the standardized factor loadings. To avoid possible confounds, the extent of the
misspecifications that are present in the model must be known to ensure that models only differ in
the standardized factor loadings, which cannot be discerned with real data. Therefore, we will
generate our own data that satisfy these requirements with a small illustrative simulation. We
realize that not all readers may be familiar with interpreting simulation studies, so we will provide
guidance throughout this section to facilitate proper interpretation of this demonstration.
In order to elucidate the effect of measurement quality on AFI cut-offs, we begin with the
original model used in HB. To briefly overview HB’s original conditions, their “simple” true
model was a CFA model with three covarying exogenous latent variables, each with five manifest
indicators that had factor loadings mostly equal to 0.70 with a few loadings of 0.75 or 0.80. The
path diagram of the data generation model is presented in Figure 1. The degrees of
misspecification included a “minor” condition such that one factor covariance path was omitted
and a “severe” condition which omitted two factor covariance paths from the model. Samples of
different sizes (𝑁=150, 250, 500, 1000, 2500, 5000) were drawn from seven conditions that
differed in terms of normality and independence. We only replicate the simple model from HB
under multivariate normality and only for HB’s “minor misspecification” condition.
In each cell of the simulation design 1000 datasets were generated according to the HB
simple model as presented in Figure 1. We will then fit the HB model containing a “minor”
misspecification which purposefully omits the factor covariance path between Factor 1 and Factor
2. The model is therefore not correct and the TML statistic and AFI values should detect that this
model is misspecified. Addressing the primary interest of the paper to investigate the effect of
varying degrees of measurement quality on AFI cut-offs we generated data with factor loadings
that were equivalent across all 15 indicator variables. Population values for the standardized factor
loadings were manipulated to range from 0.40 to 0.90, in 0.10 increments (unlike HB’s
standardized loading conditions which were constrained near 0.70). Data were generated and
modeled within PROC CALIS in SAS 9.3. During the process, we tracked RMSEA, SRMR, CFI,
and TML because, as mentioned previously, these indices are widely reported in empirical studies
(Jackson et al., 2009).
For readers who are not familiar with simulation studies, this paragraph will conceptually
describe the logic of the simulation. Readers familiar with interpreting simulation results can skip
the remainder of this section without loss of continuity. The advantage of the simulation design is
that we generated the data according to a certain model, so we are able to control the magnitude of
model misspecification, the magnitude of the standardized factor loadings, and we can determine
if a model fit to the generated data is correct. This luxury is not available when using real data
because one cannot be certain of the “correct” model for the data (i.e., whether the model-implied
covariance matrix perfectly reproduces the observed covariance matrix) or be certain of the level
of misspecification. Furthermore, the measurement quality in real data cannot be manipulated. We
start by generating data from the model in Figure 1 that has standardized factor loadings equal to
0.40. From this data, we then fit a misspecified model that should fit somewhat poorly. We then
record the model fit criteria for the model. This process is repeated with 1000 unique generated
datasets so that we have adequate information to inspect the distribution of the model fit criteria
(we repeat the process instead of generating a single dataset to avoid succumbing to any
idiosyncratic nuances that could occur based on random chance). We then repeat the process with
data generated from standardized factor loadings equal to 0.50, 0.60, and so forth until 0.90 (with
1000 different generated datasets in for each standardized loading value). We then compare the
distributions for each of the standardized factor loading conditions to show how the values of the
fit indices values change as the values of the standardized factor loadings change even though we
know that the degree of misspecification is exactly the same across the entire simulation (because
we have control over the how the data are created).
For each of the models, we calculate the percentage of the replications in which the SRMR,
RMSEA, and CFI values for the fitted model exceed a particular cut-off (i.e., a conclusion of poor
fit) for each standardized loading magnitude condition, similar to HB. For each index, we explore
the percentage of models that would be declared poorly fitting based on the HB cut-off
recommendations 0.06 for RMSEA, 0.08 for SRMR, and 0.95 for CFI. Additionally, we
investigate how many models would be declared poorly fitting based on a value of each index
conventionally thought to indicate unambiguously good fit were used as the cut-offs (0.04 for
RMSEA, 0.04 for SRMR, and 0.97 for CFI) and one value conventionally thought to indicate
unambiguously poor data-model fit (0.20 for RMSEA, 0.14 for SRMR, and 0.775 for CFI). For
researchers who adhere to the principle that TML is the only philosophically defensible assessment
of data-model fit, we also tracked the number of replications in which TML would reject the null at
the 0.05 and 0.01 levels of significance. We only report results for n =500 in the interest
succinctness of presentation although similar patterns of results hold for other sample sizes as
Misspecified Model
Table 1 presents the percentage of models declared poorly fitting for the misspecified model.
In Table 1, each row represents a different standardized loading condition. Each column represents
the percentage of the time in which the particular fit criteria reported that the model did not fit the
data well. In Table 1, the values are expected to be rather high as the misspecification should be
detected a large portion of the time. The 0.70 loading row is bolded to denote the condition that
mirrors conditions used in HB.
From Table 1, it can be seen that AFI values were above the “ambiguously good” cut-off
(the leftmost column for each index) 100% of the time when the standardized loadings were 0.70
(values closer to 100% mean that AFIs identify that the model did not fit well). As an example, the
100% value in the 0.04 column and 0.70 row for RMSEA means that 100% of the models fit to the
simulated data returned RMSEA values greater than 0.04. TML also rejected 100% of models at
both the 0.05 and 0.01 levels when the loadings were 0.70. Using the HB recommendations (the
middle column for each index), the misspecified model always had AFI values beyond the cut-off
for RMSEA and CFI while SRMR was beyond the cut-off about half the time (SRMR is lower
because it tends to be less sensitive to the type of misspecification used and the misspecification
was not severe; Fan & Sivo, 2007). This reifies that the cut-offs recommended by HB indeed
perform well when the standardized loadings are 0.70 as was demonstrated in their study.
Essentially none of the fitted models had an RMSEA above 0.20, an SRMR above 0.14, or a CFI
below 0.775 when the loadings were 0.70 meaning that these values indicate excessively poor fit
as expected because, although the models are noticeably misspecified, the misspecification is
moderate and AFIs therefore do not reach such seemingly extreme values. Despite the fact that
these general guidelines have been ported to SEM broadly, this interpretation of fit index values is
only applicable when the standardized loadings are 0.70.
Consider the exact same model, featuring the exact same misspecification, with the exact
same sample size but now the standardized loadings are 0.40 instead of 0.70. Now, close to 95%
of the replications have an RMSEA less than 0.04 and all the models have an RMSEA below 0.06
which indicates good fit based on HB cut-offs. Consider what this means recall that with
standardized loadings equal to 0.70, none of the 1000 fitted models have an RMSEA value above
0.20. This indicates that if a model yields an RMSEA of 0.20 in practice, then this would be
indicative of excessively poor fit because 0 of the 1000 misspecified models output an RMSEA
value that high. With loadings of 0.40, none of the models have an RMSEA above 0.06. By
similar logic, this means 0.06 indicates poor fit with lower standardized loadings; however, for
researchers who routinely rely on HB cut-offs, an RMSEA of 0.06 would be indicate good fit and
increase the probability that the theory under investigation would gain traction in the literature.
This phenomenon is not restricted to RMSEA based on SRMR, a quarter of the
replications would have values below 0.04 and all replications are below the 0.08 HB cut-off when
the standard loadings are equal to 0.40. As noted by Miles and Shevlin (2007), CFI is the least
susceptible but about 15% of models are above 0.95 and about 5% are above 0.975. Instead of all
replications being rejected by TML, only 56% were rejected at the 0.01 level of significance and
77% were rejected at the 0.05 level of significance, a marked decrease in power compared to the
scenario where the loadings are 0.70, as noted in the derivation by Heene et al (2011).
Now consider the opposite extreme in Table 1 where the standardized loadings are 0.90
instead of the 0.70. With excellent measurement quality, essentially every replication has an
RMSEA value above 0.20, an SRMR value above 0.14, and a CFI value below 0.775 while TML
again is able to correctly reject the model as misfitting. Using RMSEA as an example, this means
that when the loadings are 0.90 that an RMSEA cut-off of 0.20 can distinguish good fitting models
from poorly fitting (but only moderately misspecified) models just as well as 0.06 when the
loadings are equal to 0.70. If using the HB cut-offs in practice however, a model with an RMSEA
of 0.20 would never be considered anywhere close to fitting well.
To depict these results visually, Figure 2 shows the empirical distribution of values for
RMSEA with loading conditions of 0.40, 0.70, and 0.90. The difference between the distributions
is stark as there is no overlap whatsoever. The 0.70 loading distribution is slightly above the HB
cut-off which makes sense because the misspecification was non-trivial but moderate in
magnitude. On the other hand, the 0.40 loading distribution is completely below the HB cutoff
while the 0.90 loading distribution is almost entirely above an RMSEA of 0.20. Needless to say,
RMSEA functions very differently depending on the measurement quality and the exact same
degree of misspecification can be viewed in a very different light as a result. Similar patterns are
also present in the empirical distribution plots for SRMR and CFI, which are shown in Figure 3.
The SRMR distributions in the left panel display similar behavior to the RMSEA and the CFI
values are still affected, but to a lesser degree as there is some overlap across the factor loadings
conditions (as has been noted in Miles & Shevlin, 2007).
Consider the ramifications of these results if one has excellent measurement quality, the
HB cut-offs of 0.06/0.08/0.95 for RMSEA/SRMR/CFI could be discarded for values of
0.20/0.14/0.775 and one would be able to identify nearly all moderately misspecified models the
classification accuracy remains unblemished. Therefore, based on this model, if the standardized
loadings were 0.90 and one obtained AFI values of 0.12/0.12/0.85, a researcher could be confident
that the model fits approximately, possibly containing only trivial misspecifications of the same
magnitude as a 0.06/0.08/0.95 trio of fit indices with standardized loadings equal to 0.70 but no
moderate or severe misspecifications, even though these values are conventionally thought to
indicate ambiguously poor fit. As a rhetorical argument, we challenge readers to consider the last
time they saw a study confidently report 0.12/0.12/0.85 for RMSEA/SRMR/CFI in a positive light
in a top-tier outlet. However, based on the logic of AFI cut-offs, these values are just as good at
discriminating between good and bad models with loadings equal to 0.90 as HB cut-offs are for
models with loadings equal to 0.70.
More problematically, consider the case of poorer measurement quality even a
0.04/0.04/0.975 trio of RMSEA/SRMR/CFI values does not guarantee much in terms of the model
fitting well despite the fact that most researchers would be pleased to achieve these AFI values for
their model and studies with less than enviable measurement quality are routinely published with
less reassuring fit values. Furthermore, TML has far less power to detect misspecifications under
such circumstances. The ramifications of this result are that many models that appear to fit well
based on the HB cut-offs are based on theories that, in actuality, may not be well supported by the
data if more nuanced assessments of data-model fit were employed or if these studies featured
more rigorous measurement models. Conversely, many models may appear to fit poorly and be
disregarded but may actually fit well if the quality of measurement is strong.
The Issue for Empirical Researchers
In essence, even though researchers strive to measure their latent variables with the highest
quality manifest variables, relying on strict AFI cut-offs to judge data-model fit ends up punishing
this diligence and rewarding studies whose models feature much poorer measurement quality
when a single cut-off is applied broadly. Stated more drastically, the meaning of good data-model
fit changes as a function of measurement quality even if all other conditions are held constant for
example, an RMSEA value of 0.06 can be considered poor fit with low measurement quality,
adequate fit with loadings near 0.70 as in HB, or great fit with high measurement quality.
Moreover, although the criteria for determining good or bad fit with TML is unaltered under
different loading conditions (i.e., the interpretation of inferential tests is consistent), TML is far less
powerful with poorer measurement quality. Somewhat ironically, measurement quality with AFIs
is rather analogous to sample size with TML with sample size being the primary issue AFIs were
designed to mitigate.
With TML, a more sound methodological design (i.e., large sample size) makes good data-
model fit more difficult to obtain while good fit can be achieved more readily under the less
desirable condition of a smaller sample size, holding all else constant. Similarly with AFIs, good
data-model fit is difficult to achieve with more sound methodological designs (i.e., better
measurement quality) but less desirable design conditions (i.e., poorer measurement quality) will
result in better fit if all else is constant. At a basic level, using AFIs instead of TML effectively
trades problems associated with sample size for problems associated with measurement quality.
Empirical researchers are aware of the perils of sample size with TML, yet most are unacquainted
with the issues associated with measurement quality with AFIs.
To highlight the broader issue with an analogy, imagine a researcher fits a model to a sample
size of 25 (casting issues related to estimation difficulties aside) and reports a non-significant TML
value, concluding that the model fits well and that the theory is upheld. Reviewers and critical
readers would immediately cast doubt about this conclusion and would instinctively note that the
small sample size renders the inferential TML essentially powerless to detect violations unless they
are massive. Conversely, if a researcher presented a model with a sample of 25,000 and a
significant TML test, many reviewers and critical readers would note that the model may still fit
reasonably well and that the TML test is vastly overpowered in this context and may be detecting
trivial differences.
This exact same scenario exists with measurement quality although it has largely gone
unnoticed in empirical literatures up to this point. If the standardized loadings are low, AFIs may
appear to indicate great fit but readers should question whether the model actually fits or whether
the model simply too underpowered to detect meaningful misfit. By a similar token, if
measurement quality is high, seemingly poor AFIs may be attributable to either (1) poor fit or (2)
a model that is overpowered and is detecting trivial discrepancies between the model and the data.
Although it is the current state of affairs, examining AFIs without taking measurement quality into
account is as egregious as interpreting p-values without taking sample size into account a TML p-
value means very different things with 100 versus 10,000 people just as an RMSEA of 0.06 means
something very different with high or low measurement quality. Conclusions from studies with
small samples are questioned (as they should be) but studies with poor measurement quality rarely
receive the same type of treatment with respect to data-model fit despite the fact that AFIs with
poor measurement quality are similarly underpowered as TML with small samples (and
overpowered for large samples and good measurement quality). More bluntly, the exercise of
comparing AFIs to a single, predetermined cut-off is akin to interpreting p-values as if every
dataset had the exact same sample size.
Conclusions and Recommendations for Practice
It is increasingly clear that no single cut-off value for any particular AFI can be broadly
applied across latent variable models. At this point, readers may hope to see revised
recommendations for adjudicating fit across a broader set of circumstances (measurement quality
in particular); although this is a logical next step we refrain from doing so in attempt to dissuade
overgeneralizations that have run rampant in assessments of data-model fit. Even if updated
recommendations were provided to account for varying levels of measurement quality, these
recommendations would be just as susceptible as the original recommendations to factors that
obscure AFI interpretation and comparability such as model complexity, the number of indicators
per factor, and sample size.
Although the atmosphere appears to be changing as methodological research continues to
expose weaknesses with popular recommendations, current practice largely still operates under the
assumption that there is a single cut-off that can be used for each fit-index, which is akin to
recommending a single sample size to achieve adequate power across all statistical models.
Imagine a scenario in which all studies with, say, 200 or fewer observations were considered
to be underpowered. Yet, if one were researching a phenomenon with a large effect size, a sample
size of, say, 50 might be more than sufficient to detect true differences but, because the sample
size was below 200, the study might be poorly received in this hypothetical universe. On the other
hand, if the phenomenon of interest had a small, but non-zero, effect size, a sample size potentially
much larger than 200 would be needed to detect that a difference exists; however, many studies
would conclude that there are no significant differences if researchers repeatedly tested samples of
200 in accordance with the recommendation. By comparison, if one is researching a construct for
which very high measurement quality can be obtained, one need not subscribe to such stringent
AFI criteria to be confident that the model features any non-trivial misspecifications. Conversely,
if one is researching a construct that cannot be measured very reliably, the currently employed cut-
offs are not suitable and would be very likely to overlook potentially meaningful misspecifications
in the model.
With regard to the reporting of results in empirical studies, it is common to report fit and
paths/correlations of the structural model but often studies do not report the standardized loadings
from the latent variables to the manifest variables (only 31% of reviewed studies reported the
standardized loadings in Jackson et al., 2009). We are cognizant of space limitations in journals
that publish empirical studies and the fact that many studies test multiple models, so reporting
each individual loading for each model may not be feasible (although this is the preferable option
if possible). However, as illustrated in the above simulation, it is vital to have a general idea of
the values of the standardized loadings in order to assess AFIs because, without this context, the
value of the AFIs are uninterpretable. We encourage applied researchers to report some type of
information pertaining to standardized loadings of the latent variables in the final model to help
contextualize the AFIs (e.g., mean, median, range). Alternatively, a measure of reliability that is
based on the magnitude of the standardized loadings such as McDonald’s omega for construct
reliability (McDonald, 1970) or coefficient H for maximal reliability (Hancock & Mueller, 2001)
may be able to succinctly provide such a contextualization.
Otherwise, readers, reviewers, and
editors have essentially no information upon which to base the fit of the model and an SRMR
value of 0.07, for example, can be interpreted many different ways conditional on measurement
quality. This does not solve the issue related to the reliability paradox, but it would help
researchers to be more upfront about the conditions from which their AFIs come. Additionally, it
would reward researchers who exercised due diligence to construct more reliable measures and
relegate some inappropriate claims made from data with low reliability.
As a final note to put the implications of these findings into perspective, consider again the
two sets of AFIs mentioned near the beginning of the paper. As a reminder, in Model A, RMSEA
= 0.040, SRMR = 0.040, and CFI = 0.975; and in Model B, RMSEA = 0.20, SRMR= 0.14, and
CFI = 0.775. Under current practice where the HB criteria have become common reference points,
We acknowledge that standardized loadings are not interchangeable with reliability but they are related and may
serve as a fair approximation that is concise to report. If it helps contextualize this study, the Coefficient H values for
the 0.40, 0.70. and 0.90 loading conditions were 0.48, 0.83, and 0.96, respectively. The McDonald’s omega values for
these conditions were 0.49, 0.83, and 0.96, respectively.
Model A would be universally seen as fitting the data better than Model B, which would likely be
desk-rejected at many reputable journals. However, if one does not somehow condition on
measurement quality, this assertion can be highly erroneous. If the factor loadings in Model A had
standardized values of 0.40 and the factor loadings in Model B had standardized values of 0.90,
Model B actually indicates better data-model fit and has higher power to detect the same moderate
misspecification in the same model based on the results of our illustrative simulation study
(assuming multivariate normality). Reverting back to Table 1, about 25% of moderately
misspecified models produced SRMR below 0.04, about 5% of models resulted in CFI values
below 0.975, and nearly 95% of models produced an RMSEA value below 0.04 with poor
measurement quality. Conversely, with excellent measurement quality, essentially none of the
misspecified models produced an SRMR value less than 0.14, an RMSEA value less than 0.20, or
a CFI value less than 0.775. Even though the AFI values of Model B appear quite poor upon first
glance, under certain conditions, even these seemingly unsatisfactory values could indicate
acceptable fit with possibly only trivial misspecifications present in the model. More importantly,
the seemingly poor Model B AFI values better classify models with excellent measurement quality
compared to the seemingly pristine Model A AFI values when measurement quality is poor. To
put the thesis of this paper into a single sentence, information about the quality of the
measurement must be reported along with AFIs in order for the values to have any interpretative
Allan, N. P., Lonigan, C. J., & Phillips, B. M. (2015). Examining the factor structure and
structural invariance of the PANAS across children, adolescents, and young adults. Journal of
Personality Assessment, 97, 616-625.
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A
review and recommendations. The Leadership Quarterly, 21, 1086-1120.
Bardeen, J. R., Fergus, T. A., Hannan, S. M., & Orcutt, H. K. (2016). Addressing psychometric
limitations of the Difficulties in Emotion Regulation Scale through item modification. Journal of
Personality Assessment, 98, 298-309.
Barrett, P. (2007). Structural equation modelling: Adjudging model fit. Personality and
Individual Differences, 42, 815-824.
Beauducel, A., & Wittmann, W. W. (2005). Simulation study on fit indexes in CFA based o
data with slightly distorted simple structure. Structural Equation Modeling, 12, 41-75.
Bentler, P. M. (2007). On tests and indices for evaluating structural models. Personality an
Individual Differences, 42, 825-829.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606.
Browne, M. W., MacCallum, R. C., Kim, C. T., Andersen, B. L., & Glaser, R. (2002). When fit
indices and residuals are incompatible. Psychological Methods, 7, 403-421.
Chen, F., Curran, P. J., Bollen, K. A., Kirby, J., & Paxton, P. (2008). An empirical evaluation of
the use of fixed cutoff points in RMSEA test statistic in structural equation models. Sociological
Methods & Research, 36, 462-494.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling, 9, 233-255.
Cole, D. A., & Preacher, K. J. (2014). Manifest variable path analysis: Potentially serious and
misleading consequences due to uncorrected measurement error. Psychological Methods, 19, 300-
Credé, M., & Harms, P. D. (2015). 25 years of higherorder confirmatory factor analysis in the
organizational sciences: A critical review and development of reporting
recommendations. Journal of Organizational Behavior, 36, 845-872.
Demianczyk, A. C., Jenkins, A. L., Henson, J. M., & Conner, B. T. (2014). Psychometric
evaluation and revision of Carver and White's BIS/BAS scales in a diverse sample of young
adults. Journal of Personality Assessment, 96, 485-494.
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement
model components: Rationale of two-index strategy revisited. Structural Equation
Modeling, 12, 343-367.
Fan, X., & Sivo, S. A. (2007). Sensitivity of fit indices to model misspecification and model
types. Multivariate Behavioral Research, 42, 509-529.
Fergus, T. A., Valentiner, D. P., McGrath, P. B., Gier-Lonsway, S. L., & Kim, H. S. (2012). Short
forms of the social interaction anxiety scale and the social phobia scale. Journal of Personality
Assessment, 94, 310-320.
Hancock, G. R., & Mueller, R. O. (2011). The reliability paradox in assessing structural relations
within covariance structure models. Educational & Psychological Measurement, 71, 306-324.
Hancock, G. R., & Mueller, R. O. (2001). Rethinking construct reliability within latent variable
systems. In R. Cudeck, S. du Toit, & D. Sörbom (Eds.), Structural equation modeling: Present
and futureA festschrift in honor of Karl Jöreskog (pp. 195216). Lincolnwood, IL: Scientific
Software International.
Hayduk, L. A. (2014). Shame for disrespecting evidence: the personal consequences of
insufficient respect for structural equation model testing. BMC Medical Research
Methodology, 14, 124.
Hayduk, L. A., & Glaser, D. N. (2000). Jiving the four-step, waltzing around factor analysis, and
other serious fun. Structural Equation Modeling, 7, 1-35.
Hayduk, L., Cummings, G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing!
Testing! One, two, threetesting the theory in structural equation models!. Personality and
Individual Differences, 42, 841-850.
Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in
confirmatory factor analysis by increasing unique variances: a cautionary note on the usefulness of
cutoff values of fit indices. Psychological Methods, 16, 319-336.
Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to
underparameterized model misspecification. Psychological Methods, 3, 424-453.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.
Ion, A., Iliescu, D., Aldhafri, S., Rana, N., Ratanadilok, K., Widyanti, A., & Nedelcea, C. (2016).
A cross-cultural analysis of personality structure through the lens of the hexaco model. Journal of
Personality Assessment, OnlineFirst.
Jackson, D. L., Gillaspy Jr, J. A., & Purc-Stephenson, R. (2009). Reporting practices in
confirmatory factor analysis: an overview and some recommendations. Psychological
Methods, 14, 6-23.
Joshanloo, M. (2016). Factor Structure of Subjective Well-Being in Iran. Journal of Personality
Assessment, 98, 435-443.
Jöreskog, K. G., & Sörbom, D. (1981). LISREL V: Analysis of linear structural relationships by
maximum likelihood and least squares methods. University of Uppsala, Department of Statistics.
Kang, Y., McNeish, D.M., & Hancock, G. R. (2016). The role of measurement quality on
practical guidelines for assessing measurement and structural invariance. Educational and
Psychological Measurement, 76, 533-561.
Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and
Bentler's (1999) findings. Structural Equation Modeling, 11, 320-341.
McDonald, R. P. (1970). The theoretical foundations of principal factor analysis, canonical factor
analysis, and alpha factor analysis. British Journal of Mathematical and Statistical
Psychology, 23, 1-21.
McIntosh, C. (2007). Rethinking fit assessment in structural equation modelling: A commentary
and elaboration on Barrett (2007). Personality and Individual Differences, 42, 859-867.
Michel, J. S., Pace, V. L., Edun, A., Sawhney, E., & Thomas, J. (2014). Development and
validation of an explicit aggressive beliefs and attitudes scale. Journal of Personality
Assessment, 96, 327-338.
Miles, J., & Shevlin, M. (2007). A time and a place for incremental fit indices. Personality and
Individual Differences, 42, 869-874.
Millsap, R. E. (2007). Structural equation modeling made difficult. Personality and Individual
Differences, 42, 875-881.
Mulaik, S. (2007). There is a place for approximate fit in structural equation modelling.
Personality and Individual Differences, 42, 883-891.
Rice, K. G., Richardson, C. M., & Tueller, S. (2014). The short form of the revised almost perfect
scale. Journal of Personality Assessment, 96, 368-379.
Saris, W. E., Satorra, A., & Van der Veld, W. M. (2009). Testing structural equation models or
detection of misspecifications?. Structural Equation Modeling, 16, 561-582.
Saris W. E., Satorra A. (1992). Power evaluations in structural equation models. InBollen K.
A., Long S. (Eds.), Testing in structural equation models (pp. 181-204).London, England: Sage.
Savalei, V. (2012). The relationship between root mean square error of approximation and model
misspecification in confirmatory factor analysis models. Educational and Psychological
Measurement, 72, 910-932.
Steiger, J. H. (2007). Understanding the limitations of global fit assessment in structural equation
modeling. Personality and Individual Differences, 42, 893-898.
Steiger, J. H. (2000). Point estimation, hypothesis testing, and interval estimation using the
RMSEA: Some comments and a reply to Hayduk and Glaser. Structural Equation Modeling, 7,
Steiger, J.H. & Lind, J.C. (1980, May). Statistically-based tests for the number of common
factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.
Tomarken, A. J., & Waller, N. G. (2003). Potential problems with" well fitting" models. Journal
of Abnormal Psychology, 112, 578-598.
You, J., Leung, F., Lai, K. K. Y., & Fu, K. (2013). Factor structure and psychometric properties of
the Pathological Narcissism Inventory among Chinese university students. Journal of Personality
Assessment, 95, 309-318.
Table 1
Proportion of the replications declared poorly fitting for various loading magnitudes under
different criteria.
Note: The values in the second row are the hypothetical AFI cut-off values to which the
percentage of rejected models corresponds. The bolded row represents the condition used in Hu
and Bentler (1999).
Figure 1. Confirmative factor analysis model used in Hu and Bentler (1999). As in HB, in the
population model for the simulation model ϕ12 = 0.50, ϕ13 = 0.40, and ϕ23 = 0.30.
Figure 2. Empirical distribution of RMSEA for selected factor loading conditions. The vertical
black line corresponds to the HB cut-off of 0.06. Values to the right of the vertical black line
would be classified as having poor data-model fit, values to the left indicate good data-model fit.
Figure 3. Empirical distribution of SRMR (left panel) and CFI (right panel) for selected factor
loading conditions. The vertical black line corresponds to the HB cut-off of 0.08 for SRMR and
0.95 for CFI. Values to the right of the vertical black line would be classified as having poor data-
model fit, values to the left indicate good data-model fit.
... Cognitive interview data and qualitative analyses, which informed item development, were often used to inform decisions. Statistical information informing item selection decisions included: (1) low item variance or very low item-rest correlations; (2) high rates of item non-response or not applicable responses resulting from context-specific items; (3) weak standardized factor loadings (< 0.6) or weak discrimination parameters in the IRT models (< 1.35) [40,41]; (4) examination of IRT item location parameters, item-category response function plots and item information plots to assess the contribution of each item to measuring the latent variable; (5) indications of problems with item performance based on the 1-2 largest modification indices at each model iteration, such as correlated errors (which violates the conditional independence assumption) or cross-loading of an item between domains (indicative of weak discriminant validity of an item); (6) evidence of differential item functioning by sociodemographic variables; and (7) areas of local strain based on residual correlations. A more detailed description of how each of these criteria were used is provided in Supplementary Appendix. ...
... Standard cut-offs widely used in the literature as indicative of good fit were used (SRMR < 0.05, RMSEA < 0.06, and TLI and CFI > 0.95) [43]. However, a RMSEA of 0.06 can indicate poor fit in a model with low standardized factor loadings (e.g., 0.40) while a RMSEA of 0.20 can indicate good fit in a model with high standardized factor loadings (e.g., > 0.90) [40]. So, we also assessed the fit of individual items using CFA standardized factor loadings, IRT item discrimination parameters, and the infit and outfit mean square fit statistics. ...
... Standardized factor loadings in the range 0.60-0.74 were interpreted as high, and values ≥ 0.75 were interpreted as very high [40]. IRT item discrimination parameters in the range 1.35-1.69 ...
Full-text available
Purpose To select and scale items for the seven domains of the Patient-Reported Inventory of Self-Management of Chronic Conditions (PRISM-CC) and assess its construct validity. Methods Using an online survey, data on 100 potential items, and other variables for assessing construct validity, were collected from 1055 adults with one or more chronic health conditions. Based on a validated conceptual model, confirmatory factor analysis (CFA) and item response models (IRT) were used to select and scale potential items and assess the internal consistency and structural validity of the PRISM-CC. To further assess construct validity, hypothesis testing of known relationships was conducted using structural equation models. Results Of 100 potential items, 36 (4–8 per domain) were selected, providing excellent fit to our hypothesized correlated factors model and demonstrating internal consistency and structural validity of the PRISM-CC. Hypothesized associations between PRISM-CC domains and other measures and variables were confirmed, providing further evidence of construct validity. Conclusion The PRISM-CC overcomes limitations of assessment tools currently available to measure patient self-management of chronic health conditions. This study provides strong evidence for the internal consistency and construct validity of the PRISM-CC as an instrument to assess patient-reported difficulty in self-managing different aspects of daily life with one or more chronic conditions. Further research is needed to assess its measurement equivalence across patient attributes, ability to measure clinically important change, and utility to inform self-management support.
... While never intended to become gold standard thresholds of performance, they have unfortunately become entrenched in modeling practice (Marsh et al., 2004). A further complication with the use of AFI cutoffs is that their performance differs based on research conditions, such as sample size (Chen et al., 2008), the number of variables in the model (Kenny and McCoach, 2003), magnitude of the correlations amongst the observed variables (Fornell and Larcker, 1981;Marsh et al., 2004), and measurement quality (McNeish et al., 2018;Browne et al., 2002). When working from the cutoffs alone, conflicting messages can create ambiguity associated with model evaluation. ...
... The model was fitted with a standardized root mean square residual, and its value should be <0.08, which is a good model (McNeish et al., 2018). Table 5 shows the direct impacts of all studied variables, H1a showed a positive and significant impact of EL on EGIB, and therefore H1 is supported (β = 0.543; t = 12.761; p < 0.000). ...
Full-text available
To enhance environmental protection and sustainable development, green innovation (GI) is an inevitable choice for enterprises. This study incorporates social identity theory and social learning theory to explore the impact of ethical leadership on employee GI behavior. In addition, this study also examines the mediating effects of green organizational identity (GOI) and the moderating role of strategic flexibility (SF). Using the structural equation modeling, an empirical survey was conducted among 300 Chinese manufacturing companies. The study found that ethical leadership (EL) positively affects employees’ GI behavior (EGIB). It also positively impacts the GOI, which led to EGIB. In addition, the study also confirmed that GOI played a mediating role in the relationship between EL and EGIB. The results further indicated that SF positively enhanced the effect of GOI on EGIB. The findings have important contributions to theory and practice in the current research context.
... Model fit was assessed in a manner similar to phase one. We used model fit indices in tandem with the strength of the factor loadings to consider overall measurement quality (McNeish et al., 2018). ...
Confirmatory factor analyses (CFA) are widely used in the organizational literature. As a result, understanding how to properly conduct these analyses, report the results, and interpret their implications is critically important for advancing organizational research. The goal of this paper is to summarize the complexities of CFA models and, therefore, to provide a resource for journal reviewers and researchers who are using CFA in their research. The topics covered in this paper include the estimation process, power analyses, model fit, and model modifications, among other things. In addition, this paper concludes with a checklist that summarizes the key points that are discussed and can be used to evaluate future studies that incorporate CFA.
Conference Paper
OBJECTIVES AND PRACTICAL RELEVANCE. The study reports technology and business fields undergraduate adolescent (n = 76) Innovativeness Competence Personality Traits using Hurt et al. 1977; Goldsmith & De Witt 2003) the Global Innovativeness (GI) scale converging to the Ryan & Tipu (2013) Full Range Leadership and Innovation Management Questionnaire scales (FRL & IM), that was validated (Uusi-Kaakkuri et al. 2016). The aim is to evaluate IQS's reliability and validity and discuss its results from the perspective of lately (author's) publishing FRL support for IM thematics. Piloted outcomes from GI provide new insight and a comforting fit for innovativeness psychological coherence, invariance measurement, and test reliability. GI is like the brightest zodiac of the star chart for positioning personality qualities. In contrast, organizations seek continuous improvement to their business models for higher innovation capabilities. FRL and IM Skills are essential to combine to be successful in today's social systems (Day & Sammons 2013). High GI promotes curiosity to try, creativity, leadership and helps to face ambiguities in periodic liveliness. The validated Ryan & Tipu's (2013) GI instrument's potential is the quintessential set for traversed features that facilitate perceptual innovativeness flow of inventions. METHODS. Study variables tested reliability from the construct and composite validity. The data-driven in Factor Analysis (FA) with Principal Component Analysis (PCA) showed, in general, meritorious (.876) Kaiser-Meyer-Olkin (KMO) without altering the endogenous variables structures by eliminating weaknesses with modifications. Weaknesses indicated the most exciting discoveries (for continuous improvement). Weaknesses are visible in the cross-tabulated correlation/covariate structure, in which each square rooted average variance extract surpasses comparable values for almost every correlating loading. Although, the meritorious model was acceptable for further analysis by factorizing successfully. RESEARCH QUESTIONS. The research questions (RQs) are formed from the top levels and cover the treatment of psychometrics sub-concepts. The RQs are at what level, association, and context of the IQS, FLQS, and IMQS, core characters relate to the existing framework and model fit indices. DATA FINDINGS. As empirical results for RQ 1 barometric examined mainly negatively skewed sum variable relations for Global Innovativeness (GI) Latent-competencies: Willing to Try (WT), Creativity (CR), Opinion Leader (OL), Ambiguities (AM). The broad leadership competencies compare Full Range Leadership (FRL) and Management (IM) parent concepts with GI. That leaves us RQ 2 on the one hand. The confidence interval comparison of WT (β = .632**), CR (β = .798**), OL (β = .803**), and AM (β = .562**) shows meaningful, strong correlations for GI sum, leaving the research hypotheses valid due to connections. Meta-Level predictors of GI support FRL (β = .263**) and IM (β = .166**). On the other hand, RQ 3 addresses the discriminant validity by squared correlation loadings. The study discriminates the weak latent variables for RQ 3. AVE is accounted for innovative intentions accounted by ~19 to 39 % of the variance in individuals' behaviors, as Mirjana et al. (2018) theory dictate. CONCLUSIONS. The sieved GI instrument latent variables are finally considered and elaborated on the problem areas with LFR and IM of cause-and-effect relationships and solutions. Positive evidence for idealized innovative cognitive style results in leadership practice recommendations. MANAGERIAL IMPLICATIONS. The supportive structure of the innovation competency needs to be strengthened by linking authentic corporate collaboration, and an EU-level leadership vision.
Full-text available
The Job Demands-Resources Framework (JDR) has established job- and personal resources as essential elements motivating people to perform. Whilst the purpose of job resources in this motivational process is well established, the role of personal resources is still quite ambiguous. Within the JDR framework, personal resources could (a) directly affect performance, (b) indirectly affect the relationship between a job resource and a performance outcome and (c) moderate the job resource-performance relationship. Grit has recently emerged as a promising personal resource as it could potentially act as a direct antecedent-, mediator and moderator within the motivational process of the JDR. To further the debate on the role of personal resources, this paper explores the function of grit (as a personal resource) within the person-environment fit (job resource) and task performance relationship. Specifically, the aim is to determine if grit directly or indirectly affects the relationship between person-environment fit and task performance. Finally, it aims to investigate whether grit moderates this relationship. Data were collected from 310 working adults through electronic surveys, and the relationships were explored through structural equation modelling. When controlling for age and gender, the results showed a positive association between person-environment fit, grit and task performance. Further, grit was also found to indirectly affect the relationship between the person-environment fit and task performance. However, no moderating effect could be established. This signifies the importance of grit as a psychological process, rather than a buffering element that may explain how person-environment fit affects performance outcomes.
Racial attitudes, beliefs, and motivations lie at the center of many influential theories of prejudice and discrimination. The extent to which such theories can meaningfully explain behavior hinges on accurate measurement of these latent constructs. We evaluated the validity properties of 25 race-related scales in a sample of 910,066 respondents using various tools, including dynamic fit indices, item response theory, and nomological nets. Despite showing adequate internal reliability, many scales demonstrated poor model fit and had latent score distributions showing clear floor or ceiling effects, results that illustrate deficiencies in these measures' ability to capture their intended latent construct. Nomological nets further suggested that the theoretical space of "racial prejudice" is crowded with scales that may not capture meaningfully distinct latent constructs. We provide concrete recommendations for both scale selection and scale renovation and outline implications for overlooking measurement issues in the study of prejudice and discrimination.
Full-text available
Structural equation modeling (SEM) is an important statistical method in social science research. In the first two decades of the 21st century, great progress has been made in methodological research on SEM in China's mainland. The publications cover five aspects: model development, parameter estimation, model evaluation, measurement invariance and the special data processing in SEM. SEM development includes the research on measurement models, structural models, and complete models, as well as the SEM in population heterogeneity studies and longitudinal studies. The research on the measurement models involves bi-factor model, exploratory structural equation model, measurement models for special design (e.g., random intercept factor analysis model, fixed-links model, and the Thurston model), and formative measurement models. The research on the structural models involves the actor-partner interdependence model. The research on the complete models focuses on item parceling. The SEM in the study of population heterogeneity involves latent class/profile model, factor mixture model, and multi-level latent class model. The SEM in longitudinal studies includes models describing development trajectories and differences, such as the latent growth model, the piecewise growth model, the latent class growth model, the growth mixture model, the piecewise growth mixture model, the latent transition model and the cross-lagged model. The publications on parameter estimation methods mainly involve the introduction of methodology (including the partial least square method and the Bayesian method) and the comparison of different parameter estimation methods. Advances in the model evaluation include fit indices and their corresponding critical values, selection of fit indices, model evaluation criteria beyond fit indices, and comparison and selection among alternative models. The development of measurement invariance involves three topics: (1) the introduction of different models with testing process and model evaluation criteria for measurement invariance analysis; (2) measurement invariance analysis in a particular model or data (e.g., second order factor model and ordered categorical data); (3) new methods of measurement invariance analysis (e.g., alignment and projection method). In addition, research into special data processing methods in SEM addresses issues of missing data, non-continuous data, non-normal data, and latent variable scores. Finally, recent advances in SEM methodological research abroad are introduced to help researchers understand some cutting-edge topics in this field, which offers implications for future directions of SEM methodological research.
Objective : to develop a new justified scale of fear of crime, based on the theory of constructed emotion, qualitative interviews and factor analysis. Methods : dialectical approach to cognition of social phenomena, using the general scientific and specific scientific methods of cognition, based on it. Results : Fear of crime researchers have long debated how to best define and measure fear of crime. There is disagreement about the definition of fear of crime, which has led to inconsistent measurement. Our goal was to develop a new fear of crime scale using a theory of emotion and rigorous methodology. Scale development involved five major stages: in-depth interviews to understand how people describe their fear of crime, qualitative analysis to develop questionnaire items, pretesting, factor analyses, and psychometric validation. Qualitative interviews ( N = 29) revealed that people use words like “fear”, “worry”, and “concern” interchangeably. After qualitative analysis led to an initial item pool, factor analyses yielded a 10-item, onefactor scale. Quantitative analyses ( N = 665) revealed standardized factor loadings between 0.715 and 0.888, an internal consistency of a = 0.945, and convergent and divergent validity. Our new measure will allow greater precision when researching fear of crime. Scientific novelty : this study introduced the theory of constructed emotion to the study of fear of crime. The wide range of interviewees’ descriptions of their fear of crime is consistent with the theory of constructed emotion. Many interviewees conflated fear, worry, concern, and other emotion words, which illustrates the concept of emotional granularity. When someone uses words like “fear” and “concern” interchangeably, it suggests that that person’s experience of those emotions is the same in that context. The theory of constructed emotion posits that emotions are subjective and depend on the present context, someone’s previous experiences, and their understanding and use of emotion words. According to qualitative interviews, fear of crime encompasses many feelings including concern, unpleasant affect, worry, anxiety, paranoia, and panic. These findings will allow future research to further build theory on fear of crime. Practical significance : the main provisions and conclusions of the article can be used in scientific, pedagogical and law enforcement activities when considering the issues related to the levels of fear of crime. The article was first published in English language by Criminology, Criminal Justice, Law & Society and The Western Society of Criminology Hosting by Scholastica . For more information please contact: For original publication: Etopio, Au. L., Berthelot, E. R. (2022). Defining and Measuring Fear of Crime: A New Validated Scale Created from Emotion Theory, Qualitative Interviews, and Factor Analyses. Criminology, Criminal Justice, Law & Society, 23 (1), 46–67. Publication URL: from-emotion-theory-qualitative-interviews-and-factor-analyses
Full-text available
Across 5 different samples, totaling more than 1,600 participants from India, Indonesia, Oman, Romania, and Thailand, the authors address the question of cross-cultural replicability of a personality structure, while exploring the utility of exploratory structural equation modeling (ESEM) as a data analysis technique in cross-cultural personality research. Personality was measured with an alternative, non–Five-Factor Model (FFM) personality framework, provided by the HEXACO–PI (Lee & Ashton, 200435. Lee, K., & Ashton, M. C. (2004). Psychometric properties of the HEXACO personality inventory. Multivariate Behavioral Research, 39, 329–358.View all references). The results show that the HEXACO framework was replicated in some of the investigated cultures. The ESEM data analysis technique proved to be especially useful in investigating the between-group measurement equivalence of broad personality measures across different cultures.
Full-text available
Goodness-of-fit (GOF) indexes provide "rules of thumb"—recommended cutoff values for assessing fit in structural equation modeling. Hu and Bentler (1999) proposed a more rigorous approach to evaluating decision rules based on GOF indexes and, on this basis, proposed new and more stringent cutoff values for many indexes. This article discusses potential problems underlying the hypothesis-testing rationale of their research, which is more appropriate to testing statistical significance than evaluating GOF. Many of their misspecified models resulted in a fit that should have been deemed acceptable according to even their new, more demanding criteria. Hence, rejection of these acceptable-misspecified models should have constituted a Type 1 error (incorrect rejection of an "acceptable" model), leading to the seemingly paradoxical results whereby the probability of correctly rejecting misspecified models decreased substantially with increasing N. In contrast to the application of cutoff values to evaluate each solution in isolation, all the GOF indexes were more effective at identifying differences in misspecification based on nested models. Whereas Hu and Bentler (1999) offered cautions about the use of GOF indexes, current practice seems to have incorporated their new guidelines without sufficient attention to the limitations noted by Hu and Bentler (1999).
Full-text available
Subjective well-being is predominantly conceived as having 3 components: life satisfaction, positive affect, and negative affect. This article reports 2 studies that seek to investigate the factor structure of subjective well-being in Iran. One-, two-, and three-factor models of subjective well-being were evaluated using confirmatory factor analysis (CFA) and exploratory structural equation modeling (ESEM). The results of Study 1 (N = 2,197) and Study 2 (N = 207) show that whereas the 1- and 2-factor models do not fit the data well, the 3-factor model provides an adequate fit. These results indicate that the 3 components of subjective well-being constitute 3 interrelated, yet distinct, factors. The analyses demonstrate how traditional CFA and ESEM can be combined to obtain a clear picture of the measurement model of subjective well-being and generate new insights about individual items and cross-loadings needed to derive more parsimonious measures. Nuances relating to the assessment of subjective well-being in more collectivist and Muslim countries are discussed.
Full-text available
Through its frequent use, a pattern has emerged showing psychometric limitations of the Difficulties in Emotion Regulation Scale (DERS; Gratz & Roemer, 200425. Gratz, K. L., & Roemer, L. (2004). Multidimensional assessment of emotion regulation and dysregulation: Development, factor structure, and initial validation of the Difficulties in Emotion Regulation Scale. Journal of Psychopathology and Behavioral Assessment, 26, 41–54.View all references). This 3-part study sought to (a) determine whether these limitations are due to a method effect by rewording all reverse-coded items in a straightforward manner and submitting them to exploratory factor analysis (EFA), and (b) examine the tenability of an adaptation of the original measure. EFA results from Study 1 (N = 743) supported retention of 29 modified items across 5 factors. Consistent with the original theoretical underpinnings of the DERS, Awareness and Clarity items loaded on the same factor. In Study 2 (N = 738), confirmatory factor analysis (CFA) was used to examine the factor structure of the pool of items identified in Study 1. All of the modified subscales clustered strongly with one another and evidenced large loadings on a higher-order emotion regulation construct. These results were replicated in Study 3 (N = 918). Results from Study 3 also provided support for the reliability and validity of scores on the modified version of the DERS (i.e., internal consistency, convergent and criterion-related validity). These findings provide psychometric support for a modified version of the DERS.
Hayduk and Glaser (2000) asserted that the most commonly used point estimate of the Root Mean Square Error of Approximation index of fit (Steiger & Lind, 1980) has two significant problems: (a) The frequently cited target value of. 05 is not a stable target, but a "sample size adjustment"; and (b) the truncated point estimate Rt = max(R, 0) effectively throws away a substantial part of the sampling distribution of the test statistic with "proper models," rendering it useless a substantial portion of the time. In this article, I demonstrate that both issues discussed by Hayduk and Glaser are actually not problems at all. The first "problem" derives from a false premise by Hayduk and Glaser that Steiger (1995) specifically warned about in an earlier publication. The second so-called problem results from the point estimate satisfying a fundamental property of a good estimator and can be shown to have virtually no negative implications for statistical practice.