The Thorny Relation between Measurement Quality and Fit Index Cut-Offs in Latent Variable
Daniel MCNEISH, Ji AN, & Gregory R. HANCOCK
University of Maryland, College Park
Correspondence for this manuscript should be sent to the first author at Daniel McNeish who is
now at the University of North Carolina, Chapel Hill, 100 E. Franklin Street Suite 200, Chapel
Hill, NC 27599, USA. (Email: email@example.com)
MEASUREMENT QUALITY AND FIT INDICES 1
Latent variable modeling is a popular and flexible statistical framework. Concomitant with fitting
latent variable models is assessment of how well the theoretical model fits the observed data.
Although firm cut-offs for these fit indices are often cited, recent statistical proofs and simulations
have shown that these fit indices are highly susceptible to measurement quality. For instance, an
RMSEA value of 0.06 (conventionally thought to indicate good fit) can actually indicate poor fit
with poor measurement quality (e.g., standardized factors loadings of around 0.40). Conversely,
an RMSEA value of 0.20 (conventionally thought to indicate very poor fit) can indicate acceptable
fit with very high measurement quality (standardized factor loadings around 0.90). Despite the
wide-ranging effect on applications of latent variable models, the high level of technical detail
involved with this phenomenon has curtailed the exposure of these important findings to empirical
researchers who are employing these methods. This paper briefly reviews these methodological
studies in minimal technical detail and provides a demonstration to easily quantify the large
influence measurement quality has on fit index values and how greatly the cut-offs would change
if they were derived under an alternative level of measurement quality. Recommendations for best
practice are also discussed.
MEASUREMENT QUALITY AND FIT INDICES 2
The Thorny Relation between Measurement Quality and Fit Index Cut-Offs in Latent
Latent variable models have burgeoned in popularity and have been a fixture in the toolkits
of psychologists analyzing empirical data. As a critical step in the evaluation of latent variable
models, researchers routinely investigate how well their theoretical model fits to the observed
data. Such assessments of data-model fit are crucial in the appraisal of psychological theories –
favorable data-model fit can lend support of a theory while poor data-model fit can call a theory
into question. Despite the substantial role data-model fit plays is the assessment of theories tested
via latent variable models, the appropriate manner to assess data-model fit has long been a
controversial issue in the methodological literature. Lively debates have erupted over whether the
minimum fit function chi-square test is the only true measure of fit or whether its tendency to
become highly powered with larger samples may necessitate the use of descriptive, approximate
goodness of fit indices (AFIs; see Antonakis, Bendahan, Jacquart, & Lalive, 2010; Barrett, 2007;
Bentler, 2007; Browne, MacCallum, Kim, Andersen, & Glaser, 2002; Chen, Curran, Bollen,
Kirby, & Paxton 2008; Credé & Harms, 2015; Hayduk, Cummings, Boadu, Pazderka-Robinson, &
Boulianne, 2007; Hayduk & Glaser, 2000; Hayduk, 2014; McIntosh, 2007; Miles & Shevlin,
2007; Mulaik, 2007; Steiger, 2007; Tomarken & Waller, 2003). If one concludes that AFIs are
appropriate, because they are largely descriptive measures, the absence of traditional inferential
properties (i.e., p-values) immediately raises the question of which values are indicative of
acceptable data-model fit. Throughout the 1980s and 1990s, there was little consensus about
which values of which AFIs could be considered reflective of a well-fitting model and researchers
often relied on experience, intuition, or subjective criteria to varying degrees (Marsh, Hau, &
MEASUREMENT QUALITY AND FIT INDICES 3
Hu and Bentler (1999; hereafter referred to as HB) attempted to address this considerable
issue by conducting a simulation study with an impressive breadth of conditions, ultimately
yielding empirically-based recommendations for values that are indicative of acceptable data-
model fit. That is, HB determined values maximally discerning models known to be incorrect
from those known to be correct, recommending a cut-off value for the standardized root mean
square residual (SRMR; Jöreskog & Sörbom, 1981) less than or equal to 0.08, a root mean square
error of approximation (RMSEA; Steiger & Lind, 1980) value less than or equal to 0.06, and a
comparative fit index (CFI; Bentler, 1990) value greater than or equal to 0.95, among other
indices. These recommended cut-off values have achieved near canonical status (as evidenced in
part by the work’s 35,000+ citations on Google Scholar) and nearly any researcher working with
latent variable models is likely familiar with these cut-offs values.
Despite several cautions by HB themselves advising against over-generalizing their findings
to conditions and models outside of what was contained in the simulation, many applied
researchers, textbook authors, journal editors, and reviewers have endorsed the HB criteria as
applicable to latent variable models broadly (Marsh et al., 2004; Jackson, Gillaspy, & Purc-
Stephenson, 2009). For instance, in a review of reporting practices for confirmatory factor analysis
(CFA) in general applied psychology studies, Jackson et al. (2009) found that almost 60% of
studies explicitly used or referenced the HB recommendations to judge the fit of the model.
Additionally, Jackson et al. (2009) showed implicit evidence of the omnipresent nature of the cut-
offs – the average fit index values across over 350 published psychology studies were 0.060,
0.062, and 0.933 for the SRMR, RMSEA, and CFI, respectively, demonstrating that these cut-offs
have essentially become the hurdle which empirical researchers must clear in order to publish their
findings. To foreshadow the motivation for the current study, Jackson et al. (2009) stated, “we also
did not find evidence that warnings about strict adherence to Hu and Bentler’s suggestions were
MEASUREMENT QUALITY AND FIT INDICES 4
being heeded” (p. 18). One such warning that is particularly important is the role of measurement
quality, which will be the focus of the remainder of this paper.
Before delving into the specifics of latent variable data-model fit indices, we will advance a
hypothetical example to set the stage for our treatment of the role of measurement quality on fit
index cut-off values. Imagine that two studies are conducted, one testing Model A and the other
testing Model B. These models have the same latent structure and number of indicator variables,
are based upon the same sample size, and the data in each case meet all standard distributional
assumptions. Using commonly reported AFIs from Jackson et al. (2009), suppose Model A’s AFI
values surpass HB’s recommendations such that the RMSEA is 0.04, SRMR is 0.04, and CFI is
0.975, whereas Model B has an RMSEA of 0.20, an SRMR of 0.14, and a CFI of 0.775. Based on
this information and knowing that the model type, model complexity, adherence to assumptions,
and sample sizes are equivalent, we presume that many researchers would instinctively assess the
fit of Model A to be superior to Model B and, if Model B were under consideration at a
prestigious journal or conference, that the theory underlying it may be subjected to a steady stream
of criticism or it may dismissed outright based upon the seemingly egregious data-model fit.
However, in fairly routine scenarios found in psychology, there are circumstances in which
the AFI values such as those from Model B can not only be indicative of adequate data-model fit
in an absolute sense, but can better distinguish between well-fitting and poorly fitting models than
the fit index values reported for Model A, despite the apparently stable nature of AFI cut-off
values which are applied indiscriminately across studies. We ask readers to retain this motivating
hypothetical example in mind as we continue below, and we will return to this hypothetical
scenario in the concluding paragraphs of this paper after we review the literature and report results
from our illustrative simulation.
MEASUREMENT QUALITY AND FIT INDICES 5
The Origin and Issues of the Hu and Bentler Cut-Offs
Although many empirical researchers treat the HB cut-offs more or less as firm cut-offs,
these cut-offs were derived via a Monte Carlo simulation study and were not mathematically
derived. With such an approach, the results are necessarily constrained to the conditions featured
in the simulation study and are not broadly generalizable to all types of models and data types.
Although the conditions within the seminal HB study were expansive (the original paper spanned
50 journal pages upon publication), one could not realistically expect for HB to cover all possible
conditions that may arise in empirical studies. As such, several studies, have spoken to some of the
problems of over-generalizing the HB recommendations and also possible shortcomings in the HB
simulation design. These criticisms range from the how realistic HB’s induced misspecifications
were (Marsh et al., 2004) to possible confounding of the models types can could have
differentially affected performance of different AFIs (Fan & Sivo, 2005).
Another major criticism that will be the focus of this study is the quality of measurement.
Although HB manipulated many conditions when deriving their cut-off values, the strength of the
standardized factor loadings was kept constant throughout their entire study. That is, HB tested
many different sample sizes, degrees of deviation from normality, and model types; however, all
of these conditions was tested with factor loadings that were always near 0.70 (most loadings were
0.70 in their models, a few were either 0.75 or 0.80 but the loadings were not systematically
manipulated in the study). Recalling that Monte Carlo derived values are only applicable to
conditions included it the study, the absence of multiple measurement quality conditions
undoubtedly limits the generalizability of the HB cut-offs because measurement quality is quite
variable not only from discipline to discipline, but also from study to study within a single
In the methodological literature, measurement quality does not necessarily have a strict definition and can be used to
refer to validity, reliability, or generalizability of a particular scale. In this paper, we use “measurement quality” to
refer to the strength of the standardized factor loadings, which is highly related to reliability.
MEASUREMENT QUALITY AND FIT INDICES 6
discipline. Studies with standardized loadings that exceed 0.70 are commonly (but not
exclusively) found when measuring more concrete constructs such as cognitive abilities,
reasoning abilities, or attitudes (for recent empirical examples in the Journal of Personality
Assessment, see e.g., Bardeen, Fergus, Hannan, & Orcutt, 2016; Jashanloo, 2015; Rice,
Richardson, & Tueller, 2014; You, Leung, Lai, & Fu, 2013). Studies aiming to capture less well-
defined constructs such as creativity, risky or substance use behaviors, or abilities in very young
children tend to (but do not exclusively) have standardized loadings below 0.70 (for recent
empirical examples in the Journal of Personality Assessment, see e.g., Allan, Lonigan, & Phillips,
2015; Demianczyk, Jenkins, Henson, & Conner, 2014; Fergus, Valentiner, McGrath, Gier-
Lonsway, & Kim, 2012; Ion et al., 2016; Michel, Pace, Edun, Sawhney, & Thomas, 2014). In
practice, researchers can reasonably expect to see standardized factor loadings with magnitudes
between about 0.40 and 0.90 both in their own work and when reading the work of others. Despite
the considerable difference between a variable that loads at 0.40 on a latent variable and one that
loads at 0.90, the original HB study did not differentiate between these situations.
Issues of Measurement Quality and the Reliability Paradox
The lack of multiple measurement quality conditions in HB has been questioned over the
last decade with recent studies noting the how such an oversight can greatly limit the broad
validity of fit index cut-offs in latent variable models (Beauducel & Wittman, 2005; Cole &
Preacher, 2014; Hancock & Mueller, 2011; Heene, Hilbert, Draxler, Ziegler, & Bühner, 2011;
Kang, McNeish, & Hancock, 2016; Miles & Shevlin, 2007; Saris, Satorra, & van der Veld, 2009;
Savalei, 2012; Steiger, 2000). While some studies have merely noted the issues with such an
omission, other studies can gone so far as to mathematically prove that measurement quality
directly affects the values of AFIs. We will review the findings from each of these studies next.
MEASUREMENT QUALITY AND FIT INDICES 7
For a given set of misspecifications in a latent variable model, holding all else equal,
models with poor measurement quality appear to fit much better than models with excellent
measurement quality. This phenomenon was first noted in a study investigating properties of
RMSEA by Saris and Satorra (1992) but has been developed further over the last few years.
Hancock and Mueller (2011) coined the phrase reliability paradox to describe this relationship.
The paradoxical nature of the phenomenon is evoked by the fact that researchers often strive for
the highest measurement quality possible for their latent variables, but, once obtained, AFIs will
be far worse than if measurement quality were much poorer. Using a population study, Hancock
and Mueller (2011) systematically showed how, with one hypothetical model, evaluations of data-
model fit slowly deteriorate as a function of measurement quality, even when all other model and
design factors are held constant. In their study, they kept the degree of misspecification, sample
size, and the model identical and only change the magnitude of the standardized factor loadings
from 0.40 to 0.95. For example, in their hypothetical model, the RMSEA value with standardized
loadings of 0.40 was 0.00 while the RMSEA value with standardized loadings of 0.95 was 0.10.
Hancock and Mueller (2011) further showed that standard error estimates of structural parameters
are much larger with poorer measurement quality and Lagrange multiplier test statistics (more
commonly known as modification indices) similarly lose their effectiveness to denote path that
should be introduced into the model to improve the fit of the model when poorer measurement
quality. Hancock and Mueller (2011) concluded that the nature of the AFI cut-offs is in direct
contrast to best data analytic practice – poor measurement quality is rewarded while good
measurement quality is punished.
Heene et al. (2011) had a similar motivation to Hancock and Mueller (2011) but conducted
their study via a simulation and mathematically derived the direct relation of measurement quality
on AFIs. Heene et al. (2011) used HB to inspire their simulation conditions although they did
MEASUREMENT QUALITY AND FIT INDICES 8
make several alterations including (1) using different magnitudes of factor loadings for each item
on each factor (i.e., a congeneric scale) rather than having the loadings be equal (i.e., a tau-
equivalent scale), (2) the number of manifest variables per factor was changed to 15 and 45 which
more closely represents a scale or instrument rather than the much smaller factors used by HB, and
(3) different standardized factor loading conditions were used with high (factor loadings near
0.80), medium (factor loadings near 0.60), and low (factor loadings near 0.40) factor reliability
conditions. Similar to Hancock and Mueller (2011), results showed that RMSEA, SRMR, CFI,
and TML values in misspecified models were seen as fitting the data well under the low factor
reliability condition whereas the high factor reliability condition showed RMSEA, SRMR, CFI,
and TML values that would routinely call for rejecting the model. When the model was perfectly
specified (i.e., the model used to generate data and the model used to fit the data were identical),
factor reliability was inconsequential and the RMSEA, SRMR, CFI, and values were indicative of
good fit regardless of the factor loading condition (however, measurement quality was only
inconsequential if the model were perfectly specified which is highly unlikely in empirical
studies). Noting that simulations are not broadly generalizable, Heene et al. (2011) went on to
explain via a mathematical derivation why the reliability paradox exists. Although the proof is
somewhat in-depth, the rationale stems from the fact that the eigenvalues of the model-implied
covariance matrix are bounded below by the residual variances which are a function of the factor
loadings – larger factor loadings lead to low residual variances. That is, if the factor loadings are
high, the latent variable explains a larger amount of variance in the manifest variable and the
associated residual is low as a result. As the factor loadings decrease (and residual variances
increase), the lower bound of the eigenvalues decreases which means that TML and AFIs values
decrease as well (provided that the model is not perfectly specified and has at least some trivial
misspecification), making models appear to fit relatively better if cut-off criteria are held constant.
MEASUREMENT QUALITY AND FIT INDICES 9
The lower bound of the model fit criteria changes as a function of measurement quality but the
cut-off for good fit is constant, meaning that the relative distance between the lower bound and the
cut-off is not constant and that studies with poor measurement quality therefore have an easier
path to reaching a conclusion of acceptable data-model fit.
Miles and Shevlin (2007) conducted an illustrative population analysis on incremental fit
indices to show how they are minimally affected by model conditions, including the magnitude of
the factor loadings. Incremental fit indices are those that compare the improvement in fit of the
model of interest to some baseline model (usually in independence model that simply models the
variance and mean for each individual manifest variable but does not allow the manifest variables
to covary). Miles and Shevlin (2007) compare three models: one with perfectly reliable manifest
indicator variables, one with manifest indicators with 0.80 reliability, and one with manifest
indicator variables with 0.50 reliability. Results showed that CFI, the Tucker-Lewis Index, and the
incremental fit index were able to demonstrate good data-model fit with a trivially misspecified
model (i.e., one that should fail to be rejected for practical purposes) with high reliabilities (i.e.,
factor loadings with strong magnitudes) despite the fact that both RMSEA and TML would
handedly reject the model. On these grounds, Miles and Shevlin (2007) advocate for wider use of
incremental fit indices because they are less affected by the reliability paradox because the effect
of measurement quality is partially included in the baseline model as well as the model of interest.
Incremental indices are still not immune, however, and Miles and Shevlin (2007) list their primary
conclusion, somewhat tongue-in-cheek, as “If you wish your model to fit, … ensure that your
measures are unreliable” (p. 874).
Saris et al. (2009) conducted simulations and provided population analyses to explore
whether TML and AFIs are actually detecting misspecifications in the model or whether they are
instead more sensitive to attributes of the model that are unrelated to misspecifications. Their
MEASUREMENT QUALITY AND FIT INDICES 10
analyses showed that the magnitude of misspecification was just as related to incidental conditions
such as sample size and measurement quality as to actual misspecifications. Related to
measurement quality, their results showed that, for a constant degree of misspecification, as
standardized factor loadings were increased from 0.70 (as in HB) to 0.90, the RMSEA went from
demonstrating great fit (0.00) to very poor fit (0.14) and crossing the HB cut-off value at a
standardized loading value of 0.85 (see Table 3 in their article for complete results). Saris et al.
(2009) make a case for abandoning global fit measures (i.e., TML and AFIs) and replacing
assessment of latent variable models with single-parameter tests such as expected parameter
change or modification indices. Saris et al. (2009) also note that all possible misspecifications are
not theoretically equivalent and that the goal of data-model fit assessments should not be strictly
concerned with detecting any type of misspecification (which is the case when testing the fit of the
model globally) but rather data-model fit assessment should focus on detecting theoretical
Kang et al. (2016) extended Hancock and Mueller (2011) to the context of multiple-group
latent variables models where a primary interest is determining whether parameter estimates are
invariant across group (e.g., whether items function similarly across different groups of people).
These invariance tests typically feature difference in fit index tests (e.g., ΔTML, ΔCFI; see Cheung
& Rensvold, 2002 for additional details) so the goal of the study was to examine whether
measurement quality similarly affects differences in fit indices or whether the reliability paradox is
confined to AFIs in their raw form. Kang et al. (2016) found that, for the purpose of testing either
measurement invariance (similarly of loadings across groups) or structural invariance (similarly of
structural paths across groups), only ΔMcDonald’s Non-Centrality Index was reasonably
unaffected by measurement quality and only for measurement invariance. For all other conditions,
as the measurement quality increased, the indices were much more likely to find non-invariance
MEASUREMENT QUALITY AND FIT INDICES 11
while conditions with poor measurement quality (i.e., around 0.40) concluded that invariance was
Bridging the Gap to Empirical Researchers
These methodological studies have noted the existence of this phenomenon but, due to a
strong methodological focus, previous studies emphasize why it occurs rather than how it affects
cut-off values that a majority of researchers are using (or at least referencing as guidelines and/or
being subjected to through peer review). That is, the tangible implications of broadly applying the
HB cut-offs has yet to be demonstrated to primarily non-statistical audience, despite the fact that
non-statisticians and their substantive theories are the most widely affected.
The goal of this paper is not to extend the methodological conclusions of the so-called
reliability paradox to novel situations or to provide additional insight about the mechanism by
which it functions. As discussed in the preceding section, there are already several rigorous
methodological studies that achieve this goal. More importantly, despite the wide-ranging
implications of these methodological studies and the potentially serious implications for the
evaluation of applied research, these findings are largely confined to the pages of technical
journals and are examined from a more theoretical perspective. Instead, our goal is to elucidate
these findings to as broad an audience as possible by stripping the technical language and detail to
demonstrate the magnitude of the practical implications as plainly as possible. Thus, this paper is
not attempting to pass these ideas off as original but rather to illuminate highly relevant
methodological considerations that have yet to find their way into discussions of empirical studies.
To accomplish this goal, we provide an illustrative simulation next to show how (1) the behavior
of the fit index cut-offs varies as a function of measurement quality and (2) if HB used even a
slightly different measurement quality condition in their study, how the rampantly utilized cut-offs
values would be quite different. We will then discuss the effect this has when interpreting models
MEASUREMENT QUALITY AND FIT INDICES 12
in empirical studies, ways that researchers can report AFIs to acknowledge this issue, and how it
may affect what journal editors and reviewers deem worthy of publication in top-tier outlets.
Illustrative Simulation Design
Although example analyses from real datasets are often the preferred method to demonstrate
methodological issues to non-statisticians, the nature of the problem at hand does not lend itself to
be examined in such a manner. That is, to fully grasp the severity of the issue, all components of
the data and associated model (sample size, number of latent variables, number of indicators per
latent variable, severity of misspecification) must be held constant with the exception of the
magnitude of the standardized factor loadings. To avoid possible confounds, the extent of the
misspecifications that are present in the model must be known to ensure that models only differ in
the standardized factor loadings, which cannot be discerned with real data. Therefore, we will
generate our own data that satisfy these requirements with a small illustrative simulation. We
realize that not all readers may be familiar with interpreting simulation studies, so we will provide
guidance throughout this section to facilitate proper interpretation of this demonstration.
In order to elucidate the effect of measurement quality on AFI cut-offs, we begin with the
original model used in HB. To briefly overview HB’s original conditions, their “simple” true
model was a CFA model with three covarying exogenous latent variables, each with five manifest
indicators that had factor loadings mostly equal to 0.70 with a few loadings of 0.75 or 0.80. The
path diagram of the data generation model is presented in Figure 1. The degrees of
misspecification included a “minor” condition such that one factor covariance path was omitted
and a “severe” condition which omitted two factor covariance paths from the model. Samples of
different sizes (𝑁=150, 250, 500, 1000, 2500, 5000) were drawn from seven conditions that
differed in terms of normality and independence. We only replicate the simple model from HB
under multivariate normality and only for HB’s “minor misspecification” condition.
MEASUREMENT QUALITY AND FIT INDICES 13
In each cell of the simulation design 1000 datasets were generated according to the HB
simple model as presented in Figure 1. We will then fit the HB model containing a “minor”
misspecification which purposefully omits the factor covariance path between Factor 1 and Factor
2. The model is therefore not correct and the TML statistic and AFI values should detect that this
model is misspecified. Addressing the primary interest of the paper – to investigate the effect of
varying degrees of measurement quality on AFI cut-offs – we generated data with factor loadings
that were equivalent across all 15 indicator variables. Population values for the standardized factor
loadings were manipulated to range from 0.40 to 0.90, in 0.10 increments (unlike HB’s
standardized loading conditions which were constrained near 0.70). Data were generated and
modeled within PROC CALIS in SAS 9.3. During the process, we tracked RMSEA, SRMR, CFI,
and TML because, as mentioned previously, these indices are widely reported in empirical studies
(Jackson et al., 2009).
FIGURE 1 ABOUT HERE
For readers who are not familiar with simulation studies, this paragraph will conceptually
describe the logic of the simulation. Readers familiar with interpreting simulation results can skip
the remainder of this section without loss of continuity. The advantage of the simulation design is
that we generated the data according to a certain model, so we are able to control the magnitude of
model misspecification, the magnitude of the standardized factor loadings, and we can determine
if a model fit to the generated data is correct. This luxury is not available when using real data
because one cannot be certain of the “correct” model for the data (i.e., whether the model-implied
covariance matrix perfectly reproduces the observed covariance matrix) or be certain of the level
of misspecification. Furthermore, the measurement quality in real data cannot be manipulated. We
start by generating data from the model in Figure 1 that has standardized factor loadings equal to
0.40. From this data, we then fit a misspecified model that should fit somewhat poorly. We then
MEASUREMENT QUALITY AND FIT INDICES 14
record the model fit criteria for the model. This process is repeated with 1000 unique generated
datasets so that we have adequate information to inspect the distribution of the model fit criteria
(we repeat the process instead of generating a single dataset to avoid succumbing to any
idiosyncratic nuances that could occur based on random chance). We then repeat the process with
data generated from standardized factor loadings equal to 0.50, 0.60, and so forth until 0.90 (with
1000 different generated datasets in for each standardized loading value). We then compare the
distributions for each of the standardized factor loading conditions to show how the values of the
fit indices values change as the values of the standardized factor loadings change even though we
know that the degree of misspecification is exactly the same across the entire simulation (because
we have control over the how the data are created).
For each of the models, we calculate the percentage of the replications in which the SRMR,
RMSEA, and CFI values for the fitted model exceed a particular cut-off (i.e., a conclusion of poor
fit) for each standardized loading magnitude condition, similar to HB. For each index, we explore
the percentage of models that would be declared poorly fitting based on the HB cut-off
recommendations – 0.06 for RMSEA, 0.08 for SRMR, and 0.95 for CFI. Additionally, we
investigate how many models would be declared poorly fitting based on a value of each index
conventionally thought to indicate unambiguously good fit were used as the cut-offs (0.04 for
RMSEA, 0.04 for SRMR, and 0.97 for CFI) and one value conventionally thought to indicate
unambiguously poor data-model fit (0.20 for RMSEA, 0.14 for SRMR, and 0.775 for CFI). For
researchers who adhere to the principle that TML is the only philosophically defensible assessment
of data-model fit, we also tracked the number of replications in which TML would reject the null at
the 0.05 and 0.01 levels of significance. We only report results for n =500 in the interest
MEASUREMENT QUALITY AND FIT INDICES 15
succinctness of presentation although similar patterns of results hold for other sample sizes as
Table 1 presents the percentage of models declared poorly fitting for the misspecified model.
In Table 1, each row represents a different standardized loading condition. Each column represents
the percentage of the time in which the particular fit criteria reported that the model did not fit the
data well. In Table 1, the values are expected to be rather high as the misspecification should be
detected a large portion of the time. The 0.70 loading row is bolded to denote the condition that
mirrors conditions used in HB.
TABLE 1 ABOUT HERE
From Table 1, it can be seen that AFI values were above the “ambiguously good” cut-off
(the leftmost column for each index) 100% of the time when the standardized loadings were 0.70
(values closer to 100% mean that AFIs identify that the model did not fit well). As an example, the
100% value in the 0.04 column and 0.70 row for RMSEA means that 100% of the models fit to the
simulated data returned RMSEA values greater than 0.04. TML also rejected 100% of models at
both the 0.05 and 0.01 levels when the loadings were 0.70. Using the HB recommendations (the
middle column for each index), the misspecified model always had AFI values beyond the cut-off
for RMSEA and CFI while SRMR was beyond the cut-off about half the time (SRMR is lower
because it tends to be less sensitive to the type of misspecification used and the misspecification
was not severe; Fan & Sivo, 2007). This reifies that the cut-offs recommended by HB indeed
perform well when the standardized loadings are 0.70 as was demonstrated in their study.
Essentially none of the fitted models had an RMSEA above 0.20, an SRMR above 0.14, or a CFI
below 0.775 when the loadings were 0.70 meaning that these values indicate excessively poor fit
as expected because, although the models are noticeably misspecified, the misspecification is
MEASUREMENT QUALITY AND FIT INDICES 16
moderate and AFIs therefore do not reach such seemingly extreme values. Despite the fact that
these general guidelines have been ported to SEM broadly, this interpretation of fit index values is
only applicable when the standardized loadings are 0.70.
Consider the exact same model, featuring the exact same misspecification, with the exact
same sample size but now the standardized loadings are 0.40 instead of 0.70. Now, close to 95%
of the replications have an RMSEA less than 0.04 and all the models have an RMSEA below 0.06
which indicates good fit based on HB cut-offs. Consider what this means – recall that with
standardized loadings equal to 0.70, none of the 1000 fitted models have an RMSEA value above
0.20. This indicates that if a model yields an RMSEA of 0.20 in practice, then this would be
indicative of excessively poor fit because 0 of the 1000 misspecified models output an RMSEA
value that high. With loadings of 0.40, none of the models have an RMSEA above 0.06. By
similar logic, this means 0.06 indicates poor fit with lower standardized loadings; however, for
researchers who routinely rely on HB cut-offs, an RMSEA of 0.06 would be indicate good fit and
increase the probability that the theory under investigation would gain traction in the literature.
This phenomenon is not restricted to RMSEA – based on SRMR, a quarter of the
replications would have values below 0.04 and all replications are below the 0.08 HB cut-off when
the standard loadings are equal to 0.40. As noted by Miles and Shevlin (2007), CFI is the least
susceptible but about 15% of models are above 0.95 and about 5% are above 0.975. Instead of all
replications being rejected by TML, only 56% were rejected at the 0.01 level of significance and
77% were rejected at the 0.05 level of significance, a marked decrease in power compared to the
scenario where the loadings are 0.70, as noted in the derivation by Heene et al (2011).
Now consider the opposite extreme in Table 1 where the standardized loadings are 0.90
instead of the 0.70. With excellent measurement quality, essentially every replication has an
RMSEA value above 0.20, an SRMR value above 0.14, and a CFI value below 0.775 while TML
MEASUREMENT QUALITY AND FIT INDICES 17
again is able to correctly reject the model as misfitting. Using RMSEA as an example, this means
that when the loadings are 0.90 that an RMSEA cut-off of 0.20 can distinguish good fitting models
from poorly fitting (but only moderately misspecified) models just as well as 0.06 when the
loadings are equal to 0.70. If using the HB cut-offs in practice however, a model with an RMSEA
of 0.20 would never be considered anywhere close to fitting well.
To depict these results visually, Figure 2 shows the empirical distribution of values for
RMSEA with loading conditions of 0.40, 0.70, and 0.90. The difference between the distributions
is stark as there is no overlap whatsoever. The 0.70 loading distribution is slightly above the HB
cut-off which makes sense because the misspecification was non-trivial but moderate in
magnitude. On the other hand, the 0.40 loading distribution is completely below the HB cutoff
while the 0.90 loading distribution is almost entirely above an RMSEA of 0.20. Needless to say,
RMSEA functions very differently depending on the measurement quality and the exact same
degree of misspecification can be viewed in a very different light as a result. Similar patterns are
also present in the empirical distribution plots for SRMR and CFI, which are shown in Figure 3.
The SRMR distributions in the left panel display similar behavior to the RMSEA and the CFI
values are still affected, but to a lesser degree as there is some overlap across the factor loadings
conditions (as has been noted in Miles & Shevlin, 2007).
INSERT FIGURE 2 ABOUT HERE
INSERT FIGURE 3 ABOUT HERE
Consider the ramifications of these results – if one has excellent measurement quality, the
HB cut-offs of 0.06/0.08/0.95 for RMSEA/SRMR/CFI could be discarded for values of
0.20/0.14/0.775 and one would be able to identify nearly all moderately misspecified models – the
classification accuracy remains unblemished. Therefore, based on this model, if the standardized
loadings were 0.90 and one obtained AFI values of 0.12/0.12/0.85, a researcher could be confident
MEASUREMENT QUALITY AND FIT INDICES 18
that the model fits approximately, possibly containing only trivial misspecifications of the same
magnitude as a 0.06/0.08/0.95 trio of fit indices with standardized loadings equal to 0.70 but no
moderate or severe misspecifications, even though these values are conventionally thought to
indicate ambiguously poor fit. As a rhetorical argument, we challenge readers to consider the last
time they saw a study confidently report 0.12/0.12/0.85 for RMSEA/SRMR/CFI in a positive light
in a top-tier outlet. However, based on the logic of AFI cut-offs, these values are just as good at
discriminating between good and bad models with loadings equal to 0.90 as HB cut-offs are for
models with loadings equal to 0.70.
More problematically, consider the case of poorer measurement quality – even a
0.04/0.04/0.975 trio of RMSEA/SRMR/CFI values does not guarantee much in terms of the model
fitting well despite the fact that most researchers would be pleased to achieve these AFI values for
their model and studies with less than enviable measurement quality are routinely published with
less reassuring fit values. Furthermore, TML has far less power to detect misspecifications under
such circumstances. The ramifications of this result are that many models that appear to fit well
based on the HB cut-offs are based on theories that, in actuality, may not be well supported by the
data if more nuanced assessments of data-model fit were employed or if these studies featured
more rigorous measurement models. Conversely, many models may appear to fit poorly and be
disregarded but may actually fit well if the quality of measurement is strong.
The Issue for Empirical Researchers
In essence, even though researchers strive to measure their latent variables with the highest
quality manifest variables, relying on strict AFI cut-offs to judge data-model fit ends up punishing
this diligence and rewarding studies whose models feature much poorer measurement quality
when a single cut-off is applied broadly. Stated more drastically, the meaning of good data-model
fit changes as a function of measurement quality even if all other conditions are held constant – for
MEASUREMENT QUALITY AND FIT INDICES 19
example, an RMSEA value of 0.06 can be considered poor fit with low measurement quality,
adequate fit with loadings near 0.70 as in HB, or great fit with high measurement quality.
Moreover, although the criteria for determining good or bad fit with TML is unaltered under
different loading conditions (i.e., the interpretation of inferential tests is consistent), TML is far less
powerful with poorer measurement quality. Somewhat ironically, measurement quality with AFIs
is rather analogous to sample size with TML – with sample size being the primary issue AFIs were
designed to mitigate.
With TML, a more sound methodological design (i.e., large sample size) makes good data-
model fit more difficult to obtain while good fit can be achieved more readily under the less
desirable condition of a smaller sample size, holding all else constant. Similarly with AFIs, good
data-model fit is difficult to achieve with more sound methodological designs (i.e., better
measurement quality) but less desirable design conditions (i.e., poorer measurement quality) will
result in better fit if all else is constant. At a basic level, using AFIs instead of TML effectively
trades problems associated with sample size for problems associated with measurement quality.
Empirical researchers are aware of the perils of sample size with TML, yet most are unacquainted
with the issues associated with measurement quality with AFIs.
To highlight the broader issue with an analogy, imagine a researcher fits a model to a sample
size of 25 (casting issues related to estimation difficulties aside) and reports a non-significant TML
value, concluding that the model fits well and that the theory is upheld. Reviewers and critical
readers would immediately cast doubt about this conclusion and would instinctively note that the
small sample size renders the inferential TML essentially powerless to detect violations unless they
are massive. Conversely, if a researcher presented a model with a sample of 25,000 and a
significant TML test, many reviewers and critical readers would note that the model may still fit
MEASUREMENT QUALITY AND FIT INDICES 20
reasonably well and that the TML test is vastly overpowered in this context and may be detecting
This exact same scenario exists with measurement quality although it has largely gone
unnoticed in empirical literatures up to this point. If the standardized loadings are low, AFIs may
appear to indicate great fit but readers should question whether the model actually fits or whether
the model simply too underpowered to detect meaningful misfit. By a similar token, if
measurement quality is high, seemingly poor AFIs may be attributable to either (1) poor fit or (2)
a model that is overpowered and is detecting trivial discrepancies between the model and the data.
Although it is the current state of affairs, examining AFIs without taking measurement quality into
account is as egregious as interpreting p-values without taking sample size into account – a TML p-
value means very different things with 100 versus 10,000 people just as an RMSEA of 0.06 means
something very different with high or low measurement quality. Conclusions from studies with
small samples are questioned (as they should be) but studies with poor measurement quality rarely
receive the same type of treatment with respect to data-model fit despite the fact that AFIs with
poor measurement quality are similarly underpowered as TML with small samples (and
overpowered for large samples and good measurement quality). More bluntly, the exercise of
comparing AFIs to a single, predetermined cut-off is akin to interpreting p-values as if every
dataset had the exact same sample size.
Conclusions and Recommendations for Practice
It is increasingly clear that no single cut-off value for any particular AFI can be broadly
applied across latent variable models. At this point, readers may hope to see revised
recommendations for adjudicating fit across a broader set of circumstances (measurement quality
in particular); although this is a logical next step we refrain from doing so in attempt to dissuade
overgeneralizations that have run rampant in assessments of data-model fit. Even if updated
MEASUREMENT QUALITY AND FIT INDICES 21
recommendations were provided to account for varying levels of measurement quality, these
recommendations would be just as susceptible as the original recommendations to factors that
obscure AFI interpretation and comparability such as model complexity, the number of indicators
per factor, and sample size.
Although the atmosphere appears to be changing as methodological research continues to
expose weaknesses with popular recommendations, current practice largely still operates under the
assumption that there is a single cut-off that can be used for each fit-index, which is akin to
recommending a single sample size to achieve adequate power across all statistical models.
Imagine a scenario in which all studies with, say, 200 or fewer observations were considered
to be underpowered. Yet, if one were researching a phenomenon with a large effect size, a sample
size of, say, 50 might be more than sufficient to detect true differences but, because the sample
size was below 200, the study might be poorly received in this hypothetical universe. On the other
hand, if the phenomenon of interest had a small, but non-zero, effect size, a sample size potentially
much larger than 200 would be needed to detect that a difference exists; however, many studies
would conclude that there are no significant differences if researchers repeatedly tested samples of
200 in accordance with the recommendation. By comparison, if one is researching a construct for
which very high measurement quality can be obtained, one need not subscribe to such stringent
AFI criteria to be confident that the model features any non-trivial misspecifications. Conversely,
if one is researching a construct that cannot be measured very reliably, the currently employed cut-
offs are not suitable and would be very likely to overlook potentially meaningful misspecifications
in the model.
With regard to the reporting of results in empirical studies, it is common to report fit and
paths/correlations of the structural model but often studies do not report the standardized loadings
from the latent variables to the manifest variables (only 31% of reviewed studies reported the
MEASUREMENT QUALITY AND FIT INDICES 22
standardized loadings in Jackson et al., 2009). We are cognizant of space limitations in journals
that publish empirical studies and the fact that many studies test multiple models, so reporting
each individual loading for each model may not be feasible (although this is the preferable option
if possible). However, as illustrated in the above simulation, it is vital to have a general idea of
the values of the standardized loadings in order to assess AFIs because, without this context, the
value of the AFIs are uninterpretable. We encourage applied researchers to report some type of
information pertaining to standardized loadings of the latent variables in the final model to help
contextualize the AFIs (e.g., mean, median, range). Alternatively, a measure of reliability that is
based on the magnitude of the standardized loadings such as McDonald’s omega for construct
reliability (McDonald, 1970) or coefficient H for maximal reliability (Hancock & Mueller, 2001)
may be able to succinctly provide such a contextualization.
Otherwise, readers, reviewers, and
editors have essentially no information upon which to base the fit of the model and an SRMR
value of 0.07, for example, can be interpreted many different ways conditional on measurement
quality. This does not solve the issue related to the reliability paradox, but it would help
researchers to be more upfront about the conditions from which their AFIs come. Additionally, it
would reward researchers who exercised due diligence to construct more reliable measures and
relegate some inappropriate claims made from data with low reliability.
As a final note to put the implications of these findings into perspective, consider again the
two sets of AFIs mentioned near the beginning of the paper. As a reminder, in Model A, RMSEA
= 0.040, SRMR = 0.040, and CFI = 0.975; and in Model B, RMSEA = 0.20, SRMR= 0.14, and
CFI = 0.775. Under current practice where the HB criteria have become common reference points,
We acknowledge that standardized loadings are not interchangeable with reliability but they are related and may
serve as a fair approximation that is concise to report. If it helps contextualize this study, the Coefficient H values for
the 0.40, 0.70. and 0.90 loading conditions were 0.48, 0.83, and 0.96, respectively. The McDonald’s omega values for
these conditions were 0.49, 0.83, and 0.96, respectively.
MEASUREMENT QUALITY AND FIT INDICES 23
Model A would be universally seen as fitting the data better than Model B, which would likely be
desk-rejected at many reputable journals. However, if one does not somehow condition on
measurement quality, this assertion can be highly erroneous. If the factor loadings in Model A had
standardized values of 0.40 and the factor loadings in Model B had standardized values of 0.90,
Model B actually indicates better data-model fit and has higher power to detect the same moderate
misspecification in the same model based on the results of our illustrative simulation study
(assuming multivariate normality). Reverting back to Table 1, about 25% of moderately
misspecified models produced SRMR below 0.04, about 5% of models resulted in CFI values
below 0.975, and nearly 95% of models produced an RMSEA value below 0.04 with poor
measurement quality. Conversely, with excellent measurement quality, essentially none of the
misspecified models produced an SRMR value less than 0.14, an RMSEA value less than 0.20, or
a CFI value less than 0.775. Even though the AFI values of Model B appear quite poor upon first
glance, under certain conditions, even these seemingly unsatisfactory values could indicate
acceptable fit with possibly only trivial misspecifications present in the model. More importantly,
the seemingly poor Model B AFI values better classify models with excellent measurement quality
compared to the seemingly pristine Model A AFI values when measurement quality is poor. To
put the thesis of this paper into a single sentence, information about the quality of the
measurement must be reported along with AFIs in order for the values to have any interpretative
MEASUREMENT QUALITY AND FIT INDICES 24
Allan, N. P., Lonigan, C. J., & Phillips, B. M. (2015). Examining the factor structure and
structural invariance of the PANAS across children, adolescents, and young adults. Journal of
Personality Assessment, 97, 616-625.
Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A
review and recommendations. The Leadership Quarterly, 21, 1086-1120.
Bardeen, J. R., Fergus, T. A., Hannan, S. M., & Orcutt, H. K. (2016). Addressing psychometric
limitations of the Difficulties in Emotion Regulation Scale through item modification. Journal of
Personality Assessment, 98, 298-309.
Barrett, P. (2007). Structural equation modelling: Adjudging model fit. Personality and
Individual Differences, 42, 815-824.
Beauducel, A., & Wittmann, W. W. (2005). Simulation study on fit indexes in CFA based o
data with slightly distorted simple structure. Structural Equation Modeling, 12, 41-75.
Bentler, P. M. (2007). On tests and indices for evaluating structural models. Personality an
Individual Differences, 42, 825-829.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588-606.
Browne, M. W., MacCallum, R. C., Kim, C. T., Andersen, B. L., & Glaser, R. (2002). When fit
indices and residuals are incompatible. Psychological Methods, 7, 403-421.
Chen, F., Curran, P. J., Bollen, K. A., Kirby, J., & Paxton, P. (2008). An empirical evaluation of
the use of fixed cutoff points in RMSEA test statistic in structural equation models. Sociological
Methods & Research, 36, 462-494.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling, 9, 233-255.
Cole, D. A., & Preacher, K. J. (2014). Manifest variable path analysis: Potentially serious and
misleading consequences due to uncorrected measurement error. Psychological Methods, 19, 300-
Credé, M., & Harms, P. D. (2015). 25 years of higher‐order confirmatory factor analysis in the
organizational sciences: A critical review and development of reporting
recommendations. Journal of Organizational Behavior, 36, 845-872.
Demianczyk, A. C., Jenkins, A. L., Henson, J. M., & Conner, B. T. (2014). Psychometric
evaluation and revision of Carver and White's BIS/BAS scales in a diverse sample of young
adults. Journal of Personality Assessment, 96, 485-494.
MEASUREMENT QUALITY AND FIT INDICES 25
Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement
model components: Rationale of two-index strategy revisited. Structural Equation
Modeling, 12, 343-367.
Fan, X., & Sivo, S. A. (2007). Sensitivity of fit indices to model misspecification and model
types. Multivariate Behavioral Research, 42, 509-529.
Fergus, T. A., Valentiner, D. P., McGrath, P. B., Gier-Lonsway, S. L., & Kim, H. S. (2012). Short
forms of the social interaction anxiety scale and the social phobia scale. Journal of Personality
Assessment, 94, 310-320.
Hancock, G. R., & Mueller, R. O. (2011). The reliability paradox in assessing structural relations
within covariance structure models. Educational & Psychological Measurement, 71, 306-324.
Hancock, G. R., & Mueller, R. O. (2001). Rethinking construct reliability within latent variable
systems. In R. Cudeck, S. du Toit, & D. Sörbom (Eds.), Structural equation modeling: Present
and future—A festschrift in honor of Karl Jöreskog (pp. 195–216). Lincolnwood, IL: Scientific
Hayduk, L. A. (2014). Shame for disrespecting evidence: the personal consequences of
insufficient respect for structural equation model testing. BMC Medical Research
Methodology, 14, 124.
Hayduk, L. A., & Glaser, D. N. (2000). Jiving the four-step, waltzing around factor analysis, and
other serious fun. Structural Equation Modeling, 7, 1-35.
Hayduk, L., Cummings, G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing!
Testing! One, two, three–testing the theory in structural equation models!. Personality and
Individual Differences, 42, 841-850.
Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in
confirmatory factor analysis by increasing unique variances: a cautionary note on the usefulness of
cutoff values of fit indices. Psychological Methods, 16, 319-336.
Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to
underparameterized model misspecification. Psychological Methods, 3, 424-453.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.
Ion, A., Iliescu, D., Aldhafri, S., Rana, N., Ratanadilok, K., Widyanti, A., & Nedelcea, C. (2016).
A cross-cultural analysis of personality structure through the lens of the hexaco model. Journal of
Personality Assessment, OnlineFirst.
Jackson, D. L., Gillaspy Jr, J. A., & Purc-Stephenson, R. (2009). Reporting practices in
confirmatory factor analysis: an overview and some recommendations. Psychological
Methods, 14, 6-23.
MEASUREMENT QUALITY AND FIT INDICES 26
Joshanloo, M. (2016). Factor Structure of Subjective Well-Being in Iran. Journal of Personality
Assessment, 98, 435-443.
Jöreskog, K. G., & Sörbom, D. (1981). LISREL V: Analysis of linear structural relationships by
maximum likelihood and least squares methods. University of Uppsala, Department of Statistics.
Kang, Y., McNeish, D.M., & Hancock, G. R. (2016). The role of measurement quality on
practical guidelines for assessing measurement and structural invariance. Educational and
Psychological Measurement, 76, 533-561.
Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and
Bentler's (1999) findings. Structural Equation Modeling, 11, 320-341.
McDonald, R. P. (1970). The theoretical foundations of principal factor analysis, canonical factor
analysis, and alpha factor analysis. British Journal of Mathematical and Statistical
Psychology, 23, 1-21.
McIntosh, C. (2007). Rethinking fit assessment in structural equation modelling: A commentary
and elaboration on Barrett (2007). Personality and Individual Differences, 42, 859-867.
Michel, J. S., Pace, V. L., Edun, A., Sawhney, E., & Thomas, J. (2014). Development and
validation of an explicit aggressive beliefs and attitudes scale. Journal of Personality
Assessment, 96, 327-338.
Miles, J., & Shevlin, M. (2007). A time and a place for incremental fit indices. Personality and
Individual Differences, 42, 869-874.
Millsap, R. E. (2007). Structural equation modeling made difficult. Personality and Individual
Differences, 42, 875-881.
Mulaik, S. (2007). There is a place for approximate fit in structural equation modelling.
Personality and Individual Differences, 42, 883-891.
Rice, K. G., Richardson, C. M., & Tueller, S. (2014). The short form of the revised almost perfect
scale. Journal of Personality Assessment, 96, 368-379.
Saris, W. E., Satorra, A., & Van der Veld, W. M. (2009). Testing structural equation models or
detection of misspecifications?. Structural Equation Modeling, 16, 561-582.
Saris W. E., Satorra A. (1992). Power evaluations in structural equation models. InBollen K.
A., Long S. (Eds.), Testing in structural equation models (pp. 181-204).London, England: Sage.
Savalei, V. (2012). The relationship between root mean square error of approximation and model
misspecification in confirmatory factor analysis models. Educational and Psychological
Measurement, 72, 910-932.
MEASUREMENT QUALITY AND FIT INDICES 27
Steiger, J. H. (2007). Understanding the limitations of global fit assessment in structural equation
modeling. Personality and Individual Differences, 42, 893-898.
Steiger, J. H. (2000). Point estimation, hypothesis testing, and interval estimation using the
RMSEA: Some comments and a reply to Hayduk and Glaser. Structural Equation Modeling, 7,
Steiger, J.H. & Lind, J.C. (1980, May). Statistically-based tests for the number of common
factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.
Tomarken, A. J., & Waller, N. G. (2003). Potential problems with" well fitting" models. Journal
of Abnormal Psychology, 112, 578-598.
You, J., Leung, F., Lai, K. K. Y., & Fu, K. (2013). Factor structure and psychometric properties of
the Pathological Narcissism Inventory among Chinese university students. Journal of Personality
Assessment, 95, 309-318.
MEASUREMENT QUALITY AND FIT INDICES 28
Proportion of the replications declared poorly fitting for various loading magnitudes under
Note: The values in the second row are the hypothetical AFI cut-off values to which the
percentage of rejected models corresponds. The bolded row represents the condition used in Hu
and Bentler (1999).
MEASUREMENT QUALITY AND FIT INDICES 29
Figure 1. Confirmative factor analysis model used in Hu and Bentler (1999). As in HB, in the
population model for the simulation model ϕ12 = 0.50, ϕ13 = 0.40, and ϕ23 = 0.30.
MEASUREMENT QUALITY AND FIT INDICES 30
Figure 2. Empirical distribution of RMSEA for selected factor loading conditions. The vertical
black line corresponds to the HB cut-off of 0.06. Values to the right of the vertical black line
would be classified as having poor data-model fit, values to the left indicate good data-model fit.
MEASUREMENT QUALITY AND FIT INDICES 31
Figure 3. Empirical distribution of SRMR (left panel) and CFI (right panel) for selected factor
loading conditions. The vertical black line corresponds to the HB cut-off of 0.08 for SRMR and
0.95 for CFI. Values to the right of the vertical black line would be classified as having poor data-
model fit, values to the left indicate good data-model fit.