Content uploaded by Daniel McNeish

Author content

All content in this area was uploaded by Daniel McNeish on Oct 19, 2018

Content may be subject to copyright.

The Thorny Relation between Measurement Quality and Fit Index Cut-Offs in Latent Variable

Models

Daniel MCNEISH, Ji AN, & Gregory R. HANCOCK

University of Maryland, College Park

AUTHOR NOTES:

Correspondence for this manuscript should be sent to the first author at Daniel McNeish who is

now at the University of North Carolina, Chapel Hill, 100 E. Franklin Street Suite 200, Chapel

Hill, NC 27599, USA. (Email: dmcneish@email.unc.edu)

MEASUREMENT QUALITY AND FIT INDICES 1

Abstract

Latent variable modeling is a popular and flexible statistical framework. Concomitant with fitting

latent variable models is assessment of how well the theoretical model fits the observed data.

Although firm cut-offs for these fit indices are often cited, recent statistical proofs and simulations

have shown that these fit indices are highly susceptible to measurement quality. For instance, an

RMSEA value of 0.06 (conventionally thought to indicate good fit) can actually indicate poor fit

with poor measurement quality (e.g., standardized factors loadings of around 0.40). Conversely,

an RMSEA value of 0.20 (conventionally thought to indicate very poor fit) can indicate acceptable

fit with very high measurement quality (standardized factor loadings around 0.90). Despite the

wide-ranging effect on applications of latent variable models, the high level of technical detail

involved with this phenomenon has curtailed the exposure of these important findings to empirical

researchers who are employing these methods. This paper briefly reviews these methodological

studies in minimal technical detail and provides a demonstration to easily quantify the large

influence measurement quality has on fit index values and how greatly the cut-offs would change

if they were derived under an alternative level of measurement quality. Recommendations for best

practice are also discussed.

MEASUREMENT QUALITY AND FIT INDICES 2

The Thorny Relation between Measurement Quality and Fit Index Cut-Offs in Latent

Variable Models

Latent variable models have burgeoned in popularity and have been a fixture in the toolkits

of psychologists analyzing empirical data. As a critical step in the evaluation of latent variable

models, researchers routinely investigate how well their theoretical model fits to the observed

data. Such assessments of data-model fit are crucial in the appraisal of psychological theories –

favorable data-model fit can lend support of a theory while poor data-model fit can call a theory

into question. Despite the substantial role data-model fit plays is the assessment of theories tested

via latent variable models, the appropriate manner to assess data-model fit has long been a

controversial issue in the methodological literature. Lively debates have erupted over whether the

minimum fit function chi-square test is the only true measure of fit or whether its tendency to

become highly powered with larger samples may necessitate the use of descriptive, approximate

goodness of fit indices (AFIs; see Antonakis, Bendahan, Jacquart, & Lalive, 2010; Barrett, 2007;

Bentler, 2007; Browne, MacCallum, Kim, Andersen, & Glaser, 2002; Chen, Curran, Bollen,

Kirby, & Paxton 2008; Credé & Harms, 2015; Hayduk, Cummings, Boadu, Pazderka-Robinson, &

Boulianne, 2007; Hayduk & Glaser, 2000; Hayduk, 2014; McIntosh, 2007; Miles & Shevlin,

2007; Mulaik, 2007; Steiger, 2007; Tomarken & Waller, 2003). If one concludes that AFIs are

appropriate, because they are largely descriptive measures, the absence of traditional inferential

properties (i.e., p-values) immediately raises the question of which values are indicative of

acceptable data-model fit. Throughout the 1980s and 1990s, there was little consensus about

which values of which AFIs could be considered reflective of a well-fitting model and researchers

often relied on experience, intuition, or subjective criteria to varying degrees (Marsh, Hau, &

Wen, 2004).

MEASUREMENT QUALITY AND FIT INDICES 3

Hu and Bentler (1999; hereafter referred to as HB) attempted to address this considerable

issue by conducting a simulation study with an impressive breadth of conditions, ultimately

yielding empirically-based recommendations for values that are indicative of acceptable data-

model fit. That is, HB determined values maximally discerning models known to be incorrect

from those known to be correct, recommending a cut-off value for the standardized root mean

square residual (SRMR; Jöreskog & Sörbom, 1981) less than or equal to 0.08, a root mean square

error of approximation (RMSEA; Steiger & Lind, 1980) value less than or equal to 0.06, and a

comparative fit index (CFI; Bentler, 1990) value greater than or equal to 0.95, among other

indices. These recommended cut-off values have achieved near canonical status (as evidenced in

part by the work’s 35,000+ citations on Google Scholar) and nearly any researcher working with

latent variable models is likely familiar with these cut-offs values.

Despite several cautions by HB themselves advising against over-generalizing their findings

to conditions and models outside of what was contained in the simulation, many applied

researchers, textbook authors, journal editors, and reviewers have endorsed the HB criteria as

applicable to latent variable models broadly (Marsh et al., 2004; Jackson, Gillaspy, & Purc-

Stephenson, 2009). For instance, in a review of reporting practices for confirmatory factor analysis

(CFA) in general applied psychology studies, Jackson et al. (2009) found that almost 60% of

studies explicitly used or referenced the HB recommendations to judge the fit of the model.

Additionally, Jackson et al. (2009) showed implicit evidence of the omnipresent nature of the cut-

offs – the average fit index values across over 350 published psychology studies were 0.060,

0.062, and 0.933 for the SRMR, RMSEA, and CFI, respectively, demonstrating that these cut-offs

have essentially become the hurdle which empirical researchers must clear in order to publish their

findings. To foreshadow the motivation for the current study, Jackson et al. (2009) stated, “we also

did not find evidence that warnings about strict adherence to Hu and Bentler’s suggestions were

MEASUREMENT QUALITY AND FIT INDICES 4

being heeded” (p. 18). One such warning that is particularly important is the role of measurement

quality, which will be the focus of the remainder of this paper.

Hypothetical Scenario

Before delving into the specifics of latent variable data-model fit indices, we will advance a

hypothetical example to set the stage for our treatment of the role of measurement quality on fit

index cut-off values. Imagine that two studies are conducted, one testing Model A and the other

testing Model B. These models have the same latent structure and number of indicator variables,

are based upon the same sample size, and the data in each case meet all standard distributional

assumptions. Using commonly reported AFIs from Jackson et al. (2009), suppose Model A’s AFI

values surpass HB’s recommendations such that the RMSEA is 0.04, SRMR is 0.04, and CFI is

0.975, whereas Model B has an RMSEA of 0.20, an SRMR of 0.14, and a CFI of 0.775. Based on

this information and knowing that the model type, model complexity, adherence to assumptions,

and sample sizes are equivalent, we presume that many researchers would instinctively assess the

fit of Model A to be superior to Model B and, if Model B were under consideration at a

prestigious journal or conference, that the theory underlying it may be subjected to a steady stream

of criticism or it may dismissed outright based upon the seemingly egregious data-model fit.

However, in fairly routine scenarios found in psychology, there are circumstances in which

the AFI values such as those from Model B can not only be indicative of adequate data-model fit

in an absolute sense, but can better distinguish between well-fitting and poorly fitting models than

the fit index values reported for Model A, despite the apparently stable nature of AFI cut-off

values which are applied indiscriminately across studies. We ask readers to retain this motivating

hypothetical example in mind as we continue below, and we will return to this hypothetical

scenario in the concluding paragraphs of this paper after we review the literature and report results

from our illustrative simulation.

MEASUREMENT QUALITY AND FIT INDICES 5

The Origin and Issues of the Hu and Bentler Cut-Offs

Although many empirical researchers treat the HB cut-offs more or less as firm cut-offs,

these cut-offs were derived via a Monte Carlo simulation study and were not mathematically

derived. With such an approach, the results are necessarily constrained to the conditions featured

in the simulation study and are not broadly generalizable to all types of models and data types.

Although the conditions within the seminal HB study were expansive (the original paper spanned

50 journal pages upon publication), one could not realistically expect for HB to cover all possible

conditions that may arise in empirical studies. As such, several studies, have spoken to some of the

problems of over-generalizing the HB recommendations and also possible shortcomings in the HB

simulation design. These criticisms range from the how realistic HB’s induced misspecifications

were (Marsh et al., 2004) to possible confounding of the models types can could have

differentially affected performance of different AFIs (Fan & Sivo, 2005).

Another major criticism that will be the focus of this study is the quality of measurement.

1

Although HB manipulated many conditions when deriving their cut-off values, the strength of the

standardized factor loadings was kept constant throughout their entire study. That is, HB tested

many different sample sizes, degrees of deviation from normality, and model types; however, all

of these conditions was tested with factor loadings that were always near 0.70 (most loadings were

0.70 in their models, a few were either 0.75 or 0.80 but the loadings were not systematically

manipulated in the study). Recalling that Monte Carlo derived values are only applicable to

conditions included it the study, the absence of multiple measurement quality conditions

undoubtedly limits the generalizability of the HB cut-offs because measurement quality is quite

variable not only from discipline to discipline, but also from study to study within a single

1

In the methodological literature, measurement quality does not necessarily have a strict definition and can be used to

refer to validity, reliability, or generalizability of a particular scale. In this paper, we use “measurement quality” to

refer to the strength of the standardized factor loadings, which is highly related to reliability.

MEASUREMENT QUALITY AND FIT INDICES 6

discipline. Studies with standardized loadings that exceed 0.70 are commonly (but not

exclusively) found when measuring more concrete constructs such as cognitive abilities,

reasoning abilities, or attitudes (for recent empirical examples in the Journal of Personality

Assessment, see e.g., Bardeen, Fergus, Hannan, & Orcutt, 2016; Jashanloo, 2015; Rice,

Richardson, & Tueller, 2014; You, Leung, Lai, & Fu, 2013). Studies aiming to capture less well-

defined constructs such as creativity, risky or substance use behaviors, or abilities in very young

children tend to (but do not exclusively) have standardized loadings below 0.70 (for recent

empirical examples in the Journal of Personality Assessment, see e.g., Allan, Lonigan, & Phillips,

2015; Demianczyk, Jenkins, Henson, & Conner, 2014; Fergus, Valentiner, McGrath, Gier-

Lonsway, & Kim, 2012; Ion et al., 2016; Michel, Pace, Edun, Sawhney, & Thomas, 2014). In

practice, researchers can reasonably expect to see standardized factor loadings with magnitudes

between about 0.40 and 0.90 both in their own work and when reading the work of others. Despite

the considerable difference between a variable that loads at 0.40 on a latent variable and one that

loads at 0.90, the original HB study did not differentiate between these situations.

Issues of Measurement Quality and the Reliability Paradox

The lack of multiple measurement quality conditions in HB has been questioned over the

last decade with recent studies noting the how such an oversight can greatly limit the broad

validity of fit index cut-offs in latent variable models (Beauducel & Wittman, 2005; Cole &

Preacher, 2014; Hancock & Mueller, 2011; Heene, Hilbert, Draxler, Ziegler, & Bühner, 2011;

Kang, McNeish, & Hancock, 2016; Miles & Shevlin, 2007; Saris, Satorra, & van der Veld, 2009;

Savalei, 2012; Steiger, 2000). While some studies have merely noted the issues with such an

omission, other studies can gone so far as to mathematically prove that measurement quality

directly affects the values of AFIs. We will review the findings from each of these studies next.

MEASUREMENT QUALITY AND FIT INDICES 7

For a given set of misspecifications in a latent variable model, holding all else equal,

models with poor measurement quality appear to fit much better than models with excellent

measurement quality. This phenomenon was first noted in a study investigating properties of

RMSEA by Saris and Satorra (1992) but has been developed further over the last few years.

Hancock and Mueller (2011) coined the phrase reliability paradox to describe this relationship.

The paradoxical nature of the phenomenon is evoked by the fact that researchers often strive for

the highest measurement quality possible for their latent variables, but, once obtained, AFIs will

be far worse than if measurement quality were much poorer. Using a population study, Hancock

and Mueller (2011) systematically showed how, with one hypothetical model, evaluations of data-

model fit slowly deteriorate as a function of measurement quality, even when all other model and

design factors are held constant. In their study, they kept the degree of misspecification, sample

size, and the model identical and only change the magnitude of the standardized factor loadings

from 0.40 to 0.95. For example, in their hypothetical model, the RMSEA value with standardized

loadings of 0.40 was 0.00 while the RMSEA value with standardized loadings of 0.95 was 0.10.

Hancock and Mueller (2011) further showed that standard error estimates of structural parameters

are much larger with poorer measurement quality and Lagrange multiplier test statistics (more

commonly known as modification indices) similarly lose their effectiveness to denote path that

should be introduced into the model to improve the fit of the model when poorer measurement

quality. Hancock and Mueller (2011) concluded that the nature of the AFI cut-offs is in direct

contrast to best data analytic practice – poor measurement quality is rewarded while good

measurement quality is punished.

Heene et al. (2011) had a similar motivation to Hancock and Mueller (2011) but conducted

their study via a simulation and mathematically derived the direct relation of measurement quality

on AFIs. Heene et al. (2011) used HB to inspire their simulation conditions although they did

MEASUREMENT QUALITY AND FIT INDICES 8

make several alterations including (1) using different magnitudes of factor loadings for each item

on each factor (i.e., a congeneric scale) rather than having the loadings be equal (i.e., a tau-

equivalent scale), (2) the number of manifest variables per factor was changed to 15 and 45 which

more closely represents a scale or instrument rather than the much smaller factors used by HB, and

(3) different standardized factor loading conditions were used with high (factor loadings near

0.80), medium (factor loadings near 0.60), and low (factor loadings near 0.40) factor reliability

conditions. Similar to Hancock and Mueller (2011), results showed that RMSEA, SRMR, CFI,

and TML values in misspecified models were seen as fitting the data well under the low factor

reliability condition whereas the high factor reliability condition showed RMSEA, SRMR, CFI,

and TML values that would routinely call for rejecting the model. When the model was perfectly

specified (i.e., the model used to generate data and the model used to fit the data were identical),

factor reliability was inconsequential and the RMSEA, SRMR, CFI, and values were indicative of

good fit regardless of the factor loading condition (however, measurement quality was only

inconsequential if the model were perfectly specified which is highly unlikely in empirical

studies). Noting that simulations are not broadly generalizable, Heene et al. (2011) went on to

explain via a mathematical derivation why the reliability paradox exists. Although the proof is

somewhat in-depth, the rationale stems from the fact that the eigenvalues of the model-implied

covariance matrix are bounded below by the residual variances which are a function of the factor

loadings – larger factor loadings lead to low residual variances. That is, if the factor loadings are

high, the latent variable explains a larger amount of variance in the manifest variable and the

associated residual is low as a result. As the factor loadings decrease (and residual variances

increase), the lower bound of the eigenvalues decreases which means that TML and AFIs values

decrease as well (provided that the model is not perfectly specified and has at least some trivial

misspecification), making models appear to fit relatively better if cut-off criteria are held constant.

MEASUREMENT QUALITY AND FIT INDICES 9

The lower bound of the model fit criteria changes as a function of measurement quality but the

cut-off for good fit is constant, meaning that the relative distance between the lower bound and the

cut-off is not constant and that studies with poor measurement quality therefore have an easier

path to reaching a conclusion of acceptable data-model fit.

Miles and Shevlin (2007) conducted an illustrative population analysis on incremental fit

indices to show how they are minimally affected by model conditions, including the magnitude of

the factor loadings. Incremental fit indices are those that compare the improvement in fit of the

model of interest to some baseline model (usually in independence model that simply models the

variance and mean for each individual manifest variable but does not allow the manifest variables

to covary). Miles and Shevlin (2007) compare three models: one with perfectly reliable manifest

indicator variables, one with manifest indicators with 0.80 reliability, and one with manifest

indicator variables with 0.50 reliability. Results showed that CFI, the Tucker-Lewis Index, and the

incremental fit index were able to demonstrate good data-model fit with a trivially misspecified

model (i.e., one that should fail to be rejected for practical purposes) with high reliabilities (i.e.,

factor loadings with strong magnitudes) despite the fact that both RMSEA and TML would

handedly reject the model. On these grounds, Miles and Shevlin (2007) advocate for wider use of

incremental fit indices because they are less affected by the reliability paradox because the effect

of measurement quality is partially included in the baseline model as well as the model of interest.

Incremental indices are still not immune, however, and Miles and Shevlin (2007) list their primary

conclusion, somewhat tongue-in-cheek, as “If you wish your model to fit, … ensure that your

measures are unreliable” (p. 874).

Saris et al. (2009) conducted simulations and provided population analyses to explore

whether TML and AFIs are actually detecting misspecifications in the model or whether they are

instead more sensitive to attributes of the model that are unrelated to misspecifications. Their

MEASUREMENT QUALITY AND FIT INDICES 10

analyses showed that the magnitude of misspecification was just as related to incidental conditions

such as sample size and measurement quality as to actual misspecifications. Related to

measurement quality, their results showed that, for a constant degree of misspecification, as

standardized factor loadings were increased from 0.70 (as in HB) to 0.90, the RMSEA went from

demonstrating great fit (0.00) to very poor fit (0.14) and crossing the HB cut-off value at a

standardized loading value of 0.85 (see Table 3 in their article for complete results). Saris et al.

(2009) make a case for abandoning global fit measures (i.e., TML and AFIs) and replacing

assessment of latent variable models with single-parameter tests such as expected parameter

change or modification indices. Saris et al. (2009) also note that all possible misspecifications are

not theoretically equivalent and that the goal of data-model fit assessments should not be strictly

concerned with detecting any type of misspecification (which is the case when testing the fit of the

model globally) but rather data-model fit assessment should focus on detecting theoretical

meaningful misspecifications.

Kang et al. (2016) extended Hancock and Mueller (2011) to the context of multiple-group

latent variables models where a primary interest is determining whether parameter estimates are

invariant across group (e.g., whether items function similarly across different groups of people).

These invariance tests typically feature difference in fit index tests (e.g., ΔTML, ΔCFI; see Cheung

& Rensvold, 2002 for additional details) so the goal of the study was to examine whether

measurement quality similarly affects differences in fit indices or whether the reliability paradox is

confined to AFIs in their raw form. Kang et al. (2016) found that, for the purpose of testing either

measurement invariance (similarly of loadings across groups) or structural invariance (similarly of

structural paths across groups), only ΔMcDonald’s Non-Centrality Index was reasonably

unaffected by measurement quality and only for measurement invariance. For all other conditions,

as the measurement quality increased, the indices were much more likely to find non-invariance

MEASUREMENT QUALITY AND FIT INDICES 11

while conditions with poor measurement quality (i.e., around 0.40) concluded that invariance was

established.

Bridging the Gap to Empirical Researchers

These methodological studies have noted the existence of this phenomenon but, due to a

strong methodological focus, previous studies emphasize why it occurs rather than how it affects

cut-off values that a majority of researchers are using (or at least referencing as guidelines and/or

being subjected to through peer review). That is, the tangible implications of broadly applying the

HB cut-offs has yet to be demonstrated to primarily non-statistical audience, despite the fact that

non-statisticians and their substantive theories are the most widely affected.

The goal of this paper is not to extend the methodological conclusions of the so-called

reliability paradox to novel situations or to provide additional insight about the mechanism by

which it functions. As discussed in the preceding section, there are already several rigorous

methodological studies that achieve this goal. More importantly, despite the wide-ranging

implications of these methodological studies and the potentially serious implications for the

evaluation of applied research, these findings are largely confined to the pages of technical

journals and are examined from a more theoretical perspective. Instead, our goal is to elucidate

these findings to as broad an audience as possible by stripping the technical language and detail to

demonstrate the magnitude of the practical implications as plainly as possible. Thus, this paper is

not attempting to pass these ideas off as original but rather to illuminate highly relevant

methodological considerations that have yet to find their way into discussions of empirical studies.

To accomplish this goal, we provide an illustrative simulation next to show how (1) the behavior

of the fit index cut-offs varies as a function of measurement quality and (2) if HB used even a

slightly different measurement quality condition in their study, how the rampantly utilized cut-offs

values would be quite different. We will then discuss the effect this has when interpreting models

MEASUREMENT QUALITY AND FIT INDICES 12

in empirical studies, ways that researchers can report AFIs to acknowledge this issue, and how it

may affect what journal editors and reviewers deem worthy of publication in top-tier outlets.

Illustrative Simulation Design

Although example analyses from real datasets are often the preferred method to demonstrate

methodological issues to non-statisticians, the nature of the problem at hand does not lend itself to

be examined in such a manner. That is, to fully grasp the severity of the issue, all components of

the data and associated model (sample size, number of latent variables, number of indicators per

latent variable, severity of misspecification) must be held constant with the exception of the

magnitude of the standardized factor loadings. To avoid possible confounds, the extent of the

misspecifications that are present in the model must be known to ensure that models only differ in

the standardized factor loadings, which cannot be discerned with real data. Therefore, we will

generate our own data that satisfy these requirements with a small illustrative simulation. We

realize that not all readers may be familiar with interpreting simulation studies, so we will provide

guidance throughout this section to facilitate proper interpretation of this demonstration.

In order to elucidate the effect of measurement quality on AFI cut-offs, we begin with the

original model used in HB. To briefly overview HB’s original conditions, their “simple” true

model was a CFA model with three covarying exogenous latent variables, each with five manifest

indicators that had factor loadings mostly equal to 0.70 with a few loadings of 0.75 or 0.80. The

path diagram of the data generation model is presented in Figure 1. The degrees of

misspecification included a “minor” condition such that one factor covariance path was omitted

and a “severe” condition which omitted two factor covariance paths from the model. Samples of

different sizes (𝑁=150, 250, 500, 1000, 2500, 5000) were drawn from seven conditions that

differed in terms of normality and independence. We only replicate the simple model from HB

under multivariate normality and only for HB’s “minor misspecification” condition.

MEASUREMENT QUALITY AND FIT INDICES 13

In each cell of the simulation design 1000 datasets were generated according to the HB

simple model as presented in Figure 1. We will then fit the HB model containing a “minor”

misspecification which purposefully omits the factor covariance path between Factor 1 and Factor

2. The model is therefore not correct and the TML statistic and AFI values should detect that this

model is misspecified. Addressing the primary interest of the paper – to investigate the effect of

varying degrees of measurement quality on AFI cut-offs – we generated data with factor loadings

that were equivalent across all 15 indicator variables. Population values for the standardized factor

loadings were manipulated to range from 0.40 to 0.90, in 0.10 increments (unlike HB’s

standardized loading conditions which were constrained near 0.70). Data were generated and

modeled within PROC CALIS in SAS 9.3. During the process, we tracked RMSEA, SRMR, CFI,

and TML because, as mentioned previously, these indices are widely reported in empirical studies

(Jackson et al., 2009).

FIGURE 1 ABOUT HERE

For readers who are not familiar with simulation studies, this paragraph will conceptually

describe the logic of the simulation. Readers familiar with interpreting simulation results can skip

the remainder of this section without loss of continuity. The advantage of the simulation design is

that we generated the data according to a certain model, so we are able to control the magnitude of

model misspecification, the magnitude of the standardized factor loadings, and we can determine

if a model fit to the generated data is correct. This luxury is not available when using real data

because one cannot be certain of the “correct” model for the data (i.e., whether the model-implied

covariance matrix perfectly reproduces the observed covariance matrix) or be certain of the level

of misspecification. Furthermore, the measurement quality in real data cannot be manipulated. We

start by generating data from the model in Figure 1 that has standardized factor loadings equal to

0.40. From this data, we then fit a misspecified model that should fit somewhat poorly. We then

MEASUREMENT QUALITY AND FIT INDICES 14

record the model fit criteria for the model. This process is repeated with 1000 unique generated

datasets so that we have adequate information to inspect the distribution of the model fit criteria

(we repeat the process instead of generating a single dataset to avoid succumbing to any

idiosyncratic nuances that could occur based on random chance). We then repeat the process with

data generated from standardized factor loadings equal to 0.50, 0.60, and so forth until 0.90 (with

1000 different generated datasets in for each standardized loading value). We then compare the

distributions for each of the standardized factor loading conditions to show how the values of the

fit indices values change as the values of the standardized factor loadings change even though we

know that the degree of misspecification is exactly the same across the entire simulation (because

we have control over the how the data are created).

Results

For each of the models, we calculate the percentage of the replications in which the SRMR,

RMSEA, and CFI values for the fitted model exceed a particular cut-off (i.e., a conclusion of poor

fit) for each standardized loading magnitude condition, similar to HB. For each index, we explore

the percentage of models that would be declared poorly fitting based on the HB cut-off

recommendations – 0.06 for RMSEA, 0.08 for SRMR, and 0.95 for CFI. Additionally, we

investigate how many models would be declared poorly fitting based on a value of each index

conventionally thought to indicate unambiguously good fit were used as the cut-offs (0.04 for

RMSEA, 0.04 for SRMR, and 0.97 for CFI) and one value conventionally thought to indicate

unambiguously poor data-model fit (0.20 for RMSEA, 0.14 for SRMR, and 0.775 for CFI). For

researchers who adhere to the principle that TML is the only philosophically defensible assessment

of data-model fit, we also tracked the number of replications in which TML would reject the null at

the 0.05 and 0.01 levels of significance. We only report results for n =500 in the interest

MEASUREMENT QUALITY AND FIT INDICES 15

succinctness of presentation although similar patterns of results hold for other sample sizes as

well.

Misspecified Model

Table 1 presents the percentage of models declared poorly fitting for the misspecified model.

In Table 1, each row represents a different standardized loading condition. Each column represents

the percentage of the time in which the particular fit criteria reported that the model did not fit the

data well. In Table 1, the values are expected to be rather high as the misspecification should be

detected a large portion of the time. The 0.70 loading row is bolded to denote the condition that

mirrors conditions used in HB.

TABLE 1 ABOUT HERE

From Table 1, it can be seen that AFI values were above the “ambiguously good” cut-off

(the leftmost column for each index) 100% of the time when the standardized loadings were 0.70

(values closer to 100% mean that AFIs identify that the model did not fit well). As an example, the

100% value in the 0.04 column and 0.70 row for RMSEA means that 100% of the models fit to the

simulated data returned RMSEA values greater than 0.04. TML also rejected 100% of models at

both the 0.05 and 0.01 levels when the loadings were 0.70. Using the HB recommendations (the

middle column for each index), the misspecified model always had AFI values beyond the cut-off

for RMSEA and CFI while SRMR was beyond the cut-off about half the time (SRMR is lower

because it tends to be less sensitive to the type of misspecification used and the misspecification

was not severe; Fan & Sivo, 2007). This reifies that the cut-offs recommended by HB indeed

perform well when the standardized loadings are 0.70 as was demonstrated in their study.

Essentially none of the fitted models had an RMSEA above 0.20, an SRMR above 0.14, or a CFI

below 0.775 when the loadings were 0.70 meaning that these values indicate excessively poor fit

as expected because, although the models are noticeably misspecified, the misspecification is

MEASUREMENT QUALITY AND FIT INDICES 16

moderate and AFIs therefore do not reach such seemingly extreme values. Despite the fact that

these general guidelines have been ported to SEM broadly, this interpretation of fit index values is

only applicable when the standardized loadings are 0.70.

Consider the exact same model, featuring the exact same misspecification, with the exact

same sample size but now the standardized loadings are 0.40 instead of 0.70. Now, close to 95%

of the replications have an RMSEA less than 0.04 and all the models have an RMSEA below 0.06

which indicates good fit based on HB cut-offs. Consider what this means – recall that with

standardized loadings equal to 0.70, none of the 1000 fitted models have an RMSEA value above

0.20. This indicates that if a model yields an RMSEA of 0.20 in practice, then this would be

indicative of excessively poor fit because 0 of the 1000 misspecified models output an RMSEA

value that high. With loadings of 0.40, none of the models have an RMSEA above 0.06. By

similar logic, this means 0.06 indicates poor fit with lower standardized loadings; however, for

researchers who routinely rely on HB cut-offs, an RMSEA of 0.06 would be indicate good fit and

increase the probability that the theory under investigation would gain traction in the literature.

This phenomenon is not restricted to RMSEA – based on SRMR, a quarter of the

replications would have values below 0.04 and all replications are below the 0.08 HB cut-off when

the standard loadings are equal to 0.40. As noted by Miles and Shevlin (2007), CFI is the least

susceptible but about 15% of models are above 0.95 and about 5% are above 0.975. Instead of all

replications being rejected by TML, only 56% were rejected at the 0.01 level of significance and

77% were rejected at the 0.05 level of significance, a marked decrease in power compared to the

scenario where the loadings are 0.70, as noted in the derivation by Heene et al (2011).

Now consider the opposite extreme in Table 1 where the standardized loadings are 0.90

instead of the 0.70. With excellent measurement quality, essentially every replication has an

RMSEA value above 0.20, an SRMR value above 0.14, and a CFI value below 0.775 while TML

MEASUREMENT QUALITY AND FIT INDICES 17

again is able to correctly reject the model as misfitting. Using RMSEA as an example, this means

that when the loadings are 0.90 that an RMSEA cut-off of 0.20 can distinguish good fitting models

from poorly fitting (but only moderately misspecified) models just as well as 0.06 when the

loadings are equal to 0.70. If using the HB cut-offs in practice however, a model with an RMSEA

of 0.20 would never be considered anywhere close to fitting well.

To depict these results visually, Figure 2 shows the empirical distribution of values for

RMSEA with loading conditions of 0.40, 0.70, and 0.90. The difference between the distributions

is stark as there is no overlap whatsoever. The 0.70 loading distribution is slightly above the HB

cut-off which makes sense because the misspecification was non-trivial but moderate in

magnitude. On the other hand, the 0.40 loading distribution is completely below the HB cutoff

while the 0.90 loading distribution is almost entirely above an RMSEA of 0.20. Needless to say,

RMSEA functions very differently depending on the measurement quality and the exact same

degree of misspecification can be viewed in a very different light as a result. Similar patterns are

also present in the empirical distribution plots for SRMR and CFI, which are shown in Figure 3.

The SRMR distributions in the left panel display similar behavior to the RMSEA and the CFI

values are still affected, but to a lesser degree as there is some overlap across the factor loadings

conditions (as has been noted in Miles & Shevlin, 2007).

INSERT FIGURE 2 ABOUT HERE

INSERT FIGURE 3 ABOUT HERE

Consider the ramifications of these results – if one has excellent measurement quality, the

HB cut-offs of 0.06/0.08/0.95 for RMSEA/SRMR/CFI could be discarded for values of

0.20/0.14/0.775 and one would be able to identify nearly all moderately misspecified models – the

classification accuracy remains unblemished. Therefore, based on this model, if the standardized

loadings were 0.90 and one obtained AFI values of 0.12/0.12/0.85, a researcher could be confident

MEASUREMENT QUALITY AND FIT INDICES 18

that the model fits approximately, possibly containing only trivial misspecifications of the same

magnitude as a 0.06/0.08/0.95 trio of fit indices with standardized loadings equal to 0.70 but no

moderate or severe misspecifications, even though these values are conventionally thought to

indicate ambiguously poor fit. As a rhetorical argument, we challenge readers to consider the last

time they saw a study confidently report 0.12/0.12/0.85 for RMSEA/SRMR/CFI in a positive light

in a top-tier outlet. However, based on the logic of AFI cut-offs, these values are just as good at

discriminating between good and bad models with loadings equal to 0.90 as HB cut-offs are for

models with loadings equal to 0.70.

More problematically, consider the case of poorer measurement quality – even a

0.04/0.04/0.975 trio of RMSEA/SRMR/CFI values does not guarantee much in terms of the model

fitting well despite the fact that most researchers would be pleased to achieve these AFI values for

their model and studies with less than enviable measurement quality are routinely published with

less reassuring fit values. Furthermore, TML has far less power to detect misspecifications under

such circumstances. The ramifications of this result are that many models that appear to fit well

based on the HB cut-offs are based on theories that, in actuality, may not be well supported by the

data if more nuanced assessments of data-model fit were employed or if these studies featured

more rigorous measurement models. Conversely, many models may appear to fit poorly and be

disregarded but may actually fit well if the quality of measurement is strong.

The Issue for Empirical Researchers

In essence, even though researchers strive to measure their latent variables with the highest

quality manifest variables, relying on strict AFI cut-offs to judge data-model fit ends up punishing

this diligence and rewarding studies whose models feature much poorer measurement quality

when a single cut-off is applied broadly. Stated more drastically, the meaning of good data-model

fit changes as a function of measurement quality even if all other conditions are held constant – for

MEASUREMENT QUALITY AND FIT INDICES 19

example, an RMSEA value of 0.06 can be considered poor fit with low measurement quality,

adequate fit with loadings near 0.70 as in HB, or great fit with high measurement quality.

Moreover, although the criteria for determining good or bad fit with TML is unaltered under

different loading conditions (i.e., the interpretation of inferential tests is consistent), TML is far less

powerful with poorer measurement quality. Somewhat ironically, measurement quality with AFIs

is rather analogous to sample size with TML – with sample size being the primary issue AFIs were

designed to mitigate.

With TML, a more sound methodological design (i.e., large sample size) makes good data-

model fit more difficult to obtain while good fit can be achieved more readily under the less

desirable condition of a smaller sample size, holding all else constant. Similarly with AFIs, good

data-model fit is difficult to achieve with more sound methodological designs (i.e., better

measurement quality) but less desirable design conditions (i.e., poorer measurement quality) will

result in better fit if all else is constant. At a basic level, using AFIs instead of TML effectively

trades problems associated with sample size for problems associated with measurement quality.

Empirical researchers are aware of the perils of sample size with TML, yet most are unacquainted

with the issues associated with measurement quality with AFIs.

To highlight the broader issue with an analogy, imagine a researcher fits a model to a sample

size of 25 (casting issues related to estimation difficulties aside) and reports a non-significant TML

value, concluding that the model fits well and that the theory is upheld. Reviewers and critical

readers would immediately cast doubt about this conclusion and would instinctively note that the

small sample size renders the inferential TML essentially powerless to detect violations unless they

are massive. Conversely, if a researcher presented a model with a sample of 25,000 and a

significant TML test, many reviewers and critical readers would note that the model may still fit

MEASUREMENT QUALITY AND FIT INDICES 20

reasonably well and that the TML test is vastly overpowered in this context and may be detecting

trivial differences.

This exact same scenario exists with measurement quality although it has largely gone

unnoticed in empirical literatures up to this point. If the standardized loadings are low, AFIs may

appear to indicate great fit but readers should question whether the model actually fits or whether

the model simply too underpowered to detect meaningful misfit. By a similar token, if

measurement quality is high, seemingly poor AFIs may be attributable to either (1) poor fit or (2)

a model that is overpowered and is detecting trivial discrepancies between the model and the data.

Although it is the current state of affairs, examining AFIs without taking measurement quality into

account is as egregious as interpreting p-values without taking sample size into account – a TML p-

value means very different things with 100 versus 10,000 people just as an RMSEA of 0.06 means

something very different with high or low measurement quality. Conclusions from studies with

small samples are questioned (as they should be) but studies with poor measurement quality rarely

receive the same type of treatment with respect to data-model fit despite the fact that AFIs with

poor measurement quality are similarly underpowered as TML with small samples (and

overpowered for large samples and good measurement quality). More bluntly, the exercise of

comparing AFIs to a single, predetermined cut-off is akin to interpreting p-values as if every

dataset had the exact same sample size.

Conclusions and Recommendations for Practice

It is increasingly clear that no single cut-off value for any particular AFI can be broadly

applied across latent variable models. At this point, readers may hope to see revised

recommendations for adjudicating fit across a broader set of circumstances (measurement quality

in particular); although this is a logical next step we refrain from doing so in attempt to dissuade

overgeneralizations that have run rampant in assessments of data-model fit. Even if updated

MEASUREMENT QUALITY AND FIT INDICES 21

recommendations were provided to account for varying levels of measurement quality, these

recommendations would be just as susceptible as the original recommendations to factors that

obscure AFI interpretation and comparability such as model complexity, the number of indicators

per factor, and sample size.

Although the atmosphere appears to be changing as methodological research continues to

expose weaknesses with popular recommendations, current practice largely still operates under the

assumption that there is a single cut-off that can be used for each fit-index, which is akin to

recommending a single sample size to achieve adequate power across all statistical models.

Imagine a scenario in which all studies with, say, 200 or fewer observations were considered

to be underpowered. Yet, if one were researching a phenomenon with a large effect size, a sample

size of, say, 50 might be more than sufficient to detect true differences but, because the sample

size was below 200, the study might be poorly received in this hypothetical universe. On the other

hand, if the phenomenon of interest had a small, but non-zero, effect size, a sample size potentially

much larger than 200 would be needed to detect that a difference exists; however, many studies

would conclude that there are no significant differences if researchers repeatedly tested samples of

200 in accordance with the recommendation. By comparison, if one is researching a construct for

which very high measurement quality can be obtained, one need not subscribe to such stringent

AFI criteria to be confident that the model features any non-trivial misspecifications. Conversely,

if one is researching a construct that cannot be measured very reliably, the currently employed cut-

offs are not suitable and would be very likely to overlook potentially meaningful misspecifications

in the model.

With regard to the reporting of results in empirical studies, it is common to report fit and

paths/correlations of the structural model but often studies do not report the standardized loadings

from the latent variables to the manifest variables (only 31% of reviewed studies reported the

MEASUREMENT QUALITY AND FIT INDICES 22

standardized loadings in Jackson et al., 2009). We are cognizant of space limitations in journals

that publish empirical studies and the fact that many studies test multiple models, so reporting

each individual loading for each model may not be feasible (although this is the preferable option

if possible). However, as illustrated in the above simulation, it is vital to have a general idea of

the values of the standardized loadings in order to assess AFIs because, without this context, the

value of the AFIs are uninterpretable. We encourage applied researchers to report some type of

information pertaining to standardized loadings of the latent variables in the final model to help

contextualize the AFIs (e.g., mean, median, range). Alternatively, a measure of reliability that is

based on the magnitude of the standardized loadings such as McDonald’s omega for construct

reliability (McDonald, 1970) or coefficient H for maximal reliability (Hancock & Mueller, 2001)

may be able to succinctly provide such a contextualization.

2

Otherwise, readers, reviewers, and

editors have essentially no information upon which to base the fit of the model and an SRMR

value of 0.07, for example, can be interpreted many different ways conditional on measurement

quality. This does not solve the issue related to the reliability paradox, but it would help

researchers to be more upfront about the conditions from which their AFIs come. Additionally, it

would reward researchers who exercised due diligence to construct more reliable measures and

relegate some inappropriate claims made from data with low reliability.

As a final note to put the implications of these findings into perspective, consider again the

two sets of AFIs mentioned near the beginning of the paper. As a reminder, in Model A, RMSEA

= 0.040, SRMR = 0.040, and CFI = 0.975; and in Model B, RMSEA = 0.20, SRMR= 0.14, and

CFI = 0.775. Under current practice where the HB criteria have become common reference points,

2

We acknowledge that standardized loadings are not interchangeable with reliability but they are related and may

serve as a fair approximation that is concise to report. If it helps contextualize this study, the Coefficient H values for

the 0.40, 0.70. and 0.90 loading conditions were 0.48, 0.83, and 0.96, respectively. The McDonald’s omega values for

these conditions were 0.49, 0.83, and 0.96, respectively.

MEASUREMENT QUALITY AND FIT INDICES 23

Model A would be universally seen as fitting the data better than Model B, which would likely be

desk-rejected at many reputable journals. However, if one does not somehow condition on

measurement quality, this assertion can be highly erroneous. If the factor loadings in Model A had

standardized values of 0.40 and the factor loadings in Model B had standardized values of 0.90,

Model B actually indicates better data-model fit and has higher power to detect the same moderate

misspecification in the same model based on the results of our illustrative simulation study

(assuming multivariate normality). Reverting back to Table 1, about 25% of moderately

misspecified models produced SRMR below 0.04, about 5% of models resulted in CFI values

below 0.975, and nearly 95% of models produced an RMSEA value below 0.04 with poor

measurement quality. Conversely, with excellent measurement quality, essentially none of the

misspecified models produced an SRMR value less than 0.14, an RMSEA value less than 0.20, or

a CFI value less than 0.775. Even though the AFI values of Model B appear quite poor upon first

glance, under certain conditions, even these seemingly unsatisfactory values could indicate

acceptable fit with possibly only trivial misspecifications present in the model. More importantly,

the seemingly poor Model B AFI values better classify models with excellent measurement quality

compared to the seemingly pristine Model A AFI values when measurement quality is poor. To

put the thesis of this paper into a single sentence, information about the quality of the

measurement must be reported along with AFIs in order for the values to have any interpretative

value.

MEASUREMENT QUALITY AND FIT INDICES 24

References

Allan, N. P., Lonigan, C. J., & Phillips, B. M. (2015). Examining the factor structure and

structural invariance of the PANAS across children, adolescents, and young adults. Journal of

Personality Assessment, 97, 616-625.

Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A

review and recommendations. The Leadership Quarterly, 21, 1086-1120.

Bardeen, J. R., Fergus, T. A., Hannan, S. M., & Orcutt, H. K. (2016). Addressing psychometric

limitations of the Difficulties in Emotion Regulation Scale through item modification. Journal of

Personality Assessment, 98, 298-309.

Barrett, P. (2007). Structural equation modelling: Adjudging model fit. Personality and

Individual Differences, 42, 815-824.

Beauducel, A., & Wittmann, W. W. (2005). Simulation study on fit indexes in CFA based o

data with slightly distorted simple structure. Structural Equation Modeling, 12, 41-75.

Bentler, P. M. (2007). On tests and indices for evaluating structural models. Personality an

Individual Differences, 42, 825-829.

Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of

covariance structures. Psychological Bulletin, 88, 588-606.

Browne, M. W., MacCallum, R. C., Kim, C. T., Andersen, B. L., & Glaser, R. (2002). When fit

indices and residuals are incompatible. Psychological Methods, 7, 403-421.

Chen, F., Curran, P. J., Bollen, K. A., Kirby, J., & Paxton, P. (2008). An empirical evaluation of

the use of fixed cutoff points in RMSEA test statistic in structural equation models. Sociological

Methods & Research, 36, 462-494.

Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing

measurement invariance. Structural Equation Modeling, 9, 233-255.

Cole, D. A., & Preacher, K. J. (2014). Manifest variable path analysis: Potentially serious and

misleading consequences due to uncorrected measurement error. Psychological Methods, 19, 300-

315.

Credé, M., & Harms, P. D. (2015). 25 years of higher‐order confirmatory factor analysis in the

organizational sciences: A critical review and development of reporting

recommendations. Journal of Organizational Behavior, 36, 845-872.

Demianczyk, A. C., Jenkins, A. L., Henson, J. M., & Conner, B. T. (2014). Psychometric

evaluation and revision of Carver and White's BIS/BAS scales in a diverse sample of young

adults. Journal of Personality Assessment, 96, 485-494.

MEASUREMENT QUALITY AND FIT INDICES 25

Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement

model components: Rationale of two-index strategy revisited. Structural Equation

Modeling, 12, 343-367.

Fan, X., & Sivo, S. A. (2007). Sensitivity of fit indices to model misspecification and model

types. Multivariate Behavioral Research, 42, 509-529.

Fergus, T. A., Valentiner, D. P., McGrath, P. B., Gier-Lonsway, S. L., & Kim, H. S. (2012). Short

forms of the social interaction anxiety scale and the social phobia scale. Journal of Personality

Assessment, 94, 310-320.

Hancock, G. R., & Mueller, R. O. (2011). The reliability paradox in assessing structural relations

within covariance structure models. Educational & Psychological Measurement, 71, 306-324.

Hancock, G. R., & Mueller, R. O. (2001). Rethinking construct reliability within latent variable

systems. In R. Cudeck, S. du Toit, & D. Sörbom (Eds.), Structural equation modeling: Present

and future—A festschrift in honor of Karl Jöreskog (pp. 195–216). Lincolnwood, IL: Scientific

Software International.

Hayduk, L. A. (2014). Shame for disrespecting evidence: the personal consequences of

insufficient respect for structural equation model testing. BMC Medical Research

Methodology, 14, 124.

Hayduk, L. A., & Glaser, D. N. (2000). Jiving the four-step, waltzing around factor analysis, and

other serious fun. Structural Equation Modeling, 7, 1-35.

Hayduk, L., Cummings, G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing!

Testing! One, two, three–testing the theory in structural equation models!. Personality and

Individual Differences, 42, 841-850.

Heene, M., Hilbert, S., Draxler, C., Ziegler, M., & Bühner, M. (2011). Masking misfit in

confirmatory factor analysis by increasing unique variances: a cautionary note on the usefulness of

cutoff values of fit indices. Psychological Methods, 16, 319-336.

Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to

underparameterized model misspecification. Psychological Methods, 3, 424-453.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:

Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.

Ion, A., Iliescu, D., Aldhafri, S., Rana, N., Ratanadilok, K., Widyanti, A., & Nedelcea, C. (2016).

A cross-cultural analysis of personality structure through the lens of the hexaco model. Journal of

Personality Assessment, OnlineFirst.

Jackson, D. L., Gillaspy Jr, J. A., & Purc-Stephenson, R. (2009). Reporting practices in

confirmatory factor analysis: an overview and some recommendations. Psychological

Methods, 14, 6-23.

MEASUREMENT QUALITY AND FIT INDICES 26

Joshanloo, M. (2016). Factor Structure of Subjective Well-Being in Iran. Journal of Personality

Assessment, 98, 435-443.

Jöreskog, K. G., & Sörbom, D. (1981). LISREL V: Analysis of linear structural relationships by

maximum likelihood and least squares methods. University of Uppsala, Department of Statistics.

Kang, Y., McNeish, D.M., & Hancock, G. R. (2016). The role of measurement quality on

practical guidelines for assessing measurement and structural invariance. Educational and

Psychological Measurement, 76, 533-561.

Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis

testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and

Bentler's (1999) findings. Structural Equation Modeling, 11, 320-341.

McDonald, R. P. (1970). The theoretical foundations of principal factor analysis, canonical factor

analysis, and alpha factor analysis. British Journal of Mathematical and Statistical

Psychology, 23, 1-21.

McIntosh, C. (2007). Rethinking fit assessment in structural equation modelling: A commentary

and elaboration on Barrett (2007). Personality and Individual Differences, 42, 859-867.

Michel, J. S., Pace, V. L., Edun, A., Sawhney, E., & Thomas, J. (2014). Development and

validation of an explicit aggressive beliefs and attitudes scale. Journal of Personality

Assessment, 96, 327-338.

Miles, J., & Shevlin, M. (2007). A time and a place for incremental fit indices. Personality and

Individual Differences, 42, 869-874.

Millsap, R. E. (2007). Structural equation modeling made difficult. Personality and Individual

Differences, 42, 875-881.

Mulaik, S. (2007). There is a place for approximate fit in structural equation modelling.

Personality and Individual Differences, 42, 883-891.

Rice, K. G., Richardson, C. M., & Tueller, S. (2014). The short form of the revised almost perfect

scale. Journal of Personality Assessment, 96, 368-379.

Saris, W. E., Satorra, A., & Van der Veld, W. M. (2009). Testing structural equation models or

detection of misspecifications?. Structural Equation Modeling, 16, 561-582.

Saris W. E., Satorra A. (1992). Power evaluations in structural equation models. InBollen K.

A., Long S. (Eds.), Testing in structural equation models (pp. 181-204).London, England: Sage.

Savalei, V. (2012). The relationship between root mean square error of approximation and model

misspecification in confirmatory factor analysis models. Educational and Psychological

Measurement, 72, 910-932.

MEASUREMENT QUALITY AND FIT INDICES 27

Steiger, J. H. (2007). Understanding the limitations of global fit assessment in structural equation

modeling. Personality and Individual Differences, 42, 893-898.

Steiger, J. H. (2000). Point estimation, hypothesis testing, and interval estimation using the

RMSEA: Some comments and a reply to Hayduk and Glaser. Structural Equation Modeling, 7,

149-162.

Steiger, J.H. & Lind, J.C. (1980, May). Statistically-based tests for the number of common

factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.

Tomarken, A. J., & Waller, N. G. (2003). Potential problems with" well fitting" models. Journal

of Abnormal Psychology, 112, 578-598.

You, J., Leung, F., Lai, K. K. Y., & Fu, K. (2013). Factor structure and psychometric properties of

the Pathological Narcissism Inventory among Chinese university students. Journal of Personality

Assessment, 95, 309-318.

MEASUREMENT QUALITY AND FIT INDICES 28

Table 1

Proportion of the replications declared poorly fitting for various loading magnitudes under

different criteria.

RMSEA

SRMR

CFI

TML

Loading

0.04

0.06

0.20

0.04

0.08

0.14

0.975

0.950

0.775

0.05

0.01

0.40

5.3

0.0

0.0

74.3

0.0

0.0

95.3

84.2

0.6

77.4

56.3

0.50

84.3

1.7

0.0

99.3

0.0

0.0

100

99.4

0.3

88.7

96.6

0.60

99.4

91.3

0.0

100

0.3

0.0

100

100

0.2

99.8

100

0.70

100

100

0.0

100

48.7

0.0

100

100

1.1

100

100

0.80

100

100

0.0

100

99.9

0.5

100

100

19.9

100

100

0.90

100

100

99.8

100

100

94.8

100

100

99.3

100

100

Note: The values in the second row are the hypothetical AFI cut-off values to which the

percentage of rejected models corresponds. The bolded row represents the condition used in Hu

and Bentler (1999).

MEASUREMENT QUALITY AND FIT INDICES 29

Figure 1. Confirmative factor analysis model used in Hu and Bentler (1999). As in HB, in the

population model for the simulation model ϕ12 = 0.50, ϕ13 = 0.40, and ϕ23 = 0.30.

𝑋1

𝑋2

𝑋3

𝑋4

𝑋5

𝑋6

𝑋7

𝑋8

𝑋9

𝑋10

𝑋11

𝑋12

𝑋13

𝑋14

𝑋15

𝜉1

𝜉2

𝜉3

𝜙13

𝜙12

𝜙23

MEASUREMENT QUALITY AND FIT INDICES 30

Figure 2. Empirical distribution of RMSEA for selected factor loading conditions. The vertical

black line corresponds to the HB cut-off of 0.06. Values to the right of the vertical black line

would be classified as having poor data-model fit, values to the left indicate good data-model fit.

MEASUREMENT QUALITY AND FIT INDICES 31

Figure 3. Empirical distribution of SRMR (left panel) and CFI (right panel) for selected factor

loading conditions. The vertical black line corresponds to the HB cut-off of 0.08 for SRMR and

0.95 for CFI. Values to the right of the vertical black line would be classified as having poor data-

model fit, values to the left indicate good data-model fit.