ArticlePDF Available

Thanks Coefficient Alpha, We’ll Take it From Here


Abstract and Figures

Empirical studies in psychology commonly report Cronbach's alpha as a measure of internal consistency reliability despite the fact that many methodological studies have shown that Cronbach's alpha is riddled with problems stemming from unrealistic assumptions. In many circumstances, violating these assumptions yields estimates of reliability that are too small, making measures look less reliable than they actually are. Although methodological critiques of Cronbach's alpha are being cited with increasing frequency in empirical studies, in this tutorial we discuss how the trend is not necessarily improving methodology used in the literature. That is, many studies continue to use Cronbach's alpha without regard for its assumptions or merely cite methodological papers advising against its use to rationalize unfavorable Cronbach's alpha estimates. This tutorial first provides evidence that recommendations against Cronbach’s alpha have not appreciably changed how empirical studies report reliability. Then, we summarize the drawbacks of Cronbach's alpha conceptually without relying on mathematical or simulation-based arguments so that these arguments are accessible to a broad audience. We continue by discussing several alternative measures that make less rigid assumptions which provide justifiably higher estimates of reliability compared to Cronbach’s alpha.. We conclude with empirical examples to illustrate advantages of alternative measures of reliability including omega total, Revelle’s omega total, the greatest lower bound, and Coefficient H. A detailed software appendix is also provided to help researchers implement alternative methods.
Content may be subject to copyright.
Thanks Coefficient Alpha, We’ll Take it From Here
Daniel McNeish
Utrecht University &
University of North Carolina, Chapel Hill
Daniel MCNEISH is now at the University of North Carolina, Chapel Hill; 100 E. Franklin
Street Suite 200, Chapel Hill, NC, USA, 27599. Email: He wrote
the first version of this paper while an assistant professor in the Department of Methodology
and Statistics, Utrecht University, the Netherlands. All subsequent revisions were completed
at UNC. The information in this paper has not been previously disseminated at a conference
or electronically.
I am indebted to Denis Dumas, Greg Hancock, Katherine Muenks, and Kathryn Wentzel for
conversations that inspired the motivation for this paper. I especially would like to thank
Gjalt-Jorn Peters for his expertise and assistance with the R code included in this paper.
Empirical studies in psychology commonly report Cronbach's alpha as a measure of internal
consistency reliability despite the fact that many methodological studies have shown that
Cronbach's alpha is riddled with problems stemming from unrealistic assumptions. In many
circumstances, violating these assumptions yields estimates of reliability that are too small,
making measures look less reliable than they actually are. Although methodological critiques
of Cronbach's alpha are being cited with increasing frequency in empirical studies, in this
tutorial we discuss how the trend is not necessarily improving methodology used in the
literature. That is, many studies continue to use Cronbach's alpha without regard for its
assumptions or merely cite methodological papers advising against its use to rationalize
unfavorable Cronbach's alpha estimates. This tutorial first provides evidence that
recommendations against Cronbach’s alpha have not appreciably changed how empirical
studies report reliability. Then, we summarize the drawbacks of Cronbach's alpha
conceptually without relying on mathematical or simulation-based arguments so that these
arguments are accessible to a broad audience. We continue by discussing several alternative
measures that make less rigid assumptions which provide justifiably higher estimates of
reliability compared to Cronbach’s alpha.. We conclude with empirical examples to illustrate
advantages of alternative measures of reliability including omega total, Revelle’s omega
total, the greatest lower bound, and Coefficient H. A detailed software appendix is also
provided to help researchers implement alternative methods.
Thanks Coefficient Alpha, We’ll Take it From Here
In many areas of psychology and in the behavioral sciences more broadly, variables that
are of interest (e.g., motivation, depression, cognitive abilities) are not directly observable and
are there measured with scales or instruments comprised of a set of items. These items indirectly
measure the variable of interest by inferring that some underlying construct manifests itself
through these items. For example, an MRI study cannot directly measure the amount of
extraversion present in a person’s brain. Rather, items are created and administered to an
individual. If the individual has high extraversion, this trait manifests itself through certain
responses to the items.
Because most measurement in psychology is done through the use of indirect
measurement tools, researchers often report a measure of reliability to demonstrate that the items
composing the measure are reliable, meaning that the scores based on the items are reasonably
consistent, the responses to the scale are reproducible, and that responses are not simply
comprised of random noise. Put another way, a reliability analysis provides evidence that the
scale is consistently measuring the same thing (although, this is distinct from concluding that the
scale is measuring the intended construct that is a question of scale validity).
In psychology studies, the most commonly used reliability index, by a wide margin, is
Cronbach’s alpha. In a review of reliability reporting practices conducted by Hogan, Benjamin,
and Brezinski (2000), about two-thirds (66%) of studies reporting a reliability measure selected
Cronbach’s alpha. Of those reporting a type of reliability that requires only a single
administration (e.g., not test-retest or interrater reliability), 87% (548 out 633) reported
Cronbach’s alpha (or the KR-20, which is a special case of alpha where all items are binary;
Crocker & Algina, 2008). Indeed, Cronbach’s alpha can be universally found in the pages of
psychology journals in any subfield. As of October 2014, the seminal Cronbach (1951) paper
that first introduced Cronbach’s alpha was the 64th most cited English language research paper
on Google Scholar in any field and, within psychology, is only surpassed by the paper of Baron
and Kenny (1986) on mediation and moderation and the seminal paper of Bandura (1977) on
self-efficacy (van Noorden, Maher, & Nuzzo, 2014). In the last 20 years, however, many
methodological articles have appeared which question how Cronbach’s alpha is applied (Bentler,
2007; Crutzen, 2007; Crutzen & Peters, 2015; Cortina, 1993; Dunn, Baguley, & Brunsden, 2014;
Geldhof, Preacher, & Zyphur, 2014; Graham, 2006; Green & Hershberger, 2000; Green & Yang,
2009a, 2009b; Peters, 2014; Raykov, 1997a,1997b, 1998, 2004; Raykov & Shrout, 2002; Revelle
& Zinbarg, 2009; Schimtt, 1996; Sijtsma, 2009; Teo & Fan, 2013; Yang & Green, 2011;
Zinbarg, Yovel, Revelle, & McDonald, 2006; Zinbarg, Revelle, Yovel, & Li, 2005). These
articles argue that the assumptions made by Cronbach’s alpha are commonly violated in types of
data and models with which psychological researchers work. These arguments have led to the
development of alternative reliability measures whose assumptions are more in-line with
psychological data (Hancock & Mueller, 2001; Jackson & Agunwamba, 1977; McDonald, 1970,
1999; Revelle, 1979). Software routines for calculating these measures are also available in R
packages such as MBESS (Kelley, 2007), psych (Revelle, 2008), or the scaleStructure
function in the userfriendlyscience package (Peters, 2014).
The articles to which we referred in the previous paragraphs are actually fairly well-
known, even among non-methodological researchers. For instance, based on Google Scholar
citation counts, Sijtsma (2009) has over 800 citations, Zinbarg et al. (2005) over 450, Hancock
and Mueller (2001) almost 400, Yang and Green (2001) over 125, and Dunn et al. (2014) over
100 as of October 2016. Although such seemingly high awareness of issues with Cronbach’s
alpha appears reassuring, it does not appear that there have been substantial changes in the use of
Cronbach's alpha.
To provide evidence for this claim and to show the enduring status of Cronbach’s alpha,
we reviewed articles in three flagship APA journals from educational psychology (the Journal of
Educational Psychology; JEP), social psychology (the Journal of Personality and Social
Psychology; JPSP), and clinical psychology (the Journal of Abnormal Psychology, JAP) from
January 2014 until October 2016. We located studies through Google Scholar by searching for
the string “reliability” within these journals. This resulted in 369 total studies (131 from JEP,
118 from JPSP, and 120 from JAP). We filtered out studies that reported types of reliability that
are not of interest to this paper (e.g., interrater reliability), studies where reliability only
appeared in the references, or where reliability was not used in in a psychometric sense. This
netted 118 total studies (52 from JEP, 31 from JPSP, and 35 from JAP). Of these 118 studies,
109 (92%) solely used Cronbach’s alpha to assess reliability of the scales used in their study
while 9 (8%) reported an alternative reliability measure either by itself or in addition to
Cronbach’s alpha. Despite the large number of citations of articles calling for alternative
reliability measures, reliability reporting in these flagship APA journals (which have stringent
methodological requirements) appears unchanged from the results reported in the Hogan et al.,
(2000) review. In fact, the aforementioned studies advising against Cronbach’s alpha were nearly
invisible in these APA journals. For example, none of the five aforementioned, highly-cited
papers which advocate for alternative measures were cited more than once each in the 118
reviewed papers.
This evidence suggests that researchers continue to almost exclusively rely on
Cronbach’s alpha as a measure of scale reliability. The pattern that methodological studies are
well-cited but do not appear in flagship journals may suggest that researchers are aware of the
issues with Cronbach’s alpha but are reluctant to adopt new methods because these methods are
not as widely known or accepted, that reviewers may not be familiar with the alternative
methods, that the editorial process does not require more rigorous methods so researchers do not
invest time to learn them, or that researchers are unsure how to obtain estimates of alternative
measures for their data because many are not offered as popular general software packages like
SPSS, SAS, or Stata. This also suggests that the more rigorous methodological work advising
against Cronbach’s alpha has not impacted psychologists as much as it has psychometricians or
statisticians working in psychological domains. Sijtsma (2009) aptly summarizes this by stating,
“while much of Cronbach’s paper was and still is accessible to many psychologists, the work by
Lord, Novick, and Lewis and many others since may have gone unnoticed by most
psychologists. This is truly an example of the gap that has grown between psychometrics and
psychology and that prevents new and interesting psychometric results” (p. 115).
Though it appears promising that methodological papers are highly cited, there is limited
evidence that the findings, conclusions, and recommendations are being incorporated in
empirical studies. This may be taken to suggest that these studies are either being misinterpreted
or not being read in their entirety, possibly because many appear in journals that are aimed at
methodologists and statisticians and therefore may be written at too technical for empirical
researchers with less quantitative training to fully benefit from the arguments being presented.
Consistent with recent recommendations from Sharpe (2013) concerning bridging innovations in
the use of statistical methods in psychology to empirical researchers, the aim of this tutorial
paper is to state as plainly and succinctly as possible why Cronbach’s alpha is often
inappropriate in empirical contexts and why researchers would benefit from abandoning
Cronbach’s alpha in favor of alternative measures. Though there are many resources for readers
capable of following mathematically-based arguments, far fewer resources exist for the large
number of psychological researchers operating below such a level of mathematical
sophistication. As such, the scope of this paper is intended to be very broad to elucidate the
general idea that widespread adoption and continued use of Cronbach’s alpha is detrimental. We
heavily cite previous work in this area that can provide additional technical or nuanced detail on
the issues discussed herein.
To outline this paper, we first discuss the basics behind Cronbach’s alpha including the
restrictive assumptions that often obviate its use. We then overview some of the more
conceptually clear, leading alternatives that can be employed to yield better estimates of
reliability than Cronbach’s alpha. This is followed by a brief comparison of scenarios in which
these alternatives have specific advantages and disadvantages. Rather than lay out mathematical
or logical arguments for why Cronbach’s alpha should not be used as has been the primary
method of previous papers on the topic, we demonstrate some of the issues with Cronbach’s
alpha using example analyses from publicly available datasets. We end with a discussion of why
prolonged use of Cronbach’s alpha is detrimental and how alternative measures are better suited
to accomplish the same goal, often to researchers’ benefit. We provide a heavily annotated
software appendix to help readers employ these methods in their own research so that they can
abandon Cronbach’s alpha in favor of better alternatives.
Basics of Reliability and Cronbach’s Alpha
From a theoretical standpoint, some observed score X for a trait or construct is considered
to have two latent components: the true component T and an error component E such that
X T E
. From a classical test theory perspective (Novick & Lewis, 1967), reliability is
considered to be greater when the variance of the true score component accounts for a higher
proportion of variance in the observed scores relative to the variance attributable to the error
component. More formally, reliability is defined by the ratio of the true score variance to the
observed score variance,
'( ) ( )
XX Var T Var X
. Under this more formal definition, reliability
can also be interpreted as the correlation between scores on two consecutive administrations,
assuming the respondent does not recall their answers from the first administration (hence the
choice of
as the symbol for reliability).
Although the definition of reliability is relatively straightforward, obtaining an estimate
of reliability is not always so easy. Historically, many methods for assessing reliability (parallel
forms, test-retest, test-retest with parallel forms; Crocker & Algina, 2008) required multiple test
administrations which were then correlated to form an estimate of reliability. Due to logistical
issues of multiple administrations, the ability to calculate reliability from a single test
administration was highly desirable. Cronbach (1951) addressed this in his seminal paper on
internal consistency reliability, the type of reliability on which this paper focuses. Rather than
inspecting the correlation between separate administrations, internal consistency reliability
inspects the relation of each item to all other items from a single administration. If respondents
provide similar answers to a set of items, then their responses would reasonably generalize to
other items from a similar domain, and the set of items would be considered to have high internal
consistency reliability. (Crocker & Algina, 2008).
Cronbach’s Alpha
Cronbach’s alpha (Cronbach, 1951) is by far the most common measure of internal
consistency reliability.
Cronbach’s alpha is calculated by
where k is the number of items,
is the variance of individual item i where i =1,…, k, and
the variance for all items on the scale. This formula is often reported in reduced form as
ij X
k s s
is the mean covariance between all pairs of items on the scale (Geldhof
et al., 2014). One can interpret the value of Cronbach’s alpha in one of many different ways:
1. Cronbach’s alpha is the correlation of the scale of interest with another scale of the same
length that intends to measure the same construct, with different items, taken from the
same hypothetical pool of items (Kline, 1986)
2. The square root of Cronbach’s alpha is an estimate of the correlation between observed
scores and true scores (Nunnally & Bernstein, 1994)
3. Cronbach’s alpha is the proportion of the variance of the scale that can be attributed to a
common source (DeVellis, 1991).
4. Cronbach’s alpha is the average of all possible split-half reliabilities from the set of items
(Pedhazur & Schmelkin, 1991).
Under certain assumptions, Cronbach’s alpha is a consistent estimate of the population
internal consistency; however, these assumptions are quite rigid and are precisely why
methodologists have argued against the use of Cronbach’s alpha (Gignac et al., 2007; Graham,
Readers should note that there are several criticisms of Cronbach’s alpha about the degree to which is truly
measures internal consistency (e.g., Revelle & Zinbarg, 2009; Sijtsma, 2009). These arguments can become rather
abstract and theoretical so, given the intent of this paper, we will not delve into the specifics and we will use
“internal consistency” as a simplification of what Cronbach’s alpha intends to measures. Do note, however, that
Cronbach’s alpha being a true measure of internal consistency has been called into question on multiple occasions.
2006; Novick & Lewis, 1967; Revelle & Zinbarg, 2009; Yang & Green, 2011). The assumptions
of Cronbach’s alpha are:
1. The scale adheres to tau equivalence
2. Scale items are on a continuous scale and normally distributed
3. The errors of the items do not covary
4. The scale is unidimensional
These assumptions have been stated in other locations (e.g., Green & Yang, 2009a; Yang &
Green, 2011) and demonstrated mathematically (e.g., Bentler, 2009; Sijtsma, 2009) but their
importance (and rigidity) may not necessarily be understood or appreciated in empirical work.
The following subsections will expound these assumptions.
Assumption 1: Tau equivalence. Tau equivalence is the statistically precise way to state
that that each item on a scale contributes equally to the total scale score. To put this assumption
into perspective, imagine that an exploratory factor analysis is run on the scale and a single
factor is extracted (as a researcher would desire). For the tau equivalence assumption to be
upheld, the standardized factor loadings for each item would need to be nearly identical to all
other items on the scale. Figure 1 below shows what hypothetical SPSS output would look like
for a five item scale that does meet tau-equivalence (left panel) and a scale that does not meet tau
equivalence (right panel).
Figure 1. Hypothetical SPSS exploratory factor analysis output for standardized factor loadings
of a 5 item scale that meets tau equivalence (left) and that does not meet tau equivalence (right)
Tau-equivalence tends to be unlikely for most scales that are used in empirical research
some items strongly relate to the construct while some are more weakly related. Furthermore, if a
scale captures only a single construct, it is unlikely that all the items devised by researchers
capture the construct to an equal degree (Cortina, 1993; Yang & Green, 2011). Put more
technically, most scales are congeneric (Geldhof et al., 2014; Graham, 2006; Peterson & Kim,
2013) which means that the items measure the same construct, but they do so with different
degrees of precision (Raykov, 1997a). Such disparities between the quality of the individual
items does not mean that the weaker items necessarily need to be removed, but it does violate the
assumptions made by Cronbach’s alpha with the result being that Cronbach’s alpha will be too
low (Miller, 1995).
In the likely event that the assumption of tau equivalence is violated, Cronbach’s alpha
becomes a lower-bound estimate of internal consistency rather than a true estimate, provided that
errors are reasonably uncorrelated (Graham, 2006; Sijtsma, 2009; Yang & Green, 2011). This
results in Cronbach’s alpha estimates that can vastly underestimate the actual value of reliability
even if just a single item on the scale is responsible for the violation of tau equivalence
(Raykov, 1997b). A simulation by Green and Yang (2009) found that Cronbach’s alpha may
underestimate the true reliability by as much as 20% when tau equivalence is violated (e.g., if the
true reliability is 0.70, Cronbach’s alpha would estimate reliability in the mid-0.50s).
Furthermore, the degree of underestimation is greatest when scales have a fairly small number of
items (e.g., less than 10), which is often the case in empirical psychological research (Graham,
Assumption 2: Continuous Items with Normal Distributions. As noted in discussions
of Equation 1, Cronbach’s alpha is largely based on the observed covariances (or correlations)
between items. In most software implementations of Cronbach’s alpha (such as in SAS and
SPSS), these item covariances are calculated using a Pearson covariance matrix (Gadermann,
Guhn, & Zumbo, 2012). A well-known assumption of Pearson covariance matrices is that all
variables are continuous in nature. Otherwise, the elements of the matrix can be substantially
biased downward (i.e., the magnitudes will be closer to 0 than they should be; Flora & Curran,
2004). However, it is particularly common for psychological scales to contain items that are
discrete (e.g., Likert or binary response scales), which violates this assumption. If discrete items
are treated as continuous, the covariance estimates will be attenuated, which ultimately results in
underestimation of Cronbach’s alpha because the relations between items will appear smaller
than they actually are.
To accommodate items that are not on a continuous scale, the covariances between items
can instead be estimated with a polychoric covariance (or correlation) matrix rather than with a
Pearson covariance matrix. Polychoric covariance matrices assume that there is an underlying
normal distribution to discrete responses. For instance, imagine a three category Likert item
whose response choices consist of Agree, Neutral, and Disagree. A polychoric covariance matrix
first assumes that these response choices map onto a normal distribution whereby there is no
longer three distinct categories but a continuous range of “agreement”. Then thresholds are
estimated which can conceptually be thought of as cut-points on the continuous agreement scale
that separate the response categories. So, respondents at the 40th percentile or below on the
Likert scales with many response options can often be treated as continuous without any adverse effects. The
definition of how many response options constitutes “many” has been debated in the methodological literature. In
latent variable models broadly, Rhemtulla, Broussard, & Savalei (2012) recommend 5. In the specific context of
Cronbach’s alpha, Gaderman et al., (2012) recommended 7 response options.
hypothetical agreement continuum may be considered in the “Disagree” category, respondents
between the 40th and 80th percentile on the hypothetical agreement continuum would correspond
to the “Neutral” category, and respondents above the 80th percentile would correspond to the
“Agree” category (the percentile cut-points are estimated and would change for each item).
Provided that it is reasonable to assume that a normal distribution underlies the discrete options,
the polychoric covariance estimates correct the attenuation that occur when discrete items are
treated as continuous (Carroll, 1961). Gadermann et al. (2012) demonstrate how using a
polychoric covariance matrix with Cronbach’s alpha can addresses underestimation of reliability
attributable to discrete items.
Another related and less commonly considered assumption is that both the true scores
and the errors are normally distributed (e.g., van Zyl, Neudecker, & Nel, 2000, Zimmerman,
Zumbo, & LaLonde, 1993). Studies investigating the effect of non-normal distributions on
Cronbach’s alpha have been mixed. Zimmerman et al. (1993) generally conclude that
Cronbach’s alpha is fairly robust to deviation from normality. On the other hand, Sheng and
Sheng (2012) reported that leptokurtic distributions lead to negative bias (i.e., reliability
estimates are too low) while platykurtic distributions lead to positive bias (i.e., reliability
estimates are too high). In the simulation in Sheng and Sheng (2012), these biases dissipated as
sample size and the magnitude of the true reliability increased.
Assumption 3: Uncorrelated errors. Although frequently overlooked (Zumbo & Rupp,
2004), the assumption that errors are uncorrelated is also required when utilizing Cronbach’s
alpha. Correlated errors occur when sources other than the construct being measured cause item
responses to be related to one another. Correlated errors between items may arise for a variety of
reasons including the order of the items on the scale (Cronbach & Shavelson, 2004; Green &
Hershberger, 2000), speeded tests (Rozeboom, 1966), transient responses where feelings or
opinions may change over the course of the scale (Becker, 2000; Green 2003), or unmodeled
multidimensionality of a scale (Steinberg & Thissen, 1996). Unlike the tau equivalence
assumption, the impact of correlated errors does not necessarily bias Cronbach’s alpha estimates
in a predictable direction, meaning that violations can lead to either overestimates or
underestimates of reliability. When errors are correlated, the correlations are often positive which
will result in Cronbach’s alpha overestimating the reliability (Bentler, 2009; Green &
Hershberger, 2000; Green & Yang, 2009b). When correlated errors are not accounted for in the
calculation of reliability, Cronbach’s alpha can be overestimated by as much as 20% (Gessaroli
& Folske, 2002).
Some reasons for error covariances are innocuous while others are much more
problematic. For instance, if error covariances are necessary because of item order effects, error
covariances can be incorporated to yield appropriate estimates. On the other hand, if the error
covariances are needed due to unmodeled dimensions in the scale, this eliminates nearly all
support for using the scale (i.e., the assumption of unidimensionality is violated this
assumption is discussed next). Unfortunately, considerations for which of these mechanisms is
responsible for the covariances is difficult to determine empirically. It is difficult to test whether
error covariances are non-null because there are often not sufficient degrees of freedom to
include many error covariances into the model. Possible solutions to such a violation are
discussed in subsequent sections.
Assumption 4: Unidimensionality. Though Cronbach’s alpha is sometimes thought to
be a measure of unidimensionality because its colloquial definition is that it measures “how well
items stick together”, unidimensionality is an assumption that needs to be verified prior to
calculating Cronbach’s alpha rather than being the focus of what Cronbach’s alpha measures
(Cortina, 1993; Crutzen & Peters, 2015; Green, Lissitz, & Mulaik, 1977; Schmitt, 1996).
Although the terminology is not universally accepted (c.f., Sijtsma, 2009), Schmitt (1996) makes
the distinction between unidimensionality and internal consistency. He defines internal
consistency as the interrelatedness of a set of items while unidimensionality is the degree to
which the items all measure the same underlying construct.
Green et al. (1977) note that internal consistency is necessary for unidimensionality but
that internal consistency is not sufficient for demonstrating unidimensionality. That is, items that
measure different things can still have a high degree of interrelatedness, so a large Cronbach’s
alpha value does not necessarily guarantee that the scale measures a single construct. As a result,
violations of unidimensionality do not necessarily bias estimates of Cronbach’s alpha. In the
presence of a multidimensional scale, Cronbach’s alpha may still estimate the interrelatedness of
the items accurately and the interrelatedness of multidimensional items can in fact be quite high
(Cortina, 1993; Schmitt, 1996; Sijtsma, 2009).
Many papers (e.g., Crutzen & Peters, 2015; Schmitt, 1996; Green & Yang, 2009)
recommend beginning any reliability analysis with an inspection of the factor structure of the
scale, specifically examining whether a one-factor model fits well via inferential tests like the
minimum fit function chi square statistic or via fit index values. Though vitally important to the
interpretation of scales, a review by Crutzen and Peters (2015) found that only 2.4% of health
psychology studies reported any information about the dimensionality of the scale beyond
assessments of reliability. Many leading alternatives to Cronbach’s alpha (discussed in detail in
the next section), make explicit use of the factor analytic approach to reliability, facilitating the
presentation of dimensionality and reliability side-by-side.
Alternatives to Cronbach’s Alpha
There are many methods available to assess the reliability of scales. Hattie (1985) reviews
about 30 such methods and there are undoubtedly many additional methods that have been
developed in the 30+ years since this review was published. Our intention is not to update Hattie
(1985) by providing a broad overview of all the possible alternatives to Cronbach’s alpha that are
available. Instead, we focus on three particular methods: omega coefficients, Coefficient H, and
the greatest lower bound. These three alternatives are selected because (1) they have been shown
to perform well in previous studies, (2) they do not make as strict assumptions as Cronbach’s
alpha, and (3) they are conceptually similar to Cronbach’s alpha, so the idea of each should be
relatively familiar if one understands Cronbach’s alpha.
Omega and Composite Reliability
Composite reliability is conceptually related to Cronbach’ alpha in that it assesses
reliability via a ratio of the variability explained by items compared to the total variance of the
entire scale (Bentler, 2007; Geldhof et al., 2014; Raykov, 1997a, 1997b, 1998). Omega
(McDonald, 1970, 1999) is a commonly recommended measure of composite reliability that is
available in multiple software programs. Omega is designed for congeneric scales, where the
items vary in how strongly they are related to the construct being measured (i.e., in a factor
analysis setting, the loadings would not be assumed to be equal). In other words, where tau
equivalence is not assumed. Composite reliability is appropriate when the items from a scale are
unit-weighted to form the total scale score but the scale itself in congeneric (Bentler, 2007;
Geldhof et al., 2014). A unit-weighted scale means that the total score of the scale is calculated
by adding up the raw scores (or reverse coded raw scores, if appropriate) of the individual items:
each item is weighted equally.
There are multiple variations of omega including omega hierarchical, omega total, and
what we will refer to as “Revelle’s omega total”. Omega hierarchical is useful for scales that
may not be truly unidimensional and may contain additional minor dimensions (Zinbarg et al.,
2006). Omega hierarchical attempts to parse out the variability attributable to sub-factors and
calculates reliability for a general factor that applies to all items. Although highly advantageous,
omega hierarchical differs from Cronbach’s alpha conceptually, so we will only provide a broad
overview here (although we do recommend its use if researchers believe that the items in the
scale are organized in hierarchical factors).
Omega total, on the other hand, assumes that the scale is unidimensional and estimates
the reliability for the composite of items on the scale (which is conceptually similar to
Cronbach’s alpha). In the R software environment, two packages (MBESS and psych)
calculate versions of omega total. However, they yield different results because MBESS uses a
different specification which generally tends to be more conservative and yields estimates closer
to Cronbach’s alpha (Peters, 2014; Revelle & Zinbarg, 2009; Revelle, 2016). We overview the
properties and formulas for each version of omega total in the next subsections. Though both
versions are typically referred to as “omega total”, we assign different names to each version
help keep them distinct. We refer to the omega total value based on the psych R package
specification as “Revelle’s omega total”. We use “omega total” to refer to the version calculated
by the MBESS R package (and as presented in many other sources).
Omega total. Under the assumption that the construct variance is constrained to 1 and
that there are no error covariances, omega total is calculated from factor analysis estimates such
Total kk
i ii
is the factor loading (not necessarily standardized) for the ith item on the scale,
is the
error variance for the ith item, and k is the number of items on the scale. Omega total can only be
calculated if the scale is first factor analyzed to obtain the factor loadings and error variances.
This is necessary because tau equivalence is no longer assumed and the potentially differential
contribution of each item to the scale must be assessed.
Although perhaps not immediately intuitive, Equation 2 is identical to the Cronbach’s
alpha formula in Equation 1 under the condition of tau equivalence (Geldhof et al., 2014). The
condensed equation for Cronbach’s alpha that appears under Equation 1 can alternatively be
written as
ij X
ij k
. From factor analysis path tracing
rules, the model-implied covariance for a pair of items (with no error covariances) that load on
the same factor is equal to the square of the loadings (times the factor variance which is assumed
to be equal to 1). Under tau equivalence, all the loadings are equal, so the total true score
variance is equal to the item covariance for a single pair of items, repeated k times. In both
Equation 1 and Equation 2, this variance is divided by the total variance of the scale. The
denominator is Equation 2 is the factor analysis representation of
from Equation 1. As such,
omega total is a more general version of Cronbach’s alpha and actually subsumes Cronbach’s
alpha as a special case. More simply, if tau equivalence is met, omega total will yield the same
result as Cronbach’s alpha but omega total has the flexibility to accommodate congeneric scales,
unlike Cronbach’s alpha.
Similar to Cronbach’s alpha, omega total overestimates reliability if errors have a positive
covariance. The omega total formula in Equation 2 assumes that errors are uncorrelated, though
it can be generalized to cases where this assumption is violated by altering the denominator term
to account for error covariance such that,
1 1 2 1
TCov k k k i
i ii ij
i i i j
If the residual covariances may be attributable to additional minor dimensions, then omega
hierarchical will yield a more accurate estimate of the reliability of the scale (Zinbarg et al.,
2006). Extensions of omega total are also available for cases where the factor variance is not
assumed to be 1 (Raykov, 2004) and when the data contain multiple groups (Zinbarg, Revelle, &
Yovel, 2007). These extensions, however, are outside the scope of this introduction and will not
be discussed further.
Revelle’s omega total. Though similar in name and idea, Revelle’s omega total can yield
quite different (and typically larger) estimates of reliability than omega total. This is due to a
different, more sophisticated variance decomposition that is used. In Revelle’s omega total, a
factor model is estimated as with omega total. However, the solution is then transformed with a
Schimd-Leiman rotation (Schmid & Leiman, 1957). Though we will not go into full detail
regarding this rotation because it is rather technical and full detail is outside the scope of this
Note that, although the inclusion of the error covariances in the denominator appropriately takes the extra source of
variation into account, it does not solve the broader issue of why there is error covariance. That is, whether the error
covariance is attributable to a model misspecification where an important factor has been omitted from the model
(Green & Hershberger, 2000) or whether design-driven aspects of the scale led to the correlated errors (e.g., speeded
tests; Cole, Ciesla, & Steiger, 2007). Bentler (2009) nicely summarizes this issue by stating “It would seem that the
question of whether to consider correlated errors as factors and hence part of the common factor space, or as residual
covariances and hence as part of the unique space, should be left up to the goals of the investigator.” (pp. 139).
paper (for full details, see Mansolf & Reise, 2016 or Wolff & Preising, 2005), the general idea is
to rotate the factor solution to a bifactor model where there is one general factor and several
minor factors. More specifically, each item will load on the single general factor (g), one or more
group factors (f), and an item-specific factor (s). The communality is then calculated by squaring
the loadings of the general factor and the group factor(s) but not the item-specific factors
(Revelle, 2016).
The formula for Revelle’s omega total is essentially the same as Equation 2; however, it
is more complex to account for the differential variance decomposition and additional minor
factors. Namely, Revelle’s omega is equal to
1 1 1
gi fi
i f i
is the loading of the ith item on the general factor,
is the standardized loading of
the ith item on the fth group factor, k is the total number of items, F is the total number of group
factors, and
is the number of items that load on the fth group factor.
is the total variance
after rotation which is equal to the sum of each element of the sample Pearson (or polychoric)
correlation matrix (in matrix notation, this can be succinctly written as
1 R1
where R is the
sample correlation matrix).
Omega hierarchical is based on the exact same Schmid-Leiman transformation except
that it only considers contributions of the general factor and disregards the loadings of both the
group factors in addition to the item-specific factors,
For interested readers, Kelley and Pornprasertmanit (2016) provide a highly readable description
of omega hierarchical and when it should be used. Readers looking for complete details on
omega hierarchical are referred to Zinbarg et al. (2005)
Though the formulas may look intimidating, the idea is quite straightforward because
software will handle the rotation and complexities of the formula. Explanations of how these
values are extracted from the data are provided in the software appendix.
Coefficient H and Maximal Reliability
Should researchers want to use the information present from the factor loadings to create
a scale that is optimally-weighted where each item contributes different amounts of information
to the overall scale score (instead of each item being given the same weight with unit-weighting),
then maximal reliability is a more appropriate measure of the scale’s reliability (Bentler, 2007;
Hancock & Mueller, 2001; Raykov, 2004).
Hancock and Mueller (2001) derived Coefficient H
as a measure of maximal reliability for an optimally-weighted scale. Similar to the form of
omega total presented in Equation 2, Coefficient H requires the (standardized) factor loadings
from a unidimensional factor analysis of the scale (or from unidimensional subscales).
Coefficient H is calculated by.
where k is again the number of items on the scale and
is the standardized factor loadings for
the ith item. Unlike Equation 2, notice that the squaring of the factor loadings occurs prior to
summing over the each of the items. Both Cronbach’s alpha and omega (all versions) are
When using optimal weighting, the contribution of each item to the scale score is based on the magnitude of its
standardized factor loading. For example, an item with a standardized loading of 0.90 would have a much larger
impact on the scale score than an item with a standardized loading of 0.50.
adversely affected by items with negative loadings, whereas Coefficient H squares the loadings
first so that magnitude (and not sign) is the only important feature. This means that negatively
worded items do not need to be reverse coded with Coefficient H.
There are several other features of Coefficient H that differentiate it from omega total.
First, error variances are not included in the denominator of the equation. This means that items
with weak factor loadings do not negatively affect Coefficient H as they do in the computation of
omega total. In Equation 2, an item with a weak loading will necessarily have a large error
variance (i.e., the underlying construct accounts for a small percentage of the variance, so the
remaining variance must be attributable to error). In Coefficient H, the scale is not penalized for
featuring weaker items because its intended use is for optimally-weighted scales. For example,
whereas adding an item completely unrelated to the construct of interest to a scale reduces
reliability for Cronbach’s alpha and omega (which are appropriate for unit-weighted scales), with
optimal-weighted scales, an unrelated item’s factor loading will essentially be 0 and the
information from this item would not affect the scale scores. Put another way, in unit-weighted
scales, every item receives equal treatment so an unrelated item hurts the scale; in optimally-
weighted scales, items are differentially weighted so an unrelated item does not hurt reliability
because the item simply receives very little or zero consideration when scoring the scale.
Another property exclusive to Coefficient H is that the reliability of the scale cannot be less than
the squared loading (the definition of reliability in factor analytic models) of the single best item
(Geldhof et al., 2014).
Greatest Lower Bound
The greatest lower bound (GLB) is a class of methods for assessing reliability which are
all based on the same conceptual idea. First introduced by Jackson and Agunwamba (1977), the
GLB is based on the classical test theory approach to reliability. First, the GLB extends the
classical test theory formula from
X T E
( ) ( ) ( )Cov Cov CovX T E
the covariance
matrix of all observed scores X is equal to the covariance matrix of all trues scores T plus the
covariance matrix of all the errors E (Shapiro & ten Berge, 2000; ten Berge & Sočan, 2004).
Conceptually, Jackson and Agunwamba (1977) argued that the greatest lower bound for
reliability could be calculated from the estimate of the covariance matrix of E with the largest
trace that is consistent with the data (provided that Cov(T) and Cov(E) are non-negative
Once the estimated covariance matrix for E with the largest trace is found, GLB
reliability is calculated by
[ ( )]
trace Cov
 E
is the variance of the observed items. More simply, the goal is to determine the
maximal values for the error component of the observed scores that is consistent with the data
because reliability calculated with these maximum errors will yield the lowest possible value for
reliability (Sočan, 2000). Jackson and Agunwamba (1977) showed that Cronbach’s alpha and
other single administration measures like split-half reliability can be shown to be based on the
same principle as the GLB with the exception that they inefficiently estimate Cov(E) and
therefore do not exceed the theoretical GLB value.
Though appealing theoretically, a major challenge for GLB reliability is its computation.
The difficulty stems from finding the estimate of Cov(E) that maximize the trace. In fact, a
simple analytical solution is generally impossible, so several iterative methods have been
proposed to determine this matrix with leading candidates being the minimum rank factor
The trace of a matrix is computed by adding up all of the diagonal elements and non-negative definite means that
the diagonal elements of the matrix are 0 or larger.
analysis (MRFA) approach of ten Berge and Kiers (1991) and the GLB algebraic solution from
Moltner and Revelle (2015) (both of which can be implemented in R). An additional limitation
of GLB reliability is that it tends to overestimate reliability with smaller sample sizes (e.g., bias
is rather large with a sample size of 100 but is reasonable with a sample size of 500; Shapiro &
ten Berge, 2000; Trizano-Hermosilla & Alvarado, 2016).
Practical Comparison of Methods
Table 1 compares the six aforementioned methods (Cronbach’s alpha, omega total,
Revelle’s omega total, omega hierarchical, Coefficient H, and the GLB) based on practical
considerations. That is, because adopting new statistical approaches often entails a steep learning
curve, Table 1 does not compare strict statistical properties or asymptotic behavior but rather
overviews which software can compute each method, whether the method is calculable by hand,
notable conceptual advantages, and notable conceptual disadvantages. Alternatives to
Cronbach’s alpha tend to have very little support in general software, so the easiest measures to
report are omega total or Coefficient H because they can be calculated using a simple
spreadsheet. More computationally intensive measures are only currently supported in R. We
realize that R is not the first-choice software for many psychologists, so extensive annotated R
code is provided in an appendix to assist in calculating measures that require more computational
resources (e.g., Schmid-Leiman transformation, MRFA).
Empirical Examples
In this section, we provide example analyses to demonstrate the shortcomings of
Cronbach’s alpha. The first example dataset is based on a subsample of the Early Childhood
Longitudinal Study Kindergarten (ECLS-K) from the United States’ National Center for
Educational Statistics. The data include 21,054 students and thousands of variables such as direct
cognitive assessments of students, teacher reports of students, parental reports of students, and
detailed information about demographic information and students’ home life at seven time-
points. The data are publicly available from the United States’ National Center for Educational
Statistics ( and are intended to allow researchers to answer research
questions pertaining to child development, school readiness, and experiences in schools. We
used a subsample consisting of 1977 students who had complete math and reading scores at all
seven waves of the study. Socioeconomic status is not captured by a single variable in ECLS-K,
therefore researchers have argued and demonstrated that it is more fruitful to form a scale for
socioeconomic status using variables that capture different aspects of socioeconomic status
(Curran & Kellogg, 2016; Lubienski & Crane, 2010). In this example, we use 9 variables:
Mother’s Education, Father’s Education, Household income (in dollars), parents’ expectation of
child’s eventual education level, the number of books the child has, whether the child qualifies
for free or reduced lunch, whether the parent volunteers at school, whether there is a computer in
the house (these data were collected in the late 1990s when home computers were not
ubiquitous), and whether the child is enrolled in music lessons. These variables were collected
during the fall semester of the child’s kindergarten year. The first example primarily
demonstrates how the assumption of tau equivalence adversely affects Cronbach’s alpha in ways
that do not affect other measures. Differences between reliability for optimally-weighted and
unit-weighted scales are also shown.
The second example contains responses to 25 Likert items from the Big Five Inventory
for personality traits. The data contain responses from 2800 people and were collected as part of
the Synthetic Aperture Personality Assessment (SAPA) project (Revelle, Wilt, & Rosenthal,
2010). The data are freely available in the psych R package as the “bfi” data. This example
shows how the various measures are similar when tau equivalence is approximately met and how
the measures diverge when scales are congeneric. The data in this example are based on Likert
items, so the example also shows how reliability is attenuated if discrete responses are treated as
continuous and how discrete items similarly affect alternatives measures as well.
Although we previously listed other assumptions earlier in the text, these examples
primarily focus on violations of the tau equivalence and continuous item assumptions. This is
intentional because these assumption violations are the most often violated assumptions of
Cronbach’s alpha and are the simplest assumptions to relax.
ECLS-K Example
To demonstrate the large violation of tau-equivalence in these data, we first perform a
likelihood ratio test comparing a model with constrained standardized loadings across all items
to a model with standardized loadings freely estimated for all items. We reverse coded the Free
or Reduced Lunch variable because its loading was negative, which would adversely affect fit.
With all loadings constrained,
2(35) 625.33
, SRMR = .12, McDonald Centrality = .83
the standardized loading for all items was estimated to be 0.48. When loadings were allowed to
be unconstrained,
2(27) 160.52
, SRMR = .05. McDonald Centrality = .96. A likelihood ratio
test of these two models results in a value of
2(8) 464.81
which is clearly significant (the
Hu and Bentler (1999) recommend McDonald’s Centrality > .90 and SRMR < .09 as a combinational rule that
minimizes the sum of Type-I and Type-II errors (pp. 26) while McDonald’s Centrality > .93 and SRMR < .06 also
worked fairly well but tended to over-reject true models. We use this criteria to establish goodness-of-fit throughout
these examples because factor models for scales with few items tend to have few degrees of freedom, for which
RMSEA vastly over-rejects well-fitting models (Kenny, Kaniskan, & McCoach, 2014) and because the sample size
in both models is rather large, which may render the chi-square test overpowered (e.g., Hu & Bentler, 1998). Note
that there has been a steady wave of criticism against generalizing the Hu and Bentler cut-offs (e.g., Marsh, Hau, &
Wen, 2004; Hancock & Mueller, 2011) although our examples fall fairly closely to their original simulation design
(factor model with 5 items per factor and standardized loadings near 0.70).
0.05 cut-off is 15.51) and indicates that the model with constrained loadings fits significantly
worse. The standardized loadings for the unconstrained model are presented in Table 2, which
clearly show a wide range of standardized factor loadings (Range: 0.21 to 0.76). The fit indices
also provide evidence that the scale is unidimensional because a one factor solution fits the data
reasonably well. Table 2 provides the reliability estimates using Cronbach’s alpha, omega total,
Revelle’s omega total, the GLB, and Coefficient H. If Cronbach’s alpha is used, the value is in
the mid .70s which would result in the scale being seen as “acceptable” using common
guidelines from Kline (1986) and Devalis (1991). However, recall that the loadings in this
example are highly discrepant and that this negatively biases Cronbach’s alpha estimates. Using
an alternative measure of reliability results in noticeable increases in reliability estimates, as high
as 10% with Coefficient H.
Although many researchers would consider removing the Music Lessons variable due to
its low loading, we have retained it to demonstrate the difference in reliability estimates for unit-
weighted and optimally-weighted scales. For Cronbach’s alpha, both omega totals, and the GLB,
a weakly related item decreases reliability because each item receives equal consideration when
computing scale scores. However, optimally weighted scales (for which Coefficient H is
appropriate) differentially weight each item based on its factor loading. As a result, Coefficient H
in this case is higher (5% higher than Revelle’s omega total) because the Music Lessons variable
is heavily down-weighted and the other, more reliable items would be weighted much more
heavily when scale scores are computed. As a reminder, even though it may be appealing to
report Coefficient H in such a case because it is higher, it is only appropriate if the scale score is
calculated using optimal weights.
Big Five Inventory Example
Unlike the previous example where tau equivalence was badly violated, this example
features five subscales with various gradations of (possible) violations to tau equivalence. Table
3 shows the standardized factor loadings based on the Pearson and polychoric correlation
matrices. Both sets of results were obtained in R using the psych package and the
scaleStructure wrapper from the userfriendlyscience package (details are
provided in the appendix). Each subscale in this dataset contains five items that are intended to
be unidimensional (i.e., each item only measures a single construct). To assess the
unidimensionality of these subscales, SRMR and McDonald’s Centrality are provided for each
subscale; the values for each subscale meet the suggested guidelines and we continue under the
assumption that unidimensionality for each sub-scale is preserved.
Upon initial inspection of Table 3, the various subscales adhere to tau equivalence to
varying degrees. The loadings for the Conscientiousness subscale are rather close to one another
(magnitude range: 0.55 to 0.67 using a Pearson covariance matrix, 0.58 to 0.72 using a
polychoric covariance matrix). On the other hand, the loadings for the Agreeableness subscale
are quite variable (Range: 0.37 to 0.76 using a Pearson covariance matrix, 0.43 to 0.80 using a
polychoric covariance matrix). To more rigorously demonstrate the similarity of the loadings on
the Conscientiousness subscale, we constrained the standardized loadings to be equal and
compared the fit to a model where all loadings are freely estimated. The likelihood ratio test was
2(4) 28.17, .01p
but the changes in the SRMR (
SRMR .0125
) and
McDonald’s Centrality (
McDonald .0048
) were rather small.
We proceed by allowing the
When sample size is large, some studies have recommended using change in fit indices instead of likelihood ratio
test (e.g., Cheung & Rensvold, 2002; F.F. Chen, 2007). Although the field has not uniformly accepted this approach
(e.g., Barrett, 2007), these changes in fit indices between models are below the recommend cut-offs (less than .025
for SRMR when testing loadings, greater than -.005 for McDonald’s Centrality; Chen, 2007).
loadings to be freely estimated, but we treat the Conscientiousness subscale as an exemplar of
the behavior of the various reliability measures when tau equivalence is roughly appropriate.
Table 4 shows the estimated reliability using Cronbach’s alpha, omega total, Revelle’s
omega total, the GLB (using the MRFA approach), and Coefficient H using both a Pearson
covariance matrix and a polychoric covariance matrix. First, notice that when the subscale is
very closely tau equivalent (as in the Conscientiousness subscale), there are small differences
between the various reliability measures.
However, the difference between the estimates grows
larger the as the subscales deviate from tau equivalence with relative percentage increases over
Cronbach’s alpha ranging from 5 to 12% across subscales.
This example also shows the effect of treating truly discrete items as continuous when
calculating reliability, which is an assumption of all methods because each use the inter-item
covariance matrix in some form in their calculation. Even though item responses are on a six
point Likert scale, the reliability estimates using the polychoric covariance matrix are noticeably
larger because treating the items as continuous attenuates the covariances. Across each subscale,
the estimates based on the polychoric covariance matrix are between .02 to .11 points higher for
the same measure than if the Pearson covariance matrix is used. Regardless of which method is
used to calculate reliability, when assessing reliability, it is important to consider the scale of the
Among the various alternatives to Cronbach’s alpha, the expected trends can be seen in
this example. First, Cronbach’s alpha consistently yields the lowest estimate of reliability. This is
expected because Cronbach’s alpha is the only method making the tau equivalence assumption
When a scale is perfectly tau equivalent, omega total and Coefficient H will be identical to Cronbach’s alpha,
provided that all other assumptions are met. With tau equivalence, there is no difference between unit weighting and
optimal weighting because, with optimal weighting and tau equivalence, each item receives the same weight. The
GLB will not necessarily be equal to Cronbach’s alpha, even if a scale is tau equivalent (Sočan, 2000).
which is rarely tenable and inappropriate for at least four of the five subscales in this example.
Second, when subscales have an item that has a noticeably poor item relative to the other items
(e.g., Item1 on Agreeableness, Item4 on Openness), Coefficient H tends to provide larger
reliability estimates than omega total, the GLB, and sometimes than Revelle’s omega total
because the scale would be better scored using optimal weighting (to down-weight the impact of
the poor item). When subscales have factor loadings in the same general vicinity (but not
necessarily close enough to be considered approximately tau equivalent), the GLB and Revelle’s
omega total yield higher estimates than Coefficient H. In the case of approximate tau
equivalence, Coefficient H converges to Cronbach’s alpha whereas the GLB is known to exceed
Cronbach’s alpha in such instances (e.g., Sočan, 2000). When there is moderate separation
between the loadings of the various items (as on the Neuroticism subscale), Coefficient H and the
GLB are approximately equal.
Take-Home Message
The take-home message of these examples is that there is a vast discrepancy in the
reliability estimates when applying the conventional Cronbach’s alpha compared to employing
alternative methods. In the Big Five Inventory example, Cronbach’s alpha for the Openness
subscale using a Pearson covariance matrix is .61 which would be classified as borderline poor
(Kline, 1986 and DeValis, 1991 designate the “poor” classification at <.60) and would likely
need to be defended if a manuscript were submitted for publication. However, by appropriately
accounting for the discreteness of the responses and using a method that does not mandate tau
equivalence, Revelle’s omega total, the GLB, and Coefficient H estimate the reliability to be
well above .70. The GLB yields the highest estimate at .76, 25% higher than the Cronbach’s
alpha estimate based on the Pearson covariance matrix.
Although Cronbach’s alpha is familiar, commonly reported, and easy to obtain in
software, it is rarely an appropriate measure of reliability - its assumptions are overly rigid and
almost always violated. Worse yet, under the near ubiquitous violation of tau equivalence,
Cronbach’s alpha estimates make scales appear much less reliable than they are in actuality.
Moreover, even if all assumptions are met, Cronbach’s alpha is a special case of the alternative
measures overviewed in this paper meaning that, even if Cronbach’s alpha is appropriate, other
methods will yield the exact same values and others (Revelle’s omega total and the GLB) have
been shown to routinely exceed Cronbach’s alpha. Quite plainly, there is no situation where
Cronbach’s alpha is the optimal method for assessing reliability.
Despite a steady stream of criticism against Cronbach’s alpha, researchers continue to
report it in flagship APA journals, as reviewed in the introduction. A common tactic when
reporting unfavorable values of Cronbach’s alpha is to appeal to the weakness of the method.
This approach, while well-intended, is highly problematic for the scientific process because it
impedes the ability to identify scales with less desirable properties. That is, if a scale has a
Cronbach’s alpha value of 0.40, the value could be low because (1) the scale is not reliable or (2)
the scale is sufficiently reliable but assumption violations led to downwardly biased estimates of
Cronbach’s alpha. This uncertainty leads towards a dichotomy where either (1) the use of the
scale is supported because reliability is sufficiently high (e.g., 0.70 or greater) or (2) Cronbach’s
alpha should be higher but was underestimated because assumptions were violated and the scale
is still usable. Such a dichotomy hides a third option which is simply that the scale is not reliable.
In the long run, it does the field little good to use faulty methods whose results may subsequently
be disregarded; the process of scale validation at such a point becomes highly subjective and not
readily falsifiable, eroding the credibility of psychometric analysis.
Given that many psychologists employ latent variable methods (item response theory,
confirmatory factor analysis, or exploratory factor analysis) to explore their scales rather than
classical test theory, it is difficult to excuse the continued use of Cronbach’s alpha. Specifically,
the vital assumption of tau equivalence is quite easy to inspect by examining the similarity of the
factor loadings. Even the classic eyeball test can be an effective approximation in many cases.
For instance, in the ECLS-K example, formal tests are not likely necessary to determine that
standardized loadings of 0.21 and 0.76 are not approximately equal. If the factor loadings are not
equivalent for all items on the scale, then Cronbach’s alpha is not appropriate and its use will
adversely affect results by making reliability appear lower than it actually is. Other measures are
susceptible to other assumption violations, but we remind readers that there are ways in which
these could be addressed such as omega hierarchical for the presence of minor dimensions,
including error covariances between items for design-driven reasons, or basing estimates on a
polychoric rather than Pearson covariance matrix if item responses are discrete rather than
continuous. We would like to note that Likert items, even with many categories, attenuate the
item covariances that are used in all methods we discuss in this paper, which results is
downwardly biased estimates of reliability. Therefore, it tends to be in researchers’ best interest
to acknowledge potential discreteness of items.
Although there have been previous calls to abandon Cronbach’s alpha, Revelle and
Zinbarg (2009) noted that software for other methods was somewhat limited and that empirical
researchers may be hesitant because of the undoubted attraction to methods that have simple
software applications. Although the GLB and Revelle’s omega total are best estimated in R
because of some computational complexities, omega total and Coefficient H are fairly
straightforward to compute manually or with spreadsheets and do not require sophisticated or
iterative processes. In the Appendix, we provide annotated R code that can be used to estimate
these alternative measures. Some of the functionality included in these packages may require
additional analyses in R, which we realize may not be helpful to users who are unfamiliar with or
who dislike using R (though the scaleStructure function can eliminate the need for these
additional analyses for most of the alternative measures). In an attempt to make these measures
more accessible, we provide an Excel spreadsheet on the first author’s personal website and on
the Open Science Framework that allows researchers to compute Coefficient H and omega total
using only the standardized factor loadings. Guidance for using this spreadsheet is also provided
in the appendix.
This paper is not intended to fully cover all the nuances and issues associated with
Cronbach’s alpha or calculating and reporting scale reliability as this literature is rather
extensive. Other researchers have provided more technical information on this topic for those
seeking a deeper understanding of the issues surrounding reliability. Geldhof et al. (2014)
provide further guidance on calculating reliability with Cronbach’s alpha, omega total, and
Coefficient H when data come from a multilevel structure. Kelley and colleagues have several
recent papers discussing the importance of confidence intervals around reliability estimates and
discuss how to compute such intervals for many measures which have been included in their
MBESS R package (e.g., Kelley & Cheng, 2012; Kelley & Pornprasertmanit, 2016; Terry &
Kelley, 2012). Zhang and Yuan (2014) discuss robust methods to compute Cronbach’s alpha and
omega total with non-normal or missing data and also provide the R package
coefficientalpha. We presented only a few of the possible alternatives to Cronbach’s
alpha. Bentler’s rho (Bentler, 1968) has also been recommended and is easy to compute in the
EQS software while Sijtsma (2009) has vouched for the explained common variance (ECV)
method. We focused on unidimensional scales, although there is a growing trend in the literature
to assess the reliability of multidimensional scales. Bifactor and hierarchical models (where there
is a single general factor and several subscale factors) are more appropriate for these types of
scales and there are alternative measures (Reise, 2012; Reise, Bonifay & Haviland., 2013; Reise,
Morizot, & Hays, 2007).
In conclusion, we hope that we have sufficiently demonstrated why Cronbach’s alpha is
obsolete and that it is time for the field to move on to better, more general alternatives. As seen
in the empirical examples, the practical differences among the competing alternatives tends to be
rather small the example showed that the GLB, Revelle’s omega total, and Coefficient H tend
to provide the highest estimates of reliability. We realize that readers may be hoping for
guidance on which of the aforementioned methods should be the “successor” to Cronbach’s
Although some of these comparisons have been noted in the literature and some general
relations are known (such as those presented in Table 1), these results should not be taken as
rigorous and comprehensive since they are anecdotal and not based on analytic derivations or
simulation results (though such comparisons would undoubtedly be a fruitful avenue of future
research).The common theme we hope to espouse is that Cronbach’s alpha is outperformed by
all of these methods. We believe that the most important message empirical researchers receive
from this article is that using any of the alternatives is preferable to continued use of Cronbach’s
alpha. Cronbach’s alpha had a good run and was able to hold down the fort for the field for over
50 years, but methodological reinforcements have indeed arrived.
This phrase was used by a reviewer, which we adopted because we thought it very aptly described the current state
of affairs
Bandura, A. (1977). Self-efficacy: toward a unifying theory of behavioral change. Psychological
Review, 84, 191-215.
Baron, R. M., & Kenny, D. A. (1986). The moderatormediator variable distinction in social
psychological research: Conceptual, strategic, and statistical considerations. Journal of
Personality and Social Psychology, 51, 1173.
Barrett, P. (2007). Structural equation modelling: Adjudging model fit. Personality and
Individual Differences, 42, 815-824.
Becker, G. (2000). How important is transient error in estimating reliability? Going beyond
simulation studies. Psychological Methods, 5, 370-379.
Bentler, P. M. (2009). Alpha, dimension-free, and model-based internal consistency
reliability. Psychometrika, 74, 137-143.
Bentler, P. M. (2007). Covariance structure models for maximal reliability of unit-weighted
composites. In S. Lee (Ed.), Handbook of computing and statistics with applications: Vol. 1.
Handbook of latent variable and related models (pp. 119). New York: Elsevier.
Bentler, P. M. (1968). Alpha-maximized factor analysis (Alphamax): Its relation to alpha and
canonical factor analysis. Psychometrika, 33, 335-345.
Carroll, J. B. (1961). The nature of the data, or how to choose a correlation
coefficient. Psychometrika, 26, 347-372.
Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing
measurement invariance. Structural Equation Modeling,9, 233-255.
Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement
invariance. Structural Equation Modeling, 14, 464-504.
Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). The insidious effects of failing to include
design-driven correlated residuals in latent-variable covariance structure analysis. Psychological
Methods, 12, 381-398.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and
applications. Journal of Applied Psychology, 78, 98-104.
Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. New York:
Cronbach, L.J., & Shavelson, R.J. (2004). My current thoughts on coefficient alpha and
successor procedures. Educational and Psychological Measurement, 64, 391418.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16,
Crutzen, R. (2007). Time is a jailer: What do alpha and its alternatives tell us about
reliability?. European Health Psychologist, 16, 70-74.
Crutzen, R., & Peters, G. J. Y. (2015). Scale quality: alpha is an inadequate estimate and factor-
analytic evidence is needed first of all. Health Psychology Review. OnlineFirst. DOI:
Curran, F. C., & Kellogg, A. T. (2016). Understanding science achievement gaps by
race/ethnicity and gender in kindergarten and first grade.Educational Researcher, 45, 273-282.
DeVellis, R. F. (1991). Scale Development. London: Sage.
Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to
the pervasive problem of internal consistency estimation. British Journal of Psychology, 105,
Flora, D. B., & Curran, P. J. (2004). An empirical evaluation of alternative methods of
estimation for confirmatory factor analysis with ordinal data.Psychological Methods, 9, 466-491.
Gadermann, A. M., Guhn, M., & Zumbo, B. D. (2012). Estimating ordinal reliability for Likert-
type and ordinal item response data: A conceptual, empirical, and practical guide. Practical
Assessment, Research & Evaluation,17, 1-13.
Geldhof, G. J., Preacher, K. J., & Zyphur, M. J. (2014). Reliability estimation in a multilevel
confirmatory factor analysis framework. Psychological Methods, 19, 72-91.
Gessaroli, M. E., & Folske, J. C. (2002). Generalizing the reliability of tests comprised of
testlets. International Journal of Testing, 2, 277-295.
Gignac, G. E., Bates, T. C., & Lang, K. (2007a). Implications relevant to CFA model misfit,
reliability, and the five factor model as measured by the NEOFFI. Personality and Individual
Differences, 43, 10511062.
Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score reliability
what they are and how to use them. Educational and Psychological Measurement, 66, 930-944.
Green, S. B. (2003). A coefficient alpha for test-retest data. Psychological Methods, 8, 88-101.
Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index
of test unidimensionality. Educational and Psychological Measurement, 37, 827-838.
Green, S. B., & Hershberger, S. L. (2000). Correlated errors in true score models and their effect
on coefficient alpha. Structural Equation Modeling, 7, 251-270.
Green, S. B., & Yang, Y. (2009a). Commentary on coefficient alpha: A cautionary
tale. Psychometrika, 74, 121-135.
Green, S. B., & Yang, Y. (2009b). Reliability of summed item scores using structural equation
modeling: An alternative to coefficient alpha. Psychometrika, 74, 155-167.
Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255-282.
Hancock, G. R., & Mueller, R. O. (2011). The reliability paradox in assessing structural relations
within covariance structure models. Educational and Psychological Measurement, 71, 306-324.
Hancock, G. R., & Mueller, R. O. (2001). Rethinking construct reliability within latent variable
systems. In R. Cudeck, S. du Toit, & D. Sörbom (Eds.), Structural equation modeling: Present
and futureA festschrift in honor of Karl Jöreskog (pp. 195216). Lincolnwood, IL: Scientific
Software International.
Hattie, J. (1985). Methodology review: assessing unidimensionality of tests and items. Applied
Psychological Measurement, 9, 139-164.
Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the
frequency of use of various types. Educational and Psychological Measurement, 60, 523-531.
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis:
Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.
Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to
underparameterized model misspecification.Psychological Methods, 3, 424-453.
Jackson, P. H., & Agunwamba, C. C. (1977). Lower bounds for the reliability of the total score
on a test composed of non-homogeneous items: I: Algebraic lower bounds. Psychometrika, 42,
Kelley, K. (2007). Methods for the behavioral, educational, and social sciences: An R
package. Behavior Research Methods, 39, 979-984.
Kelley, K., & Cheng, Y. (2012). Estimation of and confidence interval formation for reliability
coefficients of homogeneous measurement instruments. Methodology, 8, 39-50.
Kelley, K., & Pornprasertmanit, S. (2016). Confidence intervals for population reliability
coefficients: Evaluation of methods, recommendations, and software for composite
measures. Psychological Methods, 21, 69-92.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in models
with small degrees of freedom. Sociological Methods & Research, 44, 486-507.
Kline, P. (1986). A handbook of test construction: Introduction to psychometric design. London:
Lubienski, S., & Crane, C. C. (2010). Beyond free lunch: Which family background measures
matter?. Education Policy Analysis Archives, 18, 11.
Mansolf, M. & Reise, S.P. (2016) Exploratory bifactor analysis: The Schmid-Leiman
orthogonalization and Jennrich-Bentler analytic rotations. Multivariate Behavioral Research, 51,
Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-
testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and
Bentler's (1999) findings.Structural Equation Modeling, 11, 320-341.
McDonald, R. P. (1999). Test theory: A unified approach. Mahwah, NJ: Erlbaum.
McDonald, R. P. (1970). The theoretical foundations of principal factor analysis, canonical
factor analysis, and alpha factor analysis. British Journal of Mathematical and Statistical
Psychology, 23, 1-21.
Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical
test theory and structural equation modeling. Structural Equation Modeling, 2, 255-273.
Moltner, A., & Revelle, W. (2015). Find the greatest lower bound to reliability. Available online
at: html
Novick, M. R., & Lewis, C. (1967). Coefficient alpha and the reliability of composite
measurements. Psychometrika, 32, 1-13.
Nunnally, J.C. & Bernstein, R.H. (1994). The assessment of reliability. In: J.C. Nunnally, R.H.
Bernstein (Eds.) Psychometric theory. (pp. 248-292). New York: McGraw-Hill.
Pedhazur‚ E. J.‚ and Schmelkin‚ L. P. (1991). Measurement‚ design‚ and analysis: An integrated
approach. Hillsdale‚ NJ: Erlbaum.
Peters, G. J. Y. (2014). The alpha and the omega of scale reliability and validity: why and how to
abandon Cronbach’s alpha and the route towards more comprehensive assessment of scale
quality. European Health Psychologist, 16, 56-69.
Peterson, R. A., & Kim, Y. (2013). On the relationship between coefficient alpha and composite
reliability. Journal of Applied Psychology, 98, 194-198.
Raykov, T. (1997a). Estimation of composite reliability for congeneric measures. Applied
Psychological Measurement, 21, 173-184.
Raykov, T. (1997b). Scale reliability, Cronbach's coefficient alpha, and violations of essential
tau-equivalence with fixed congeneric components. Multivariate Behavioral Research, 32, 329-
Raykov, T. (1998). Coefficient alpha and composite reliability with interrelated
nonhomogeneous items. Applied Psychological Measurement, 22, 375-385.
Raykov, T. (2004). Behavioral scale reliability and measurement invariance evaluation using
latent variable modeling. Behavior Therapy, 35, 299-331.
Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: Point and
interval estimation using a structural equation modeling approach. Structural Equation
Modeling, 9, 195-212.
Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral
Research, 47, 667-696.
Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological
measures in the presence of multidimensionality. Journal of Personality Assessment, 95, 129-
Reise, S. P., Morizot, J., & Hays, R. D. (2007). The role of the bifactor model in resolving
dimensionality issues in health outcomes measures. Quality of Life Research, 16, 19-31.
Revelle, W. (2016, May). Using R and the psych package to find ω. Accessed October 19, 2016
Revelle, W. (2008). psych: Procedures for personality and psychological research (R package
version 1.0-51).
Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate
Behavioral Research, 14, 57-74.
Revelle, W., Wilt, J., & Rosenthal, A. (2010). Individual differences in cognition: New methods
for examining the personality-cognition link. In A. Gruszka, G. Matthews, & B. Szymura (Eds.),
Handbook of individual differences in cognition: Attention, memory and executive control (pp.
2749). New York: Springer.
Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments
on Sijtsma. Psychometrika, 74, 145-154.
Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be
treated as continuous? A comparison of robust continuous and categorical SEM estimation
methods under suboptimal conditions. Psychological Methods, 17, 354-373.
Rozeboom, W. W. (1966). Scaling theory and the nature of measurement. Synthese, 16, 170-233.
Schmid, J., & Leiman, J. M. (1957). The development of hierarchical factor
solutions. Psychometrika, 22, 53-61.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350-353.
Shapiro, A., & Ten Berge, J. M. (2000). The asymptotic bias of minimum trace factor analysis,
with applications to the greatest lower bound to reliability.Psychometrika, 65, 413-425.
Sheng, Y., & Sheng, Z. (2012). Is coefficient alpha robust to non-normal data? Frontiers in
Psychology, 3, 34.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s
alpha. Psychometrika, 74, 107-120.
Sočan, G. (2000). Assessment of reliability when test items are not essentially τ-
equivalent. Developments in Survey Methodology, 15, 23-35.
Steinberg, L., & Thissen, D. (1996). Uses of item response theory and the testlet concept in the
measurement of psychopathology. Psychological Methods, 1, 81-97.
ten Berge, J. M., & Sočan, G. (2004). The greatest lower bound to the reliability of a test and the
hypothesis of unidimensionality. Psychometrika,69, 613-625.
ten Berge, J. M., & Kiers, H. A. (1991). A numerical approach to the approximate and the exact
minimum rank of a covariance matrix.Psychometrika, 56, 309-315.
Terry, L., & Kelley, K. (2012). Sample size planning for composite reliability coefficients:
Accuracy in parameter estimation via narrow confidence intervals. British Journal of
Mathematical and Statistical Psychology, 65, 371-401.
Teo, T., & Fan, X. (2013). Coefficient Alpha and beyond: Issues and alternatives for educational
research. The Asia-Pacific Education Researcher, 22, 209-213.
Trizano-Hermosilla, I., & Alvarado, J. M. (2016). Best Alternatives to cronbach’s alpha
reliability in realistic conditions: congeneric and asymmetrical measurements. Frontiers in
Psychology, 7, 769.
van Noorden, R., Maher, B., & Nuzzo, R. (2014). The top 100 papers. Nature, 514, 550-553.
van Zyl, J. M., Neudecker, H., & Nel, D. G. (2000). On the distribution of the maximum
likelihood estimator of Cronbach's alpha. Psychometrika, 65, 271-280.
Wolff, H. G., & Preising, K. (2005). Exploring item and higher order factor structure with the
Schmid-Leiman solution: Syntax codes for SPSS and SAS. Behavior Research Methods, 37, 48-
Yang, Y., & Green, S. B. (2011). Coefficient alpha: A reliability coefficient for the 21st
century?. Journal of Psychoeducational Assessment, 29, 377-392.
Zhang, Z., & Yuan, K. H. (2016). Robust coefficients alpha and omega and confidence intervals
with outlying observations and missing data: Methods and software. Educational and
Psychological Measurement, 76, 387-411.
Zimmerman, D. W., Zumbo, B. D., & Lalonde, C. (1993). Coefficient alpha as an estimate of
test reliability under violation of two assumptions. Educational and Psychological
Measurement, 53, 33-49.
Zinbarg, R. E., Yovel, I., Revelle, W., & McDonald, R. P. (2006). Estimating generalizability to
a latent variable common to all of a scale's indicators: A comparison of estimators for
ωh. Applied Psychological Measurement, 30, 121-144.
Zinbarg, R. E., Revelle, W., & Yovel, I. (2007). Estimating ω h for structures containing two
group factors: Perils and prospects. Applied Psychological Measurement, 31, 135-157.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle’s β, and
McDonald’s ω H: Their relations with each other and two alternative conceptualizations of
reliability. Psychometrika, 70, 123-133.
Zumbo, B.D., & Rupp, A.A. (2004). Responsible modeling of measurement data for appropriate
inferences: Important advances in reliability and validity theory. In D. Kaplan (Ed.), The SAGE
handbook of quantitative methodology for the social sciences (pp. 7392). Thousand Oaks: Sage.
Zumbo, B. D., Gadermann, A. M., & Zeisser, C. (2007). Ordinal versions of coefficients alpha
and theta for Likert rating scales. Journal of Modern Applied Statistical Methods, 6, 4.
Table 1
Comparison of practical considerations for six different methods
Ease of Implementation in General
Statistical Software
Notable Advantages
Ubiquitous in general software
(e.g., SPSS, SAS, Stata, R)
Familiar to readers and
reliability, requires tau
Omega Total
Available in the MBESS R package via the
ci.reliability function or via the
scaleStructure function in the
userfriendlyscience package.
Also calculable with a spreadsheet
(provided in the appendix). No built-in
option for computing a polychoric
covariance matrix, though factor analysis
procedures do, which do not affect manual
ease of calculation
Most conceptually
related to Cronbach's
alpha (Cronbach’s alpha
is a special case).
Formula can be
extended to take design-
driven error covariances
into account.
Tends to be yield
conservative estimates
compared to other
alternative methods
Omega Total
Available in the psych R package via
omega function or via the
scaleStructure function in the
userfriendlyscience package. Not
calculable manually.
Tends to justifiably
exceed Omega Total
and often exceeds the
assumption of
Schmid-Leiman must
be met
Available in the psych R package via
omega function or via the
scaleStructure function in the
userfriendlyscience package.
Not calculable manually. Includes a built-
in option for internally computing and
using a polychoric covariance matrix
Accounts for and
excludes effects of
minor dimensions
Most conceptually
distant from
traditional Cronbach's
alpha. Also dependent
on Schmid-Leiman
Available in the psych R package via
glb.fa or glb.algebraic function.
Available via the scaleStructure
function in the userfriendlyscience
package. Not calculable manually. No
built-in option for computing polychoric
covariance matrix
Exceeds Cronbach's
alpha, even if all
assumptions are met
No analytic solution,
current software does
not offer polychoric
Very simple to calculate in a spreadsheet
(provided in in the appendix), calculated by
default in the scaleStructure function in the
userfriendlyscience package for continuous
Designed for optimal-
weighted scales, not
affected by addition of
poor items
Misleading if the scale
is scored with unit-
Table 2
ECLS-K example standardized factor loadings, estimated reliability using different methods, and
model fit indices
Std. Loading
% Increase
FR Lunch
Cronbach’s Alpha
Mom Education
Omega Total
Dad Education
Revelle’s Omega Total
Household Income
Greatest Lower Bound
Expect Education
Coefficient H
Number of Books
Music Lessons
Computer at Home
Parent Volunteers
McDonald’s Centrality
Note: SRMR = standardized root mean squared residual, % Increase = the percent relative
increase of reliability compared to Cronbach’s alpha. The Free or Reduced Lunch variable was
reverse coded when calculating Cronbach’s alpha and both Omega Totals so that all covariances
would be positive.
Table 3
Standardized factor loadings for Big Five example, treating the items as continuous with a
Pearson covariance matrix and discrete with a polychoric covariance matrix
Pearson Covariance Matrix
Polychoric Covariance Matrix
Note: SRMR = standardized root mean squared residual, MC = McDonald Centrality
Table 4
Comparison of subscale reliabilities for model in Big Five Inventory example using Cronbach’s
Alpha, both versions of omega total, the GLB, and Coefficient H
Pearson Covariance Matrix
Lower Bound
Polychoric Covariance Matrix
Lower Bound
Note: Omega Revelle= Revelle’s Omega Total from psych R package. Items with negative
loadings were recoded when calculating Cronbach’s alpha and both omega totals so that all
covariances would be positive.
Software Code and Associated Screenshots for Obtaining Alternative Estimates of Reliability
Using R
Basics and Installing Packages
Because R is open source, new statistical packages are being added almost daily. In R, a
“package” is a set of procedures that can be used to perform certain statistical analyses. This is
equivalent to the “Proc” commands in SAS, procedures in SPSS, or commands in Stata. For
example, to fit a linear multilevel model, SAS uses the Proc Mixed procedure, SPSS uses the
MIXED procedure, Stata uses the xtmixed command, and R would use the lme4 package.
In R, not all packages are available by default upon opening the program (in fact, only very basic
packages are available). The packages needed to calculate scale reliability (psych, MBESS, and
userfriendlyscience) are not included and must be installed. This is done with the
following code:
Note that code in R is case-sensitive so capitalization is important. After running this code, you
will likely be prompted to select a “mirror site” which is the location from where these packages
are downloaded. A list of geographic locations may appear; it makes little difference which is
selected and they all contain the same information. These packages may take a few minutes to
install. Installing packages only needs to be done once per machine. Once the packages are
installed, they do not need to be installed again.
Loading the Data
Undoubtedly, one of the most difficult tasks when working with a new software is to
successfully load the desired dataset. In this appendix, we use the data from the Big Five
Inventory example because it is included as an internal example without the psych package.
After installing the psych package, the Big Five Inventory dataset can be loaded with the
following code,
data(bfi, package="psych")
In general, there are multiple ways to load data into R. Although the pathway to the file can be
explicitly stated, it is often easier to find the desired file from a dialog menu. The following code
shows how to input datafiles into R that are saved in either the .csv, .sav (SPSS), .dta (Stata), or
permanent SAS data set formats.
require(foreign) # after installing a package, the require
command tells R to use the package
dat<-read.csv(file.choose()) # CSV
dat<-read.dta(file.choose()) #Stata
dat<-read.ssd(file.choose()) # SAS
If the userfriendlyscience package is already installed, then one can use the getDat()
function to import data. This function determines the appropriate format and will automatically
import the data and assign it the name “dat”.
To simplify the analysis, we will separately break the full data into 5 separate datasets such that
each of the 5 subscales are contained within their own data set.
The name of the left side of the arrow is the new data name. On the right side of the arrow is the
old dataset (called bfi here because that is the default name for this data when loaded in from R)
and a set of brackets. Within these brackets, users specify which parts of the data matrix to use.
The first value is blank because we want all the rows (people). The second numbers correspond
to the columns in the data. So, for the Agreeableness dataset (agre), we want the first 5 columns
of the bfi data. The Conscientious dataset (cons) is composed of the 6th through 10th columns of
the bfi and so on.
Reverse Scoring
As is common in psychometric scales, some items may need to be reverse scored (this is required
for appropriate calculation of some reliability coefficients like Cronbach’s alpha). This can be
done with the invertItems function that is part of the userfriendlyscience package.
agreRev <- invertItems(agre, 1)
consRev <- invertItems(cons, c(4,5));
extrRev <- invertItems(extr, c(1, 2));
openRev <- invertItems(open, c(2, 5));
This code creates a new R object (agreRev, consRev, extrRev, openRev) from the original R
data. After the invertItems function, the first value within the parentheses is the data set to
reverse score. After the comma, the numbers listed are the columns in the data that should be
reverse scored. The “c” indicates that a list will follow and is needed if multiple items are reverse
scored. So, the Agreeableness scale will reverse score Item 1, the Conscientiousness scale will
reverse score Items 4 and 5, and so on. The Neuroticism scale does not contain any items that
need to be reverse scored.
Cronbach’s Alpha
Cronbach’s alpha can be calculated as part of many different functions. The simplest is to use the
alpha function from the psych R package. If relevant items are reverse scored as discussed
previously, then the only argument of the alpha function is the dataset.
The output for the Agreeableness scale is as follows. The estimate of Cronbach’s alpha can be
found in the first row of the output under std.alpha.
Omega Total
To calculate the measure that we call omega total (not Revelle’s omega total), one must go
outside of the psych package to the MBESS package.
In the MBESS package, the ci.reliability function will estimate omega total as well as its
confidence interval.
require(MBESS)#only necessary the first time the package is used
This yields the following output,
The estimate of omega total is the first value which appears beneath $est. On the Agreeableness
subscale, omega total is estimated to be 0.71 with a 95% confidence interval of [.69, .73]
Revelle’s Omega Total
Revelle’s omega total is calculated from the omega function in the psych package. The
omega function also outputs Cronbach’s alpha as well, so it can be used in lieu of the alpha
function. Again, the only argument needed in the function to obtain Revelle’s omega total using
a Pearson covariance matrix is the data set.
The output from this function for the Agreeableness subscale is as follows:
The Alpha row shows Cronbach’s alpha, which matches the output from the alpha function.
Revelle’s omega total is the last value in the first set of values which is listed as 0.77. Notice that
this value is not the same as omega total because it uses a variance decomposition based on a
Schmid-Leiman transformation (the details of which are provided below the output).
A convenient option in the omega function is that a polychoric covariance matrix can be
estimated and used internally and is possible by specifying only two additional words in the
omega(agreRev, poly=TRUE)
omega(consRev, poly=TRUE)
omega(extrRev, poly=TRUE)
omega(neur, poly=TRUE)
omega(openRev, poly=TRUE)
The output from the omega function with the polychoric option for the Agreeableness subscale is
as follows:
Notice that the Alpha and (Revelle’s) Omega Total values are much higher than in the previous
output. The alpha function does not feature this poly option, so Cronbach’s alpha with a
polychoric covariance matrix is best run through the omega function.
The computation of Revelle’s omega total is a little involved and there are not many sources that
describe this version of the omega coefficient (outside of documentation for the psych R
package). We outline where Revelle’s omega total from where comes for the remainder of this
section to elucidate what Revelle’s omega total is calculating. In Equation 4 of the main text, we
defined Revelle’s omega total as
1 1 1
gi fi
i f i
Revelle (2016) notes the numerator of this formula is equal to the communality of each item,
so the formula can be rewritten as
(1 )
This can be simplified to
is the uniqueness of the ith item (a.k.a. the error variance).
Using the polychoric covariance analysis of the Agreement subscale above, the communalities
appear in the “h2” column and the uniquenesses appear in the “u2” column. The sum of the
uniquenesses is equal to
.81 .01 .13 .68 .57 2.20
which is the numerator of Equation A2.
Unfortunately, the denominator
does not appear in the output. Fortunately, this value is quite
simple to calculate in R. Recall, that
is equal to the sum of all elements of the sample
correlation matrix. The polychoric correlation matrix in R can be saved as an object with the
following code,
The sum function can then be used to add all the individual elements
which yields
1 0.8272 0.83
, matching the output above.
Omega hierarchical is similar except that the numerator is only equal to the variance explained
by only the common factor. This can be found by adding up all the values in the “g” column and
squaring (be sure to add first and then square the sum, do not square first and then add the
squares). In the polychoric Agreement example,
.34 .70 .79 .52 .62 8.82= 
is still
equal to the same value (12.73) so hierarchical omega is equal to
8.82 /12.73 0.692
Greatest Lower Bound
The glb.fa function in the psych package estimates the greatest lower bound. Similar to
other methods in the psych package, the only necessary argument of the function is the data
The output for the Agreeableness subscale is as follows,
The greatest lower bound estimate appears as the first item in the output after $glb
Unfortunately, the glb.fa function does not offer the option to use a polychoric covariance
matrix internally and therefore uses a Pearson covariance matrix. However, this can be
circumvented by separately estimating a polychoric covariance or correlation matrix, and using
that as the input file instead of the raw data. However, it can be a bit tricky to save a polychoric
correlation matrix as a data frame in R.
First, the polychoric matrix is estimated with the polychoric function from the psych package.
Rather than immediately outputting the results, the output is saved to an object (called “mat” in
the code below). The output contains both the polychoric correlation matrix and thresholds; the
thresholds are not needed, so we want to exclude them and only save the matrix. In doing so, we
also must convert the object to a data frame. The R code for doing so for the Agreeableness
subscale is as follows,
The glb.fa function can accept a correlation matrix as input, so we can use the saved
polychoric correlation matrix as the input of the function.
This will provide the desired output,
scaleStructure Function
Although the above analyses are not difficult to perform because the commands are quite
straightforward, for inexperienced or reluctant R users, the scaleStructure can estimate
these quantities in a single pass and summarizes the output.
scaleStructure(dat=agreRev, ci=FALSE)
scaleStructure(dat=consRev, ci=FALSE)
scaleStructure(dat=extrRev, ci=FALSE)
scaleStructure(dat=neur, ci=FALSE)
scaleStructure(dat=openRev, ci=FALSE)
ci=FALSE indicates that we do not want the confidence interval for the estimate (although best
practice suggests that this is helpful to report).
The output for the Agreeableness subscale from this function is as shown on the next page
The function goes through the previously outlined methods, estimates reliability, saves the
output, and summarizes them in one window. The first set of output shows the results assuming a
Pearson covariance matrix followed by results that use a polychoric covariance matrix. It also
differentiates between omega total and Revelle’s omega total and is the only R package of which
the author is aware that provides estimates of Coefficient H.
Using Excel
Although R is the best available software option for estimating alternatives to Cronbach’s alpha
(and it is open source), we realize that some users may be hesitant to adopt a new software
program, especially to use methods with which they are unfamiliar. In attempt to make these
methods as broadly accessible as possible, we have included two Excel spreadsheets for
calculating omega total and Coefficient H using only the standardized loadings from a factor
analysis. These loadings can be obtained from any software program of the user’s choosing and
does not require learning any new software.
The provided Excel spreadsheet has two tabs, one for coefficient H and one for omega total. The
spreadsheet allows for up to 36 items. A factor analysis must be conducted to obtain the factor
loadings. This can be done in any program of the user’s choosing. Then, these loadings are
placed into Column B of the spreadsheet. For omega total, the spreadsheet is setup to
automatically calculate the uniqueness terms based on the standardized loadings. Column G for
Coefficient H and Column F for omega total will reveal the estimate of these measures.
Using the Agreeableness subscale example that was used in the previous section, we will first
obtain the standardized factor loadings using maximum likelihood in R using the fa function
from the psych package. These loadings need not be obtained from R and can be estimated
from any program of the user’s choice (e.g., Mplus, SPSS, SAS, Stata)
fa(agre, nfactors=1, fm="ml")
The output of this analysis yields the following,
The “ML1” column contains the standardized factor loadings for this scale (these correspond to
those provided in Table 3 of the main text). Taking these loadings and entering them into the
Excel spreadsheet for Coefficient H and omega total gives
... McDonald's ω was used to evaluate the composite reliability for both subgroups, with scores higher than 0.70 indicating satisfactory composite reliability [62]. Given the documented issues with using Cronbach's α as a measure of composite reliability, McDonald's ω was chosen in this study [62]. ...
... McDonald's ω was used to evaluate the composite reliability for both subgroups, with scores higher than 0.70 indicating satisfactory composite reliability [62]. Given the documented issues with using Cronbach's α as a measure of composite reliability, McDonald's ω was chosen in this study [62]. Since the skewness and kurtosis values ranged from −1 to +1, the univariate normality of the Sexual Satisfaction Questionnaire was confirmed [55]. ...
Full-text available
Background: Sexual satisfaction (SS) is an essential component of quality of life. There is a scarcity of research about sexual satisfaction in Lebanon, a country where discussing sexual issues is still considered a taboo. The present study aimed to assess the reliability and validity of responses to the items in the Arabic version of the Sexual Satisfaction Questionnaire (SSQ), as well as the correlates of sexual satisfaction, among a sample of Lebanese adults. (2) Methods: Two cross-sectional studies were conducted between June and September 2022 with 270 and 359 participants, respectively. (3) Results: The results showed that the Sexual Satisfaction Questionnaire is adequate to be used in Lebanon (McDonald's ω = 0.90 and 0.86, respectively). Multivariate analysis showed that higher waterpipe dependence (Beta = −0.17) was substantially linked to lower sexual satisfaction, while better emotional intelligence (EI) (Beta = 0.27) and physical activity (Beta = 0.17) were significantly associated with greater sexual satisfaction. (4) Conclusions: The reliability and validity of the responses to the Arabic version of the Sexual Satisfaction Questionnaire were supported by our findings. Also, practical implications for sexual satisfaction enhancement strategies in the Lebanese population might be beneficial since many associated factors are considered to be modifiable.
... Based on the results in similar pediatric populations, McDonald's ω was considered satisfactory if it was at least 0.7, e.g., [59]. Cronbach's α reliability coefficient was considered satisfactory if it was at least 0.7 for parent reports [60] and at least 0.6 for child/adolescent reports, e.g., [61]. The different cut-offs used for children and adults reflect the fact that Cronbach's α cannot be considered a general measure of the scale or instrument itself, but only of its application to a particular sample [62]. ...
Full-text available
Pediatric health-related quality of life (HRQoL) as a measure of subjective wellbeing and functioning has received increasing attention over the past decade. HRQoL in children and adolescents following pediatric traumatic brain injury (pTBI) has been poorly studied, and performing adequate measurements in this population is challenging. This study compares child/adolescent and parent reports of HRQoL following pTBI using the newly developed Quality of Life after Brain Injury in Children and Adolescents (QOLIBRI-KID/ADO) questionnaire. Three hundred dyads of 8–17-year-old children/adolescents and their parents were included in the study. The parent–child agreement, estimated using intraclass correlation coefficients and Cohen’s κ, displayed poor to moderate concordance. Approximately two-fifths of parents (39.3%) tended to report lower HRQoL for their children/adolescents on the total QOLIBRI-KID/ADO score. At the same time, about one-fifth (21.3%) reported higher HRQoL Total scores for their children/adolescents. The best agreement for parents rating adolescents (aged 13–17 years) was found in terms of the Total score and the Cognition and Self scale scores. To date, parent-reported HRQoL has been the preferred choice in pediatric research after TBI. However, with a parent–child disagreement of approximately 60%, our results highlight the importance of considering self-reports for children/adolescents capable of answering or completing the HRQoL measures.
... Finally, the reliability of the scales was estimated using Cronbach's alpha coefficient (α) and McDonald's omega coefficient (ω). The latter coefficient is strongly preferred over the best-known α in the presence of congeneric items (i.e., items represent to a different degree the dimensions they are intended to measure) [65]. As a rule of thumb, values between 0.60 and 0.70 indicate acceptable reliability and values above 0.70 provide evidence of good reliability [66]. ...
Full-text available
This study explored the association between irrational beliefs—i.e., rigid, unrealistic, and illogical convictions that people hold—and well-being at work. In detail, we tested whether secondary irrational beliefs (i.e., self-depreciation, low frustration tolerance, and awfulizing) displayed both common and unique associations with well-being. Furthermore, we investigated whether the perceived degree of performance expectations’ fulfillment mediate such associations. Data were collected on a sample of 3576 employees from companies providing business and consulting services. Results showed that the general irrationality factor and awfulizing were negatively associated with well-being, both directly and indirectly (via a lower degree of performance expectations’ fulfillment). Low frustration tolerance was positively related with the performance expectations’ fulfillment, which, in turn, fully mediated its association with well-being. Self-depreciation did not relate to our outcomes. The study contributes to the advancement of irrational beliefs literature as it is the first to disentangle the common and unique associations they have with well-being at work and identified the perceived fulfillment of performance expectations as a relevant mediating mechanism in the workplace. Theoretical and practical implications are discussed.
... It also sheds light on the psychometric properties and error associated with these types of instrumentation. For these purposes, the instrument reliability analyses benefitted from the application of superior coefficients to the much used (but often misapplied) Cronbach's alpha (McNeish, 2018;cf. Raykov & Marcoulides, 2019), steps we encourage other researchers to take. ...
Full-text available
The present study uses self-paced reading as a measure of online processing and an acceptability judgement task as a measure of offline explicit linguistic knowledge, to understand L2 learners’ comprehension processes and their awareness of subtle differences between the modal auxiliaries may and can. Participants were two groups of university students: 42 native speakers of English and 41 native speakers of Croatian majoring in L2 English. The study is part of a larger project that has provided empirical evidence of the two modals, may and can , being mutually exclusive when denoting ability ( can ) and epistemic possibility ( may ) but equally acceptable in pragmatic choices expressing permission. The present results revealed that L1 and L2 speakers rated the acceptability of sentences in offline tasks similarly; however, L2 learners showed no sensitivity to verb–context mismatches in epistemic modality while demonstrating sensitivity when processing modals expressing ability. Implications for L2 acquisition of modals and future research are discussed.
Oil and gas (O&G) organizations are progressively being digitalized in order to facilitate substantial information flow to remain competitive in the information age. This critical sector is spearheading the establishment of technical security measures to mitigate information security risks, yet employee behavioral influence remains an ongoing challenge in assuring information security. Existing studies of this domain primarily focus on employee behavior reshaping through multiple psychological theories. However, these studies ignore how these critical infrastructures implement information security. Most such infrastructures follow the International Society of Automation (ISA)-95 levels of automation and implement information security controls in line with these levels. This research paper proposed a theoretical framework to enhance information security policy compliance (ISPC) at level 4 to level 2 automation level in O&G organizations. To support the hypotheses, data were collected from 13 Malaysian O&G organizations. A total of 254 O&G employees participated in the survey and the structural equation modeling technique was used for data analysis. The study confirmed that ISA-95-based organizational governance factors and social bonding could enhance ISPC in O&G organizations. However, risk assessment and involvement factors have shown less support to the notion. For information systems practitioners, this study has shown how to enhance ISPC in O&G organizations through ISA-95-based organizational governance and social bonding.
Purpose In the competitive retailing environment, retailers who provide service experiences that stand out from the competition can gain a competitive advantage. Increasingly, an important aspect of the service experience involves product returns, in particular, the fairness of returns policies and procedures. Previous research studies support that interpersonal justice and informational justice relate positively to consumer attitudes and behaviors. In this paper, the authors examine the relative effects of interpersonal justice and informational justice on return satisfaction, positive word-of-mouth (PWOM) and trust. Additionally, the authors examine the moderating effects of returns process convenience and returns policy restrictiveness as indicators of procedural justice. Design/methodology/approach A scenario-based experiment methodology was used to test the relationships of interest. Findings Results support that the effects of interpersonal justice on the outcome variables are stronger than the effects of informational justice. There is also support for a moderating effect of returns process convenience on the relationships between interpersonal justice and each outcome variable, as well as partial support for the moderating effect of returns policy restrictiveness on the relationship between interpersonal justice and PWOM. Originality/value The research extends previous work on the effects of justice on customer outcomes. Results support the importance of retailers treating customers with fairness during the returns experience and further support the benefits of providing a convenient returns experience.
Bilingualism is an experience that varies across a continuum and can change across the lifespan. Psychometric research is an underexplored avenue with the potential to further our understanding of the mechanisms and traits underlying bilingual experiences. Here, we developed and validated a social network questionnaire to measure sociolinguistic features in 212 individuals via personal social network. Confirmatory factor analysis examined the measurement structure of the variables. Compared to a one-factor model, the best fitting model was a two-factor model in which the language experience of the individual (i.e., ego) and the language experience of the individual's network (i.e., alters) were correlated latent factors under which aspects of the bilingual experience loaded. Additional analyses revealed other potential ways to examine the data in future analyses. These results provide the first measurement model of bilingual experiences, and provide support for theoretical accounts suggesting differential neuropsychological outcomes based on individual bilingual variability. The results also support the use of social network tools to capture differences in bilingualism.
Full-text available
The ongoing global digital transformation has significant implications for economies and societies, with potential benefits and challenges. This study addresses the critical need for a comprehensive measure of regional digitalization in Germany to better understand its impact on various aspects of life, including education, employment, and working conditions. Using confirmatory factor analysis (CFA), it introduces a multifaceted regional digitalization indicator at the administrative district level (NUTS-3) that incorporates digital infrastructure, culture, technology capacity, high-tech human capital, and digitalization-related innovativeness. The study reveals that digitalization varies significantly across regions. Urban regions tend to have higher digitalization levels, which are positively associated with economic productivity and high-skilled labor demand. Moreover, regional digitalization complements the established measure of regional automation potential, as the two are only slightly correlated, highlighting the complexity of regional disparities in the digital age.
Full-text available
Valid measurements are needed to investigate the impact of parental bonding on child health development from a life-course perspective. The aim was to develop and validate a psychometric rating scale, the Parent-to-Infant Bonding Scale (PIBS) to measure bonding in both mothers and fathers. Internal consistency and construct validity were analysed using data from Swedish parents from both clinical ( N = 182), and community ( N = 122) population samples. Overall, good internal consistency, convergent validity (against the Postpartum Bonding Questionnaire, analysed in the clinical sample), and discriminant validity (against the mental health constructs depressive symptoms and anxiety) appeared. The results indicate good psychometric measurement properties of the PIBS for both mothers and fathers in community and clinical populations. Similarities in PIBS measurement properties between the groups suggest its usefulness for comparisons between mothers and fathers, and for investigating unique and interactive impacts of maternal and paternal bonding on child outcomes using community and clinical cohorts.
Full-text available
This study evaluated the sensitivity of maximum likelihood (ML)- generalized least squares (GLS) - and asymptotic distribution-free (ADF)-based fit indices to model misspecification under conditions that varied sample size and distribution. The effect of violating assumptions of asymptotic robustness theory also was examined. Standardized root-mean-square residual (SRMR) was the most sensitive index to models with misspecified factor covariance(s) and Tucker–Lewis Index (1973; TLI)Bollen's fit index (1989; BL89) relative noncentrality index (RNI) comparative fit index (CFI) and the ML- and GLS-based gamma hat McDonald's centrality index (1989; Mc) and root-mean-square error of approximation (RMSEA) were the most sensitive indices to models with misspecified factor loadings. With ML and GLS methods we recommend the use of SRMR supplemented by TLI BL89 RNI CFI gamma hat Mc or RMSEA (TLI Mc and RMSEA are less preferable at small sample sizes). With the ADF method we recommend the use of SRMR supplemented by TLI BL89 RNI or CFI. Finally most of the ML-based fit indices outperformed those obtained from GLS and ADF and are preferable for evaluating model fit.
Full-text available
Presents an integrative theoretical framework to explain and to predict psychological changes achieved by different modes of treatment. This theory states that psychological procedures, whatever their form, alter the level and strength of self-efficacy. It is hypothesized that expectations of personal efficacy determine whether coping behavior will be initiated, how much effort will be expended, and how long it will be sustained in the face of obstacles and aversive experiences. Persistence in activities that are subjectively threatening but in fact relatively safe produces, through experiences of mastery, further enhancement of self-efficacy and corresponding reductions in defensive behavior. In the proposed model, expectations of personal efficacy are derived from 4 principal sources of information: performance accomplishments, vicarious experience, verbal persuasion, and physiological states. Factors influencing the cognitive processing of efficacy information arise from enactive, vicarious, exhortative, and emotive sources. The differential power of diverse therapeutic procedures is analyzed in terms of the postulated cognitive mechanism of operation. Findings are reported from microanalyses of enactive, vicarious, and emotive modes of treatment that support the hypothesized relationship between perceived self-efficacy and behavioral changes. (21/2 p ref)
Full-text available
The Cronbach's alpha is the most widely used method for estimating internal consistency reliability. This procedure has proved very resistant to the passage of time, even if its limitations are well documented and although there are better options as omega coefficient or the different versions of glb, with obvious advantages especially for applied research in which the ítems differ in quality or have skewed distributions. In this paper, using Monte Carlo simulation, the performance of these reliability coefficients under a one-dimensional model is evaluated in terms of skewness and no tau-equivalence. The results show that omega coefficient is always better choice than alpha and in the presence of skew items is preferable to use omega glb coefficients even in small samples.
Analytic bifactor rotations have been recently developed and made generally available, but they are not well understood. The Jennrich-Bentler analytic bifactor rotations (bi-quartimin and bi-geomin) are an alternative to, and arguably an improvement upon, the less technically sophisticated Schmid-Leiman orthogonalization. We review the technical details that underlie the Schmid-Leiman and Jennrich-Bentler bifactor rotations, using simulated data structures to illustrate important features and limitations. For the Schmid-Leiman, we review the problem of inaccurate parameter estimates caused by the linear dependencies, sometimes called "proportionality constraints," that are required to expand a p correlated factors solution into a (p + 1) (bi)factor space. We also review the complexities involved when the data depart from perfect cluster structure (e.g., item cross-loading on group factors). For the Jennrich-Bentler rotations, we describe problems in parameter estimation caused by departures from perfect cluster structure. In addition, we illustrate the related problems of (a) solutions that are not invariant under different starting values (i.e., local minima problems) and (b) group factors collapsing onto the general factor. Recommendations are made for substantive researchers including examining all local minima and applying multiple exploratory techniques in an effort to identify an accurate model.
Disparities in science achievement across race and gender have been well documented in secondary and postsecondary school; however, the science achievement gap in the early years of elementary school remains understudied. We present findings from the recently released Early Childhood Longitudinal Study, Kindergarten Class of 2010–2011 that demonstrate significant gaps in science achievement in kindergarten and first grade by race/ethnicity. We estimate the Black-White science gap in kindergarten at –.82 SD but find only a small gender gap by first grade. Large disparities between Asian student performance in science as compared to mathematics and reading are documented. Student background characteristics and school fixed effects explain nearly 60% of the Black-White and Hispanic-White science achievement gaps in kindergarten. Implications for policy and practice are discussed.