ArticlePDF Available

Managing validity versus reliability trade-offs in scale-building decisions



Scale builders strive to maximize dual priorities: validity and reliability. While the literature is full of tips for increasing one, the other, or both simultaneously, how to navigate tensions between them is less clear. Confusion shrouds the nature, prevalence, and practical implications of trade-offs between validity and reliability-formerly called paradoxes. This confusion results in most trade-offs being resolved de facto at validity's expense despite validity being de jure the higher priority. Decades-long battles against clear measurement malpractice persist because unspecified trade-offs render scale-building decisions favoring validity perennially unattractive to scale builders. In light of this confusion, the goal of this article is to make plain that the source of validity versus reliability trade-offs is systematic error that contributes to item communality. Moreover, straightforward, nontrivial trade-offs pervade the scale-building process. This article highlights common trade-offs in 6 contexts: item content, item construction, item difficulty, item scoring, item order, and item analysis. I end with 5 recommendations for managing trade-offs and out 7 "dirty tricks" often used to exploit them when nobody's looking. In short, reviewers should require scale builders to declare how validity and reliability will be prioritized and penalize those who resolve trade-offs in goal-inconsistent ways. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Managing Validity Versus Reliability Trade-Offs in
Scale-Building Decisions
Jeremy D. W. Clifton
University of Pennsylvania
Scale builders strive to maximize dual priorities: validity and reliability. While the literature is full of tips
for increasing one, the other, or both simultaneously, how to navigate tensions between them is less clear.
Confusion shrouds the nature, prevalence, and practical implications of trade-offs between validity and
reliability—formerly called paradoxes. This confusion results in most trade-offs being resolved de facto
at validity’s expense despite validity being de jure the higher priority. Decades-long battles against clear
measurement malpractice persist because unspecified trade-offs render scale-building decisions favoring
validity perennially unattractive to scale builders. In light of this confusion, the goal of this article is to
make plain that the source of validity versus reliability trade-offs is systematic error that contributes to
item communality. Moreover, straightforward, nontrivial trade-offs pervade the scale-building process.
This article highlights common trade-offs in 6 contexts: item content, item construction, item difficulty,
item scoring, item order, and item analysis. I end with 5 recommendations for managing trade-offs and
out 7 “dirty tricks” often used to exploit them when nobody’s looking. In short, reviewers should require
scale builders to declare how validity and reliability will be prioritized and penalize those who resolve
trade-offs in goal-inconsistent ways.
Translational Abstract
When psychologists build surveys, they try to maximize validity and reliability. Yet sometimes these
goals are in tension such that improving one means sacrificing the other—a trade-off. This article
discusses why and when such trade-offs occur so that scale builders can better manage them and
consumers of scale-based research can increase demand for better scales.
Keywords: scale development, validity, reliability, measurement, attenuation paradox
While some scientists enjoy the luxury of measuring tangible
objects with tools like rulers and weights, psychologists—along
with sociologists, program evaluators, and other social scientists—
frequently face the problem of latency (Ghiselli, Campbell, &
Zedeck, 1981). We cannot see, taste, touch, smell, or feel many of
the phenomena we most want to study, including intelligence,
poverty, depression, extraversion, fear, belief in God, and so forth.
One way psychologists have overcome the latency problem—
especially since the decline of behaviorism— builds on the insight
that, despite many downsides, asking people for their thoughts on
themselves can be very useful. As Kazdin (1998, p. 281) points
out, the individual is in “a unique position to report his or her own
thoughts, feelings, wishes, and dreams.” Thus, psychologists have
built and tested numerous questionnaires (often called scales) that
ask individuals a series of questions about themselves (often called
items). Using different techniques, psychologists then pool item
responses into scale scores and analyze scores. But how do any of
us know if these scores reflect an actual latent thing in nature, and
how do we know if that thing is what we intended to measure? As
discussed below, these are the central questions of reliability and
validity, respectively. All neophyte scale builders know that good
scales must be both reliable and valid. Often both goals are
furthered simultaneously. For example, both are increased by
removing double-barreled items and both decreased by adding
items with vague terms. Spotting and managing trade-offs between
reliability and validity, however, requires additional sophistica-
Psychometricians have long been aware of some trade-offs
between validity and reliability, referring to them as “paradoxes”
(e.g., Clark & Watson, 1995;Dawis, 1987). Gulliksen (1945) was
perhaps the first to recognize one specifically in the context of item
difficulty (i.e., item means). In this same narrow context, Loev-
inger (1954) coined the term attenuation paradox to label and
mathematically demonstrate that validity sometimes increases with
increases in reliability, while at other times validity decreases with
increases in reliability. Since then, the term attenuation paradox
has remained obscure and inconsistently applied, sometimes con-
cerning only item difficulty or only item redundancy, while other
times being used as an umbrella term for all possible tension
This article was published Online First August 15, 2019.
I am grateful to Martin Seligman, Angela Duckworth, Robert DeRubeis,
Alicia Clifton, and David Yaden for feedback on this article. Neither this
article nor its novel contributions has previously been publicly dissemi-
nated and there are no conflicts of interest to declare.
Correspondence concerning this article should be addressed to Jeremy
D. W. Clifton, Department of Psychology, University of Pennsylvania,
3701 Market Street, Suite #203, Philadelphia, PA 19104. E-mail:
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Psychological Methods
© 2019 American Psychological Association 2020, Vol. 25, No. 3, 259–270
ISSN: 1082-989X
between reliability and validity (e.g., Boyle, 1991;Dawis, 1987;
John & Soto, 2007;Ponterotto & Ruckdeschel, 2007;Robinson,
Shaver, & Wrightsman, 1991). For example, Williams and Zim-
merman (1982, p. 169) see the attenuation paradox as one specific
trade-off that happens to be the “tip of the iceberg” of many others,
while McCrae, Kurtz, Yamagata, and Terracciano (2011) use the
term as a reference for all trade-offs. Yet the cause, nature, and
prevalence of this larger group of trade-offs remains unclear.
This article, written for both psychometricians and nonpsycho-
metricians interested in the sophisticated measurement of latent
variables, has three goals. First, I will demonstrate that straight-
forward nontrivial trade-offs pervade virtually every stage of the
modern scale-building process. In doing so, I will show that many
oft-discussed measurement problems have validity-versus-
reliability trade-offs at their core that make resolving them in favor
of validity seem perennially unattractive to scale builders; an
“iceberg” of trade-offs exists that is not at all paradoxical or
mysterious. Moreover, as McDonald (2011) notes, the word par-
adox is an inappropriate label for predictable and understandable
effects that result from clear violations of measurement theory; I
avoid the term paradox and instead propose trade-off. Second, in
light of few guidelines for navigating trade-offs, I will offer some
practical suggestions on how to spot trade-offs and make informed
decisions given one’s measurement paradigm. Third, I seek to
nudge consumer preferences in order to expand the market for
slightly less reliable yet more valid scales. Like any artisan pro-
ducing goods, scale builders will drift toward constructing prod-
ucts people want (and cite) even if quality suffers.
To accomplish these goals, this article has three parts. The first
briefly reviews the concepts of reliability and validity to clarify the
source of theoretical trade-offs between them: systematic error
contributing to item or score communality. The second establishes
the practical, pervasive, and nontrivial nature of these trade-offs by
highlighting specific decisions scale builders face in six areas—
item content, item construction, item difficulty, item scoring, item
order, and item analysis—that typically result in sacrificing valid-
ity for reliability. The third provides recommendations on how
scale builders should manage trade-offs given their measurement
Reliability and Validity
In order to understand the nature of trade-offs between reliabil-
ity and validity, one must deeply understand each. This section
defines both and identifies the theoretical source of trade-offs
between them.
A necessary but insufficient condition of measurement accuracy
is producing consistent readings when measuring the same phe-
nomenon. This axiom underlies, for example, testing unfamiliar
bathroom scales by hopping on and off a few times. If results vary,
the scale is broken. If consistent, the scale could be accurate.
Self-report scales must likewise be consistent, which psychome-
tricians call reliability. To determine reliability, psychometricians
use various tests that examine consistency across individual items
(internal reliability), random subsets of items (split-half reliabil-
ity), scale versions (alternate-form reliability), perspectives (inter-
rater reliability), time (test–retest reliability), and so forth (Anas-
tasi & Urbina, 1997;Kazdin, 1998). Instead of passing categorical
judgment, these tests produce continuous indicators, usually be-
tween 0.00 and 1.00, that require interpretation informed by a
conceptual understanding of partial reliability and incremental
change in reliability. DeVellis’ (2003) well-known text provides
an appropriate starting point for that conceptual understanding. He
defines scale reliability as (a) “the proportion of variance” that is
(b) “attributable to the true score of the latent variable” (p. 27). I
will unpack this two-part definition by discussing an example
reliability indicator and the importance of item independence. I
end this subsection by describing the typical role of reliability
analysis in modern scale-building.
Cronbach’s alpha. The most commonly examined type of
reliability is internal reliability, which concerns consistency be-
tween each item, and the most commonly used indicator of internal
reliability is (Cronbach, 1951;John & Soto, 2007;McCrae,
Kurtz, Yamagata, & Terracciano, 2011;McNeish, 2018;Peterson,
1994;Santos, 1999). Most scale builders also use the same arbi-
trary threshold of ␣⬎.70 to indicate scale reliability, though other
more nuanced standards have been suggested. For example, De-
Vellis (2003, pp. 95–96) describes standards for what is unaccept-
able (␣⬍.60), undesirable (.60 ⬍␣⬍.65), minimally acceptable
(.65 ⬍␣⬍.70), respectable (.70 ⬍␣⬍.80), very good (.80
␣⬍.90), and unnecessarily high such that one should consider
shortening one’s scale (.90 ⬍␣). Due to its widespread use and
computational simplicity, my discussion of reliability versus va-
lidity trade-offs in this article will rely heavily on and more than
a cursory understanding of it is helpful. For a more detailed
discussion of and alternatives to , see McNeish (2018).
Despite its popularity, is not well understood; John and Soto
(2007) call it the misunderstood giant of psychological research.
Yet standardized is a simple enough calculation: ␣⫽kr/(1
r[k1]). Both denominator and numerator are functions of the
number of items in a scale (k) and mean interitem correlation (r),
calculated such that the numerator will always be less than the
denominator to a diminishing degree as either the number of items
or interitem correlations increase. Because is determined exclu-
sively by the number of items in a scale and the degree to which
they covary, DeVellis (2003) accurately describes it as a propor-
tion of covariance among items. John and Soto (2007) illustrate
’s dependence on these two attributes by noting meets DeV-
ellis’ (2003) standard of very good at .87 for both a six-item scale
with a mean interitem correlation of .52 and a nine-item scale with
a mean interitem correlation of .42. This dependence on the
number of items in a scale and the degree to which they covary
also means that does not indicate validity, even though it is
commonly used to do so in even flagship journals like the Journal
of Personality and Social Psychology (Flake, Pek, & Hehman,
2017). It also does not indicate unidimensionality, as psychome-
tricians have pointed out for decades (Flake et al., 2017;Schmitt,
1996). The reader might imagine a fictional nine-item scale in-
volving a five-item set concerning gender and an orthogonal
four-item set concerning hair color. If interitem correlations aver-
aged .95 within sets and .00 between sets, the average correlation
across all nine items would be .42, would be very good at .87, yet
the scale would measure nothing. Therefore, incremental change in
indicates that the proportion of covariance among items is
increasing without providing insight into why. Therefore, in order
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
for to be attributable to the true score of the latent variable as
DeVellis (2003) suggests, scales must meet an additional assump-
tion called item independence.
Item independence. Despite different theoretical approaches
to scale-building, including classical test theory and item response
theory, all are committed to item independence. Classical test
theory requires it of every item—items are called independent
parallel tests because every item must do a more-or-less-equally
good job at measuring the latent phenomenon along its entire
continuum (DeVellis, 2003; paraphrase of DeVellis, 2006, p. 57).
Item response theory requires item independence only within
subsets of items—termed local independence—as each item must
perform more-or-less equally compared with just those items in-
tended to measure a particular part of the continuum (Hays, Mo-
rales, & Reise, 2000). Indeed, it is difficult to imagine how any
theoretical approach to solving the latency problem by pooling
item responses could avoid relying on item independence. This is
because all items inevitably involve a degree of error.
Every item captures signal from the latent variable as well as
error associated with any number of other things, such as a
respondent misreading a word, defining a term differently, being
distracted by a bumble bee, and so forth. This can be expressed as
item score true score error. When attempting to solve the
latency problem, error associated with each item is typically sub-
stantial, which is why most scales require more than one item. The
primary purpose of additional items is not to gather more true
score as some might think, but to involve new and different types
of error, called random or unsystematic error, that will cancel out
the substantial error captured by other items. When this happens,
pooled scores isolate true signal from the latent phenomena and
reliability indicators like elegantly approximate true signal ver-
sus noise as DeVellis (2003) suggests. For example, if items are
truly independent and ␣⫽.90, then 81% of variance in the scale
score (.90 .90 .81) is attributable to systematic variance, all
of which is true signal, and 19% is due to random variance, all of
which is noise (Tavakol & Dennick, 2011). However, if item error
is not random, then error terms covary, error does not cancel out,
and pooled scores express it. When this happens and is .90, then,
in addition to unsystematic error (i.e., noise) explaining 19% of
item variance, systematic error (i.e., signal not attributable to the
latent variable) explains some unknown proportion of the remain-
ing 81% of variance. Unless scales avoid systematic error, reli-
ability indicators like measure the communality of items while
saying nothing about the source of communality. It could be signal
from the latent variable. It could also be systematic, as in not
random, error signal sent by something else.
Of course, in practice, error is never completely random. Ide-
ally, items would be randomly selected from an infinite universe of
items (DeVellis, 2003;Ghiselli et al., 1981;Thorndike, 1982), but
scale builders have no access to infinity. Each time a scale is
administered, it is an open question not if but to what degree
communality is being determined by shared error variance. Indeed,
some theorists suggest that the basic assumptions of test theory are
unrealistic (e.g., McDonald, 2011). Perhaps so, but even if perfec-
tion is unattainable, a degree of independence among items is
within the scale builder’s power, as this article will show.
Reliability analysis in modern scale-building. After defin-
ing the latent variable of interest, generating an initial item pool,
editing items, piloting, and administering the survey, scale builders
analyze items in two stages that build on each other and are subject
to similar problems: dimensionality analysis and reliability analy-
sis (DeVellis, 2003;Robinson et al., 1991;Thorndike, 1982).
Dimensionality analysis involves some sort of multivariate analy-
sis, such as exploratory factor analysis, principal components
analysis, or confirmatory factor analysis. All aim to determine
“how many latent variables underlie a set of items” (DeVellis,
2003, p. 103). For example, exploratory factor analysis—regard-
less of rotation or prerotation—identifies patterns explaining the
largest amount of common variance in the data, the second largest,
and so forth, removing these patterns— called extraction— until
only insubstantial patterns remain. Items that contribute the most
to dimensions, often called top-loading items, are typically relied
on most when naming the dimension. Items that do not substan-
tially contribute to any dimensions, often called nonloading or
low-loading items, are typically deleted from the scale. In this
article, I will refer to all such analyses as dimensionality analysis,
the factors they derive as dimensions, and the latent dimensions
they may or may not identify as aspects.
In reliability analysis, scale builders check if dimensions— often
called subscales—involve strong enough signal from a latent vari-
able to justify a claim that the subscale score reflects an actual
phenomenon in nature. Researchers often skip dimensionality
analysis entirely, however, and conduct reliability analysis on
items assumed to be unidimensional (Crutzen & Peters, 2017).
Reliability analysis often begins by computing for each subscale
and what would be if each item was deleted. The scale builder
may then delete the “worst” item, meaning the item that does not
correlate as much with other items and thus decreases or in-
creases it marginally, before recomputing , deleting the next
worst item, and so forth. This process incrementally increases
until some threshold, often .70, is satisfied. If the threshold cannot
be reached, the scale builder concludes that the subscale is unre-
liable, indicating either (a) the underlying phenomena does not
exist or (b) the scale is too inconsistent to measure it accurately,
like a broken bathroom scale.
Though reliability indicators do not signal dimensionality, reli-
ability and dimensionality analyses ask interrelated questions: Is
there anything here (and if so how many)? Yet neither can tell you
what those latent patterns actually are, and both are dependent on
item independence. Like reliability indicators, dimensionality in-
dicators just as easily capture systematic signal from the latent
variable as systematic signal from something else. Thus, removing
items based on either reliability or dimensionality will increase or
“double down” on systematic error as easily as true signal.
For those interested in solving the latency problem, the degree
of systematic error matters because the more systematic error,
the less validity. Unlike reliability, which involves a specific
mathematical definition of a portion of variance ostensibly due to
some latent variable, validity involves the correct identification of
the latent variable, which is a conceptual determination of meaning
(DeVellis, 2003). Stated plainly, validity concerns the degree to
which the scale builder is measuring what she claims to be mea-
suring. A bathroom scale can be consistent, for example, and
consistently wrong.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Types of validity are even more undifferentiated and numerous
than types of reliability. They include construct, content, concur-
rent, predictive, criterion, face, factorial, convergent, and discrim-
inant validity (e.g., John & Soto, 2007;Kazdin, 1998). Different
theorists organize them in various ways (e.g., DeVellis, 2003;
Ghiselli et al., 1981;Hogan & Nicholson, 1988;John & Soto,
2007;Thorndike, 1982). I organize them into two categories—
content validity and predictive validity— based on method, topic,
and when they are examined in a typical scale-creation process.
Content validity. Content validity concerns the degree to
which items denote the right construct, the entire construct, and
nothing else. It is often conceptualized in educational contexts
(Thorndike, 1982). For example, a 19th century American history
exam becomes an increasingly poor test the more questions that
are (a) off topic (e.g., about the Ancient Roman Empire); or (b) too
narrowly focused (e.g., most questions concern Abraham Lincoln).
In psychology, this type of validity is sometimes referred to as face
validity or construct validity, with the former sometimes meant as
content validity from a lay person’s perspective (i.e., a bald judg-
ment of how respondents interpret items based on item content)
and the latter as content validity from the psychometricans’ per-
spective (i.e., an expert judgment based on item content, the
research literature, and a vast array of empirical indicators of item
performance). Of course, expert or not, informed by empirical
evidence or not, determining content validity is based on subjec-
tive interpretation.
Content validity is what should determine scale labels. Because
scale builders want their work to matter and get attention as most
humans do, they often use labels that, in the view of Anastasi and
Urbina (1997, p. 113), are “far too broad and vague” to be justified
by the scale itself. Decades ago, Brigham (1930, p. 159) saw most
scale builders as “guilty of a naming fallacy that easily enables
them to slide mysteriously from the score in the test to the
hypothetical faculty suggested by the name given to the test.” For
example, romantic relationship quality might involve many aspects
relating to say sexual compatibility, communication quality,
shared values, and so forth. If a scale builder created a scale
labeled romantic relationship quality that only concerned sexual
compatibility, the scale would lack content validity. Presumably,
romantic relationships are about more than sex just as 19th century
American history is about more than Abraham Lincoln. Scale
labels must be as narrow as the items composing each scale.
Predictive validity. In addition to interpreting items concep-
tually (i.e., examining content validity), scale builders can empir-
ically check interpretations against other observable phenomena.
Predictive validity, sometimes called criterion-related validity,
concerns the degree to which scale scores occupy the right spot in
the nomological net or, as DeVellis (2003, p. 50) puts it, having the
right empirical associations. Are scores orthogonal to variables
presumed to be unrelated (divergent/discriminant validity), signif-
icantly correlated with variables presumed to be related (conver-
gent validity), explaining new variance in variables presumed to be
related (incremental validity), and very highly correlated with
variables presumed to reflect the same latent phenomenon (con-
current validity)? Answering these questions typically requires
dozens of hypotheses about how the supposed latent variable
should covary with other things in nature (DeVellis, 2003).
Though each particular hypothesis can be tested, the basis for
such hypotheses is always subjective (e.g., Hogan & Nicholson,
1988;Robinson et al., 1991). For example, why are certain crite-
rion variables selected and not others? What does it mean if
criterion variables correlate a great deal less or more than ex-
pected, and what constitutes a great deal? What if one of 20
hypothesized relationships fails? What if five fail? At what point is
the scale builder simply wrong about what he measured; wrong
about criteria; or getting the first glimpse of an amazing new
discovery? Indeed, often scale builders seek a “sweet spot” where
most variables correlate as expected in order to establish predictive
validity but enough surprises spark interest in further research. A
handful of obvious, common-sense suggestions provide little in-
sight into what constitutes too surprising (Anastasi & Urbina,
1997;DeVellis, 2003). Methodologists note, for example, that
scale scores should covary with convergent validity criterion be-
yond what might be attributable to shared method variance, and
especially important theoretical relationships should be statisti-
cally significant. Not only is validity in the eye of the beholder,
standards for achieving it are low.
Three Observations
Having discussed validity and reliability, I can now make three
distinctions that are key to understanding why and how validity
versus reliability trade-offs occur in the scale-building context.
First, it is difficult for scale builders to prioritize incremental
change in validity relative to incremental change in reliability
because validity is subjectively determined. No numeric indicators
are applicable across scales, let alone a widely used indicator like
where the threshold of .70 often determines publishability.
Anastasi and Urbina (1997) even note that distinguishing high
versus low validity is meaningless because validity is non-
numerical. But how can a priority be maximized if it cannot
meaningfully increase or decrease? High versus low reliability,on
the other hand, is an easily computed indication of an objective and
important psychometric quality. The general incentive among re-
searchers to pursue what is measurable at the expense of what is
not has been commented on in scale-building since Loevinger
(1954) introduced the attenuation paradox. Due to this incentive,
scale-building decisions that incrementally increase reliability
seem more rigorous and defensible to peers than decisions that
increase validity. The result is that reliability indicators are widely
reported and dominate item-retention decisions (John & Soto,
2007). Of course, in any industry or discipline, when numeric
indicators drive decisions at the expense of meaning, higher more
meaningless numbers must be expected, which the next section
confirms has happened in the scale-building context.
A second observation concerns sequence. Validity consider-
ations— especially the most speculative ones— dominate early
stages of the scale-building process as the scale builder streamlines
definitions, generates items, and identifies criteria to assess pre-
dictive validity. Reliability considerations come afterward, once
researchers have data. However, this is also the first point at which
a variety of powerful indicators, including item means, skew,
kurtosis, interitem and item-total correlations, factor loadings, item
information curves, and so forth, can provide empirical insight into
which sources of systematic error are present and how to
strengthen validity. However, because maximizing is currently
more defensible to peers, these more empirically based validity
considerations are often given little to no weight in item retention
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
decisions. Even if validity was prioritized, John and Soto (2007)
note that how scale builders should prioritize validity relative to
reliability during item analysis remains unclear.
The third and most important distinction between validity and
reliability for the purposes of this article concerns their inverse
relationship to systematic error. All measures, from atomic clocks
to bathroom scales, incorporate three types of information: true
signal, random error, and systematic error. Changes in true signal
and random error impact reliability and validity similarly. For
example, bathroom scales that capture more true signal will pro-
duce readings that are both more consistent and more accurate, and
bathroom scales that capture more random error will be less
consistent and less accurate—no trade-offs here. Increasing sys-
tematic error, however, makes readings more consistent with each
other, thereby increasing reliability, but does this by becoming a
less accurate measure of the intended phenomenon, which de-
creases validity. The source of validity versus reliability trade-offs,
therefore, is systematic error.
Systematic error plays the same role in self-report scales. In this
context, systematic error can be defined as dynamics unrelated to
the intended latent phenomenon that influence multiple items or
scores in similar (i.e., nonrandom) ways. Reliability indicators
(and even sophisticated dimensionality indicators) examine cova-
riance among items and scores, but are blind to whether covariance
stems from systematic error or true signal, because both are sys-
tematic. Thus, as item responses capture more systematic error two
things happen: Reliability indicators like increase as a mathe-
matical consequence of rising interitem or score-to-score correla-
tions, and validity declines as the scale increasingly becomes a
measure of the wrong phenomena. When conscientious scale
builders remove systematic error from their scales, interitem and
interscore correlations decrease, reliability decreases, but validity
improves. Therefore, increasing systematic signal that is not true
signal inevitably increases scale reliability at validity’s expense.
These trade-offs will occur regardless of whether scale reliability
is high or low, what the source of systematic error might be, or
what is being measured. Moreover, because opportunities to in-
crease systematic error pervade the scale-building process, validity
versus reliability trade-offs are legion. For naïve scale builders,
resolving trade-offs in favor of reliability will seem desirable as
the resulting damage to validity is not apparent to them. For
devious scale builders, resolving trade-offs in favor of reliability
will be desirable as the damage to validity is not apparent to others.
Several of these trade-off decisions are discussed in the next
section. Many more are not.
Validity Versus Reliability Trade-Offs in Context
Whereas the preceding section discussed the nature of reliabil-
ity, validity, and the theoretical source of trade-offs between them,
this section makes plain that scale builders face nontrivial validity
versus reliability trade-offs throughout the scale-building process,
including choices about item content, item construction, item
difficulty, item scoring, item order, and item analysis. I will show
how trade-offs at each point stem straightforwardly from decisions
that amplify systematic error that increases consistency while
weakening the relation of the scale to the phenomenon of interest.
I will also note several prominent measurement problems that
persist because a validity versus reliability trade-off perpetually
makes trading validity for reliability the more attractive option.
Item Content
A trade-off exists between reliability and diversity of item
content, also called domain coverage, which is crucial for validity.
Yet scale builders commonly undervalue domain coverage in both
the initial item pool and the analysis phase. This is likely to
continue because the core issue is this trade-off which renders
domain coverage perennially unattractive to scale builders.
Measurement efforts begin by defining the latent phenomena of
interest. This definition includes a tentative theoretical model that
serves a variety of functions, one of which is to map out every
aspect of the construct necessary for content validity (DeVellis,
2003). Ideally, items are randomly selected from an infinite uni-
verse of potential items so that no particular aspects are privileged
and item error cancels out (DeVellis, 2003). However, to accom-
plish this in reality, aspect diversity must be intentionally culti-
vated. According to Goldberg and Velicer (2006, p. 211) and
echoed by Gorsuch (1983), aspect diversity in the initial item pool
is “by far the most important” part of the scale-building process.
Evenly distributing items across many aspects is critical for valid-
ity not only because it increases item independence but also
because it allows dimensionality analyses to empirically identify
distinctions between natural phenomena that conceptual analyses
may overlook or overvalue. One never knows perfectly the true
shape of the latent phenomena, and while later analyses can
exclude items, they cannot add new ones. This is why Clark and
Watson (1995, p. 311) note that “one always should err on the side
of overinclusiveness” by including many items examining periph-
eral aspects in the initial item pool.
Unfortunately, cultivating aspect diversity in the initial item
pool always decreases reliability because the more evenly items
touch on different aspects of a latent phenomenon, the less items
intercorrelate. To use the content validity example above, a 20-
item quality of romantic relationships scale will have a much
higher if 15 items concern sex than if all 20 items are evenly
spread across sex, conversation quality, shared values, and other
aspects. Items covering a single aspect of a latent phenomenon
measure that phenomenon in a similar (i.e., nonrandom) way,
which systematically increases intercorrelations among error
terms. Therefore, by making the initial item pool lopsided, validity
is traded for reliability.
Maximizing reliability after items are administered is likewise
problematic because, as Clark and Watson (1995, p. 316) note
“maximizing internal consistency almost invariably produces a
scale that is quite narrow in content.” Items that offer the most
diversity of item content are more likely to be the first to be
removed from scales during reliability analysis precisely because
they offer more diversity, causing them to correlate less with other
items. In turn, items that offer the least diversity are more likely to
be retained because they correlate more with other items. Some-
times entire aspects can be thoughtlessly deleted. For example, say
that conversation quality is an important aspect of a quality of
romantic relationship scale but happens to be less central than
others (one aspect is inevitably least related). This will mean that
the item contributing least to is likely a conversation quality
item. If the scale builder deletes it to maximize reliability as is
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
typically done, the mean interitem correlations of all other con-
versation quality items decrease more than items related to other
aspects. As a result, the next time the scale builder determines
which items contribute least to , another conversation quality
item will be identified for deletion. This process easily builds
momentum as conversation quality items fall like dominoes until
the entire aspect is gone. The result is an appreciably shorter,
dramatically less valid, and much more reliable scale.
Many psychometricians conclude that fairly low average interi-
tem correlations are desirable because, in order to maximize va-
lidity, the scale builder should seek items that correlate as much as
possible with the latent variable and as little as possible with each
other (e.g., Clark & Watson, 1995;Kline, 1986). Boyle (1991),
concerned about item redundancy, believes scale builders should
be empowered to select items to “maximize the breadth of mea-
surement of the given factor” (p. 2) even at the expense of
reliability or dimension loadings. It is also worth noting that if the
latent variable is truly narrow, narrowness of topic may be appro-
priate (e.g., Robinson et al., 1991).
Item Difficulty
A trade-off exists between reliability and diversity of item
means, also called item difficulty, which is critical for validity. To
be useful to researchers, all variables, including scale scores, must
vary across persons, ideally in a normal distribution. This requires
achieving the right level of difficulty, what Loevinger (1948) called
popularity. A slight misnomer, the difficulty term comes to psy-
chometricians from educational and ability testing (Anastasi &
Urbina, 1997;DeVellis, 2003;Thorndike, 1982). For example, if
all participants receive As on an exam, the exam might be con-
sidered too easy; if all receive Fs, it is too difficult. An average
somewhere in the middle is ideal. Likewise, items in psychological
scales are considered difficult when item means are low (because
they are too hard to agree with) and not difficult enough when
means are high (because they are too easy to agree with). The usual
aim in both educational and psychological testing is to construct
scales capable of meaningfully differentiating between levels
throughout the distribution. For example, if all items are at a high
difficulty level, then the overall test is suited mainly for discrim-
inating between subjects only at this high level (e.g., DeVellis,
2003). This may be desirable when the target population is at this
high level, but scale builders must include a range of easier and
harder items in order to discriminate between all levels of the
latent variable (Anastasi & Urbina, 1997;Bock, Gibbons, & Mu-
raki, 1988).
There is a trade-off between reliability and diversifying item
difficulty because some variance among items is an artifact of
certain levels of difficulty (Loevinger, 1948). For example, among
items involving one latent variable, there may be two groups of
highly correlating items: hard items and easy items. Dimensional-
ity analyses will blindly capture this variance in different dimen-
sions which, when extreme, results in spurious dimensions actually
termed difficulty factors (e.g., Bock et al., 1988). Researchers since
Loevinger (1948) have warned against assigning psychological
meaning to such difficulty dimensions which can seem meaningful
and have high internal reliability, but no validity. Fortunately,
difficulty-driven dimensionality is easy to spot by comparing item
means across and within dimensions.
Whether or not dimensionality analyses produce difficulty di-
mensions, deleting items to maximize reliability is much more
likely to homogenize item difficulty than diversify it. The in-
creased communality of items associated with similar item diffi-
culty suggests that items tapping this source of systematic error the
least—the items most needed—are the worst for increasing reli-
ability. Fortunately, one can simply review item-retention deci-
sions to see if deletions resulted in homogenizing item difficulty.
Item Construction
A trade-off exists between reliability and diversity in item
language, called item form or item construction, which is critical
for validity. Scholars stress the importance of diversifying item
construction (e.g., Clark & Watson, 1995;Robinson et al., 1991).
They encourage using different nouns, different verbs, and differ-
ent sentence structures. This is because similar words or grammat-
ical structures introduce systematic error variance associated with
those words or structures. Unfortunately, semantic redundancy
problems are likely to continue because the core issue is this
trade-off between reliability and validity that makes semantic
diversity chronically unappealing to scale builders.
Though the term shared-method variance usually refers to sys-
tematic error associated with broad categories of assessment meth-
odology such as self-report, peer-report, or researcher observation,
the same idea is applicable within a self-report scale. For example,
if a 10-item scale has five items that begin with “I feel that
. . .”—what can be called Method A—while the remaining five
items have five quite different structures—what can be called
Methods B,C,D,E, and F—the “I feel that...items are likely
to correlate higher with each other and thus dominate dimensions
and do more to increase reliability—all driven by systematic error
associated with Method A. Language variety, therefore, both low-
ers reliability and is essential for validity because it decreases
systematic error associated with particular words, phrases, and
grammatical structures. For this reason, similarly worded items
that dominate dimensions are often not the superior representatives
of the latent variable they appear (e.g., the “. . . boring” construc-
tion dominating our factor concerning the belief that the world is
an interesting place; Clifton et al., 2019). In addition to retaining
these top-loading items and the systematic error they capture, scale
builders usually rely most on these top-loading items when naming
the latent variable, thus compromising validity.
Despite broad theoretical commitment to language variety, this
is a difficult area in which to conduct research because of the
extremely subjective nature of assessing language diversity. More
research has been done on the utilization of forward-scored versus
reverse-scored items, to be discussed below, which can be consid-
ered one way of respecting the need for item diversity. Moreover,
as DeVellis (2003) notes, item redundancy can be useful, espe-
cially in early stages of scale creation. Until item analysis, scale
builders often cannot tell when seeming redundancy is worthwhile
or excessive. However, an approach that values redundancy in
early stages must also empower its removal at later stages. This is
difficult when redundant items almost always load higher on the
dimension and increase reliability more than their nonredundant
counterparts, yet typically do so because of systematic error.
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
Item Scoring
A trade-off exists between reliability and using a balance of
opposite-scored items, which often aids validity. Many sources of
error, including acquiescence bias, are associated with scoring
items in only one direction (e.g., Boyle, 1991). Acquiescence bias,
sometimes called agreement bias, is the tendency for respondents
to agree with items regardless of what the item is about; it is seen
as a major problem in scale development (e.g., Anastasi & Urbina,
1997;DeVellis, 2003;Kazdin, 1998;Paulhus, 1991). For decades,
psychometricians have suggested that the effects of agreement bias
can be mitigated by using opposite-scored items so that error
associated with a tendency to agree with items is cancelled out by
a tendency to agree with other items measuring the opposite (e.g.,
Comrey, 1988;Paulhus, 1991;Robinson et al., 1991). Despite this,
using only forward-scored items remains common (or using only
reverse-scored items, e.g., Dweck, Chiu, & Hong, 1995). This is
partly due to genuine scholarly disagreement and partly to a
validity versus reliability trade-off that makes using opposite-
scored items unappealing.
All agree that opposite-scored items tend to misbehave. When
scales only have one or two opposite-scored items, those items are
almost always the ones that hurt reliability the most (e.g., DeVel-
lis, 2003). Dimensionality analysis often reveals that items scored
the same way clump or even dominate dimensions entirely, mak-
ing dimensionality suspect and, at minimum, difficult to interpret
(McPherson & Mohr, 2005;Schriesheim & Eisenbach, 1995;
Woods, 2006). In response, some psychometricians suggest miti-
gation techniques such as using 50/50 opposite-scored items (e.g.,
Comrey, 1988) and ipsatizing responses, which theoretically re-
moves variance associated with agreement bias by subtracting item
responses from the average response across all items (Hicks,
1970). Other psychometricians advise avoiding opposite-scored
items entirely. In their view, opposite-scoring, particularly reverse-
scoring, is directly to blame for low reliability and artificial di-
mensionality (e.g., Schriesheim & Eisenbach, 1995). However,
using items scored only one way increases reliability precisely
because it increases systematic error associated with shared-
method variance, thereby damaging validity.
There are only two ways to score items: forward and reverse.
Relying on one for all items creates the intractable problem where
every single item is impacted by the same sources of systematic
error variance—not just agreement bias—with no way of gauging
the damage. When scoring method defines two dimensions in a
scale, systematic error is associated with both dimensions, not
merely the dimension with reverse-scored items. Picking one or
the other merely hides the extent of the problem. Using both ways
of scoring elegantly corrects much of this error because each
source of error has largely opposite effects on the two types of
items. Scale builders can leapfrog a host of problems without
needing to identify each source of systematic error and creative
The reason opposite-scored items often misbehave is because
scale builders use one method to create forward-scored items
and a different method to create reverse-scored items. Using
different methods to create item subsets leads to subset-specific
artifactual systematic variance that damages reliability, validity, or
both. For example, scale builders often create what I call derivative
reverse-scored items by taking carefully crafted forward scored
items and minimally editing them with antonyms or negators such
as not, no, mis-, and dis-(McPherson & Mohr, 2005;Paulhus,
1991;Schriesheim & Eisenbach, 1995;Thorndike, 1982). But
derivative reverse-scored items commonly introduce three sources
of systematic error. First, because minor edits are often overlooked
by fast-reading respondents, derivative reverse-scored items cap-
ture variance associated with respondent carelessness and not the
latent variable (e.g., Robinson et al., 1991). If even 10% of a
sample overlook a negator, dimensionality analysis can yield en-
tirely different results (Schmitt & Stuits, 1985;Woods, 2006).
Second, the means of derivative reverse-scored items often differ
systematically from their forward-scored counterparts, which, as
explained in the Item Difficulty section, introduces difficulty-
related communality that increases reliability at validity’s expense.
Third, derivative reverse-scored items often concern the wrong
latent phenomena. Robinson, Shaver, and Wrightsman (1991), for
example, found that responses to Obedience is an important thing
for children to learn were orthogonal to responses to an identical
item about disobedience. Unlike mathematics, which is composed
of discrete interchangeable parts, natural language has combinative
emergent meanings that obscures what negators like not means,
whether the absence of a thing, the partial absence of a thing, the
presence of an opposite, or a lack of knowledge either way. For
example, while the phrase I’m bad at cricket clearly posits low
ability, I’m not bad at cricket may posit (a) high ability as the
phrase not bad can often mean quite good, (b) moderate ability
because extremely low ability is excluded, (c) the mere absence of
poor ability, or (d) ignorance about one’s cricket abilities. For
these reasons, latent opposites are not easily captured by derivative
opposite-scored items (e.g., Paulhus, 1991). Therefore, scale build-
ers should generate all items nonderivatively, with opposite-scored
items crafted with equal care and ingenuity.
A final reason that opposite-scored items are essential for va-
lidity while often lowering reliability is because they allow for
what Tay and Jebb (2018) call continuum specification. Scales are
valid when both high and low scores reflect the same latent
phenomenon. This means scale builders must know the meaning of
high and low scores. For example, if forward-scored items such as
I am a tall person are not mirrored by responses to I am a short
person, the scale builder seeking to measure self-perceived height
faces a profound validity issue. According to the law of noncon-
tradiction, the basis of all meaning is exclusion. Just as part of
knowing the nature of Yis knowing what is not-Y, part of knowing
what it means to score high on Yincludes knowing what it means
to score low on Y. If the scale builder’s theory is found to be false
(i.e., responses to forward-scored items do not mirror responses to
reverse-scored items), the hypothesized shape of the latent variable
was incorrect, its true shape remains unknown, and more research
is needed (Paulhus, 1991). Small to discipline-impacting validity
mistakes can result when scholars fail to realize that supposed
opposites are actually empirically distinct dimensions. The histor-
ical corrective of positive psychology, for example, would have
been unnecessary if the presence of wellbeing was never construed
as merely the absence of mental illness (see Seligman & Csik-
szentmihalyi, 2000). Thus, instead of deleting pesky opposite-
scored items, scale builders should administer more. Clark and
Watson’s (1995) suggestion about domain coverage applies here
as well: an overly inclusive item pool touching on many different
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
opposite-scored aspects can allow the true shape of the latent
phenomenon to reveal itself.
Item Order
A trade-off exists between validity and two common scale
administration practices that typically increase reliability: admin-
istering items in a fixed order and in unidimensional blocks. Even
though both practices are already known to damage validity (e.g.,
Schwartz, 1999), they remain widespread partly because of survey
administration logistical constraints and partly because a trade-off
makes item administration practices that increase validity peren-
nially unattractive to scale builders.
Item order is important because each item serves as a prime for
the next. For example, reversing the order of a marital satisfaction
item and a life satisfaction item causes the interitem correlation to
switch from .32. to .67, presumably because when the marriage
item comes first, marital considerations play an inflated role in the
assessment of life more generally (Schwartz, Strack, & Mai,
1991). Though sequence effects are not always this large, every
item administered in a sequence inevitably influence those that
follow in a nonrandom way. The result is increased communality
among items, increased reliability, and weakened validity.
To explore the severity of sequence effects, Knowles (1988)
reordered items of several well-known scales in many iterations,
thereby isolating sequence effects from item content effects. He
found that—regardless of which item it was— coming near the end
of a scale caused items to illicit more extreme responses and share
nearly twice as much variance with the total scale score. Therefore,
a major portion of common variance among later items in fixed
order scales should be understood as systematic error. In reliability
analysis, early items that express less of this systematic error are
more likely to be removed precisely because the item is a more
independent (i.e., better) test of the latent variable. Indeed, the
most independent item in any scale is the first item because it is not
systematically influenced by any items preceding it, yet this item
is most likely to be removed.
When item order causes items administered near to each other to
behave similarly, dimensionality analyses blindly incorporate this
variance. To test this, Weinberger, Darkes, Del Boca, Greenbaum,
and Goldman (2006) administered 10 items: three about alcohol,
four about sex, and three that were purposefully ambiguous items
about having wild hedonistic fun. When administered first, the
ambiguous items formed their own factor. When administered
second, the ambiguous items attached themselves to either the sex
factor or the alcohol factor depending entirely on item order. This
shows that item order alone can introduce enough systematic error
to drive dimensionality.
The obvious solution to sequence effects is to randomize item
order differently for each respondent and intersperse items among
many other items measuring different phenomena (e.g., Knowles,
1988). However, making each item a more independent measure of
the latent variable in this way always lowers reliability. For ex-
ample, Alterman et al. (2003) found that internal reliability for the
socialization subscale in the California Psychological Inventory
(CPI) fell from ␣⫽.77 to ␣⫽.68 merely by interspersing items
within the larger CPI. Likewise, we (Clifton et al., 2019) found
that for a subscale concerning the belief that the world is
characterizable fell from .80 when administered in a unidimen-
sional block to .66 when interspersed with six other items in a
fixed order, even though other subscales were reliable when inter-
spersed and randomized across several times as many items. Ran-
domizing item order and interspersing items can be detrimental to
other forms of reliability as well, such as test–retest reliability.
When a respondent answers items administered in one order at
Time 1 and a completely different order at Time 2, sequence
effects at Time 1 differ from those at Time 2, these effects cannot
contribute to communality of scores, and this lowers test–retest
reliability while validity improves.
Item Analysis
A trade-off exists between reliability and excluding items tap-
ping into a response bias, which improves validity. Decisions that
ignore the impact of response biases are likely to continue, how-
ever, because this trade-off perpetually makes including such items
appealing. Many response biases can be conceptualized by extend-
ing Orne’s (e.g., 1962) insight about experimental research into
survey research. Just as subjects in experiments act differently
because they know they are being experimented on— each exper-
iment has certain demand characteristics—survey respondents
answer (and think) differently because they know they are being
assessed. In survey research, psychometricians call the results of
demand characteristics response biases. Among dozens of known
response biases are consistent responding, deviant responding,
careless responding, agreement bias (discussed above), affect bias,
and social desirability. I will briefly discuss the latter two.
Many items can reflect the respondent’s affect or mood (e.g.,
Clark & Watson, 1995). For example, phrases like I worry that or
I’m concerned that can tap negative affect regardless of what
exactly may be worrying or concerning, and phrases like It’s
wonderful that and It makes me feel good when can capture
variance associated with positive affect. If multiple items reflect
affect, affect-related variance will increase reliability and influence
dimensionality. If so, items that successfully avoid this source of
systematic error will load lower on dimensions, contribute less to
reliability, and therefore be among the first considered for removal
during dimensionality and reliability analyses.
This same dynamic occurs when multiple items reflect social
desirability bias. Social desirability reflects “the tendency to give
answers that make the respondent look good” (Paulhus, 1991,p.
17). It is a major problem in self-report surveys that has concerned
psychometricians for the last 70 years (e.g., Anastasi & Urbina,
1997;Kazdin, 1998;Paulhus, 1991;Robinson et al., 1991). The-
oretically, social desirability obscures truth to greater degrees as
respondents are asked to admit more taboo behaviors (e.g., How
often do you molest children?). Social desirability bias can also
result from prosocial eagerness to help the researcher based on
perceptions of the researcher’s goals.
Various item administration decisions can help address these
biases. For example, to mitigate social desirability bias, scale
builders can ensure respondent anonymity (Paulhus, 1991). Be-
cause it is difficult to look good if one does not know what is being
looked at (Anastasi & Urbina, 1997), scale builders can obscure
their purpose by interspersing scale items among many unrelated
items. Ironically, an excellent way for researchers to measure one
construct is to measure several simultaneously. When examining
extremely sensitive topics like criminality, researchers can use a
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
technique called randomized response which introduces an ele-
ment of chance (i.e., a coin flip) that assures subjects of plausible
deniability while being statistically correctable at the group level
(Warner, 1965). To counter the impact of affect bias, other tactics
are required, such as ensuring no preceding activity homogenizes
participant mood and crafting items without emotionally laden
words, phrases, or topics (Clark & Watson, 1995;Jackson, 1971).
It is often difficult to tell if biases warrant countermeasures or if
countermeasures are working. Fortunately, many individual differ-
ences in response biases are also measurable. Researchers can
administer social desirability scales such as Crowne and Mar-
lowe’s (1960) original scale or one of its newer (and better)
variants (e.g., Zook & Sipps, 1985). Affect is also measurable
(e.g., Watson, Clark, & Tellegen, 1988). Having assessed a re-
sponse bias and analyzed its relationship to the latent variable,
scale builders can rewrite or remove bias-related items or partial
bias-related variance. Paulhus (1991) recommends using dimen-
sionality analyses to produce dimensions associated as closely as
possible with social desirability, thus isolating seemingly ever-
present social desirability variance from remaining dimensions. In
the least, scale builders should examine top-loading items for clues
indicating variance associated with a response bias, which can
often be quite noticeable.
Informed by sophisticated statistical techniques or not, any
judgment to remove items in the analysis phase because of a
response bias relies on subjective interpretation. If several items in
a scale covary with a bias, for example, it may be that the latent
variable itself is correlated with the bias without the bias influenc-
ing responses. The presence and severity of the problem, as well as
the best solution, are up for debate. As a result, unless the bias is
obvious and debilitating, scale builders rarely make aggressive
efforts to remove systematic error caused by response biases. This
is especially true for response biases that remain unmeasurable
(several are noted by Schwartz, 1999). For example, Dawis (1987)
suggests that some individuals are biased against agreeing with
items using formal language because formality feels unfamiliar.
Because there is currently no validated scale to measure bias
against formal language and many other biases, all this unidenti-
fied systematic error exists, right now, in researchers’ data, inflat-
ing reliability and altering dimensions assumed to underlie scales.
Five Suggestions for Managing Trade-Offs (and Seven
“Dirty Tricks” for Exploiting Them)
Having described how validity versus reliability trade-offs per-
vade the scale-building process and the tendency to resolve them
at validity’s expense, this final section turns to the practical ques-
tion of what can be done about it. After noting a few of these
“paradoxes,” Clark and Watson (1995, p. 316) remind the reader
that “the goal of scale construction is to maximize validity rather
than reliability.” Most psychometricians agree that validity is
paramount (e.g., DeVellis, 2003). However, how one navigates
validity versus reliability trade-offs depends largely on whether
one’s goal is to maximize content validity or predictive validity.
This has not been previously recognized.
I began this article noting that psychologists want to study many
phenomena that we cannot see, smell, taste, or touch. The philo-
sophical assumption that makes the latency problem a problem is
that these phenomena are thought to exist before being measured;
latent variables and observable variables are assumed to be onto-
logically alike. If so, the scale builder is committed to a type of
realism that has faith in the existence of unseen things. Within this
paradigm, the initial goal of scale-building is not the construction
of something new but the discovery of something already there.
For discovery-minded scale builders, only by first peeling away
layers of systematic error do reliability indicators such as ele-
gantly reveal, when high, that something hidden has been found
and, when low, that something that was thought to exist actually
does not. Either result provides insight.
The same cannot be said of construction-minded scale-building,
though it too has its place. What if the scale builder is not
interested in solving the latency problem? For instance, what if
the latent phenomenon has already been discovered? For exam-
ple, when creating any short-form scale, the shape of the latent
variable is presumably already known. Now the aim is mimicry in
the nomological net with fewer items. Put another way, the scale
builder creating a short-form scale constructs a variable with
predictive validity that borrows content validity (i.e., the scale
label) from elsewhere. If so, validity versus reliability trade-offs
can be decided in favor of reliability because items that are not
content valid (i.e., items about other topics or overly narrow
subtopics) are permitted to help as long as predictive validity
improves. For example, when item-total correlations are especially
high, scale builders can rely on single aspects (or even single
items) as proxies for the whole, but only among populations for
which the indicator is validated. In a construction-minded ap-
proach, driving up reliability is also useful because provides an
upper limit for covariance with other variables (John & Soto,
2007). Increasing reliability is also necessary because, without a
reliable scale, the effort fails utterly. Unlike discovery-minded
efforts, construction-minded efforts that produce unreliable scales
shed little light on the underlying phenomenon. They are as useful
as broken bathroom scales.
Given the implications of these alternative priorities, my first
suggestion for managing validity versus reliability trade-offs is
that scale builders succinctly declare their overarching measure-
ment goal—to maximize content validity (i.e., discovery), predic-
tive validity (i.e., construction), or something in between—and
how that goal will guide the resolution of validity versus reliability
trade-offs. Consider the following discovery-minded example:
Since our primary goal was to discover latent phenomena, we
maximized reliability only when we saw no potential trade-off with
content validity. This prioritization impacted five decisions, each
of which we explain below.
When one type of validity is not the sole concern, declarations
should describe how competing priorities will be balanced. For
example, Clark and Watson (1995) suggest that once exceeds
.80, increasing reliability— especially at the expense of content
validity—is no longer desirable. For example, in creating the final
version of the Primals Inventory, we (Clifton et al., 2019) sought
to prioritize reliability of subscales progressively less as ex-
ceeded .70, .80, and .90, respectively. Also, an initial goal of
discovery may often switch to construction once the latent phe-
nomenon is sufficiently understood.
Once goal-stating allows different measurement tasks to be
held to appropriately different standards, my second suggestion
is that reviewers hold scale builders to those standards. This
means that, if predictive validity is paramount, even greater
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
allowances for increasing reliability at the expense of content
validity may be appropriate. But if the goal is discovery,
reviewers must demand that multiple important validity versus
reliability trade-offs be resolved in favor of content validity,
even if these decisions come later in the scale development
process, even if they are based on subjective interpretations of
item content and metrics, and even if they fly in the face of
reliability or dimensionality indicators. Discovery-minded scale
builders must demonstrate actual instances where validity was
improved at the expense of reliability. Goals should also inform
methodology broadly. For example, confirmatory factor analy-
sis—a tool that quantifies the insufficiency of one’s prior ex-
pectations—is probably better suited for construction-minded
than discovery-minded efforts, which is why confirmatory fac-
tor analysis often follows exploratory factor analysis in scale-
building efforts. Reviewers should also check if item diversity
in the initial item pool allowed sufficient flexibility in the
correlation matrix to allow exploratory factor analysis to avoid
becoming the technical term for what was actually a confirma-
tory construction-minded approach. There is nothing “explor-
atory” about administering distinct subsets of highly redundant
items and then observing each subset form its own dimension.
My third suggestion is that each instance in which content
validity plays a determinative role in a trade-off decision, such
as an item-retention decision, must be succinctly explained so it
can be evaluated. Trade-off decisions between reliability and
validity are subjective and require transparency. Currently,
when scale builders face serious content-validity concerns in
the item-analysis stage, the incentive is to hope nobody notices
or mask efforts to preserve content validity by appealing to
some reliability or dimensionality metric that happens to sug-
gest the desired course of action. Underlying rationales will
come out of hiding if they are made welcome. This leads to my
fourth suggestion: Reviewer critiques must (a) value arguments
favoring validity over reliability without dismissing them as
speculative (all are speculative) and (b) reserve truly scathing
critiques for those who treat validity versus reliability trade-offs
in a manner inconsistent with stated goals. Hypocrisy, not
interpretive license, must be the greater sin.
One promising approach to responsibly incorporating content
validity concerns is goal preregistration, a new option at the
European Journal of Psychological Assessment that allows for
outcome independent article preacceptance (Greiff & Allen, 2018).
All predictive validity criteria could be preregistered alongside
how effect sizes would be interpreted. Standards for interpreting
scoring-related dimensionality and test–retest reliability results
could also be specified. A future article that develops a standard-
ized preregistration methodology for discovery-minded scale-
building—preferably illustrated by a full scale-building process
utilizing it— could fundamentally shape how scales are made.
My fifth and final suggestion is that reviewers should require
scale-building articles to include at least one paragraph discussing
validity versus reliability trade-offs that did not impact measure-
ment decisions but were explored or seriously considered, so
reviewers can determine whether more exploration is necessary.
Many discovery-minded scale builders already run correlational
and creative experimental studies investigating the influence of
specific sources of systematic error. For example, we (Clifton et
al., 2019) conducted a simple randomized experiment that manip-
ulated survey order to examine the degree to which affect influ-
ences Primals Inventory scores or taking the Primals Inventory
influences affect. Such scale builders would merely be asked to
provide another sentence of two describing potential sources of
systematic error that were not investigated.
To help reviewers spot potential sources of systematic error, I
have compiled a list of seven common practices that introduce
major sources of systematic error. Because many reviewers cur-
rently do not mind these practices, one can think of them as an
assortment of “dirty tricks” to exploit validity versus reliability
trade-offs to inflate reliability indicators like by introducing
systematic error. When these practices are desirable or unavoid-
able (e.g., pen-and-paper scales), scale builders can examine the
practice’s effects in prior validation studies.
1. Administering items in unidimensional blocks. To prior-
itize validity, instead intersperse scale items among sev-
eral other measures.
2. Administering items in a fixed order. To prioritize valid-
ity, instead randomize item order differently for each
3. Administering no (or few) opposite-scored items. To
prioritize validity, instead use a roughly equal number of
opposite-scored items in the initial item pool or admin-
ister more in a second study, maximize internal reliability
only within similarly scored sets of items, or note the
continuum remains unspecified.
4. Using items with similar language. To prioritize validity,
instead include highly diverse items in the initial item
pool or in a second study, check if similarly worded items
dominate dimensions, and maximize internal reliability
only within similarly worded sets of items.
5. Deleting items with diverse means. To prioritize validity,
instead include a range of easy and difficult items in the
initial item pool or in a second study, check for difficulty-
related dimensionality, maximize only within sets of
items with similar means, and report item means for
retained and deleted items.
6. Deleting all items concerning a less-related aspect of the
latent phenomenon. To prioritize validity, instead maxi-
mize internal reliability only within aspects, explore if
items touching on other aspects are similarly dependent
on aspect-related communality by systematically deleting
all but one aspect-related item and checking internal
reliability, or delete the less-related aspect for validity
reasons and relabel the scale something narrower.
7. Ignoring response biases. To prioritize validity, instead
discuss relevant biases; measure any measurable ones,
including, at minimum, affect and social desirability;
examine top-loading items for patterns that might indi-
cate a response bias; and explore major concerns with
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
In this article, I sought to reframe unnecessarily mysterious
measurement “paradoxes” as straightforward reliability versus va-
lidity trade-offs, clarify their nature and prevalance, and provide
guidance on managing them. In the first section, I reviewed and
contrasted concepts of reliability and validity to identify the source
of these trade-offs—systematic error that contributes to item or
score communality—and why trade-offs are usually resolved in
favor of reliability. Increasing reliability seems objective and de-
fensible while increasing validity is interpretive and conceptual.
Validity concerns drive early stages of scale creation while reli-
ability considerations are given the last word during item analysis.
Most importantly, altering the degree of systematic error in a scale
has an inverse impact on reliability and validity. This is because
reliability analyses (and dimensionality analyses) are blind to
communality source, whether true signal from latent phenomena or
systematic error from something else.
Turning from theory to practice, the second section revealed
how nontrivial trade-offs pervade the scale-building process. I
identified trade-offs concerning item content, item construction,
item difficulty, item scoring, item order, and item analysis. These
trade-offs, recognized or not, are typically decided in favor of
reliability even though maximizing validity is the de jure priority.
Indeed, long-standing measurement problems persist in most of
these scale-building contexts because overlooked trade-offs render
steps to increase validity perennially unattractive to scale builders.
In the final section, I made five recommendations for managing
trade-offs. In the spirit of demand-side economics, all are aimed at
consumers of scale-based research, whom I call “reviewers.”
1. Reviewers should require scale builders to declare their
intention to maximize content validity (discovery), pre-
dictive validity (construction), or something in between.
2. Reviewers should require scale builders to resolve trade-
offs in a manner consistent with declared intentions.
3. Reviewers should require transparent and brief explana-
tions of each instance when reliability is traded for in-
creased validity.
4. Reviewers should vet these explanations, reserving
strong rebukes for those resolving trade-offs in goal-
inconsistent ways.
5. Reviewers should require scale builders aiming to max-
imize content validity to discuss other sources of system-
atic error that that did not impact trade-off decisions.
Finally, to help reviewers spot scale-building practices that favor
reliability at validity’s expense, I compiled seven “dirty tricks” scale
builders often use to exploit trade-offs to inflate reliability.
Many scientific disciplines go through seasons in which experts
emphasize cold adherence to metrics, realize something theoretically
important was lost, and make slight adjustments. My suggestions
about how to manage reliability versus validity trade-offs can be
summarized as follows: Can we make a bit more room for content
validity when content validity is the goal? Others have suggested
variations on this theme (Boyle, 1991;Clark & Watson, 1995;Flake
et al., 2017;Hogan & Nicholson, 1988;John & Soto, 2007). My hope
is that we are headed for a minor adjustment phase in which more
latitude is given to both (a) scale builders committed to discovering
latent phenomena and (b) scale builders committed to constructing
increasingly better ways to predict them. The days of a blanket
in-between standard that frustrates efforts on both extremes can end.
Alterman, A. I., McDermott, P. A., Mulvaney, F. D., Cacciola, J. S.,
Rutherford, M. J., Searles, J. S.,...Cnaan, A. (2003). Comparison of
embedded and isolated administrations of the California Psychological
Inventory’s Socialization subscale. Psychological Assessment, 15, 64 –
Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle
River, NJ: Prentice Hall.
Bock, R. D., Gibbons, R., & Muraki, E. (1988). Full-information item
factor analysis. Applied Psychological Measurement, 12, 261–280.
Boyle, G. J. (1991). Does item homogeneity indicate internal consistency
or item redundancy in Psychometric Scales? Personality and Individual
Differences, 12, 291–294.
Brigham, C. C. (1930). Intelligence tests of immigrant groups. Psycholog-
ical Review, 37, 158 –165.
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in
objective scale development. Psychological Assessment, 7, 309 –319.
Clifton, J. D. W., Baker, J. D., Park, C. L., Yaden, D. B., Clifton, A. B. W.,
Terni, P.,...Seligman, M. E. P. (2019). Primal world beliefs. Psycho-
logical Assessment, 31, 82–99.
Comrey, A. L. (1988). Factor-analytic methods of scale development in
personality and clinical psychology. Journal of Consulting and Clinical
Psychology, 56, 754 –761.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16, 297–334.
Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability
independent of psychopathology. Journal of Consulting Psychology, 24,
349 –354.
Crutzen, R., & Peters, G. Y. (2017). Scale quality: Alpha is an inadequate
estimate and factor-analytic evidence is needed first of all. Health
Psychology Review, 11, 242–247.
Dawis, R. V. (1987). Scale construction. Journal of Counseling Psychol-
ogy, 34, 481– 489.
DeVellis, R. F. (2003). Scale development: Theory and applications (Vol.
26). London, UK: Sage.
DeVellis, R. F. (2006). Classical test theory. Medical Care, 44, S50 –S59.
Dweck, C. S., Chiu, C. Y., & Hong, Y. Y. (1995). Implicit theories and
their role in judgments and reactions: A word from two perspectives.
Psychological Inquiry, 6, 267–285.
Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social
and personality research: Current practice and recommendations. Social
Psychological & Personality Science, 8, 370 –378.
Ghiselli, E. E., Campbell, J. P., & Zedeck, S. (1981). Measurement theory
for the behavioral sciences. San Francisco, CA: WH Freeman.
Goldberg, L. R., & Velicer, W. F. (2006). Principles of exploratory factor
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
analysis. In S. Strack (Ed.), Differentiating normal and abnormal per-
sonality (pp. 209 –237). New York, NY: Springer Publishing Company.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). London, UK: Erlbaum.
Greiff, S., & Allen, M. S. (2018). EJPA introduces registered reports as
new submission format. European Journal of Psychological Assessment,
34, 217–219.
Gulliksen, H. (1945). The relation of item difficulty and inter-item corre-
lation to test variance and reliability. Psychometrika, 10, 79 –91. http://
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory
and health outcomes measurement in the 21st century. Medical Care, 38,
II28 –II42.
Hicks, L. E. (1970). Some properties of ipsative, normative and forced-
choice normative measures. Psychological Bulletin, 74, 167–184. http://
Hogan, R., & Nicholson, R. A. (1988). The meaning of personality test
scores. American Psychologist, 43, 621– 626.
Jackson, D. N. (1971). The dynamics of structured personality tests.
Psychological Review, 78, 229.
John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability
and the process of construct validation. In R. W. Robins, R. C. Fraley,
& R. F. Krueger (Eds.), Handbook of research methods in personality
psychology (pp. 461– 494). New York, NY: Guilford Press.
Kazdin, A. E. (1998). Research design in clinical psychology (3rd ed.).
Needham Heights, MA: Allyn & Bacon.
Kline, P. (1986). A handbook of test construction: Introduction to psycho-
metric design. New York, NY: Methuen Press.
Knowles, E. S. (1988). Item context effects on personality scales: Mea-
suring changes the measure. Journal of Personality and Social Psychol-
ogy, 55, 312–320.
Loevinger, J. (1948). The technic of homogeneous tests compared with
some aspects of scale analysis and factor analysis. Psychological Bul-
letin, 45, 507–529.
Loevinger, J. (1954). The attenuation paradox in test theory. Psychological
Bulletin, 51, 493–504.
McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011).
Internal consistency, retest reliability, and their implications for person-
ality scale validity. Personality and Social Psychology Review, 15,
28 –50.
McDonald, R. P. (2011). Test theory: A unified treatment. New York, NY:
McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here.
Psychological Methods, 23, 412– 433.
McPherson, J., & Mohr, P. (2005). The role of item extremity in the
emergence of keying-related factors: An exploration with the life orien-
tation test. Psychological Methods, 10, 120 –131.
Orne, M. T. (1962). On the social psychology of the psychological exper-
iment: With particular reference to demand characteristics and their
implications. American Psychologist, 17, 776 –783.
Paulhus, D. L. (1991). The measurement and control of response bias. In J. P .
Robinson, P. R. Shaver, & L. S. Wrightman (Eds.), Measures of per-
sonality and social psychological attitudes (pp. 17– 60). San Diego, CA:
Academic Press.
Peterson, R. A. (1994). A meta-analysis of Cronbach’s coefficient alpha.
The Journal of Consumer Research, 21, 381–391.
Ponterotto, J. G., & Ruckdeschel, D. E. (2007). An overview of coefficient
alpha and a reliability matrix for estimating adequacy of internal con-
sistency coefficients with psychological research measures. Perceptual
and Motor Skills, 105, 997–1014.
Robinson, J. P., Shaver, P. R., & Wrightsman, L. S. (1991). Criteria for
scale selection and evaluation. In J. P. Robinson, P. R. Shaver, & L. S.
Wrightman (Eds.), Measures of personality and social psychological
attitudes (pp. 1–16). San Diego, CA: Academic Press.
Santos, J. R. A. (1999). Cronbach’s alpha: A tool for assessing the
reliability of scales. Journal of Extension, 37, 1–5.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological
Assessment, 8, 350 –353.
Schmitt, N., & Stuits, D. M. (1985). Factors defined by negatively keyed
items: The result of careless respondents? Applied Psychological Mea-
surement, 9, 367–373.
Schriesheim, C. A., & Eisenbach, R. J. (1995). An exploratory and con-
firmatory factor-analytic investigation of item wording effects on the
obtained factor structures of survey questionnaire measures. Journal of
Management, 21, 1177–1193.
Schwartz, N. (1999). Self-reports: How the questions shape the answers.
American Psychologist, 54, 93–105.
Schwartz, N., Strack, F., & Mai, H. P. (1991). Assimilation and contrast effects
in part-whole question sequences: A conversational logic analysis. Public
Opinion Quarterly, 55, 3–23.
Seligman, M. E. P., & Csikszentmihalyi, M. (2000). Positive psychology.
An introduction. American Psychologist, 55, 5–14.
Tavakol, M., & Dennick, R. (2011). Making sense of Cronbach’s alpha.
International Journal of Medical Education, 2, 53–55.
Tay, L., & Jebb, A. T. (2018). Establishing construct continua in construct
validation: The process of continuum specification. Advances in Meth-
ods and Practices in Psychological Science, 1, 375–388. http://dx.doi
Thorndike, R. L. (1982). Applied psychometrics. Boston, MA: Houghton
Mifflin Company.
Warner, S. L. (1965). Randomized response: A survey technique for
eliminating evasive answer bias. Journal of the American Statistical
Association, 60, 63– 66.
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and vali-
dation of brief measures of positive and negative affect: The PANAS
scales. Journal of Personality and Social Psychology, 54, 1063–1070.
Weinberger, A. H., Darkes, J., Del Boca, F. K., Greenbaum, P. E., &
Goldman, M. S. (2006). Items as context: Effects of item order and
ambiguity on factor structure. Basic and Applied Social Psychology, 28,
Williams, R. H., & Zimmerman, D. W. (1982). Reconsideration of the
“attenuation paradox”—and some new paradoxes in test validity. Jour-
nal of Experimental Education, 50, 164 –171.
Woods, C. M. (2006). Careless responding to reverse-worded items: Im-
plications for confirmatory factor analysis. Journal of Psychopathology
and Behavioral Assessment, 28, 189 –194.
Zook, A., & Sipps, G. J. (1985). Cross-validation of a short form of the
Marlowe–Crowne Social Desirability Scale. Journal of Clinical Psy-
chology, 41, 236 –238.
Received July 10, 2018
Revision received June 19, 2019
Accepted June 21, 2019
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
... This phenomenon has been termed the attenuation SCALE CONSTRUCTION OF HETEROGENEOUS CONSTRUCTS 5 paradox (Loevinger, 1954), where an increase in reliability ultimately leads to a less diverse item pool, which is most likely representative for only a narrower content domain of the construct. This item selection procedure results in a measure that only poorly reflects the target construct (Clark & Watson, 1995;Clifton, 2020). Thus, what is well intended in the case of scale construction does not always lead to the desired result. ...
... Although psychological measures should be both reliable and valid, in practice, validity is often neglected in test construction in favor of internal consistency (Clifton, 2020;Flake et al., 2017). Optimizing internal consistency is comparatively easy and straightforward to achieve. ...
... First, if the resulting scale is consistent, but it is not possible to retain coverage of the measured trait, the latent variable is narrower than intended. In this case, it is advisable to reconsider the construct label to highlight the reduced scope of the measurement (Clifton, 2020). Alternatively, researchers could revise and extend the scope of the item pool and start over. ...
Full-text available
Sound scale construction is pivotal to the measurement of psychological constructs. Common item sampling procedures emphasize aspects of reliability to the disadvantage of aspects of validity, which are less tangible. We use a health knowledge test as an example to demonstrate how item sampling strategies that either focus on factor saturation or construct coverage influence scale composition and demonstrate how to find a trade-off between these two opposing needs. More specifically, we compile three 50-item health knowledge scales using Ant Colony Optimization, a metaheuristic algorithm that is inspired by the foraging behavior of ants, to optimize factor saturation, construct coverage, or a compromise of both. We demonstrate that our approach is well suited to balance out construct coverage and factor saturation when constructing a health knowledge test. Finally, we discuss conceptual problems of modeling of declarative knowledge and provide recommendations for the assessment of health knowledge.
... Different measurement tools are used to determine the fear of childbirth in the literature. [9][10][11][12][13][14][15] The most commonly used scales to determine fear in pregnancy and the postpartum period are versions A and B of Wijma's Birth Exper ience /Expe ctati on Scale, 10,11 Prenatal Self Evaluation Questionnaire, 9 and Delivery Fear Scale. 14 Women's fear of childbirth during pregnancy is measured with Wijma Birth Scale version A, and in the early postpartum period, fear of childbirth with version B. 10,11 Other scales are planned for the postpartum period. ...
... Women Information Form: Data such as the age, marital status, income status, educational status, the presence of children, and childbirth information of women of childbearing age were obtained using this form developed by using the literature. [12][13][14][15] Childbirth Fear Scale Draft: The scale is a 5-point Likert-type scale that is evaluated between strongly agree and strongly disagree. The development stages of the scale are included in the procedure section: ...
... The CVR value of 34 items in the study was found to be 0.80 and above. Sixteen items (1,3,7,8,10,11,12,15,16,19,21,24,27,29,30, and 37) were removed from the draft measurement tool since their CVR value was below 0.80. After 16 items were removed from the scale, the content validity index value was obtained to be 0.84. ...
Full-text available
Conclusion: It was determined that the Childbirth Fear Scale was a valid and reliable measurement tool for determining women's (pregnant, with/without babies and children) fears related to childbearing.
... The external nature (we can observe) of BWC footage affords the opportunity to estimate the trustworthiness of BWC footage via reliability statistics and designs [15]. The presence of reliable data, though, does not automatically ensure the measurement validity [16,17], but greater agreement can result in stronger validities [18,19]. Given the intersect of a general under representation reliability index coverage [11,[20][21][22][23] and the strong inferences drawn from BWC footage, the present paper proposes a strategy of integrating a representation of reliability indexes. ...
... Internal consistency measures the ratio of observed scores to the associated true scores, which assesses the error measurement in a scale score. True scores refer to a scale score in a perfect world, or administrating a test under all possible circumstances, obtaining an unbiased mean of all observed scores [16,30]. Through the use of sample standard deviations, a true score is estimated. ...
Full-text available
Strong inferences are drawn from police-worn body camera (BWC) footage, frequently without an assessment of reliability. Unique characteristics of BWC footage (i.e., capturing trends and less frequent behavior, a focal actor (officer) is absent) suggest a specific reliability strategy. A three-step strategy of selecting appropriate reliability indexes, providing salient reliability categories, and ranking the reliability categories was applied to BWC footage. Five interrater agreement ( A C 1 , α ^ K , B − P coefficient, r wg , ad . m ), two interrater consistency ICC 1 , 1 , ICC 2 , 1 , and three internal consistency ( ω t , ω h , GLB ) indexes were applied to police BWC footage. A focus was to ascertain the upper limits of reliability for BWC footage. Item development and rater training were conducted to optimize rating reliability. Using a within design and confidence intervals, the relatively stronger and weaker reliabilities across the six domains of video completeness, respect (passive, active, discourse), threats, and behavioral stance were assessed. Applied to the admissibility of court evidence, central aspects of video completeness have relatively stronger reliabilities. For research, lower reliabilities have a cost of limited generalizability and ecological validity. Policy recommendations include the usage of a standardized scale with multiple ratings to determine what information should be used in high-stake decision-making based on BWC footage. The three-step strategy integrated the reliability indexes into a single figure to reflect a reliability summary of each component of BWC footage. Weighted rankings found the Overall Audio Quality (-4.9) and Empathy (-4.9) items to have the weakest reliabilities and the Clarification (5.1) and Physical Resistance (4.9) items to have the strongest reliabilities.
... From a statistical perspective, sum-scoring ignores reciprocal interactions between symptoms 25 and assumes a common cause, which is not always empirically supported, thus causing bias 28, 29 . Modelling item-level indicators of positive and negative states together may better capture complexity, and afford validity, since sum-score methods can favour unipolar wording for the sake of reliability 30 . This method can be particularly important in exploring sex differences, as males and females report, for example, different depression symptoms 31 . ...
Full-text available
There is growing concern about the role of social media use in the documented increase of adolescent mental health difficulties. However, the current evidence remains complex and inconclusive. While increasing research on this area of work has allowed for significant progress, the impact of social media use within the complex systems of adolescent mental health and development is yet to be examined. The current study addresses this conceptual and methodological oversight by applying a panel network analysis to explore the role of social media on the interacting systems of mental health, wellbeing, and social life of 12,041 UK adolescents. We find that across time, social media is one of the least influential factors of adolescent mental health with other factors (e.g. bullying, lack of family support) deserving greater attention. Our findings suggest that the current depiction of social media use as the culprit of adolescent mental health difficulties is unwarranted and highlight the need for social policy initiatives that focus on the home and school environment to foster resilience.
... CTT (Lord & Novick, 1968;Novick, 1966;Raykov & Marcoulides, 2019) assumes that observed scores consist of true scores and errors (i.e., = + ) and defines reliability as the ratio of the true score variance to the observed score variance (i.e., 2 / 2 ). A counterintuitive aspect of CTT is that the true score covers everything except random errors (Borsboom & Mellenbergh, 2002;Clifton, 2020). For example, a true score on an intelligence quotient test mixes general intelligence, verbal comprehension, arithmetic comprehension, and many more. ...
Full-text available
The current guidelines for estimating reliability recommend using two omega combinations in multidimensional data. One omega is for factor analysis (FA) reliability estimators, and the other omega is for omega hierarchical estimators (i.e., ωh). This study challenges these guidelines. Specifically, the following three questions are asked: (a) Do FA reliability estimators outperform non-FA reliability estimators? (b) Is it always desirable to estimate ωh? (c) What are the best reliability and ωh estimators? This study addresses these issues through a Monte Carlo simulation of reliability and ωh estimators. The conclusions are given as follows. First, the performance differences among most reliability estimators are small, and the performance of FA estimators is comparable to that of non-FA estimators. However, the current, most-recommended estimators, that is, estimators based on the bifactor model and exploratory factor analysis, tend to overestimate reliability. Second, the accuracy of ωh estimators is much lower than that of reliability estimators, so we should perform ωh estimation selectively only on data that meet several requirements. Third, exploratory bifactor analysis is more accurate than confirmatory bifactor analysis only in the presence of cross-loading; otherwise, exploratory bifactor analysis is less accurate than confirmatory bifactor analysis. Fourth, techniques known to improve the Schmid-Leiman (SL) transformation are not superior to SL transformation but have different advantages. This study provides an R Shiny app that allows users to obtain multidimensional reliability and ωh estimates with a few mouse clicks. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
... Items were summed with higher scores indicating increased levels of distrust. In line with research that suggests that measures of broad (e.g., distrust), as opposed to specific concepts (e.g., trust in health systems) results in lower internal consistency (Clifton, 2020), internal consistency for the measure of distrust was low (omega total (ω) = 0.51). ...
Full-text available
Deviant peer affiliation predicts externalizing behavior in adolescence, but no research explores how having negative or suspicious expectations of others (i.e., distrust) may evoke or buffer against the relationship between deviant peer affiliation and externalizing behavior. The current study used data across two timepoints to investigate the impact of deviant peer affiliation and distrust on externalizing behavior 3 years later and whether race/ethnicity moderated this relationship. The sample consisted of 611 adolescents from the Project on Human Development in Chicago Neighborhoods Study (48% male; Mage = 15.5 years, SD = 1.6; 17% White; 34% Black; 49% Hispanic). Higher levels of distrust buffered against the influence of deviant peer affiliation on externalizing behaviors. Further, this buffering was evident in Black compared to White adolescents. Understanding externalizing behavior warrants considering the intersection between the person and their environment.
... Looking instead at the internal linkage strength inside a construct's item set is strictly speaking not about validity but about reliability. The distinction between internal linkage strength (i.e., reliability) and external linkage strength (i.e., validity) is fundamental because internal reliability and external validity are in a partial trade-off relation, as Clifton (2020) demonstrates in all clarity. OA-Section 3 of our original article provides another example, juxtaposing two constructs capturing how people understand democracy: a three-item index of authoritarian notions (ANDs) of democracy and another three-item index taping liberal notions of democracy (LNDs). ...
Our original 2021 SMR article “Non-Invariance? An Overstated Problem with Misconceived Causes” disputes the conclusiveness of non-invariance diagnostics in diverse cross-cultural settings. Our critique targets the increasingly fashionable use of Multi-Group Confirmatory Factor Analysis (MGCFA), especially in its mainstream version. We document—both by mathematical proof and an empirical illustration—that non-invariance is an arithmetic artifact of group mean disparity on closed-ended scales. Precisely this arti-factualness renders standard non-invariance markers inconclusive of measurement inequivalence under group-mean diversity. Using the Emancipative Values Index (EVI), OA-Section 3 of our original article demonstrates that such artifactual non-invariance is inconsequential for multi-item constructs’ cross-cultural performance in nomological terms, that is, explanatory power and predictive quality. Given these limitations of standard non-invariance diagnostics, we challenge the unquestioned authority of invariance tests as a tool of measurement validation. Our critique provoked two teams of authors to launch counter-critiques. We are grateful to the two comments because they give us a welcome opportunity to restate our position in greater clarity. Before addressing the comments one by one, we reformulate our key propositions more succinctly.
The Rosenberg Self-Esteem Scale is the most frequently used measure of self-esteem in the social sciences. These items are often administered with a different number of response options, but it is unclear how the number of response options impacts the psychometric properties of this measure. Across three experiments ( Ns = 739, 2,358, and 1,461), we evaluated how different response options of the Rosenberg influenced (a) coefficient alpha estimates, (b) distributions of scores, and (c) associations with criterion-related variables. Observed coefficient alpha estimates were lowest for a 2-point format compared with response formats with more options. However, supplemental analyses using ordinal alpha pointed to similar estimates across conditions. Using four or more response options better approximated a normal distribution for observed summary scores. We found no consistent evidence that criterion-related correlations increased with more response options. Collectively, these results suggest that the Rosenberg should be administered with at least four response options and we favor a 5-point Likert-type response format.
Five comments below provide strong and interesting perspectives on multi‐item scale use. They define contexts and research areas where developed scales are valuable and where they are vulnerable. Katsikeas and Madan begin by taking a global perspective on scale use, demonstrating how the use and transferability of scales becomes even more problematic as researchers move across languages and cultures. They provide guidance for scale use that is particularly relevant to international marketing and marketing strategy research. Brendl and Calder acknowledge the use of well‐formed scales as measured variables in psychological experiments, both as independent and dependent variables, but critique the use of multi‐item scales to directly reveal latent unobservable constructs. As with any observed variable, scales should be used to test empirical predictions based on theoretical hypotheses about causal connections between theoretical constructs. Lehmann applauds the variability of multi‐item scales and urges the exploration of the impact of various items within a scale. He advocates for flexibility and variation in multi‐item scales related to psychological theories, simple three‐item scales for manipulation checks, and one‐item scales when measuring objective actions or beliefs. Baumgartner and Weijters focus on how to validate multi‐item scales, particularly when used as mediators or moderators where a unique interpretation of the scale is so central. They recommend meta‐analyses of scales that test relationships among measured scales. Like Lehmann, they worry about the impact of exhaustive scales on respondents and the impact of exhausted respondents on the scales themselves. In the final comment, Wang and Huang update our thinking on emerging ways to define and refine scales. They discuss ways to identify focal and orbital constructs and suggest Item Response Theory as a way to adapt scales to subsets of items that best contribute to identifying individual differences between respondents. They support confirmatory factor analysis across different studies to assess scale equivalence across different contexts, cultures and languages.
The mass adoption of digital technology continues to generate debate on how they impact people and society. Associations are regularly observed between media use and a variety of negative outcomes including depression and anxiety. However, pre-registered studies have failed to replicate these findings. Regardless of direction, many designs rely on self-reported ‘usage’ scales that aim to define and quantify a construct associated with technology engagement. This includes clinical notions of usage including disorders and addictions. Given their importance for research integrity, we consider what these scales are measuring. Across three studies, we observe that many scales align with a single, identical construct despite claims they capture something unique. We conclude that many technology measures appear to measure a similar, poorly defined construct that often overlaps with pre-existing measures of well-being. Social scientists should critically consider how they proceed both methodologically and conceptually when developing psychometric scales in this domain if research findings are to be drawn together into a coherent body of knowledge.
Full-text available
Beck’s insight—that beliefs about one’s self, future, and environment shape behavior—transformed depression treatment. Yet environment beliefs remain relatively understudied. We introduce a set of environment beliefs— primal world beliefs or primals —that concern the world’s overall character (e.g., the world is interesting, the world is dangerous ). To create a measure, we systematically identified candidate primals (e.g., analyzing tweets, historical texts, etc.); conducted exploratory factor analysis ( N = 930) and two confirmatory factor analyses ( N = 524; N = 529); examined sequence effects ( N = 219) and concurrent validity ( N = 122); and conducted test-retests over 2 weeks ( n = 122), 9 months ( n = 134), and 19 months (n = 398). The resulting 99-item Primals Inventory (PI-99) measures 26 primals with three overarching beliefs— Safe, Enticing , and Alive (mean α = .93)—that typically explain ∼55% of the common variance. These beliefs were normally distributed; stable (2 weeks, 9 months, and 19 month test-retest results averaged .88, .75, and .77, respectively); strongly correlated with many personality and wellbeing variables (e.g., Safe and optimism, r = .61; Enticing and depression, r = −.52; Alive and meaning, r = .54); and explained more variance in life satisfaction, transcendent experience, trust, and gratitude than the BIG 5 (3%, 3%, 6%, and 12% more variance, respectively). In sum, the PI-99 showed strong psychometric characteristics, primals plausibly shape many personality and wellbeing variables, and a broad research effort examining these relationships is warranted.
Full-text available
Empirical studies in psychology commonly report Cronbach's alpha as a measure of internal consistency reliability despite the fact that many methodological studies have shown that Cronbach's alpha is riddled with problems stemming from unrealistic assumptions. In many circumstances, violating these assumptions yields estimates of reliability that are too small, making measures look less reliable than they actually are. Although methodological critiques of Cronbach's alpha are being cited with increasing frequency in empirical studies, in this tutorial we discuss how the trend is not necessarily improving methodology used in the literature. That is, many studies continue to use Cronbach's alpha without regard for its assumptions or merely cite methodological papers advising against its use to rationalize unfavorable Cronbach's alpha estimates. This tutorial first provides evidence that recommendations against Cronbach’s alpha have not appreciably changed how empirical studies report reliability. Then, we summarize the drawbacks of Cronbach's alpha conceptually without relying on mathematical or simulation-based arguments so that these arguments are accessible to a broad audience. We continue by discussing several alternative measures that make less rigid assumptions which provide justifiably higher estimates of reliability compared to Cronbach’s alpha.. We conclude with empirical examples to illustrate advantages of alternative measures of reliability including omega total, Revelle’s omega total, the greatest lower bound, and Coefficient H. A detailed software appendix is also provided to help researchers implement alternative methods.
Full-text available
Cronbach's alpha is a commonly reported estimate to assess scale quality in health psychology and related disciplines. In this paper we argue that alpha is an inadequate estimate for both validity and reliability - two key elements of scale quality. Omega is a readily available alternative that can be used for both interval and ordinal data. More importantly, we argue that factor-analytic evidence should be presented before assessing the internal structure of a scale. Finally, pointers for readers and reviewers of manuscripts on making judgements about scale quality are provided and illustrated by examples from the field of health psychology.
Many areas of psychological science rely heavily on theoretical constructs, such as personality traits, attitudes, and emotions, and many of these measured constructs are defined by a continuum that represents the different degrees of the attribute. However, these continua are not usually considered by psychologists during the process of scale development and validation. Unfortunately, this can lead to numerous scientific problems, such as incomplete measurement of the construct, difficulties in distinguishing between constructs, and compromised evidence for validity. The purpose of the current article is to propose an approach for carefully considering these issues in psychological measurement. This approach, which we term continuum specification, is a two-stage process in which the researcher defines and then properly operationalizes the target continuum. Defining the continuum involves specifying its polarity (i.e., the meaning of its poles, or ends) and the nature of its gradations (i.e., the quality that separates high from low scores). Operationalizing the continuum means using this definition to develop a measure that (a) sufficiently captures the entire continuum, (b) has appropriate response options, (c) uses correct procedures for assessing dimensionality, and (d) accounts for the underlying response process. These issues have significant implications for psychological measurement.
The verity of results about a psychological construct hinges on the validity of its measurement, making construct validation a fundamental methodology to the scientific process. We reviewed a representative sample of articles published in the Journal of Personality and Social Psychology for construct validity evidence. We report that latent variable measurement, in which responses to items are used to represent a construct, is pervasive in social and personality research. However, the field does not appear to be engaged in best practices for ongoing construct validation. We found that validity evidence of existing and author-developed scales was lacking, with coefficient alpha often being the only psychometric evidence reported. We provide a discussion of why the construct validation framework is important for social and personality researchers and recommendations for improving practice. 3
Describes a method of item factor analysis based on Thurstone's multiple-factor model and implemented by marginal maximum likelihood estimation and the em algorithm. Statistical significance of successive factors added to the model were tested by the likelihood ratio criterion. Provisions for effects of guessing on multiple-choice items, and for omitted and not-reached items, are included. Bayes constraints on the factor loadings were found to be necessary to suppress Heywood cases. Applications to simulated and real data are presented to substantiate the accuracy and practical utility of the method. (PsycINFO Database Record (c) 2000 APA, all rights reserved)(unassigned)
Item wording effects were investigated using scenarios depicting a fictitious leaders behavior, 496 respondents, and a questionnaire containing regular, polar opposite, negated polar opposite, and negated regular item versions. Oblique-rotated exploratory factor-analytic results showed clear item wording factors; confirmatory factor-analytic results showed the item formats to yield separate wording factors but that the regular items had substantially more trait variance than did the other item formats. Implications for future research are discussed.
A number of mental-test theorists have called attention to the fact that increasing test reliability beyond an optimal point can actually lead to a decrement in the validity of that test with respect to a criterion. This non-monotonic relation between reliability and validity has been referred to by Loevinger as the “attentuation paradox,” because Spearman’s correction for attenuation leads one to expect that increasing reliability will always increase validity. In this paper a mathematical link between test reliability and test validity is derived which takes into account the correlation between error scores on a test and error scores on a criterion measure the test is designed to predict. It is proved that when the correlation between these two sets of error scores is positive, the non-monotonic relation between test reliability and test validity which has been viewed as a paradox occurs universally.