Content uploaded by James R. Lewis
Author content
All content in this area was uploaded by James R. Lewis on Oct 08, 2023
Content may be subject to copyright.
© The Author 2013. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
For Permissions, please email: journals.permissions@oup.com
doi:10.1093/iwc/iwt013
Critical Review of ‘The Usability Metric
for User Experience’
James R. Lewis
IBM Corporation, 8051 Congress Avenue (Suite 2088), Boca Raton, FL 33487, USA
∗
Corresponding author: jimlewis@us.ibm.com
In 2010, Kraig Finstad published (in this journal) ‘The Usability Metric for User Experience’—the
UMUX. The UMUX is a standardized usability questionnaire designed to produce scores similar
to the System Usability Scale (SUS), but with 4 rather than 10 items. The development of the
questionnaire followed standard psychometric practice. Psychometric evaluation of the final version
of the UMUX indicated acceptable levels of reliability (internal consistency), concurrent validity,
and sensitivity. Critical review of this research suggests that its weakest element was the structural
analysis, which concluded that the UMUX is unidimensional based on insufficient evidence. Mixed-
tone item content and parallel analysis of the eigenvalues point to a possible two-factor structure.
This weakness, however, is of more theoretical than practical importance, given the overall scale’s
apparent reliability, validity, and sensitivity.
Keywords: System Usability Scale; Usability Metric for User Experience; psychometric evaluation;
standardized questionnaire; satisfaction; perceived usability
Special Issue Editors: Gitte Lindgaard and Jurek Kirakowski
1. INTRODUCTION
1.1. Purpose of the review
The Usability Metric for User Experience (UMUX) is a
new addition to the existing set of standardized usability
questionnaires. The purpose of this paper is to provide a critical
review of Finstad’s (2010b), ‘The Usability Metric for User
Experience’. The goal of the review is to identify (1) aspects of
Finstad (2010b) that are consistent with current psychometric
practice and have appeared to produce promising results and
(2) elements of the method, data analyses and conclusions
that may have weaknesses. Throughout this paper, specific
criticisms are numbered and labeled as ‘minor’, ‘moderate’ or
‘major’.
1.2. Summary of Finstad (2010b)
The primary goal of the UMUX was to get a measurement of
perceived usability consistent with the System Usability Scale
(SUS) but using fewer items that more closely conformed to
the ISO (1998) definition of usability (effective, efficient, sat-
isfying). UMUX items vary in tone (odd-numbered items have
a positive tone, even-numbered items have a negative tone) and
have seven scale steps from 1 (strongly disagree) to 7 (strongly
agree). Starting with an initial pool of 12 items, the final
UMUX has four items that include a general question similar to
the single-ease question (SEQ—see Sauro and Dumas, 2009;
Tedesco and Tullis, 2006, e.g., ‘[This system] is easy to use’)
and the best candidate item from each of the item sets associ-
ated with efficiency, effectiveness and satisfaction, where ‘best’
means the item with the highest correlation to the concurrently
collected overall SUS score. Those three items, respec-
tively, for effectiveness, satisfaction and efficiency were as
follows:
(1) [This system’s] capabilities meet my requirements.
(2) Using [this system] is a frustrating experience.
(3) I have to spend too much time correcting things with
[this system].
To validate the UMUX, users of two systems, one with a rep-
utation for poor usability (System 1, n=273) and the other per-
ceived as having good usability (System 2, n=285), completed
the UMUX and the SUS. Using an item-recoding scheme simi-
lar to the SUS (recoding raw item scores to a 0–6 scale where 0 is
poor and 6 is good) a UMUX score can range from 0 to 100 (sum
the four items, divide by 24, then multiply by 100). Consistent
with previous research (Lewis and Sauro, 2009), the reliability
of the SUS was high, with a coefficient alpha of 0.97. The reli-
ability of the UMUX was also high, with coefficient alpha of
0.94. The high correlation between the SUS and UMUX scores
Interacting with Computers,2013
Interacting with Computers Advance Access published March 27, 2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
2James R. Lewis
(r=0.96, p<0.001) provided evidence of concurrent valid-
ity. The UMUX scores for the two systems were significantly
different (t(533)=39.04, p<0.01) with System 2 getting
better scores than System 1, providing evidence of sensitivity.
2. CRITIQUE
2.1. Motivation for scale development
As part of a corporate effort to create a comprehensive user
experience questionnaire, Finstad (2010b) reported that their
initial strategy was to use the standard 10-item SUS for the
usability module. That initial strategy was abandoned because
10 items seemed like too many (practical motivation), and
the item content of the SUS did not map well onto the three
primary factors of usability (theoretical motivation). Other (less
compelling) reasons given to develop a new set of items rather
than to use the SUS were that non-native English speakers had
trouble understanding the word ‘cumbersome’ and that there
is evidence that 7-point scales are more reliable than 5-point
scales (the standard SUS uses 5-point scales).
Criticism 1 (medium): By creating a new instrument rather
than using an existing instrument, the ability to compare results
with SUS scores reported in other research or with the different
sets of recently published norms (Bangor et al., 2008,2009;
Lewis and Sauro, 2009;Sauro, 2011;Sauro and Lewis, 2012)
is potentially lost.
Criticism 2 (minor): The goal of obtaining equivalent
measurement with fewer items is reasonable and consistent
with the psychometric practice of item analysis during scale
development. From a practical perspective, however, how much
time is really saved by asking 4 rather than 10 questions? Is this
savings worth the potential loss mentioned in Criticism 1?
Criticism 3 (minor):Finstad (2006) had already published a
solution to the problem of non-native speakers having trouble
with the word ‘cumbersome’, demonstrating that the word
‘awkward’ was a workable substitute. In fact, the version of
the SUS used as a baseline in this study used ‘the updated,
internationally appropriate SUS (with “cumbersome” clarified
as “awkward”)’ (Finstad, 2010b, p. 324). For these reasons, this
seems like a very weak argument for making the investment in
the development of an entirely new scale.
Criticism 4 (minor): It is well known that increasing the
number of scale steps increases the reliability of single items
(Nunnally, 1978). For scales with multiple items, the number
of scale steps per item is much less important. The decision to
use 7-point scales in the UMUX is not wrong, but the rationale
provided in the paper (tendency for respondents who use the
standard 5-point version of the SUS to interpolate relative to
those using a 7-point version; Finstad, 2010a) should have been
supplemented with citation of at least some of the research in
the relationship between the number of scale steps and scale
reliability and sensitivity. The occasional interpolation in the
standard SUS is a weak argument for the development of an
entirely new scale, and even with citation of the relationship
between number of scale steps and single-item reliability the
argument would not be much stronger given that the SUS
is a multi-item scale. The standard SUS, after combining
item scores, can take 41 values (from 0 to 100 in 2.5 point
increments). The UMUX, using its final scoring scheme, can
take 25 values (from 0 to 100 in 4 1/6 point increments).
Consequently, you would expect the UMUX to be slightly
less reliable than the SUS but not necessarily unreliable—an
expectation consistent with the final reliability estimates.
2.2. Method for item selection
Having made the decision to create a new scale, the method used
to obtain the items to use in the scale seems reasonably adequate.
The decision to include the SEQ was based on previously
reported findings (Sauro and Dumas, 2009;Tedesco and Tullis
2006). The remaining 3 items were selected from an initial
pool of 12 items distributed evenly across the 3 theoretical
dimensions of usability.The chosen items were those that from a
pilot study correlated most highly with a concurrently collected
SUS score.
2.3. Structural analysis
The only structural analysis was an unrotated principal com-
ponent analysis, with reported eigenvalues of 3.37,0.31,0.20
and 0.12. Citing Tabachnik and Fidell (1989),Finstad (2010b,
pp. 325–326) interpreted these results as indicative of alignment
along one usability component: ‘Tabachnik and Fidell (1989)
recommend the point where the screen plot line changes direc-
tion as a determinant of the number of components; this plot’s
direction drops off dramatically after the first component. This
is strong evidence for the scale measuring one “usability” com-
ponent. Because no secondary components emerged from this
analysis, no attempts at further extractions or rotations were per-
formed. The SUS provided a similar one-component extraction,
with no additional elements emerging’.
Criticism 5 (major): It is true that a reasonable first step
in structural analysis is to conduct an unrotated principal
component analysis, but that is not where structural analysis
should ever stop. Bangor et al. (2008) used the same approach
in the analysis of their SUS data (an unrotated factor analysis
and inspection of the resulting eigenvalues), and thus missed
the additional structure revealed by a varimax-rotated primary
factor analysis (Lewis and Sauro, 2009) which indicated
bidimensional structure for the SUS (confirmed by Borsci
et al., 2009, using a different statistical procedure applied to
an independent data set).
Note that the mechanics of PCA maximize the assignment
of variance to the first unrotated component, leading to
some controversy regarding its interpretability. A large first
eigenvalue is not evidence for a latent factor structure with
only one factor, rather, it is evidence for an overall usability
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
Critical Review of ‘The Usability Metric for User Experience’ 3
construct that might or might not have an additional latent
factor structure. To his credit, Finstad (2010b) did not invoke
the discredited rule-of-thumb used by some practitioners and
computer programs to set the appropriate number of factors
to the number of eigenvalues greater than 1 (for discussion
of why this rule-of-thumb does not work, see Cliff, 1987 and
Coovert and McNelis, 1988). Nonetheless, Finstad should have
also conducted a rotated principal or confirmatory (maximum
likelihood) factor analysis to evaluate the possibility that the
UMUX has two underlying factors.
Why check for two factors? There are two reasons. First,
the items that make up the UMUX have a mixed tone—two
are positive statements and two are negative. Although this is
a common practice in questionnaire design, a body of research
indicates that mixing the tone in this way can create undesirable
structure in a metric in which positive items align with one factor
and negative items align with the other (Barnette, 2000;Davis,
1989;Pilotte and Gable, 1990;Sauro and Lewis, 2011;Schmitt
and Stuits, 1985;Schriesheim and Hill, 1981;Stewart and Frye,
2004). The use of mixed tone is not necessarily bad, but from
the cited research it follows that researchers using mixed tone
should check to see if it is affecting the factor structure.
Second, Finstad (2010b) seems to have misunderstood the
method of checking eigenvalues for discontinuities as a way to
estimate the number of underlying factors (slope analysis). Of
course ‘this plot’s direction drops off dramatically after the first
component’—that is how the eigenvalues derived from principal
components analysis work—as much variance as possible goes
to the first component through the assignment of weights, then
weights are derived for an orthogonal component to assign
as much as possible of the residual variance to the second
component, and so on, so any eigenvalue other than the first must
necessarily be smaller than the preceding eigenvalues.The basic
approach of discontinuity (slope) analysis is first to calculate the
differences between adjacent eigenvalues and then to see if any
difference is greater than an immediately preceding difference
(Cliff, 1987;Coovert and McNelis, 1988).When this happens, it
is reasonable to retain the same number of factors as the number
of eigenvalues that precede the discontinuity.
For the UMUX eigenvalues, there are no discontinuities,
so this method is inconclusive. Coovert and McNelis (1989)
describe an alternative method called parallel analysis which,
given the eigenvalues reported by Finstad as input, suggests the
retention of two factors. Given the sample size available, Finstad
should probably have performed confirmatory (maximum
likelihood) rotated factor analyses for one-, two-, three- and
four-factor solutions to see which would provide the best-fitting
structural model.
2.4. Scale reliability
In psychometrics, reliability is quantified consistency, typically
estimated using coefficient alpha (Nunnally, 1978;Schmitt,
1996;Yu, 2001). Coefficient alpha can range from 0
(no reliability) to 1 (perfect reliability). Measures of individual
aptitude (such as IQ tests or college entrance exams) should have
a minimum reliability of 0.90—preferably a reliability of in the
mid-1990s (DeVellis, 2003;Nunnally, 1978). For other research
or evaluation, measurement reliability should be at least 0.70
(DeVellis, 2003;Landauer, 1997). The reliability of the UMUX
as assessed using coefficient alpha is very high—0.94 in the
survey study.
When coefficient alpha for noncritical questionnaires is very
high (>0.90), DeVellis (2003, p. 97) recommended that, ‘the
scale developer should give some thought to the optimal tradeoff
between brevity and reliability’. In other words, when reliability
is very high, it might be worthwhile to make the questionnaire
shorter, and thus easier to complete.
Another potential concern when a questionnaire with a small
number of items has a large coefficient alpha is that the items
are highly correlated because they are essentially the same item
with slightly different wording (Lewis, 2002). Inspection of the
UMUX items suggests that this is not likely to have caused the
high value of coefficient alpha because the wording of the items
is not highly similar.
2.5. Scale validity
The significant correlation between the UMUX and SUS
reported for the survey study of 0.96 (p<0.001) is an indicator
of concurrent validity.
Criticism 6 (minor): I agree with Finstad’s statement of the
limitation of validity assessment in this study: ‘Its scoring has
yet to be compared to objective metrics, such as error rates and
task timings, in a full experiment’ (Finstad, 2010b, pp. 326–
327). However, given (1) the finding that in industrial usability
studies there is generally a significant correlation between
instruments like the SUS and task-based metrics like completion
rates, completion times, and error rates (Sauro and Lewis, 2009)
and (2) the high correlation between the UMUX and SUS
scores, it is very likely (though not yet proved) that the UMUX
will also correlate significantly with concurrently collected
performance metrics when used in a standard industrial usability
study.
2.6. Scale sensitivity
As expected for a scale with high reliability and concurrent
validity, a t-test comparing UMUX ratings for one product with
a reputation for good usability and one with a poorer reputation
was significant (p<0.001). There was very little difference in
the resulting mean UMUX and SUS scores for the two products.
3. DISCUSSION
For the important psychometric properties of reliability, validity,
and sensitivity, the UMUX appears to work very well. My main
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
4James R. Lewis
criticisms of Finstad (2010b) are in the areas of motivation and
structural analysis.
Regarding motivation—do we really need a four-item
instrument that appears to provide the same information as the
standard 10-item SUS? Granted, the UMUX and SUS appear
to correlate very highly and, as far as the limited evidence
indicates, appear to provide scores with similar magnitudes. So,
at the very least, the findings of Finstad (2010b) are of significant
value to practitioners who use (and plan to continue using) the
SUS because it provides strong evidence of concurrent validity
of the SUS with an alternative instrument designed for closer
alignment with the standard ISO definition of usability.
I would expect that usability practitioners who currently
use the SUS will probably continue doing so, even if they
know about the UMUX, because the perceived risk in changing
instruments likely outweighs the perceived benefits. I could
be wrong about this. Time will tell. While I was working
on this review, one of my colleagues contacted me asking
if I knew of a shorter, psychometrically qualified version
of the SUS. I suggested that he obtain a copy of Finstad
(2010b) because the UMUX might fit his practical needs. It
is possible that there is a niche for the UMUX in the current
ecology of usability questionnaires. This is quite likely if
other researchers replicate the findings of Finstad (2010b),
especially with regard to the extent to which UMUX scores
can accurately predict or map onto concurrently collected SUS
scores.
In my opinion, the weakest element in this paper is the
structural analysis. The conclusion that the UMUX is
unidimensional is based on insufficient evidence, and there is
good reason to suspect that it is may be bidimensional. On the
other hand, even if those dimensions exist, as long as theyare due
to the varying tone of the items they are of little practical interest
given the overall scale’s reliability, validity and sensitivity. It
would have been better, however, to have conducted a more
thorough structural analysis.
4. CONCLUSIONS
The development of the UMUX followed standard psycho-
metric practice. Psychometric evaluation of the final version
of the UMUX indicated acceptable levels of reliability (inter-
nal consistency), concurrent validity, and sensitivity. Criti-
cal review of this research suggests that its weakest element
was the structural analysis, which concluded that the UMUX
is unidimensional based on insufficient evidence because its
mixed-tone item content and parallel analysis of the eigen-
values point to a possible two-factor structure. This weak-
ness, however, is of more theoretical than practical importance.
Given the overall scale’s apparent reliability, validity, and sen-
sitivity, researchers who need a shorter questionnaire than the
SUS but want a SUS-type metric should consider using the
UMUX.
REFERENCES
Bangor, A., Kortum, P.T. and Miller, J.T. (2008) An empirical
evaluation of the System Usability Scale. Int. J. Hum. Comput.
Interact., 24, 574–594.
Bangor, A., Kortum, P.T. and Miller, J.T. (2009) Determining what
individual SUS scores mean: adding an adjective rating scale. J.
Usability Studies, 4, 114–123.
Barnette, J.J. (2000) Effects of stem and Likert response option
reversals on survey internal consistency: if you feel the need, there
is a better alternative to using those negatively worded stems. Educ.
Psychol. Meas., 60, 361–370.
Borsci, S., Federici, S. and Lauriola, M. (2009) On the dimensionality
of the System Usability Scale: a test of alternative measurement
models. Cogn. Process., 10, 193–197.
Cliff, N. (1987) Analyzing Multivariate Data. Harcourt Brace
Jovanovich, San Diego.
Coovert, M.D. and McNelis, K. (1988) Determining the number of
common factors in factor analysis: a review and program. Educ.
Psychol. Meas., 48, 687–693.
Davis, D. (1989) Perceived usefulness, perceived ease of use, and user
acceptance of information technology. MIS Q., 13, 319–339.
DeVellis, R.F. (2003) Scale Development: Theory and Applications.
Sage Publications, Thousand Oaks, CA.
Finstad, K. (2006) The System Usability Scale and non-native English
speakers. J. Usability Stud., 1, 185–188.
Finstad, K. (2010a) Response interpolation and scale sensitivity:
evidence against 5-point scales. J. Usability Stud., 5, 104–110.
Finstad, K. (2010b) The usability metric for user experience. Interact.
Comput., 22, 323–327.
ISO 9241-11. (1998) Ergonomic Requirements for Office Work
with Visual Display Terminals (VDTs). Part 11: Guidance on
Usability.
Landauer, T.K. (1997) Behavioral Research Methods in Human–
Computer Interaction. In Helander, M., Landauer, K.T. and Prabhu,
P. (eds), Handbook of Human–Computer Interaction (2nd edn),
pp. 203–227. Elsevier, Amsterdam, Netherlands.
Lewis, J.R. (2002) Psychometric evaluation of the PSSUQ using data
from five years of usability studies. Int. J. Hum. Comput. Interact.,
14, 463–488.
Lewis, J.R. and Sauro, J. (2009) The Factor Structure of the System
Usability Scale. In Kurosu, M. (ed.), Human Centered Design, HCII
2009, pp. 94–103. Springer, Heidelberg, Germany.
Nunnally, J.C. (1978) Psychometric Theory. McGraw-Hill, New York.
Pilotte, W.J. and Gable, R.K. (1990) The impact of positive and
negative item stems on the validity of a computer anxiety scale.
Educ. Psychol. Meas., 50, 603–610.
Sauro, J. (2011) A Practical Guide to the System Usability Scale (SUS):
Background, Benchmarks & Best Practices. Measuring Usability
LLC, Denver, CO.
Sauro, J. and Dumas, J.S. (2009) Comparison of Three One-question,
Post-task Usability Questionnaires. In Proceedings of CHI 2009,
Boston, MA, pp. 1599–1608. ACM, Boston.
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
Critical Review of ‘The Usability Metric for User Experience’ 5
Sauro, J. and Lewis, J.R. (2009) Correlations among prototypical
usability metrics: Evidence for the construct of usability. In
Proceedings of CHI 2009, Boston, MA, pp. 1609–1618. ACM,
Boston.
Sauro, J. and Lewis, J.R. (2011) When designing usability
questionnaires, does it hurt to be positive? In Proceedings of CHI
2011, Vancouver, BC, Canada, pp. 2215–2223. ACM, Vancouver,
Canada.
Sauro, J. and Lewis, J.R. (2012) Quantifying the User Experi-
ence: Practical Statistics for User Research. Morgan-Kauffman,
Waltham, MA.
Schmitt, N. (1996) Uses and abuses of coefficient alpha. Psychol.
Assessment, 8, 350–353.
Schmitt, N. and Stuits, D. (1985) Factors defined by negatively keyed
items: the result of careless respondents? Appl. Psych. Meas., 9,
367–373.
Schriesheim, C.A. and Hill, K.D. (1981) Controlling acquiescence
response bias by item reversals: the effect on questionnaire validity.
Educ. Psychol. Meas., 41, 1101–1114.
Stewart, T.J. and Frye, A.W. (2004) Investigating the use of
negatively-phrased survey items in medical education settings:
Common wisdom or common mistake? Acad. Med., 79(10 Suppl.),
S1–S3.
Tabachnik, B.G. and Fidell, L.S. (1989) Using Multivariate Statistics
(2nd edn). Harper Collins, New York.
Tedesco, D.P. and Tullis, T.S. (2006) A comparison of methods
for eliciting post-task subjective ratings in usability testing.
Paper presented at the Usability Professionals Association Annual
Conference. UPA, Broomfield, CO.
Yu, C.H. (2001) An introduction to computing and interpreting
Cronbach coefficient alpha in SAS. In Proceedings of SUGI 26,
Paper 246-26. Long Beach, CA. SAS Institute, Cary, NC.
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from