ArticlePDF Available

Critical Review of 'The Usability Metric for User Experience'

Authors:
  • MeasuringU

Abstract

In 2010, Kraig Finstad published (in this journal) ‘The Usability Metric for User Experience’—the UMUX. The UMUX is a standardized usability questionnaire designed to produce scores similar to the System Usability Scale (SUS), but with 4 rather than 10 items. The development of the questionnaire followed standard psychometric practice. Psychometric evaluation of the final version of the UMUX indicated acceptable levels of reliability (internal consistency), concurrent validity, and sensitivity. Critical review of this research suggests that its weakest element was the structural analysis, which concluded that the UMUX is unidimensional based on insufficient evidence. Mixed-tone item content and parallel analysis of the eigenvalues point to a possible two-factor structure. This weakness, however, is of more theoretical than practical importance, given the overall scale’s apparent reliability, validity, and sensitivity.
© The Author 2013. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
For Permissions, please email: journals.permissions@oup.com
doi:10.1093/iwc/iwt013
Critical Review of ‘The Usability Metric
for User Experience’
James R. Lewis
IBM Corporation, 8051 Congress Avenue (Suite 2088), Boca Raton, FL 33487, USA
Corresponding author: jimlewis@us.ibm.com
In 2010, Kraig Finstad published (in this journal) ‘The Usability Metric for User Experience’—the
UMUX. The UMUX is a standardized usability questionnaire designed to produce scores similar
to the System Usability Scale (SUS), but with 4 rather than 10 items. The development of the
questionnaire followed standard psychometric practice. Psychometric evaluation of the final version
of the UMUX indicated acceptable levels of reliability (internal consistency), concurrent validity,
and sensitivity. Critical review of this research suggests that its weakest element was the structural
analysis, which concluded that the UMUX is unidimensional based on insufficient evidence. Mixed-
tone item content and parallel analysis of the eigenvalues point to a possible two-factor structure.
This weakness, however, is of more theoretical than practical importance, given the overall scale’s
apparent reliability, validity, and sensitivity.
Keywords: System Usability Scale; Usability Metric for User Experience; psychometric evaluation;
standardized questionnaire; satisfaction; perceived usability
Special Issue Editors: Gitte Lindgaard and Jurek Kirakowski
1. INTRODUCTION
1.1. Purpose of the review
The Usability Metric for User Experience (UMUX) is a
new addition to the existing set of standardized usability
questionnaires. The purpose of this paper is to provide a critical
review of Finstad’s (2010b), ‘The Usability Metric for User
Experience’. The goal of the review is to identify (1) aspects of
Finstad (2010b) that are consistent with current psychometric
practice and have appeared to produce promising results and
(2) elements of the method, data analyses and conclusions
that may have weaknesses. Throughout this paper, specific
criticisms are numbered and labeled as ‘minor’, ‘moderate’ or
‘major’.
1.2. Summary of Finstad (2010b)
The primary goal of the UMUX was to get a measurement of
perceived usability consistent with the System Usability Scale
(SUS) but using fewer items that more closely conformed to
the ISO (1998) definition of usability (effective, efficient, sat-
isfying). UMUX items vary in tone (odd-numbered items have
a positive tone, even-numbered items have a negative tone) and
have seven scale steps from 1 (strongly disagree) to 7 (strongly
agree). Starting with an initial pool of 12 items, the final
UMUX has four items that include a general question similar to
the single-ease question (SEQ—see Sauro and Dumas, 2009;
Tedesco and Tullis, 2006, e.g., ‘[This system] is easy to use’)
and the best candidate item from each of the item sets associ-
ated with efficiency, effectiveness and satisfaction, where ‘best’
means the item with the highest correlation to the concurrently
collected overall SUS score. Those three items, respec-
tively, for effectiveness, satisfaction and efficiency were as
follows:
(1) [This system’s] capabilities meet my requirements.
(2) Using [this system] is a frustrating experience.
(3) I have to spend too much time correcting things with
[this system].
To validate the UMUX, users of two systems, one with a rep-
utation for poor usability (System 1, n=273) and the other per-
ceived as having good usability (System 2, n=285), completed
the UMUX and the SUS. Using an item-recoding scheme simi-
lar to the SUS (recoding raw item scores to a 0–6 scale where 0 is
poor and 6 is good) a UMUX score can range from 0 to 100 (sum
the four items, divide by 24, then multiply by 100). Consistent
with previous research (Lewis and Sauro, 2009), the reliability
of the SUS was high, with a coefficient alpha of 0.97. The reli-
ability of the UMUX was also high, with coefficient alpha of
0.94. The high correlation between the SUS and UMUX scores
Interacting with Computers,2013
Interacting with Computers Advance Access published March 27, 2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
2James R. Lewis
(r=0.96, p<0.001) provided evidence of concurrent valid-
ity. The UMUX scores for the two systems were significantly
different (t(533)=39.04, p<0.01) with System 2 getting
better scores than System 1, providing evidence of sensitivity.
2. CRITIQUE
2.1. Motivation for scale development
As part of a corporate effort to create a comprehensive user
experience questionnaire, Finstad (2010b) reported that their
initial strategy was to use the standard 10-item SUS for the
usability module. That initial strategy was abandoned because
10 items seemed like too many (practical motivation), and
the item content of the SUS did not map well onto the three
primary factors of usability (theoretical motivation). Other (less
compelling) reasons given to develop a new set of items rather
than to use the SUS were that non-native English speakers had
trouble understanding the word ‘cumbersome’ and that there
is evidence that 7-point scales are more reliable than 5-point
scales (the standard SUS uses 5-point scales).
Criticism 1 (medium): By creating a new instrument rather
than using an existing instrument, the ability to compare results
with SUS scores reported in other research or with the different
sets of recently published norms (Bangor et al., 2008,2009;
Lewis and Sauro, 2009;Sauro, 2011;Sauro and Lewis, 2012)
is potentially lost.
Criticism 2 (minor): The goal of obtaining equivalent
measurement with fewer items is reasonable and consistent
with the psychometric practice of item analysis during scale
development. From a practical perspective, however, how much
time is really saved by asking 4 rather than 10 questions? Is this
savings worth the potential loss mentioned in Criticism 1?
Criticism 3 (minor):Finstad (2006) had already published a
solution to the problem of non-native speakers having trouble
with the word ‘cumbersome’, demonstrating that the word
‘awkward’ was a workable substitute. In fact, the version of
the SUS used as a baseline in this study used ‘the updated,
internationally appropriate SUS (with “cumbersome” clarified
as “awkward”)’ (Finstad, 2010b, p. 324). For these reasons, this
seems like a very weak argument for making the investment in
the development of an entirely new scale.
Criticism 4 (minor): It is well known that increasing the
number of scale steps increases the reliability of single items
(Nunnally, 1978). For scales with multiple items, the number
of scale steps per item is much less important. The decision to
use 7-point scales in the UMUX is not wrong, but the rationale
provided in the paper (tendency for respondents who use the
standard 5-point version of the SUS to interpolate relative to
those using a 7-point version; Finstad, 2010a) should have been
supplemented with citation of at least some of the research in
the relationship between the number of scale steps and scale
reliability and sensitivity. The occasional interpolation in the
standard SUS is a weak argument for the development of an
entirely new scale, and even with citation of the relationship
between number of scale steps and single-item reliability the
argument would not be much stronger given that the SUS
is a multi-item scale. The standard SUS, after combining
item scores, can take 41 values (from 0 to 100 in 2.5 point
increments). The UMUX, using its final scoring scheme, can
take 25 values (from 0 to 100 in 4 1/6 point increments).
Consequently, you would expect the UMUX to be slightly
less reliable than the SUS but not necessarily unreliable—an
expectation consistent with the final reliability estimates.
2.2. Method for item selection
Having made the decision to create a new scale, the method used
to obtain the items to use in the scale seems reasonably adequate.
The decision to include the SEQ was based on previously
reported findings (Sauro and Dumas, 2009;Tedesco and Tullis
2006). The remaining 3 items were selected from an initial
pool of 12 items distributed evenly across the 3 theoretical
dimensions of usability.The chosen items were those that from a
pilot study correlated most highly with a concurrently collected
SUS score.
2.3. Structural analysis
The only structural analysis was an unrotated principal com-
ponent analysis, with reported eigenvalues of 3.37,0.31,0.20
and 0.12. Citing Tabachnik and Fidell (1989),Finstad (2010b,
pp. 325–326) interpreted these results as indicative of alignment
along one usability component: Tabachnik and Fidell (1989)
recommend the point where the screen plot line changes direc-
tion as a determinant of the number of components; this plot’s
direction drops off dramatically after the first component. This
is strong evidence for the scale measuring one “usability” com-
ponent. Because no secondary components emerged from this
analysis, no attempts at further extractions or rotations were per-
formed. The SUS provided a similar one-component extraction,
with no additional elements emerging’.
Criticism 5 (major): It is true that a reasonable first step
in structural analysis is to conduct an unrotated principal
component analysis, but that is not where structural analysis
should ever stop. Bangor et al. (2008) used the same approach
in the analysis of their SUS data (an unrotated factor analysis
and inspection of the resulting eigenvalues), and thus missed
the additional structure revealed by a varimax-rotated primary
factor analysis (Lewis and Sauro, 2009) which indicated
bidimensional structure for the SUS (confirmed by Borsci
et al., 2009, using a different statistical procedure applied to
an independent data set).
Note that the mechanics of PCA maximize the assignment
of variance to the first unrotated component, leading to
some controversy regarding its interpretability. A large first
eigenvalue is not evidence for a latent factor structure with
only one factor, rather, it is evidence for an overall usability
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
Critical Review of ‘The Usability Metric for User Experience’ 3
construct that might or might not have an additional latent
factor structure. To his credit, Finstad (2010b) did not invoke
the discredited rule-of-thumb used by some practitioners and
computer programs to set the appropriate number of factors
to the number of eigenvalues greater than 1 (for discussion
of why this rule-of-thumb does not work, see Cliff, 1987 and
Coovert and McNelis, 1988). Nonetheless, Finstad should have
also conducted a rotated principal or confirmatory (maximum
likelihood) factor analysis to evaluate the possibility that the
UMUX has two underlying factors.
Why check for two factors? There are two reasons. First,
the items that make up the UMUX have a mixed tone—two
are positive statements and two are negative. Although this is
a common practice in questionnaire design, a body of research
indicates that mixing the tone in this way can create undesirable
structure in a metric in which positive items align with one factor
and negative items align with the other (Barnette, 2000;Davis,
1989;Pilotte and Gable, 1990;Sauro and Lewis, 2011;Schmitt
and Stuits, 1985;Schriesheim and Hill, 1981;Stewart and Frye,
2004). The use of mixed tone is not necessarily bad, but from
the cited research it follows that researchers using mixed tone
should check to see if it is affecting the factor structure.
Second, Finstad (2010b) seems to have misunderstood the
method of checking eigenvalues for discontinuities as a way to
estimate the number of underlying factors (slope analysis). Of
course ‘this plot’s direction drops off dramatically after the first
component’—that is how the eigenvalues derived from principal
components analysis work—as much variance as possible goes
to the first component through the assignment of weights, then
weights are derived for an orthogonal component to assign
as much as possible of the residual variance to the second
component, and so on, so any eigenvalue other than the first must
necessarily be smaller than the preceding eigenvalues.The basic
approach of discontinuity (slope) analysis is first to calculate the
differences between adjacent eigenvalues and then to see if any
difference is greater than an immediately preceding difference
(Cliff, 1987;Coovert and McNelis, 1988).When this happens, it
is reasonable to retain the same number of factors as the number
of eigenvalues that precede the discontinuity.
For the UMUX eigenvalues, there are no discontinuities,
so this method is inconclusive. Coovert and McNelis (1989)
describe an alternative method called parallel analysis which,
given the eigenvalues reported by Finstad as input, suggests the
retention of two factors. Given the sample size available, Finstad
should probably have performed confirmatory (maximum
likelihood) rotated factor analyses for one-, two-, three- and
four-factor solutions to see which would provide the best-fitting
structural model.
2.4. Scale reliability
In psychometrics, reliability is quantified consistency, typically
estimated using coefficient alpha (Nunnally, 1978;Schmitt,
1996;Yu, 2001). Coefficient alpha can range from 0
(no reliability) to 1 (perfect reliability). Measures of individual
aptitude (such as IQ tests or college entrance exams) should have
a minimum reliability of 0.90—preferably a reliability of in the
mid-1990s (DeVellis, 2003;Nunnally, 1978). For other research
or evaluation, measurement reliability should be at least 0.70
(DeVellis, 2003;Landauer, 1997). The reliability of the UMUX
as assessed using coefficient alpha is very high—0.94 in the
survey study.
When coefficient alpha for noncritical questionnaires is very
high (>0.90), DeVellis (2003, p. 97) recommended that, ‘the
scale developer should give some thought to the optimal tradeoff
between brevity and reliability’. In other words, when reliability
is very high, it might be worthwhile to make the questionnaire
shorter, and thus easier to complete.
Another potential concern when a questionnaire with a small
number of items has a large coefficient alpha is that the items
are highly correlated because they are essentially the same item
with slightly different wording (Lewis, 2002). Inspection of the
UMUX items suggests that this is not likely to have caused the
high value of coefficient alpha because the wording of the items
is not highly similar.
2.5. Scale validity
The significant correlation between the UMUX and SUS
reported for the survey study of 0.96 (p<0.001) is an indicator
of concurrent validity.
Criticism 6 (minor): I agree with Finstad’s statement of the
limitation of validity assessment in this study: ‘Its scoring has
yet to be compared to objective metrics, such as error rates and
task timings, in a full experiment’ (Finstad, 2010b, pp. 326–
327). However, given (1) the finding that in industrial usability
studies there is generally a significant correlation between
instruments like the SUS and task-based metrics like completion
rates, completion times, and error rates (Sauro and Lewis, 2009)
and (2) the high correlation between the UMUX and SUS
scores, it is very likely (though not yet proved) that the UMUX
will also correlate significantly with concurrently collected
performance metrics when used in a standard industrial usability
study.
2.6. Scale sensitivity
As expected for a scale with high reliability and concurrent
validity, a t-test comparing UMUX ratings for one product with
a reputation for good usability and one with a poorer reputation
was significant (p<0.001). There was very little difference in
the resulting mean UMUX and SUS scores for the two products.
3. DISCUSSION
For the important psychometric properties of reliability, validity,
and sensitivity, the UMUX appears to work very well. My main
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
4James R. Lewis
criticisms of Finstad (2010b) are in the areas of motivation and
structural analysis.
Regarding motivation—do we really need a four-item
instrument that appears to provide the same information as the
standard 10-item SUS? Granted, the UMUX and SUS appear
to correlate very highly and, as far as the limited evidence
indicates, appear to provide scores with similar magnitudes. So,
at the very least, the findings of Finstad (2010b) are of significant
value to practitioners who use (and plan to continue using) the
SUS because it provides strong evidence of concurrent validity
of the SUS with an alternative instrument designed for closer
alignment with the standard ISO definition of usability.
I would expect that usability practitioners who currently
use the SUS will probably continue doing so, even if they
know about the UMUX, because the perceived risk in changing
instruments likely outweighs the perceived benefits. I could
be wrong about this. Time will tell. While I was working
on this review, one of my colleagues contacted me asking
if I knew of a shorter, psychometrically qualified version
of the SUS. I suggested that he obtain a copy of Finstad
(2010b) because the UMUX might fit his practical needs. It
is possible that there is a niche for the UMUX in the current
ecology of usability questionnaires. This is quite likely if
other researchers replicate the findings of Finstad (2010b),
especially with regard to the extent to which UMUX scores
can accurately predict or map onto concurrently collected SUS
scores.
In my opinion, the weakest element in this paper is the
structural analysis. The conclusion that the UMUX is
unidimensional is based on insufficient evidence, and there is
good reason to suspect that it is may be bidimensional. On the
other hand, even if those dimensions exist, as long as theyare due
to the varying tone of the items they are of little practical interest
given the overall scale’s reliability, validity and sensitivity. It
would have been better, however, to have conducted a more
thorough structural analysis.
4. CONCLUSIONS
The development of the UMUX followed standard psycho-
metric practice. Psychometric evaluation of the final version
of the UMUX indicated acceptable levels of reliability (inter-
nal consistency), concurrent validity, and sensitivity. Criti-
cal review of this research suggests that its weakest element
was the structural analysis, which concluded that the UMUX
is unidimensional based on insufficient evidence because its
mixed-tone item content and parallel analysis of the eigen-
values point to a possible two-factor structure. This weak-
ness, however, is of more theoretical than practical importance.
Given the overall scale’s apparent reliability, validity, and sen-
sitivity, researchers who need a shorter questionnaire than the
SUS but want a SUS-type metric should consider using the
UMUX.
REFERENCES
Bangor, A., Kortum, P.T. and Miller, J.T. (2008) An empirical
evaluation of the System Usability Scale. Int. J. Hum. Comput.
Interact., 24, 574–594.
Bangor, A., Kortum, P.T. and Miller, J.T. (2009) Determining what
individual SUS scores mean: adding an adjective rating scale. J.
Usability Studies, 4, 114–123.
Barnette, J.J. (2000) Effects of stem and Likert response option
reversals on survey internal consistency: if you feel the need, there
is a better alternative to using those negatively worded stems. Educ.
Psychol. Meas., 60, 361–370.
Borsci, S., Federici, S. and Lauriola, M. (2009) On the dimensionality
of the System Usability Scale: a test of alternative measurement
models. Cogn. Process., 10, 193–197.
Cliff, N. (1987) Analyzing Multivariate Data. Harcourt Brace
Jovanovich, San Diego.
Coovert, M.D. and McNelis, K. (1988) Determining the number of
common factors in factor analysis: a review and program. Educ.
Psychol. Meas., 48, 687–693.
Davis, D. (1989) Perceived usefulness, perceived ease of use, and user
acceptance of information technology. MIS Q., 13, 319–339.
DeVellis, R.F. (2003) Scale Development: Theory and Applications.
Sage Publications, Thousand Oaks, CA.
Finstad, K. (2006) The System Usability Scale and non-native English
speakers. J. Usability Stud., 1, 185–188.
Finstad, K. (2010a) Response interpolation and scale sensitivity:
evidence against 5-point scales. J. Usability Stud., 5, 104–110.
Finstad, K. (2010b) The usability metric for user experience. Interact.
Comput., 22, 323–327.
ISO 9241-11. (1998) Ergonomic Requirements for Office Work
with Visual Display Terminals (VDTs). Part 11: Guidance on
Usability.
Landauer, T.K. (1997) Behavioral Research Methods in Human–
Computer Interaction. In Helander, M., Landauer, K.T. and Prabhu,
P. (eds), Handbook of Human–Computer Interaction (2nd edn),
pp. 203–227. Elsevier, Amsterdam, Netherlands.
Lewis, J.R. (2002) Psychometric evaluation of the PSSUQ using data
from five years of usability studies. Int. J. Hum. Comput. Interact.,
14, 463–488.
Lewis, J.R. and Sauro, J. (2009) The Factor Structure of the System
Usability Scale. In Kurosu, M. (ed.), Human Centered Design, HCII
2009, pp. 94–103. Springer, Heidelberg, Germany.
Nunnally, J.C. (1978) Psychometric Theory. McGraw-Hill, New York.
Pilotte, W.J. and Gable, R.K. (1990) The impact of positive and
negative item stems on the validity of a computer anxiety scale.
Educ. Psychol. Meas., 50, 603–610.
Sauro, J. (2011) A Practical Guide to the System Usability Scale (SUS):
Background, Benchmarks & Best Practices. Measuring Usability
LLC, Denver, CO.
Sauro, J. and Dumas, J.S. (2009) Comparison of Three One-question,
Post-task Usability Questionnaires. In Proceedings of CHI 2009,
Boston, MA, pp. 1599–1608. ACM, Boston.
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
Critical Review of ‘The Usability Metric for User Experience’ 5
Sauro, J. and Lewis, J.R. (2009) Correlations among prototypical
usability metrics: Evidence for the construct of usability. In
Proceedings of CHI 2009, Boston, MA, pp. 1609–1618. ACM,
Boston.
Sauro, J. and Lewis, J.R. (2011) When designing usability
questionnaires, does it hurt to be positive? In Proceedings of CHI
2011, Vancouver, BC, Canada, pp. 2215–2223. ACM, Vancouver,
Canada.
Sauro, J. and Lewis, J.R. (2012) Quantifying the User Experi-
ence: Practical Statistics for User Research. Morgan-Kauffman,
Waltham, MA.
Schmitt, N. (1996) Uses and abuses of coefficient alpha. Psychol.
Assessment, 8, 350–353.
Schmitt, N. and Stuits, D. (1985) Factors defined by negatively keyed
items: the result of careless respondents? Appl. Psych. Meas., 9,
367–373.
Schriesheim, C.A. and Hill, K.D. (1981) Controlling acquiescence
response bias by item reversals: the effect on questionnaire validity.
Educ. Psychol. Meas., 41, 1101–1114.
Stewart, T.J. and Frye, A.W. (2004) Investigating the use of
negatively-phrased survey items in medical education settings:
Common wisdom or common mistake? Acad. Med., 79(10 Suppl.),
S1–S3.
Tabachnik, B.G. and Fidell, L.S. (1989) Using Multivariate Statistics
(2nd edn). Harper Collins, New York.
Tedesco, D.P. and Tullis, T.S. (2006) A comparison of methods
for eliciting post-task subjective ratings in usability testing.
Paper presented at the Usability Professionals Association Annual
Conference. UPA, Broomfield, CO.
Yu, C.H. (2001) An introduction to computing and interpreting
Cronbach coefficient alpha in SAS. In Proceedings of SUGI 26,
Paper 246-26. Long Beach, CA. SAS Institute, Cary, NC.
Interacting with Computers,2013
by guest on March 28, 2013http://iwc.oxfordjournals.org/Downloaded from
... It is founded on the ISO 9241-11 [16] definition of usability [20,33]. Psychometric evaluation of the UMUX indicated acceptable levels of reliability (internal consistency), concurrent validity, and sensitivity [34]. The UMUX-LITE conforms to the technology acceptance model (TAM). ...
Article
Full-text available
This article aims to assess the usability of selected map portals with a checklist. The methods employed allowed the author to conduct user experience tests from a longer temporal perspective against a retrospective analysis of the evolution of design techniques for presenting spatial data online. The author performed user experience tests on three versions of Tomice Municipality’s geoportal available on the Internet. The desktop and mobile laboratory tests were performed by fourteen experts following a test scenario. The study employs the exploratory approach, inspection method, and System Usability Scale (SUS). The author calculated the Geoportal Overall Quality (GOQ) index to better illustrate the relationships among the subjective perceptions of the usability quality of the three geoportals. The usability results were juxtaposed with performance measurements. Normalised and aggregated results of user experience demonstrated that the expert assessments of the usability of geoportals G1 and G3 on mobile devices were similar despite significant development differences. The overall results under the employed research design have confirmed that geoportal G2 offers the lowest usability in both mobile and desktop modes. The study has demonstrated that some websites can retain usability even considering the dynamic advances in hardware and software despite their design, which is perceived as outdated today. Users still expect well-performing and quick map applications, even if this means limited functionality and usability. Moreover, the results indirectly show that the past resolution of the ‘large raster problem’ led to the aggravation of the issue of ‘large scripts’.
... In conjunction with SUS, UMUX-Lite is being used to support and complement the usability evaluation. UMUX-Lite is a short version of UMUX (Lewis, 2013). UMUX-Lite offers the distinct advantage of brevity, requiring less time from respondents without significantly sacrificing the quality of the usability insights. ...
Article
Full-text available
The rapid advancement in technology necessitates robust financial information systems (FIS) in educational institutions to manage the increasing complexity and volume of financial transactions. This study evaluates the usability of developed web-based Financial Information System for schools and the Foundation, designed to automate and streamline the management of student tuition fee records. The evaluation employs two prominent usability metrics: the System Usability Scale (SUS) and the Usability Metric for User Experience-Lite (UMUX-Lite). The SUS and UMUX-Lite are tools designed to quantitatively assess a system's usability and user satisfaction. The SUS provides a quick, reliable measure of a product's usability, while UMUX-Lite offers a concise evaluation of user satisfaction with less respondent time without compromising insight quality. Our research methodology included distributing questionnaires to 41 schools, employing SUS and UMUX-Lite questions to gather comprehensive feedback on the FIS's usability. Analysis of the data revealed a SUS score of 77.5, placing the system's usability in the "B+" category, indicative of good user satisfaction. The UMUX-Lite score was slightly higher at 81.69, reflecting excellent usability and aligning with user needs and ease of use. These results affirm the effectiveness of the developed FIS, highlighting its potential to significantly improve financial management processes within schools. The close correlation between SUS and UMUX-Lite scores underscores the system's robustness and intuitive design, suggesting that minor enhancements could further elevate its usability and overall user experience. This study contributes to the field by demonstrating the practical application of SUS and UMUX-Lite in evaluating the usability of financial information systems in educational settings, offering insights for future system enhancements and user-centered design practices.
... For usability analysis, Umux-Lite [35] was used. This tool consists of two questions, and its score is converted to the SUS scale, with reliability between its responses, as studied by [36]. ...
Conference Paper
Aircraft maintenance plays a vital role in ensuring not only public safety but also aircraft availability and financial viability. However, carrying out such activities traditionally leaves gaps for processual optimizations with the potential of increasing the overall performance and satisfaction of operators and engineers who deal with these responsibilities. Currently, available technologies may enable the increment of quality in aircraft maintenance processes, namely, the ones based on xReality and deep learning. Therefore, this paper proposes two use-cases based on real aircraft maintenance context requirements: one employing augmented reality for cable routing, enabling cataloguing and also previewing the position and paths of cable systems inside the fuselage kit; and another making use of deep learning and virtual environments to aid in the activities related with engine boroscopic inspections, with position awareness capabilities. Results regarding a few functional and usability tests are also presented and discussed. In spite of their still incipient maturity, both these tools showed potential for being involved in the pipelines of aircraft maintenance in the future.
... These are based on the ISO-9241/11 (1998) definition of usability (Finstad, 2010) and have alternating positive and negative wording. Research indicates acceptable levels of concurrent validity of the UMUX, which is why it is often considered a shorter alternative to the SUS (Berkman & Karahoca, 2016;Finstad, 2010;Lewis, 2013). Bangor et al. (2009) added an item to the SUS with an adjective-anchored Likert scale, the Adjective Rating Scale (Table A1). ...
Article
Full-text available
Objective In usability studies, the subjective component of usability, perceived usability, is often of interest besides the objective usability components, efficiency and effectiveness. Perceived usability is typically investigated using questionnaires. Our goal was to assess experimentally which of four perceived-usability questionnaires differing in length best reflects the difference in perceived usability between systems. Background Conventional measurement wisdom strongly favors multi-item questionnaires, as measures based on more items supposedly yield better results. However, this assumption is controversial. Single-item questionnaires also have distinct advantages and it has been shown repeatedly that single-item measures can be viable alternatives to multi-item measures. Method N = 1089 (Experiment 1) and N = 1095 (Experiment 2) participants rated the perceived usability of a good or a poor web-based mobile phone contract system using the 35-item ISONORM 9241/10 (Experiment 1 only), the 10-item System Usability Scale (SUS), the 4-item Usability Metric for User Experience (UMUX), and the single-item Adjective Rating Scale. Results The Adjective Rating Scale represented the perceived-usability difference between both systems at least as good as, or significantly better than, the multi-item questionnaires (significantly better than the UMUX and the ISONORM 9241/10 in Experiment 1, significantly better than the SUS in Experiment 2). Conclusion The single-item Adjective Rating Scale is a viable alternative to multi-item perceived-usability questionnaires. Application Extremely short instruments can be recommended to measure perceived usability, at least for simple user interfaces that can be considered concrete-singular in the sense that raters understand which entity is being rated and what is being rated is reasonably homogenous.
... For measuring usability we adopted and modified System Usability Scale (SUS) [32] for prototype validation of 50 participants where Pearson's Chisquare test was used as a statistical method (see Fig. 23). In addition to questionnaires, we evaluated prototype using usability metric of success rate [33], which is one of the most commonly used metrics in user experience. The success rate is a usability performance metric that measures how efficiently users can complete a task which in our case was to get therapy through handheld portable device. ...
Article
Full-text available
Context Medical devices fall under the broad topic encompass everything from basic hardware to integrated software systems. The integration of software into hardware devices is not simple due to requirements of regional regulatory bodies. Therefore, medical businesses need to oversee not only the creation of devices but also the observance of guidelines and standards established by regulatory bodies. While plan-driven methodologies prevented software from evolving or changing, agile methodologies have inherent characteristics of insufficient planning and documentation. Objectives The objective of our research is to propose a suitable process model for medical device development, keeping in mind the regulatory requirements. Methods First, based on the detailed analysis of literature and McHughs proposed model, we suggested the Enhanced Agile V-Model (EAV), which combines plan-driven and agile approaches. Second, we mapped the proposed model to the MDEVSPICE framework to confirm that it adhered to the rules outlined in the standard IEC62304. Finally, the proposed model is evaluated through implication to case study of wave therapeutic medical device. Results The support of both agile and waterfall approach in EAV model helps in accommodating new requirements in the medical devices and the proposed systems engineering approach helps in hardware and software integration. The mapping of the EAV model to the MDEVSPICE shows complete compliance. Moreover, the implication of the proposed model has been clearly shown statistically and successfully implemented in our case study. Further, device usability and efficiency metrics showed confidence of P < 0.05 and for device safety and efficiency, we conducted an experiment which shows significant improvement in selected parameters. Conclusion The proposed model shows conformance to regulatory standards, and successfully implemented in development of wave therapeutic device. However, its applicability to more compact and straightforward medical products is unknown and can be determined by using this model to analyze the performance of other medical products.
... The UMUX questionnaire used in this study aimed to measure users' satisfaction with the usability of LinkedIn. The 80 questionnaire consisted of several items designed to capture users' perceptions of efficiency, effectiveness, ease of use, and overall satisfaction with the platform [13]. Participants were asked to rate their agreement with each statement on a Likert scale, typically ranging from 1 (Strongly Disagree) to 5 (Strongly Agree). ...
Article
Full-text available
This article presents a user experience analysis of LinkedIn using the Usability Metric for User Experience (UMUX). The study aims to assess the usability and user satisfaction of LinkedIn, identify areas for improve-ment, and propose hypotheses for future research. The problem addressed in this research is the need to evalu-ate the user experience of LinkedIn, a popular professional networking platform. The solution offered is to uti-lize the UMUX metric as a comprehensive tool for assessing usability and user satisfaction. The objectives of the study are to measure the effectiveness, efficiency, satisfaction, and overall usability of LinkedIn, as well as to identify specific areas for improvement. Data were collected through online questionnaires from LinkedIn users. The UMUX scores were calculated, indicating a moderate level of user satisfaction and perceived usa-bility of LinkedIn. These findings suggest that LinkedIn is perceived as a useful platform for professional goals, but there is room for improvement in terms of efficiency and overall user satisfaction. Hypotheses for future research include enhancing efficiency to improve satisfaction, improving overall satisfaction for in-creased engagement and retention, and optimizing the mobile experience. The study provides valuable insights for LinkedIn to enhance user satisfaction, engagement, and usability. Further research can explore additional usability metrics and investigate the impact of implemented improvements.
... Finstad (2013) correlaciona los datos obtenidos a partir del UMUX con el cuestionario de usabilidad percibida más popular y estandarizado como es SUS "System Usability Scale" (Brooke, 1996). Los estudios de Finstad (2010de Finstad ( , 2013 han sido criticados por algunos investigadores, su fundamentación se centra en elementos de correlación o de valoración de factores (Bosley, 2013;Lewis, 2013). Teaching & Technology. ...
Article
La introducción en los centros de primaria de la modalidad a distancia, a consecuencia de la pandemia sanitaria por el COVID-19, implicó la incorporación de dispositivos informáticos para que los alumnos pudieran continuar el curso lectivo. Uno de estos dispositivos fue el Chromebook. Este trabajo tiene como objetivo valorar la experiencia del usuario y su grado de satisfacción por parte del alumnado. La metodología utilizada es del tipo cuantitativo descriptivo. Para ello se incorpora el cuestionario denominado UMUX (Usability Metric for User Experience). Un instrumento fácil de aplicar para alumnos del tercer ciclo de primaria. Los resultados obtenidos indican que los alumnos de este ciclo valoran de forma positiva la utilización, experiencia y grado de satisfacción con este dispositivo. Los resultados obtenidos son para el grupo de 5º (M=13,16) mientras que para el grupo de 6º de (M=12,55). Un análisis estadístico más detallado aporta que la experiencia y satisfacción del usuario es independiente del curso en cual está matriculado el alumno en este ciclo. La valoración de ambos cursos es alta por parte de los alumnos del tercer ciclo primaria. Con esta investigación se abre una puerta a futuros trabajos comparativos con otros cuestionarios ‘post-study’.
... There is a variety of questionnaires that measure different aspects of UX in the field. For example, some questionnaires are more focused on one to several core UX aspects, such as usability [17][18], how the aesthetics of a website supports positive UX [19][20] and the degree of perceived mental efforts users needed for completing their task with the product [21][22]. In contrast, other questionnaires aim to measure UX more holistically, which include (but not limited to) users' perceptions and feelings (e.g., hedonic aspect that generally describes users' overall perception and enjoyment) and how the product supports users' task completion (e.g., ergonomic aspect that generally describes effectiveness and efficiency). ...
Chapter
The evaluation of user experience (UX) with software products is widely recognized as a critical aspect of supporting a product lifecycle. However, existing UX evaluation methods tend to require high levels of human involvement in data collection and analysis. This makes the ongoing UX monitoring particularly challenging, especially given the increasing number of products, growing user base and associated data. Thus, there is a strong demand in developing UX evaluation systems that are able to automatically track UX and provide insights on required design improvements. The few existing frameworks for such automated systems can help identify user-centric metrics for UX evaluation, but mostly focus on providing recommendations on best practices of determining metrics and tend to reflect only parts of the UX. Moreover, these frameworks predominantly rely on high-level UX concepts, but do not necessarily allow measurements to reveal the underlying causes of UX challenges. In this paper, we demonstrate how the above-mentioned challenges can be addressed through a combination of data gathering and analysis paths employed by the traditional UX evaluation methods. Our paper contributes to the field by providing a review of existing automated UX evaluation approaches and common UX evaluation data collection methods, and offering a two-tier measurement approach for developing automated UX evaluation system, which augments the reflective power of traditional UX evaluation methods.KeywordsEvaluation Methods and TechniquesUXUser ExperienceAutomatic UX Evaluation
Article
Full-text available
In spite of the ease of computation of Cronbach Coefficient Alpha, its misconceptions and mis-applications are still widespread, such as the confusion of consistency and dimensionality, as well as the confusion of raw Alpha and standardized Alpha. To clarify these misconceptions, this paper will illustrate the computation of Cronbach Coefficient Alpha in a step-by-step manner, and also explain the meaning of each component of the SAS output. INTRODUCTION Reliability can be expressed in terms of stability, equivalence, and consistency. Consistency check, which is commonly expressed in the form of Cronbach Coefficient Alpha (Cronbach, 1951), is a popular method. Unlike test-retest for stability and alternate form for equivalence, only a single test is needed for estimating internal consistency. In spite of its ease of computation, misconceptions and mis-applications of Cronbach Coefficient Alpha are widespread. The following problems are frequently observed: 1. Assumptions of Cronbach Alpha are neglected by researchers and as a result over-estimation and under-estimation of reliability are not taken into consideration. 2. Some researchers believe that the standardized Alpha is superior to the raw Alpha because they believe standardization can normalize skewed data. This problem also reflects the confusion of covariance matrix with correlation matrix. 3. Additionally, some people throw out difficult or easy items based on the simple statistics of each item without taking the entire test into account. 4. Further, when a survey or test contains different latent dimensions, some researchers compute the overall Alpha only and jump to the wrong conclusion that the entire test or survey is poorly written. 5. On the other hand, when a high overall Alpha is obtained, many researchers assume a single dimension and do not further investigate whether the test carries subscales. 6. Several researchers use a pretest as the baseline or as a covariate. However, a low Alpha in the pretest may result from random guessing when the subjects have not been exposed to the treatment (e.g. training of the test content). Judging the reliability of the instrument based on the pretest scores is premature. 7. Last but not least, quite a few researchers adopt a validated instrument but skip computing Cronbach Coefficient Alpha with their sample. This practice makes subsequent meta-analysis of mean difference and Alpha impossible. To clarify these misconceptions, this paper will illustrate the computation of Cronbach Coefficient Alpha in a step-by-step manner, and also explain the meaning of each component of the SAS output. WHICH RELIABILITY INFORMATION SHOULD I USE? One could compute Cronbach Coefficient Alpha, Kuder Richardson (KR) Formula, or Spilt-half Reliability Coefficient to examine internal consistency within a single test. Cronbach Alpha is recommended over the other two for the following reasons: 1. Cronbach Alpha can be used for both binary-type and large-scale data. On the other hand, KR can be applied to dichotomously scored data only. 2. Spilt-half can be viewed as a one-test equivalent to alternate form and test-retest, which use two tests. In spilt-half, you treat one single test as two tests by dividing the items into two subsets. Reliability is estimated by computing the correlation between the two subsets. The drawback is that the outcome is affected by how you group the items. Therefore, the reliability coefficient may vary from group to group. On the other hand, Cronbach Alpha is the mean of all possible spilt-half coefficients that are computed by the Rulon method (Crocker & Algina, 1986).
Article
Full-text available
A series of usability tests was run on two enterprise software applications, followed by verbal administration of the System Usability Scale. The original instrument with its 5-point Likert items was presented, as well as an alternate version modified with 7-point Likert items. Participants in the 5-point scale condition were more likely than those presented with the 7-point scale to interpolate, i.e., attempt a response between two discrete values presented to them. In an applied setting, this implied that electronic radio-button style survey tools using 5-point items might not be accurately measuring participant responses. This finding supported the conclusion that 7-point Likert items provide a more accurate measure of a participant's true evaluation and are more appropriate for electronically-distributed and otherwise unsupervised usability questionnaires.
Book
You're being asked to quantify your usability improvements with statistics. But even with a background in statistics, you are hesitant to statistically analyze their data, as they are often unsure which statistical tests to use and have trouble defending the use of small test sample sizes. The book is about providing a practical guide on how to solve common quantitative problems arising in usability testing with statistics. It addresses common questions you face every day such as: Is the current product more usable than our competition? Can we be sure at least 70% of users can complete the task on the 1st attempt? How long will it take users to purchase products on the website? This book shows you which test to use, and how provide a foundation for both the statistical theory and best practices in applying them. The authors draw on decades of statistical literature from Human Factors, Industrial Engineering and Psychology, as well as their own published research to provide the best solutions. They provide both concrete solutions (excel formula, links to their own web-calculators) along with an engaging discussion about the statistical reasons for why the tests work, and how to effectively communicate the results. *Provides practical guidance on solving usability testing problems with statistics for any project, including those using Six Sigma practices *Show practitioners which test to use, why they work, best practices in application, along with easy-to-use excel formulas and web-calculators for analyzing data *Recommends ways for practitioners to communicate results to stakeholders in plain English. © 2012 Jeff Sauro and James R. Lewis Published by Elsevier Inc. All rights reserved.
Article
For a sample of 270 high school students, the differences between positive and negative item stems are studied using three forms of a computer anxiety scale (original scale, negated items, and mixed stems) to ascertain if the items for each form tended to define a single construct or two different constructs. Internal-consistency estimates of reliability for each form yielded alpha coefficients which ranged from .73 for the mixed-stem format to .95 for the original format. Confirmatory factor analysis (LISREL VI) was employed to test the hypothesis that, for high school students, negative and positive item stems are indicative of different latent variables. The findings tended to support the hypothesis and were consistent with those obtained by other researchers (Benson, 1987 Benson and Hocevar, 1985; Schmitt and Stults, 1985). It was concluded that one should view results with caution when the instrument includes mixed stems, as the two sets of items do not appear to define a single construct.