Article

Measuring User Experience With 3, 5, 7, or 11 Points: Does It Matter?

Authors:
  • MeasuringU
If you want to read the PDF, try requesting it from the authors.

Abstract

Objective To assess versions of the shorter form variant of Usability Metric for User Experience (UMUX-LITE) questionnaire differing in the number of response options for the items (3, 5, 7, and 11). Background The UMUX-LITE is an efficient (two-item) standardized questionnaire that measures perceived usability. A growing body of evidence shows it closely corresponds to one of the most widely used standardized usability questionnaires, the System Usability Scale (SUS), with regard to both correlation and magnitude of concurrently collected means. Although the “standard” version of the UMUX-LITE uses items with seven response options, there is some variance in practice. Method Members of a corporate user experience panel ( n = 242) completed surveys rating a recent Web site experience with the SUS and UMUX-LITE, also providing ratings of overall experience and likelihood-to-recommend. Results Scale reliabilities were acceptable (coefficient α >.70) with the exception of UMUX-LITE with three response options. All UMUX-LITE correlations with SUS, overall experience, and likelihood-to-recommend were highly significant. For likelihood-to-recommend, there was a significant difference in the magnitude of correlations, with 11 response options higher than three. Although some statistically significant differences were observed in correspondence between SUS and UMUX-LITE scores, these did not seem to translate to practically significant differences. Conclusion The number of UMUX-LITE response options does not matter much, especially in practice. Because the version with three response options showed some weakness with regard to reliability and correlation with likelihood-to-recommend, practitioners should avoid it. Application Unless there is a strong reason to do otherwise, use the “standard” version with seven response options.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... A recent between-subject study compared differences in the number of response options (incl. 3, 5, 7 and 11) of the Usability Metric for User Experience (UMUX-LITE) questionnaire, that is used to assess perceived usability of an auto insurance Web site [17]. The same questionnaire items were used for this online survey, while the number of response options varied for different groups of participants. ...
... This similarity between the binary and Likert scales in our study might be due to the number of times that participants were allowed to provide a response, and that this response was allowed during the drive. For example, as a comparison, [17] administered a questionnaire to measure the perceived usability of a website, which was evaluated using a Likert-type scale with a range of response options (3, 5, 7 and 11). Evaluation was provided once, after participants had finished interacting with the website. ...
Conference Paper
Full-text available
This paper compared two different methodologies, used in two driving simulator studies, for real-time evaluation of comfort imposed by the driving style of different Automated Vehicle (AV) controllers. The first method provided participants with two options for assessing three different AV controllers. Participants rated each controller in terms of whether or not it was comfortable/safe/natural, when it navigated a simulated road. The evaluation was either positive (yes) or negative (no), indicated by pressing one of two buttons on a handset. In the second study, an 11-point Likert-type scale (from-5 to +5) was used to evaluate the extent to which a controller's driving style was "comfortable" and/or "natural", separately. Participants provided this evaluation for three different AV controllers. Here, they were instructed to utter a number from the scale, at designated points during the drive. To understand which method is better for such evaluations, we compared the data collected from the two studies, and investigated the patterns of data obtained for the two methodologies. Results showed that, despite the multiple response options provided by the 11-point scale, a similar pattern was seen to that of the binary method, with more positive responses provided for all controllers. The Likert scale is useful for identifying differences because of the multiple levels of responses. However, allowing people to present their ratings as often as they want, also makes the binary technique useful for such evaluations.
... A recent between-subject study compared differences in the number of response options (incl. 3, 5, 7 and 11) of the Usability Metric for User Experience (UMUX-LITE) questionnaire, that is used to assess perceived usability of an auto insurance Web site [17]. The same questionnaire items were used for this online survey, while the number of response options varied for different groups of participants. ...
... This similarity between the binary and Likert scales in our study might be due to the number of times that participants were allowed to provide a response, and that this response was allowed during the drive. For example, as a comparison, [17] administered a questionnaire to measure the perceived usability of a website, which was evaluated using a Likert-type scale with a range of response options (3, 5, 7 and 11). Evaluation was provided once, after participants had finished interacting with the website. ...
Preprint
Full-text available
This paper compared two different methodologies, used in two driving simulator studies, for real-time evaluation of comfort imposed by the driving style of different Automated Vehicle (AV) controllers. The first method provided participants with two options for assessing three different AV controllers. Participants rated each controller in terms of whether or not it was comfortable/safe/natural, when it navigated a simulated road. The evaluation was either positive (yes) or negative (no), indicated by pressing one of two buttons on a handset. In the second study, an 11-point Likert-type scale (from-5 to +5) was used to evaluate the extent to which a controller's driving style was "comfortable" and/or "natural", separately. Participants provided this evaluation for three different AV controllers. Here, they were instructed to utter a number from the scale, at designated points during the drive. To understand which method is better for such evaluations, we compared the data collected from the two studies, and investigated the patterns of data obtained for the two methodologies. Results showed that, despite the multiple response options provided by the 11-point scale, a similar pattern was seen to that of the binary method, with more positive responses provided for all controllers. The Likert scale is useful for identifying differences because of the multiple levels of responses. However, allowing people to present their ratings as often as they want, also makes the binary technique useful for such evaluations.
... For each chatbot, after the interaction, the participants were asked to fill in the BUS-15 [5] and the two items of the UMUX-LITE [28]. The UMUX-LITE was presented to the participants on a scale with 5-Likert points instead of the classic 7-point commonly considered a safe reduction [27,35]. The UMUX-LITE items were presented in one of the four languages. ...
Article
Full-text available
The Bot Usability Scale (BUS) is a standardised tool to assess and compare the satisfaction of users after interacting with chatbots to support the development of usable conversational systems. The English version of the 15-item BUS scale (BUS-15) was the result of an exploratory factorial analysis; a confirmatory factorial analysis tests the replicability of the initial model and further explores the properties of the scale aiming to optimise this tool seeking for the stability of the original model, the potential reduction of items, and testing multiple language versions of the scale. BUS-15 and the usability metrics for user experience (UMUX-LITE), used here for convergent validity purposes, were translated from English to Spanish, German, and Dutch. A total of 1292 questionnaires were completed in multiple languages; these were collected from 209 participants interacting with an overall pool of 26 chatbots. BUS-15 was acceptably reliable; however, a shorter and more reliable solution with 11 items (BUS-11) emerged from the data. The satisfaction ratings obtained with the translated version of BUS-11 were not significantly different from the original version in English, suggesting that the BUS-11 could be used in multiple languages. The results also suggested that the age of participants seems to affect the evaluation when using the scale, with older participants significantly rating the chatbots as less satisfactory, when compared to younger participants. In line with the expectations, based on reliability, BUS-11 positively correlates with UMUX-LITE scale. The new version of the scale (BUS-11) aims to facilitate the evaluation with chatbots, and its diffusion could help practitioners to compare the performances and benchmark chatbots during the product assessment stage. This tool could be a way to harmonise and enable comparability in the field of human and conversational agent interaction.
... The scale points were set as 5 points as in the Portuguese version [33] in order to make it simpler and faster for the respondents to decide their answer. Although this may be questionable since decreasing the number of anchor points from 7 to 5 may cause some loss of reliability and precision, recent research show that "improvements in psychometric precision were identified past 6 response options" [41] for personality questionnaires and similarly, the number of response options does not differ much between 3, 5, 7 and 11 point scales for measuring user experience [ 42]. Rollercoaster experience (in this study). ...
Article
Full-text available
I-group Presence Questionnaire (IPQ), which is used to evaluate the mediated experience of presence -especially for virtual reality applications- is originally developed in German and translated to several other languages. However, there is not any psychometric study for these translations including English version, except the Portuguese and Persian translations. We evaluated English translation of IPQ with 36 participants through 12 VR sessions with an overall of 432 samples. Using a partial least squares based factor analysis approach, the original 14-item set is trimmed into 11-items in order to achieve better psychometric qualities. In addition, a covariance based confirmatory factor analysis is executed to compare models. Several indices, even the conservative Cronbach’s alpha indicated that the subscales of 11-item version are reliable, but not the 14-item version. Eliminated items did not lead to a decrease in scales’ sensitivity to identify different levels of Realism, Spatial Presence and Involvement for different virtual environments. Although we provided evidence to remove the items which are identically worded and inversely coded that are causing measurement error, we suggest researchers to employ the 14-items but report the results for both 14-item version and 11-item version, until the psychometric qualities of IPQ in English is confirmed with a larger sample of participants
... Each item of the BUS was presented as a statement to the participants, and they were asked to assess their agreement with each statement on a five-point Likert scale from 1 ('Strongly Disagree') to 5 ('Strongly Agree'). A five-point Likert scale version of the UMUX-LITE was used in line with the recommendations of Sauro [73] and Lewis [51]. ...
Article
Full-text available
Standardised tools to assess a user’s satisfaction with the experience of using chatbots and conversational agents are currently unavailable. This work describes four studies, including a systematic literature review, with an overall sample of 141 participants in the survey (experts and novices), focus group sessions and testing of chatbots to (i) define attributes to assess the quality of interaction with chatbots and (ii) the designing and piloting a new scale to measure satisfaction after the experience with chatbots. Two instruments were developed: (i) A diagnostic tool in the form of a checklist (BOT-Check). This tool is a development of previous works which can be used reliably to check the quality of a chatbots experience in line with commonplace principles. (ii) A 15-item questionnaire (BOT Usability Scale, BUS-15) with estimated reliability between .76 and .87 distributed in five factors. BUS-15 strongly correlates with UMUX-LITE by enabling designers to consider a broader range of aspects usually not considered in satisfaction tools for non-conversational agents, e.g. conversational efficiency and accessibility, quality of the chatbot’s functionality and so on. Despite the convincing psychometric properties, BUS-15 requires further testing and validation. Designers can use it as a tool to assess products, thus building independent databases for future evaluation of its reliability, validity and sensitivity.
... This score has also been associated with a verbal user rating of 'good' (score 72-85) according to other studies [24,25]. However, the primary use of the SUS is to classify the ease of use of a website, it is not a diagnostic tool for identifying areas of improvement [21,23,25,26]. The qualitative comments from the survey (D) reiterate the high SUS-score. ...
Article
Full-text available
Background Feedback is essential in a self-regulated learning environment such as medical education. When feedback channels are widely spread, the need arises for a system of integrating this information in a single platform. This article reports on the design and initial testing of a feedback tool for medical students at Charité-Universitätsmedizin, Berlin, a large teaching hospital. Following a needs analysis, we designed and programmed a feedback tool in a user-centered approach. The resulting interface was evaluated prior to release with usability testing and again post release using quantitative/qualitative questionnaires. Results The tool we created is a browser application for use on desktop or mobile devices. Students log in to see a dashboard of “cards” featuring summaries of assessment results, a portal for the documentation of acquired practical skills, and an overview of their progress along their course. Users see their cohort’s average for each format. Learning analytics rank students’ strengths by subject. The interface is characterized by colourful and simple graphics. In its initial form, the tool has been rated positively overall by students. During testing, the high task completion rate (78%) and low overall number of non-critical errors indicated good usability, while the quantitative data (system usability scoring) also indicates high ease of use. The source code for the tool is open-source and can be adapted by other medical faculties. Conclusions The results suggest that the implemented tool LevelUp is well-accepted by students. It therefore holds promise for improved, digitalized integrated feedback about students’ learning progress. Our aim is that LevelUp will help medical students to keep track of their study progress and reflect on their skills. Further development will integrate users’ recommendations for additional features as well as optimizing data flow.
Article
Full-text available
Computer-based learning applications and mobile technology have transformed many aspects of the educational experience over the last decade, producing software aimed at improving learning efficiency and streamlining the presentation of course materials. One such class of software, purpose-created to take advantage of spaced learning and spaced testing principles, are electronic flashcard applications. We provide a perspective on the novel use of the Quizlet flashcard application in a tertiary educational setting. To reduce cognitive load for international graduate dental students taking a pharmacology review course, we implemented Quizlet, which integrates both spaced learning and self-testing, to improve the student learning experience. This study assessed students’ perceptions of the Quizlet flashcard system in a student cohort comprised of two consecutive years’ classes (n = 51 students in total). Results indicated broad acceptance of Quizlet based on ease of use of the software and ease of study of the material. Our data provide insight into the use of this common software in a professional healthcare tertiary education setting and further demonstrate the successful application of electronic flashcards for a mixed international student cohort. Further research should include an assessment of the impact of flashcard on long-term knowledge retention in this setting.
Chapter
In intra- and production logistics, monotonous work processes reduce employee motivation and concentration. This leads to errors, decreasing willingness to perform and a decline in productivity. To reduce these deficits, gamification and innovative technologies such as augmented reality (AR) are attracting increasing attention in research and practice. In the context of this research field, three hypotheses are put forward that assume a positive influence of AR-supported gamification on the work experience, on productivity and on learning speed. To test the hypotheses, the training application TrainAR was gamified to support the learning of a sorting process. The gamification was evaluated by test subjects who conducted three test runs with the gamified application. The test runs each consisted of sorting ten packages for which the trainees achieved scores depending on their performance. Before and after the experiment they answered questionnaires. The results of these experiments supported the hypotheses or provided valuable information towards further research. Furthermore, a motivating effect of the game elements implemented in TrainAR - rank badges, points, feedback, humanity hero, ranking lists, progress bars, time measurement, achievements and story – were observed among subjects.
Article
Full-text available
In 2018, Noam Tractinsky published a provocative paper entitled, “The usability construct: A dead end?” He argued the following: • Usability is an umbrella concept. • There is a mismatch between the construct of usability and its empirical measurements. • Scientific progress requires unbundling the usability construct and replacing it with well-defined constructs. Tractinsky (2018) offered the Hirsch and Levin (1999) definition of an umbrella construct as a “broad concept or idea used loosely to encompass and account for a set of diverse phenomena” (p. 200), noting that this diversity of phenomena makes it impossible to achieve a goal of unidimensional measurement. Tractinsky’s (2018) paper is undoubtedly a valuable contribution to the literature of usability science. Although I do not agree with its premises or conclusions, I admire its construction and I learned a lot from reading it. I hope that it will lead to additional research that will improve the understanding of the construct of usability, as it has inspired the writing of this essay. Following are the reasons why I disagree with his arguments. First, I question whether usability is truly an umbrella construct—a “broad concept or idea used loosely to encompass and account for a set of diverse phenomena” (Hirsch & Levin, 1999, p. 200), at least in the context of industrial usability testing. Structural analysis of objective and subjective data from industrial usability studies (Sauro & Lewis, 2009) has provided evidence consistent with an underlying construct of usability that can manifest itself through objective and subjective measurement. It seems plausible that when a system intended for human use has been properly designed, then the users of that system will complete tasks successfully and quickly, and will be sufficiently aware of this to experience, at a minimum, satisfaction as a consequence of perceived usability. With regard to perceived usability, it now appears that reports of meaningful factor structure in the SUS may have been premature (Lewis & Sauro, 2009), with more recent analysis indicating a nuisance structure due to the mixed positive and negative tone of its items (Lewis & Sauro, 2017). Furthermore, the development of subscales developed using factor analysis does not preclude the calculation of an overall measure of perceived usability. Investigation of correlation and correspondence of three independently developed usability questionnaires (CSUQ, SUS, and UMUX) has provided compelling evidence that they are measuring the same underlying construct (Lewis 2018a, 2018c, 2018d). There is also evidence that this is the same underlying construct assessed by the PEOU component of the TAM (Lewis, 2018b). So, rather than being a dead end, I believe the construct of usability has a bright future both in usability science (theory) and usability engineering (practice), either alone or as a fundamental part of the larger assessment of user experience. Any report of its death is an exaggeration.
Article
Full-text available
There is a large body of work on the topic of the optimal number of response options to use in multipoint items. The takeaways from the literature are not completely consistent, most likely due to variation in measurement contexts (e.g., clinical, market research, psychology) and optimization criteria (e.g., reliability, validity, sensitivity, ease-of-use). There is also considerable research literature on visual analog scales (VAS), which are endpoint-anchored lines on which respondents place a mark to provide a rating. Typically, a VAS is a 10-cm line with the marked position converted to a 101-point scale (0–100). Multipoint rating items are widely employed in user experience (UX) research. The use of the VAS, on the other hand, is relatively rare. It seems possible that the continuous structure of the VAS could offer some measurement advantages. Our objective for this study was to compare psychometric properties of individual items and multi-item questionnaires using 7- and 11-point Likert-type agreement items and the VAS in the context of UX research. Some characteristics (e.g., means and correlations) of the VAS were different from the Likert-style (7- and 11-point items), so the VAS does not appear to be interchangeable with the Likert-style items. There were no differences in the classical psychometric properties of reliability and concurrent validity. Thus, we did not find any particular measurement advantage associated with the use of 7-point, 11-point, or VAS items. With regard to measurement properties, it doesn't seem to matter (but the literature suggests multipoint items are easier to use).
Article
Full-text available
Conference Paper
Full-text available
In this paper we present the UMUX-LITE, a two-item questionnaire based on the Usability Metric for User Experience (UMUX) [6]. The UMUX-LITE items are This system's capabilities meet my requirements and This system is easy to use." Data from two independent surveys demonstrated adequate psychometric quality of the questionnaire. Estimates of reliability were .82 and .83 -- excellent for a two-item instrument. Concurrent validity was also high, with significant correlation with the SUS (.81, .81) and with likelihood-to-recommend (LTR) scores (.74, .73). The scores were sensitive to respondents' frequency-of-use. UMUX-LITE score means were slightly lower than those for the SUS, but easily adjusted using linear regression to match the SUS scores. Due to its parsimony (two items), reliability, validity, structural basis (usefulness and usability) and, after applying the corrective regression formula, its correspondence to SUS scores, the UMUX-LITE appears to be a promising alternative to the SUS when it is not desirable to use a 10-item instrument.
Article
Full-text available
This study is a part of a research effort to develop the Questionnaire for User Interface Satisfaction (QUIS). Participants, 150 PC user group members, rated familiar software products. Two pairs of software categories were compared: 1) software that was liked and disliked, and 2) a standard command line system (CLS) and a menu driven application (MDA). The reliability of the questionnaire was high, Cronbach's alpha=.94. The overall reaction ratings yielded significantly higher ratings for liked software and MDA over disliked software and a CLS, respectively. Frequent and sophisticated PC users rated MDA more satisfying, powerful and flexible than CLS. Future applications of the QUIS on computers are discussed.
Article
Full-text available
The Software Usability Measurement Inventory is a rigorously tested and proven method of measuring software quality from the end user's point of view.SUMI is a consistent method for assessing the quality of use of a software product or prototype, and can assist with the detection of usability flaws before a product is shipped.It is backed by an extensive reference database embedded in an effective analysis and report generation tool.
Conference Paper
Full-text available
Correlations between prototypical usability metrics from 90 distinct usability tests were strong when measured at the task-level (r between .44 and .60). Using test-level satisfaction ratings instead of task-level ratings attenuated the correlations (r between .16 and .24). The method of aggregating data from a usability test had a significant effect on the magnitude of the resulting correlations. The results of principal components and factor analyses on the prototypical usability metrics provided evidence for an underlying construct of general usability with objective and subjective factors. Author Keywords
Conference Paper
Full-text available
When designing questionnaires there is a tradition of including items with both positive and negative wording to minimize acquiescence and extreme response biases. Two disadvantages of this approach are respondents accidentally agreeing with negative items (mistakes) and researchers forgetting to reverse the scales (miscoding). The original System Usability Scale (SUS) and an all positively worded version were administered in two experiments (n=161 and n=213) across eleven websites. There was no evidence for differences in the response biases between the different versions. A review of 27 SUS datasets found 3 (11%) were miscoded by researchers and 21 out of 158 questionnaires (13%) contained mistakes from users. We found no evidence that the purported advantages of including negative and positive items in usability questionnaires outweigh the disadvantages of mistakes and miscoding. It is recommended that researchers using the standard SUS verify the proper coding of scores and include procedural steps to ensure error-free completion of the SUS by users. Researchers can use the all positive version with confidence because respondents are less likely to make mistakes when responding, researchers are less likely to make errors in coding, and the scores will be similar to the standard SUS.
Article
Full-text available
Factor analysis of Post Study System Usability Questionnaire (PSSUQ) data from 5 years of usability studies (with a heavy emphasis on speech dictation systems) indicated a 3-factor structure consistent with that initially described 10 years ago: factors for System Usefulness, Information Quality, and Interface Quality. Estimated reliabilities (ranging from .83-.96) were also consistent with earlier estimates. Analyses of variance indicated that variables such as the study, developer, stage of development, type of product, and type of evaluation significantly affected PSSUQ scores. Other variables, such as gender and completeness of responses to the questionnaire, did not. Norms derived from this data correlated strongly with norms derived from the original PSSUQ data. The similarity of psychometric properties between the original and this PSSUQ data, despite the passage of time and differences in the types of systems studied, provide evidence of significant generalizability for the questionnaire, supporting its use by practitioners for measuring participant satisfaction with the usability of tested systems.
Article
Full-text available
The Usability Metric for User Experience (UMUX) is a four-item Likert scale used for the subjective assessment of an application’s perceived usability. It is designed to provide results similar to those obtained with the 10-item System Usability Scale, and is organized around the ISO 9241–11 definition of usability. A pilot version was assembled from candidate items, which was then tested alongside the System Usability Scale during usability testing. It was shown that the two scales correlate well, are reliable, and both align on one underlying usability factor. In addition, the Usability Metric for User Experience is compact enough to serve as a usability module in a broader user experience metric.
Article
Full-text available
Valid measurement scales for predicting user acceptance of computers are in short supply. Most subjective measures used in practice are unvalidated, and their relationship to system usage is unknown. The present research develops and validates new scales for two specific variables, perceived usefulness and perceived ease of use, which are hypothesized to be fundamental determinants of user acceptance. Definitions for these two variables were used to develop scale items that were pretested for content validity and then tested for reliability and construct validity in two studies involving a total of 152 users and four application programs. The measures were refined and streamlined, resulting in two six-item scales with reliabilities of .98 for usefulness and .94 for ease of use. The scales exhibited high convergent, discriminant, and factorial validity. Perceived usefulness was significantly correlated with both self-reported current usage (r=.63, Study 1) and self-predicted future usage (r =.85, Study 2). Perceived ease of use was also significantly correlated with current usage (r=.45, Study 1) and future usage (r=.59, Study 2). In both studies, usefulness had a significantly greater correlation with usage behavior than did ease of use. Regression analyses suggest that perceived ease of use may actually be a causal antecedent to perceived usefulness, as opposed to a parallel, direct determinant of system usage. Implications are drawn for future research on user acceptance.
Article
In response to recent criticism of the usefulness of the construct of usability, we investigated the relationships between measures of perceived usability and the components of a modified version of the Technology Acceptance Model (mTAM) – Perceived Usefulness (PU) and Perceived Ease-of-Use (PEU). In three surveys, respondents used SUS, UMUX-LITE and mTAM to rate their actual (as opposed to expected) experience with three software products. As expected, the correlations between PEU and other measures of perceived usability tended to be significantly stronger than those with PU. Additional findings support the use of the UMUX-LITE as a compact measure of perceived usability that has a strong relationship to the mTAM and strong correspondence with concurrently collected SUS scores. The main theoretical result of this research were regression results providing evidence that the PEU component of the mTAM appears to be another measure of the construct of perceived usability, connecting the TAM to the construct of perceived usability through the mTAM and providing evidence against the claim that the construct of usability is a theoretical dead end.
Article
This research continued previous investigation of the relationships among measures of perceived usability: the System Usability Scale (SUS), three metrics derived from the Usability Metric for User Experience (UMUX), and the Computer System Usability Questionnaire (CSUQ), this time with ratings of four everyday products (Excel, Word, Amazon, and Gmail). SUS ratings of these products were generally consistent with previous reports. Significant differences in SUS means across studies could be due to differences in frequency of use, with implications for using these data as usability benchmarks. Correspondence among the various measures of perceived usability was also consistent with previous research. Considering frequency of use, mean differences ranged from -2.0 to 1.8 (average shift in Sauro-Lewis grade range from -0.6 to 0.8). When SUS scores were above average, the range restriction of the UMUX-LITEr led to relatively large discrepancies with SUS, suggesting it might not always be better than unadjusted UMUXLITE.
Article
The System Usability Scale (SUS) is the most widely used standardized questionnaire for the assessment of perceived usability. This review of the SUS covers its early history from inception in the 1980s through recent research and its future prospects. From relatively inauspicious beginnings, when its originator described it as a “quick and dirty usability scale,” it has proven to be quick but not “dirty.” It is likely that the SUS will continue to be a popular measurement of perceived usability for the foreseeable future. When researchers and practitioners need a measure of perceived usability, they should strongly consider using the SUS.
Article
The primary purpose of this research was to investigate the relationship between two widely used questionnaires designed to measure perceived usability: the Computer System Usability Questionnaire (CSUQ) and the System Usability Scale (SUS). The correlation between concurrently collected CSUQ and SUS scores was 0.76 (over 50% shared variance). After converting CSUQ scores to a 0–100-point scale (to match the range of the SUS scores), there was a small but statistically significant difference between CSUQ and SUS means. Although this difference (just under 2 scale points out of a possible 100) was statistically significant, it did not appear to be practically significant. Although usability practitioners should be cautious pending additional independent replication, it appears that CSUQ scores, after conversion to a 0–100-point scale, can be interpreted with the Sauro–Lewis curved grading scale. As a secondary research goal, investigation of variations of the Usability Metric for User Experience (UMUX) replicated previous findings that the regression-adjusted version of the UMUX-LITE (UMUX-LITEr) had the closest correspondence with concurrently collected SUS scores. Thus, even though these three standardized questionnaires were independently developed and have different item content and formats, they largely appear to be measuring the same thing, presumably, perceived usability.
Book
You're being asked to quantify your usability improvements with statistics. But even with a background in statistics, you are hesitant to statistically analyze their data, as they are often unsure which statistical tests to use and have trouble defending the use of small test sample sizes. The book is about providing a practical guide on how to solve common quantitative problems arising in usability testing with statistics. It addresses common questions you face every day such as: Is the current product more usable than our competition? Can we be sure at least 70% of users can complete the task on the 1st attempt? How long will it take users to purchase products on the website? This book shows you which test to use, and how provide a foundation for both the statistical theory and best practices in applying them. The authors draw on decades of statistical literature from Human Factors, Industrial Engineering and Psychology, as well as their own published research to provide the best solutions. They provide both concrete solutions (excel formula, links to their own web-calculators) along with an engaging discussion about the statistical reasons for why the tests work, and how to effectively communicate the results. *Provides practical guidance on solving usability testing problems with statistics for any project, including those using Six Sigma practices *Show practitioners which test to use, why they work, best practices in application, along with easy-to-use excel formulas and web-calculators for analyzing data *Recommends ways for practitioners to communicate results to stakeholders in plain English. © 2012 Jeff Sauro and James R. Lewis Published by Elsevier Inc. All rights reserved.
Article
The purpose of this research was to investigate various measurements of perceived usability, in particular, to assess (a) whether a regression formula developed previously to bring Usability Metric for User Experience LITE (UMUX-LITE) scores into correspondence with System Usability Scale (SUS) scores would continue to do so accurately with an independent set of data; (b) whether additional items covering concepts such as findability, reliability, responsiveness, perceived use by others, effectiveness, and visual appeal would be redundant with the construct of perceived usability or would align with other potential constructs; and (c) the dimensionality of the SUS as a function of self-reported frequency of use and expertise. Given the broad use of and emerging interpretative norms for the SUS, it was encouraging that the regression equation for the UMUX-LITE worked well with this independent set of data, although there is still a need to investigate its efficacy with a broader set of products and methods. Results from a series of principal components analyses indicated that most of the additional concepts, such as findability, familiarity, efficiency, control, and visual appeal covered the same statistical ground as the other more standard metrics for perceived usability. Two of the other items (Reliable and Responsive) made up a reliable construct named System Quality. None of the structural analyses of the SUS as a function of frequency of use or self-reported expertise produced the expected components, indicating the need for additional research in this area and a need to be cautious when using the Usable and Learnable components described in previous research.
Article
The System Usability Scale (SUS) is an inexpensive, yet effective tool for assessing the usability of a product, including Web sites, cell phones, interactive voice response systems, TV applications, and more. It provides an easy-to-understand score from 0 (negative) to 100 (positive). While a 100-point scale is intuitive in many respects and allows for relative judgments, information describing how the numeric score translates into an absolute judgment of usability is not known. To help answer that question, a seven-point adjective-anchored Likert scale was added as an eleventh question to nearly 1,000 SUS surveys. Results show that the Likert scale scores correlate extremely well with the SUS scores (r=0.822). The addition of the adjective rating scale to the SUS may help practitioners interpret individual SUS scores and aid in explaining the results to non-human factors professionals.
Article
This article presents nearly 10 year's worth of System Usability Scale (SUS) data collected on numerous products in all phases of the development lifecycle. The SUS, developed by Brooke (1996)2. Brooke , J. 1996. “SUS: A “quick and dirty” usability scale”. In Usability evaluation in industry, Edited by: Jordan , P. W. , Thomas , B. A. Weerdmeester and McClelland , I. L. 189–194. London: Taylor & Francis. View all references, reflected a strong need in the usability community for a tool that could quickly and easily collect a user's subjective rating of a product's usability. The data in this study indicate that the SUS fulfills that need. Results from the analysis of this large number of SUS scores show that the SUS is a highly robust and versatile tool for usability professionals. The article presents these results and discusses their implications, describes nontraditional uses of the SUS, explains a proposed modification to the SUS to provide an adjective rating that correlates with a given score, and provides details of what constitutes an acceptable SUS score.
Article
This paper describes recent research in subjective usability measurement at IBM. The focus of the research was the application of psychometric methods to the development and evaluation of questionnaires that measure user satisfaction with system usability. The primary goals of this paper are to (1) discuss the psychometric characteristics of four IBM questionnaires that measure user satisfaction with computer system usability, and (2) provide the questionnaires, with administration and scoring instructions. Usability practitioners can use these questionnaires with confidence to help them measure users' satisfaction with the usability of computer systems.
Lewis is a senior human-factors engineer at IBM with a research interest in the measurement and assessment of usability and user experiences
  • R James
James R. Lewis is a senior human-factors engineer at IBM with a research interest in the measurement and assessment of usability and user experiences. He received a PhD in experimental psychology from Florida Atlantic University (1996). He is a member of the User Experience Professionals Association and the Human Factors and Ergonomics Society, is a Certified Human Factors Professional, and is pastpresident of the Association for Voice Interaction Design. Date received: June 24, 2019