ChapterPDF Available

Determining Usability Test Sample Size

Authors:
  • Triangle Business Architecture
  • MeasuringU

Abstract and Figures

The cumulative binomial probability formula (given appropriate adjustment of p when estimated from small samples) provides a quick and robust means of estimating problem discovery rates (p). This estimate can be used to estimate usability test sample size requirements (for studies that are underway) and to evaluate usability test sample size adequacy (for studies that have already been conducted). Further research is needed to answer remaining questions about when usability testing is reliable, valid, and useful.
Content may be subject to copyright.
International Encyclopedia of Ergonomics and Human Factors, 2006, Second Edition, Volume 3
Edited by Waldemar Karwowski, Boca Raton, FL: CRC Press
3084
Determining Usability Test Sample Size
Carl W. Turner*, James R. Lewis, and Jakob Nielsen
*State Farm Insurance Cos., Bloomington, IL 61791, USA
IBM Corp., Boca Raton, FL 33487 USA
Nielsen Norman Group, Fremont, CA 94539, USA
1 INTRODUCTION
Virzi (1992), Nielsen and Landauer (1993), and Lewis
(1994) have published influential articles on the topic of
sample size in usability testing. In these articles, the authors
presented a mathematical model of problem discovery rates
in usability testing. Using the problem discovery rate
model, they showed that it was possible to determine the
sample size needed to uncover a given proportion of
problems in an interface during one test. The authors
presented empirical evidence for the models and made
several important claims:
Most usability problems are detected with the first
three to five subjects.
Running additional subjects during the same test is
unlikely to reveal new information.
Return on investment (ROI) in usability testing is
maximized when testing with small groups using
an iterative test-and-design methodology.
Nielsen and Landauer (1993) extended Virzi’s (1992)
original findings and reported case studies that supported
their claims for needing only small samples for usability
tests. They and Lewis (1994) identified important
assumptions about the use of the formula for estimating
problem discovery rates. The problem discovery rate model
was recently re-examined by Lewis (2001).
2 THE ORIGINAL FORMULAE
Virzi (1992) published empirical data supporting the use of
the cumulative binomial probability formula to estimate
problem discovery rates. He reported three experiments in
which he measured the rate at which usability experts and
trained student assistants identified problems as a function
of the number of naive participants they observed. Problem
discovery rates were computed for each participant by
dividing the number of problems uncovered during an
individual test session by the total number of unique
problems found during testing. The average likelihood of
problem detection was computed by averaging all
participants’ individual problem discovery rates.
Virzi (1992) used Monte Carlo simulations to
permute participant orders 500 times to obtain the average
problem discovery curves for his data. Across three sets of
data, the average likelihoods of problem detection (p in the
formula above) were 0.32, 0.36, and 0.42. He also had the
observers (Experiment 2) and an independent group of
usability experts (Experiment 3) provide ratings of problem
severity for each problem. Based on the outcomes of these
experiments, Virzi made three claims regarding sample size
for usability studies: (1) Observing four or five participants
allows practitioners to discover 80% of a product’s usability
problems, (2) observing additional participants reveals
fewer and fewer new usability problems, and (3) observers
detect the more severe usability problems with the first few
participants. Based on these data, he claimed that running
tests using small samples in an iterative test-and-design
fashion would identify most usability problems and save
both time and money.
Proportion of unique problems
found = 1 (1 p)n (1)
where p is the mean problem discovery rate computed
across subjects (or across problems) and n is the number of
subjects. Seeking to quantify the patterns of problem
detection observed in several fairly large-sample studies of
problem discovery (using either heuristic evaluation or user
testing) Nielsen and Landauer (1993) derived the same
formula from a Poisson process model (constant probability
path independent). They found that it provided a good fit to
their problem-discovery data, and provided a basis for
predicting the number of problems existing in an interface
and performing cost-benefit analyses to determine
appropriate sample sizes. Across 11 studies (five user tests
and six heuristic evaluations), they found the average value
of p to be .33 (ranging from .16 to .60, with associated
estimates of p ranging from .12 to .58). Nielsen and
Landauer used lambda rather than p, but the two concepts
are essentially equivalent. In the literature, ? (lambda), L,
and p are commonly used to represent the average
likelihood of problem discovery. Throughout this article,
we will use p.
Number of unique problems
found = N(1 (1 p)n)) (2)
Determining Usability Test Sample Size 3085
where p is the problem discovery rate, N is the total number
of problems in the interface, and n is the number of subjects.
The problem discovery rate was approximately .3
when averaged across a large number of independent tests,
but the rate for any given usability test will vary depending
on several factors (Nielsen & Landauer, 1993). These
factors include:
Properties of the system and interface, including
the size of the application.
Stage in the usability lifecycle the product is tested
in, whether early in the design phase or after
several iterations of test and re-design.
Type and quality of the methodology used to
conduct the test.
Specific tasks selected.
Match between the test and the context of real
world usage.
Representativeness of the test participant.
Skill of the evaluator.
Research following these lines of investigation led to other,
related claims. Nielsen (1994) applied the formula in
Equation 2 to a study of problem discovery rate for heuristic
evaluations. Eleven usability specialists evaluated a
complex prototype system for telephone company
employees. The evaluators obtained training on the system
and the goals of the evaluation. They then independently
documented usability problems in the user interface based
on published usability heuristics. The average value of p
across 11 evaluators was .29, similar to the rates found
during talk-aloud user testing (Nielsen & Landauer, 1993;
Virzi, 1992).
Lewis (1994) replicated the techniques applied by
Virzi (1992) to data from a usability study of a suite of
office software products. The problem discovery rate for
this study was .16. The results of this investigation clearly
supported Virzi’s second claim (additional participants
reveal fewer and fewer problems), partially supported the
first (observing four or five participants reveals about 80%
of a product’s usability problems as long as the value of p
for a study is in the approximate range of .30 to .40), and
failed to support the third (there was no correlation between
problem severity and likelihood of discovery). Lewis noted
that it is most reasonable to use small-sample problem
discovery studies “if the expected p is high, if the study will
be iterative, and if undiscovered problems will not have
dangerous or expensive outcomes” (1994, p. 377).
3 RECENT CHALLENGES
Recent challenges to the estimation of problem discovery
rates appear to take two general forms. The first questions
the reliability of problem discovery procedures (user testing,
heuristic evaluation, cognitive walkthrough, etc.). If
problem discovery is completely unreliable, then how can
anyone model it? Furthermore, how can one account for the
apparent success of iterative problem-discovery procedures
in increasing the usability of the products against which they
are applied?
The second questions the validity of modeling the
probability of problem discovery with a single value for p.
Other issues such as the fact that claiming high
proportions of problem discovery with few participants
requires a fairly high value of p, that different task sets lead
to different opportunities to discover problems, and the
importance of iteration are addressed at length in earlier
papers (Lewis, 1994; Nielsen, 1993).
3.1 Is Usability Problem Discovery Reliable?
Molich et al. (1998) conducted a study in which four
different usability labs evaluated a calendar system and
prepared reports of the usability problems they discovered.
An independent team of usability professionals compared
the reports produced by the four labs. The number of
unique problems identified by each lab ranged from four to
98. Only one usability problem was reported by all four
labs. The teams that conducted the studies noted difficulties
in conducting the evaluations that included a lack of testing
goals, no access to the product development team, a lack of
user profile information, and no design goals for the
product.
Kessner et al. (2001) have also reported data that
question the reliability of usability testing. They had six
professional usability teams test an early prototype of a
dialog box. The total number of usability problems was
determined to be 36. None of the problems were identified
by every team, and only two were reported by five teams.
Twenty of the problems were reported by at least two teams.
After comparing their results with those of Molich et al.
(1999), Kessner et al. suggested that more specific and
focused requests by a client should lead to more overlap in
problem discovery.
Hertzum and Jacobsen (2001) have termed the lack
of inter-rater reliability among test observers an ‘evaluator
effect’ that “multiple evaluators evaluating the same
interface with the same usability evaluation method detect
markedly different sets of problems” (p. 421). Across a
review of 11 studies, they found the average agreement
between any two evaluators of the same system ranged from
5% to 65%, with no usability evaluation method (cognitive
walkthroughs, heuristic evaluations, or think-aloud user
studies) consistently more effective than another. Their
review, and the studies of Molich et al. (1999) and Kessner
et al. (2001) point out the importance of setting clear test
objectives, running repeatable test procedures, and adopting
clear definitions of usability problems. Given that multiple
evaluators increase the likelihood of problem detection
(Nielsen, 1994), they suggested that one way to reduce the
evaluator effect is to involve multiple evaluators in usability
tests.
3086 Determining Usability Test Sample Size
The results of these studies are in stark contrast to
earlier studies in which usability problem discovery was
reported to be reliable (Lewis, 1996; Marshall, Brendon, &
Prail, 1990). The widespread use of usability problem
discovery methods indicates that practitioners believe they
are reliable. Despite this widespread belief, an important
area of future research will be to reconcile the studies that
have challenged the reliability of problem discovery with
the apparent reality of usability improvement achieved
through iterative application of usability problem discovery
methods. For example, there might be value in exploring
the application of signal detection theory (Swets, Dawes, &
Monahan, 2000) to the detection of usability problems.
3.2 Issues in the Estimation of p
Woolrych and Cockton (2001) challenged the
assumption that a simple estimate of p is sufficient for the
purpose of estimating the sample size required for the
discovery of a specified percentage of usability problems in
an interface. Specifically, they criticized the formula for
failing to take into account individual differences in
problem discoverability and also claimed that the typical
values used for p (around .30) are overly optimistic. They
also pointed out that the circularity in estimating the key
parameter of p from the study for which you want to
estimate the sample size reduces its utility as a planning
tool. Following close examination of data from a previous
study of heuristic evaluation, they found combinations of
five participants which, if they had been the only five
participants studied, would have dramatically changed the
resulting problems lists, both for frequency and severity.
They recommended the development of a formula that
replaces a single value for p with a probability density
function.
Caulton (2001) claimed that the simple estimate of
p only applies given a strict homogeneity assumption that
all types of users have the same probability of encountering
all usability problems. To address this, Caulton added to the
standard cumulative binomial probability formula a
parameter for the number of heterogeneous groups. He also
introduced and modeled the concept of problems that
heterogeneous groups share and those that are unique to a
particular subgroup. His primary claims were (1) the more
subgroups, the lower will be the expected value of p and (2)
the more distinct the subgroups are, the lower will be the
expected value of p.
Most of the arguments of Woolrych and Cockton
(2001) were either addressed in previous literature or do not
stand up against the empirical findings reported in previous
literature. It is true that estimates of p can vary widely from
study to study. This characteristic of usability testing can be
addressed by estimating p for a study after running two
subjects and adjusting the estimate as the study proceeds
(Lewis, 2001). There are problems with the estimation of p
from the study to which you want to apply it, but recent
research (discussed below) provides a way to overcome
these problems. Of course, it is possible to select different
subsets of participants who experienced problems in a way
that leads to an overestimate of p (or an underestimate of p,
or any value of p that the person selecting the data wishes).
Test administrators should follow accepted practice and
select evaluators who represent the range of knowledge and
skills found in the population of end users . There is no
compelling evidence that a probability density function
would lead to an advantage over a single value for p,
although there might be value in computing confidence
intervals for single values of p.
Caulton’s (2001) refinement of the model is
consistent with the observation that different user groups
expose different types of usability problems (Nielsen, 1993).
It is good practice to include participants from significant
user groups in each test; three or four per group for two
groups and three participants for more than two groups. If
there is a concern that different user groups will uncover
different sets of usability problems then the data for each
group can be analyzed separately, and a separate p
computed for each user group. However, Caulton’s claim
that problem discovery estimates are always inflated when
averaged across heterogeneous groups and problems with
different values of p is inconsistent with the empirical data
presented in Lewis (1994). Lewis demonstrated that p is
robust, showing that the mean value of p worked very well
for modeling problem discovery in a set of problems that
had widely varying values of p.
4 IMPROVING SMALL-SAMPLE
ESTIMATION OF p
Lewis (2001), responding to an observation by Hertzum and
Jacobsen (2001) that small-sample estimates of p are almost
always inflated, investigated a variety of methods for
adjusting these small-sample estimates to enable accurate
assessment of sample size requirements and true proportions
of discovered problems. Using data from a series of Monte
Carlo studies applied against four published sets of problem
discovery databases, he found that a technique based on
combining information from a normalization procedure and
a discounting method borrowed from statistical language
modeling produced very accurate adjustments for small-
sample estimates of p. The Good-Turing (GT) discounting
procedure reduced, but did not completely eliminate, the
overestimate of problem discovery rates produced by small-
sample p estimates. The GT adjustment, shown in Equation
3, was:
(3)
where pest is the initial estimate computed from the raw data
of a usability study, E(N1) was the number of usability
Determining Usability Test Sample Size 3087
problems detected by only one user, and N was that total
number of unique usability problems detected by all users.
By contrast, the normalization procedure (Norm)
slightly underestimated problem discovery rates. The
equation was: (4)
where pest is the initial estimate computed from the raw data
of a usability study and n was the number of test
participants. He concluded that the overestimation of p
from small-sample usability studies is a real problem with
potentially troubling consequences for usability
practitioners, but that it is possible to apply these procedures
(normalization and Good-Turing discounting) to
compensate for the overestimation bias. Applying each
procedure to the initial estimate of p, then averaging the
results, produces a highly accurate estimate of the problem
discovery rate. Equation 5 shows the formula for an
adjusted p estimate based on averaging Good-Turing and
normalization adjustments.
(5)
“Practitioners can obtain accurate sample size estimates for
problem-discovery goals ranging from 70% to 95% by
making an initial estimate of the required sample size after
running two participants, then adjusting the estimate after
obtaining data from another two (total of four) participants”
(Lewis, 2001, p.474).
The results of a return-on-investment (ROI) model
for usability studies (Lewis, 1994) indicated that the
magnitude of p affected the point at which the percentage of
problems discovered maximized ROI. For values of p
ranging from .10 to .5, the appropriate problem discovery
goal ranged from .86 to .98, with lower values of p
associated with lower problem discovery goals.
5 AN APPLICATION OF THE
ADJUSTMENT PROCEDURES
In the example shown in Table 1, a usability test with eight
participants has led to the discovery of four unique usability
problems. The problem discovery rates (p) for individual
participants ranged from 0.0 to .75. The problem discovery
rates for specific problems ranged from .125 to .875. The
average problem discovery rate (averaged either across
problems or participants), pest , was .375. Note that
Problems 2 and 4 were detected by only one participant
(Participants 2 and 7, respectively). Applying the Good-
Turing estimating procedure from Equation 3 gives
TABLE 1
Data from a Hypothetical Usability Test with Eight
Subjects, pest = .375
Problem Number
Subject 1 2 3 4 Count
p
1 1 0 1 0 2 0.500
2 1 0 1 1 3 0.750
3 1 0 0 0 1 0.250
4 0 0 0 0 0 0.000
5 1 0 1 0 2 0.500
6 1 0 0 0 1 0.250
7 1 1 0 0 2 0.500
8 1 0 0 0 1 0.250
Count 7 1 3 1
P 0.875 0.125 0.375 0.125 0.375
Applying normalization as shown in Equation 4 gives
The adjusted problem discovery rate is obtained by
averaging the two estimates as shown in Equation 5 gives
With this adjusted value of p and the known sample size, it
is possible to estimate the sample size adequacy of this
study using the cumulative binomial probability formula: 1
(1 .25)8 = .90. If the problem discovery goal for this
study had been 90%, then the sample size was adequate. If
the discovery goal had been lower, the sample size would be
excessive, and if the discovery goal had been higher, the
sample size would be inadequate. The discovery of only
four problems (one problem for every two participants)
suggests that the discovery of additional problems would be
difficult. If four problems constitute 90% of the problems
available for discovery given the specifics of this usability
study, then 100% of the problems available for discovery
should be about 4/.9, or 4.44. In non-numerical terms, there
probably aren’t a lot of additional problems to extract from
this problem discovery space.
As an example of sample size estimation, suppose
you had data from the first four participants and wanted to
estimate the number of participants you’d need to run to
achieve 90% problem discovery. After running the fourth
participant, there were three discovered problems (because
Problem 2 did not occur until Participant 7), as shown in
Table 2. One of those problems (Problem 4) occurred only
once.
3088 Determining Usability Test Sample Size
TABLE 2
Data from a Hypothetical Usability Test; First Four
Subjects, pest = .500
Problem Number
Subject 1 3 4 Count p
1 1 1 0 2 0.667
2 1 1 1 3 1.000
3 1 0 0 1 0.333
4 0 0 0 0 0.000
Count 3 2 1
P 0.750 0.500 0.250 0.500
Applying the Good-Turing estimating procedure from
Equation 3 gives
Applying normalization as shown in Equation 4 gives
The average of the two estimates is
Given p = .28, the estimated proportion of discovered
problems would be 1 (1 .28)4, or .73. Doing the same
computation with n = 7 gives .90, indicating that the
appropriate sample size for the study would be 7. Note that
in the matrix for this hypothetical study, running the eighth
participant did not reveal any new problems.
6 CONCLUSIONS
The cumulative binomial probability formula (given
appropriate adjustment of p when estimated from small
samples) provides a quick and robust means of estimating
problem discovery rates (p). This estimate can be used to
estimate usability test sample size requirements (for studies
that are underway) and to evaluate usability test sample size
adequacy (for studies that have already been conducted).
Further research is needed to answer remaining questions
about when usability testing is reliable, valid, and useful.
REFERENCES
CAULTON, D.A., 2001, Relaxing the homogeneity assumption in
usability testing. Behaviour & Information Technology, 20,
1-7.
HERTZUM , M. and JACOBSEN, N.E., 2001, The evaluator
effect: a chilling fact about usability evaluation methods.
International Journal of Human-Computer Interaction, 13,
421-443.
KESSNER, M., WOOD, J., DILLON, R.F. and WEST, R.L., 2001,
On the reliability of usability testing. In Jacko, J. and Sears,
A., (eds), Conference on Human Factors in Computing
Systems: CHI 2001 Extended Abstracts (Seattle, WA: ACM
Press), pp. 97-98.
LEWIS, J.R., 1994, Sample sizes for usability studies: Additional
considerations. Human Factors, 36, 368-378.
LEWIS, J.R., 1996, Reaping the benefits of modern usability
evaluation: The Simon story. In Salvendy, G. and Ozok, A.,
(eds), Advances in Applied Ergonomics: Proceedings of the
1st International Conference on Applied Ergonomics ICAE
'96 (Istanbul, Turkey: USA Publishing), pp. 752-757.
LEWIS, J.R., 2001, Evaluation of procedures for adjusting
problem-discovery rates estimated from small samples.
International Journal of Human-Computer Interaction, 13,
445-479.
MARSHALL, C., BRENDAN, M. and PRAIL, A., 1990, Usability
of product X lessons from a real product. Behaviour &
Information Technology, 9, 243-253.
MOLICH, R., BEVAN, N., CURSON, I., BUTLER, S.,
KINDLUND, E., MILLER, D. and KIRAKOWSKI, J., 1998,
Comparative evaluation of usability tests. In Proceedings of
the Usability Professionals Association Conference
(Washington, DC: UPA), pp. 83-84.
NIELSEN, J., 1993, Usability engineering (San Diego, CA:
Academic Press).
NIELSEN, J., 1994, Heuristic evaluation. In Nielsen, J. and Mack,
R.L. (eds), Usability Inspection Methods (New York: John
Wiley), pp. 25-61.
NIELSEN, J. and LANDAUER, T.K., 1993, A mathematical
model of the finding of usability problems. In Proceedings of
ACM INTERCHI’93 Conference (Amsterdam, Netherlands:
ACM Press), pp. 206-213.
SWETS, J.A., DAWES, R.M. and MONAHAN, J., 2000, Better
decisions through science. Scientific American, 283(4), 82-
87.
VIRZI, R.A., 1992, Refining the test phase of usability evaluation:
How many subjects is enough? Human Factors, 34, 457-468.
WOOLRYCH, A. and COCKTON, G., 2001, Why and when five
test users aren’t enough. In Vanderdonckt, J., Blandford, A.
and Derycke A. (eds.) Proceedings of IHM-HCI 2001
Conference, Vol. 2 (Toulouse, France: Cépadèus Éditions),
pp. 105-108.
... The Sample Size Calculator for Discovering Problems in a User Interface (Lewis, 2001;Sauro, 2023) was used to identify the ideal number of participants for both studies. From the sample data, Calculator estimates the problem occurrence (p) using the Good-Turing and Normalization procedure devised by Turner et al. (2006). In both studies, an estimation was made regarding how many respondents would be appropriate to detect at least 85% of the problems encountered. ...
Article
The outbreak of the COVID-19 pandemic caused dashboards to become widely used by the public and decision-makers. Nevertheless, dashboard interfaces have been related to business intelligence since their origins, and the search for improvements in their design is not new. This article's objective is to conduct a user evaluation of COVID-19 dashboards that contain geospatial information. This is done through a formative study to identify problematic aspects of user/dashboard interaction. This is enhanced by comparing two self-developed dashboards that, according to previous tests, have functionalities with different appearances. User evaluation is performed through mixed research that combines objective (eye-tracking) and subjective (a questionnaire and an interview) methods. The results generate recommendations for better-designed dashboard interfaces that can transfer information appropriately. The vital elements needed to achieve this are interactivity, the option to choose the metrics, and the distribution of the elements in the layout, all playing a role in a more user-friendly interaction between the user and the dashboard.
... The required sample size, from a statistical point of view, for user studies, varies [12], but a reasonable rule of thumb is 30 participants (albeit this is not a number with a strong theoretical backing but rather is a convenient rule of thumb). More important than the number is the type and quality of the participants in that they represent the actual end-users of the system being tested. ...
Chapter
User studies, involving both qualitative and quantitative data about user experiences, can complement and contextualize the insights and data afforded by web and social media analytics, providing an extension to these analytics areas, which we refer to as user study analytics. In this chapter, we discuss some of the dos and don’ts of user studies, including what user studies are, how they are planned and created, and how user study results can be analyzed for positive effects on usability and user experience.
... The survey involved 20 participants (5 expert programmers and 15 students). A sample size of 20 participants is sufficient because, according to the Usability Test Sample Size Model, most usability problems are detected by the first three to five subjects [32]. Running additional subjects during the same test is unlikely to reveal new information. ...
... Examples of these discreet tasks include "navigate to the home page" and "find and click on the section about traveling with your medical equipment." At the end of the usability session, participants completed the brief, 10-item System Usability Scale, the industry standard for measuring usability [22]. Questions are answered on a 5-point response scale (strongly disagree to strongly agree). ...
Article
Full-text available
Background: Bladder cancer survivors and their caregivers face profound practical (eg, use of stoma appliances and care for urinary diversion methods) and psychosocial (eg, depression and anxiety) challenges after surgical treatment with cystectomy. Objective: To improve the health-related quality of life and postsurgical outcomes of both bladder cancer survivors and their caregivers, the team, in collaboration with Sourcetop, Inc (software design) and Dappersmith (graphic design), developed the Cancer Resource and Information Support (CRIS) software. The purpose of this manuscript is to report on the development and usability testing of the CRIS software. Methods: The development of the CRIS software was guided by the Obesity-Related Behavioral Intervention Trials (ORBIT) model for developing behavioral treatments for chronic diseases. The ORBIT model is unique in that it proposes a flexible and progressive process with prespecific clinically significant milestones for forward movement and returns to earlier stages for refinement, and it facilitates communication among diverse groups by using terminology from the drug development model. This paper focuses on 2 phases of the ORBIT model: phase IA: define and IB: refine. During phase IA, the study team developed solutions for the stated clinical problem-adjustment to life post cystectomy-by reviewing the literature and collecting feedback from clinicians, professional organizations, bladder cancer survivors, and their caregivers. During Phase IB, the study team focused on tailoring content in the CRIS software to the user as well as usability testing with 7 participants. Results: The finished product is CRIS, a web-based software for survivors of bladder cancer and their caregivers to serve as a health management and lifestyle resource after surgery. Overarching themes from phase IA (participant feedback) included how to use new medical equipment, tips and tricks for easier living with new medical equipment, questions about health maintenance, and questions about lifestyle modifications. To accommodate our target population, we also incorporated recommendations from the Americans with Disabilities Act for website design, such as large text size, large paragraph spacing, highly contrasting text and background colors, use of headings and labels to describe the purpose of the content, portrait orientation without the need for horizontal scrolling, multiple ways to access a web page within a set of pages, ability to navigate web pages in sequential order, and in-text links that are descriptive. Usability participants evaluated CRIS very positively, indicating that it was easy to use, the functions were well-integrated, and if available, they would use CRIS frequently. Conclusions: CRIS, developed over the course of 18 months by integrating feedback from experts, literature reviews, and usability testing, is the first web-based software developed for bladder cancer survivors and their caregivers to help them adjust to life following cystectomy. The efficacy of CRIS in improving patients' and caregivers' quality of life is currently being evaluated in a randomized controlled trial.
... Previous studies have found that 5 users may be sufficient to identify 80% of usability issues, with diminishing returns from additional testing [15]. Hence, the first 33% (5/15) of the participants were invited for a feedback interview at week 3 to identify any critical usability issues at the halfway point of the intervention that might significantly affect participation. ...
Article
Full-text available
Background The responsibilities of being a primary caregiver for a loved one with dementia can produce significant stress for the caregiver, leading to deleterious outcomes for the caregiver’s physical and psychological health. Hence, researchers are developing eHealth interventions to provide support for caregivers. Members of our research team previously developed and tested a positive emotion regulation intervention that we delivered through videoconferencing, in which caregiver participants would meet one-on-one with a trained facilitator. Although proven effective, such delivery methods have limited scalability because they require significant resources in terms of cost and direct contact hours. Objective This study aimed to conduct a pilot test of a socially enhanced, self-guided version of the positive emotion regulation intervention, Social Augmentation of Self-Guided Electronic Delivery of the Life Enhancing Activities for Family Caregivers (SAGE LEAF). Studies have shown that social presence or the perception of others in a virtual space is associated with enhanced learning and user satisfaction. Hence, the intervention leverages various social features (eg, discussion boards, podcasts, videos, user profiles, and social notifications) to foster a sense of social presence among participants and study team members. Methods Usability, usefulness, feasibility, and acceptability data were collected from a pilot test in which participants (N=15) were given full access to the SAGE LEAF intervention over 6 weeks and completed preintervention and postintervention assessments (10/15, 67%). Preliminary outcome measures were also collected, with an understanding that no conclusions about efficacy could be made, because our pilot study did not have a control group and was not sufficiently powered. ResultsThe results suggest that SAGE LEAF is feasible, with participants viewing an average of 72% (SD 42%) of the total available intervention web pages. In addition, acceptability was found to be good, as demonstrated by participants’ willingness to recommend the SAGE LEAF program to a friend or other caregiver. Applying Pearson correlational analyses, we found moderate, positive correlation between social presence scores and participants’ willingness to recommend the program to others (r9=0.672; P=.03). We also found positive correlation between social presence scores and participants’ perceptions about the overall usefulness of the intervention (r9=0.773; P=.009). This suggests that participants’ sense of social presence may be important for the feasibility and acceptability of the program. Conclusions In this pilot study, the SAGE LEAF intervention demonstrates potential for broad dissemination for dementia caregivers. We aim to incorporate participant feedback about how the social features may be improved in future iterations to enhance usability and to further bolster a sense of social connection among participants and study staff members. Next steps include partnering with dementia clinics and other caregiver-serving organizations across the United States to conduct a randomized controlled trial to evaluate the effectiveness of the intervention.
... Kriteria responden yang dibutuhkan pada penelitian ini adalah mahasiswa aktif Informatika UII, alumni Informatika UII, dan masyarakat umum yang pernah mengakses website Jurusan Informatika UII. Metode kuesioner pada pengujian pengalaman pengguna membutuhkan setidaknya 40 orang untuk mendapatkan data yang ideal dengan target data yang terkumpul adalah sebanyak-banyaknya untuk menghasilkan analisis yang lebih akurat [21,22]. Penentuan jumlah responden minimal tersebut didasarkan pada tinjauan literatur dari beberapa penelitian yang melakukan pengujian serupa terkait UX. ...
Article
Full-text available
The website is one of the media used by some educational institutions to convey information. One of them is the Department of Informatics at the Universitas Islam Indonesia (UII). One of the basic things that need to be considered when developing a academic website is the user experience (UX) a user has when engaging with a website. User experience measures the comfort level of a user when interacting with a product. User experience needs to be evaluated regularly so that the website can continue to meet the expectations of its users. However, there has been no research related to evaluating or testing user experience on the UII Informatics Department website. This study aims to assess the level of experience of UII Informatics Department website visitors. This study employed quantitative research methods, specifically the User Experience Questionnaire (UEQ) method. Using the mean calculation, the UII Informatics Department website received positive assessments on the dimensions of perspicuity, efficiency, attractiveness, dependability, and stimulation, but received negative evaluations on the novelty dimension. According to the benchmark analysis, the aspects of perspicuity and efficiency are above-average. While the dimensions of novelty, stimulation, dependability, and attractiveness are below-average. The results of this study can be used as a reference for developing a website for the Department of Informatics with a version that is more in line with user needs.
Article
Full-text available
In this paper, we introduce LagunAR, a mobile outdoor Augmented Reality (AR) application for providing heritage information and 3D visualization on a city scale. The LagunAR application was developed to provide historical information about the city of La Laguna in the XVI century, when it was the main city in the Canary Islands. The application provides a reconstructed 3D model of the city at that time that is shown on a mobile phone over-imposed on the actual city using geolocation. The geolocated position is used also for providing information of several points of interest in the city. The paper describes the design and implementation of the application and details the optimization techniques that have been used to manage the full information of the city using a mobile phone as a sensor and visualization tool. We explain the application usability study carried out using a heuristic test; in addition it is probed by users in a qualitative user test developed as preliminary research. Results show that it is possible to develop a real-time application that shows the user a city-scale 3D model and also manages the information of the points of interest.
Conference Paper
Bullying is a global issue that threatens the safety and wellbeing of children worldwide. While bullying is observed amongst children of all ages, the behavior peaks at ages 11–14 years. One intervention method is through the use of games. 23 mixed-gender children aged 7–11 years participated in this study, which examines the impact of Cairdeas Quest, a fantasy serious game, on increasing bullying awareness and how to respond in different bullying situations. Through an analysis of gameplay metrics, the results show that Cairdeas Quest positively impacted the identification of bullying/cyberbullying and the selection of appropriate responses in various bullying situations. All participants correctly selected the defender role in response to bullying and indirect cyberbullying in the hallway and bedroom scenes. However, the main issue concerned how to respond to verbal bullying in a classroom. In general, female players were more empathic than their male peers in different bullying/cyberbullying situations and were more likely to adopt the defender role. Meanwhile, a greater level of improvement was evidenced in the male players after playing Cairdeas Quest. This is a positive outcome as boys are more likely to adopt the bully perpetrator and bystander roles.
Article
Full-text available
Pulsed-wave Doppler ultrasound is a widely used technique for monitoring pregnancies. As ultrasound equipment becomes more advanced, it becomes harder to train practitioners to be proficient in the procedure as it requires the presence of an expert, access to high-tech equipment as well as several volunteering patients. Immersive environments such as mixed reality can help trainees in this regard due to their capabilities to simulate real environments and objects. In this article, we propose a mixed reality application to facilitate training in performing pulsed-wave Doppler ultrasound when acquiring a spectrogram to measure blood velocity in the umbilical cord. The application simulates Doppler spectrograms while the trainee has the possibility of adjusting parameters such as pulse repetition frequency, sampling depth, and beam-to-flow angle. This is done using a combination of an optimized user interface, 3D-printed objects tracked using image recognition and data acquisition from a gyroscope. The application was developed for Microsoft HoloLens as the archetype of mixed reality, while a 3D-printed abdomen was used to simulate a patient. The application aims to aid in both simulated and real-life ultrasound procedures. Expert feedback and user-testing results were collected to validate the purpose and use of the designed application. Design science research was followed to propose the intended application while contributing to the literature on leveraging immersive environments for medical training and practice. Based on the results of the study, it was concluded that mixed reality can be efficiently used in ultrasound training.
Article
Full-text available
Type 2 diabetic patients benefit significantly if the disease is well controlled through behavioral changes, namely adopting a healthy lifestyle. Currently, there is some evidence that technological strategies can help patient self-management. However, few studies have specifically targeted individuals who solely engage in automatic and personalized self-management practices. This study aims to synthesize the literature regarding personalized feedback recommendation systems to promote behavior change without health professionals’ direct intervention for the management of type 2 diabetes and to verify if the use of these systems improves health-related outcomes. A systematic review was performed from inception to April 13, 2021, based on a search conducted in six databases. According to the defined search expression, studies addressing type 2 diabetic patients and recommendation systems were included. In total, 2186 papers were initially identified, but only 22 met the specific inclusion criteria after screening. Discrepancies in the selection of studies were discussed in consensus meetings. To assess the quality of the articles, two tools were employed according to the types of articles retrieved. Selected papers were summarized regarding specific characteristics such as clinical and technological outcomes. Studies incorporating a recommendation system into their technological solution showed a positive effect on the evaluated outcomes, except for those with longer duration, where the effect was not statistically significant. Although most studies did not report the type of system used, expert systems (rule-based) were found to be the most prevalent among those that did. As behaviors are often difficult to change quickly, it is recommended that future studies extend their follow-up timeframe. Nevertheless, the data obtained suggest that technological solutions incorporating a recommendation system show potential to improve health-related outcomes and demonstrated good usability.
Article
Full-text available
Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Conference Paper
Full-text available
Simon (TM-Bellsouth Corp.) is a commercially available personal communicator (PC) combining features of a PDA (personal digital assistant) with a full suite of communications features. This paper describes the involvement of human factors engineering in the development of Simon, and summarizes the various approaches to usability evaluation employed during its development. Simon has received a considerable amount of praise from the industry and won several industry awards, with recognition both for its innovative engineering and its usability.
Article
Full-text available
There are 2 excellent reasons to compute usability problem-discovery rates. First, an estimate of the problem-discovery rate is a key component for projecting the required sample size for a usability study. Second, practitioners can use this estimate to calculate the proportion of discovered problems for a given sample size. Unfortunately, small-sample estimates of the problem-discovery rate suffer from a serious overestimation bias. This bias can lead to serious underestimation of required sample sizes and serious overestimation of the proportion of discovered problems. This article contains descriptions and evaluations of a number of methods for adjusting small-sample estimates of the problem-discovery rate to compensate for this bias. A series of Monte Carlo simulations provided evidence that the average of a normalization procedure and Good-Turing (Jelinek, 1997; Manning & Schutze, 1999) discounting produces highly accurate estimates of usability problem-discovery rates from small sample sizes.
Conference Paper
Full-text available
SUMMARY Nielsen's claim that "Five Users are Enough" (5) is based on a statistical formula (2) that makes unwarranted assumptions about individual differences in problem discovery, combined with optimistic setting of values for a key variable. We present the initial Landauer-Nielsen formula and recent evidence that it can fail spectacularly to calculate the required number of test users for a realistic web-based test. We explain these recent results by examining the assumptions behind the formula. We then re-examine some of our own data, and find that, while the Landauer-Nielsen formula does hold, this is only the case for simple problem counts. An analysis of problem frequency and severity indicates that highly misleading results could have resulted when the number of required users is almost doubled. Lastly, we identify structure and components of a more realistic approach to estimating test user requirements.
Article
Full-text available
Computer professionals have a need for robust, easy-to-use usability evaluation methods (UEMs) to help them systematically improve the usability of computer artifacts. However, cognitive walkthrough (CW), heuristic evaluation (HE), and thinkingaloud study (TA)—3 of the most widely used UEMs—suffer from a substantial evaluator effect in that multiple evaluators evaluating the same interface with the same UEM detect markedly different sets of problems. A review of 11 studies of these 3 UEMs reveals that the evaluator effect exists for both novice and experienced evaluators, for both cosmetic and severe problems, for both problem detection and severity assessment, and for evaluations of both simple and complex systems. The average agreement between any 2 evaluators who have evaluated the same system using the same UEM ranges from 5% to 65%, and no 1 of the 3 UEMs is consistently better than the others. Although evaluator effects of this magnitude may not be surprising for a UEM as informal as HE, it is certainly notable that a substantial evaluator effect persists for evaluators who apply the strict procedure of CW or observe users thinking out loud. Hence, it is highly questionable to use a TA with 1 evaluator as an authoritative statement about what problems an interface contains. Generally, the application of the UEMs is characterized by (a) vague goal analyses leading to variability in the task scenarios, (b) vague evaluation procedures leading to anchoring, or (c) vague problem criteria leading to anything being accepted as a usability problem, or all of these. The simplest way of coping with the evaluator effect, which cannot be completely eliminated, is to involve multiple evaluators in usability evaluations.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Article
Using the example of a real product, this paper shows how various usability assessments, conducted by different human factors engineers, in several phases of the product's development life-cycle, identified similar potential usability problems. Circumstances dictated that no remedial action was taken, so it was possible to track these potential usability defects to customer sites, where it was found that most of the important problems did indeed occur. Thus, it can be demonstrated that human factors advice was valid and reliable. In simpler terms, early usability evaluation by human factors engineers can save hours of wasted development effort and customer frustration, and can help to ensure that a more usable product is produced.
Article
Much attention has been paid to the question of how many subjects are needed in usability research. Virzi (1992) modelled the accumulation of usability problems with increasing numbers of subjects and claimed that five subjects are sufficient to find most problems. The current paper argues that this answer is based on an important assumption, namely that all types of users have the same probability of encountering all usability problems. If this homogeneity assumption is violated, then more subjects are needed. A modified version of Virzi's model demonstrates that the number of subjects required increases with the number of heterogeneous groups. The model also shows that the more distinctive the groups, the more subjects will be required. This paper will argue that the simple answer 'five' cannot be applied in all circumstances. It most readily applies when the probability that a user will encounter a problem is both high and similar for all users. It also only applies to simple usability tests that seek to detect the presence, but not the statistical prevalence, of usability problems.