ChapterPDF Available

Determining Usability Test Sample Size

Authors:
  • Q2 Solutions
  • MeasuringU

Abstract and Figures

The cumulative binomial probability formula (given appropriate adjustment of p when estimated from small samples) provides a quick and robust means of estimating problem discovery rates (p). This estimate can be used to estimate usability test sample size requirements (for studies that are underway) and to evaluate usability test sample size adequacy (for studies that have already been conducted). Further research is needed to answer remaining questions about when usability testing is reliable, valid, and useful.
Content may be subject to copyright.
International Encyclopedia of Ergonomics and Human Factors, 2006, Second Edition, Volume 3
Edited by Waldemar Karwowski, Boca Raton, FL: CRC Press
3084
Determining Usability Test Sample Size
Carl W. Turner
*
, James R. Lewis
, and Jakob Nielsen
*
State Farm Insurance Cos., Bloomington, IL 61791, USA
IBM Corp., Boca Raton, FL 33487 USA
Nielsen Norman Group, Fremont, CA 94539, USA
1 INTRODUCTION
Virzi (1992), Nielsen and Landauer (1993), and Lewis
(1994) have published influential articles on the topic of
sample size in usability testing. In these articles, the authors
presented a mathematical model of problem discovery rates
in usability testing. Using the problem discovery rate
model, they showed that it was possible to determine the
sample size needed to uncover a given proportion of
problems in an interface during one test. The authors
presented empirical evidence for the models and made
several important claims:
Most usability problems are detected with the first
three to five subjects.
Running additional subjects during the same test is
unlikely to reveal new information.
Return on investment (ROI) in usability testing is
maximized when testing with small groups using
an iterative test-and-design methodology.
Nielsen and Landauer (1993) extended Virzi’s (1992)
original findings and reported case studies that supported
their claims for needing only small samples for usability
tests. They and Lewis (1994) identified important
assumptions about the use of the formula for estimating
problem discovery rates. The problem discovery rate model
was recently re-examined by Lewis (2001).
2 THE ORIGINAL FORMULAE
Virzi (1992) published empirical data supporting the use of
the cumulative binomial probability formula to estimate
problem discovery rates. He reported three experiments in
which he measured the rate at which usability experts and
trained student assistants identified problems as a function
of the number of naive participants they observed. Problem
discovery rates were computed for each participant by
dividing the number of problems uncovered during an
individual test session by the total number of unique
problems found during testing. The average likelihood of
problem detection was computed by averaging all
participants’ individual problem discovery rates.
Virzi (1992) used Monte Carlo simulations to
permute participant orders 500 times to obtain the average
problem discovery curves for his data. Across three sets of
data, the average likelihoods of problem detection (p in the
formula above) were 0.32, 0.36, and 0.42. He also had the
observers (Experiment 2) and an independent group of
usability experts (Experiment 3) provide ratings of problem
severity for each problem. Based on the outcomes of these
experiments, Virzi made three claims regarding sample size
for usability studies: (1) Observing four or five participants
allows practitioners to discover 80% of a product’s usability
problems, (2) observing additional participants reveals
fewer and fewer new usability problems, and (3) observers
detect the more severe usability problems with the first few
participants. Based on these data, he claimed that running
tests using small samples in an iterative test-and-design
fashion would identify most usability problems and save
both time and money.
Proportion of unique problems
found = 1 (1 p)
n
(1)
where p is the mean problem discovery rate computed
across subjects (or across problems) and n is the number of
subjects.
Seeking to quantify the patterns of problem
detection observed in several fairly large-sample studies of
problem discovery (using either heuristic evaluation or user
testing) Nielsen and Landauer (1993) derived the same
formula from a Poisson process model (constant probability
path independent). They found that it provided a good fit to
their problem-discovery data, and provided a basis for
predicting the number of problems existing in an interface
and performing cost-benefit analyses to determine
appropriate sample sizes. Across 11 studies (five user tests
and six heuristic evaluations), they found the average value
of p to be .33 (ranging from .16 to .60, with associated
estimates of p ranging from .12 to .58). Nielsen and
Landauer used lambda rather than p, but the two concepts
are essentially equivalent. In the literature, ? (lambda), L,
and p are commonly used to represent the average
likelihood of problem discovery. Throughout this article,
we will use p.
Number of unique problems
found = N(1 (1 p)
n
)) (2)
Determining Usability Test Sample Size 3085
where p is the problem discovery rate, N is the total number
of problems in the interface, and n is the number of subjects.
The problem discovery rate was approximately .3
when averaged across a large number of independent tests,
but the rate for any given usability test will vary depending
on several factors (Nielsen & Landauer, 1993). These
factors include:
Properties of the system and interface, including
the size of the application.
Stage in the usability lifecycle the product is tested
in, whether early in the design phase or after
several iterations of test and re-design.
Type and quality of the methodology used to
conduct the test.
Specific tasks selected.
Match between the test and the context of real
world usage.
Representativeness of the test participant.
Skill of the evaluator.
Research following these lines of investigation led to other,
related claims. Nielsen (1994) applied the formula in
Equation 2 to a study of problem discovery rate for heuristic
evaluations. Eleven usability specialists evaluated a
complex prototype system for telephone company
employees. The evaluators obtained training on the system
and the goals of the evaluation. They then independently
documented usability problems in the user interface based
on published usability heuristics. The average value of p
across 11 evaluators was .29, similar to the rates found
during talk-aloud user testing (Nielsen & Landauer, 1993;
Virzi, 1992).
Lewis (1994) replicated the techniques applied by
Virzi (1992) to data from a usability study of a suite of
office software products. The problem discovery rate for
this study was .16. The results of this investigation clearly
supported Virzi’s second claim (additional participants
reveal fewer and fewer problems), partially supported the
first (observing four or five participants reveals about 80%
of a product’s usability problems as long as the value of p
for a study is in the approximate range of .30 to .40), and
failed to support the third (there was no correlation between
problem severity and likelihood of discovery). Lewis noted
that it is most reasonable to use small-sample problem
discovery studies “if the expected p is high, if the study will
be iterative, and if undiscovered problems will not have
dangerous or expensive outcomes” (1994, p. 377).
3 RECENT CHALLENGES
Recent challenges to the estimation of problem discovery
rates appear to take two general forms. The first questions
the reliability of problem discovery procedures (user testing,
heuristic evaluation, cognitive walkthrough, etc.). If
problem discovery is completely unreliable, then how can
anyone model it? Furthermore, how can one account for the
apparent success of iterative problem-discovery procedures
in increasing the usability of the products against which they
are applied?
The second questions the validity of modeling the
probability of problem discovery with a single value for p.
Other issues such as the fact that claiming high
proportions of problem discovery with few participants
requires a fairly high value of p, that different task sets lead
to different opportunities to discover problems, and the
importance of iteration are addressed at length in earlier
papers (Lewis, 1994; Nielsen, 1993).
3.1 Is Usability Problem Discovery Reliable?
Molich et al. (1998) conducted a study in which four
different usability labs evaluated a calendar system and
prepared reports of the usability problems they discovered.
An independent team of usability professionals compared
the reports produced by the four labs. The number of
unique problems identified by each lab ranged from four to
98. Only one usability problem was reported by all four
labs. The teams that conducted the studies noted difficulties
in conducting the evaluations that included a lack of testing
goals, no access to the product development team, a lack of
user profile information, and no design goals for the
product.
Kessner et al. (2001) have also reported data that
question the reliability of usability testing. They had six
professional usability teams test an early prototype of a
dialog box. The total number of usability problems was
determined to be 36. None of the problems were identified
by every team, and only two were reported by five teams.
Twenty of the problems were reported by at least two teams.
After comparing their results with those of Molich et al.
(1999), Kessner et al. suggested that more specific and
focused requests by a client should lead to more overlap in
problem discovery.
Hertzum and Jacobsen (2001) have termed the lack
of inter-rater reliability among test observers an ‘evaluator
effect’ that “multiple evaluators evaluating the same
interface with the same usability evaluation method detect
markedly different sets of problems” (p. 421). Across a
review of 11 studies, they found the average agreement
between any two evaluators of the same system ranged from
5% to 65%, with no usability evaluation method (cognitive
walkthroughs, heuristic evaluations, or think-aloud user
studies) consistently more effective than another. Their
review, and the studies of Molich et al. (1999) and Kessner
et al. (2001) point out the importance of setting clear test
objectives, running repeatable test procedures, and adopting
clear definitions of usability problems. Given that multiple
evaluators increase the likelihood of problem detection
(Nielsen, 1994), they suggested that one way to reduce the
evaluator effect is to involve multiple evaluators in usability
tests.
3086 Determining Usability Test Sample Size
The results of these studies are in stark contrast to
earlier studies in which usability problem discovery was
reported to be reliable (Lewis, 1996; Marshall, Brendon, &
Prail, 1990). The widespread use of usability problem
discovery methods indicates that practitioners believe they
are reliable. Despite this widespread belief, an important
area of future research will be to reconcile the studies that
have challenged the reliability of problem discovery with
the apparent reality of usability improvement achieved
through iterative application of usability problem discovery
methods. For example, there might be value in exploring
the application of signal detection theory (Swets, Dawes, &
Monahan, 2000) to the detection of usability problems.
3.2 Issues in the Estimation of p
Woolrych and Cockton (2001) challenged the
assumption that a simple estimate of p is sufficient for the
purpose of estimating the sample size required for the
discovery of a specified percentage of usability problems in
an interface. Specifically, they criticized the formula for
failing to take into account individual differences in
problem discoverability and also claimed that the typical
values used for p (around .30) are overly optimistic. They
also pointed out that the circularity in estimating the key
parameter of p from the study for which you want to
estimate the sample size reduces its utility as a planning
tool. Following close examination of data from a previous
study of heuristic evaluation, they found combinations of
five participants which, if they had been the only five
participants studied, would have dramatically changed the
resulting problems lists, both for frequency and severity.
They recommended the development of a formula that
replaces a single value for p with a probability density
function.
Caulton (2001) claimed that the simple estimate of
p only applies given a strict homogeneity assumption that
all types of users have the same probability of encountering
all usability problems. To address this, Caulton added to the
standard cumulative binomial probability formula a
parameter for the number of heterogeneous groups. He also
introduced and modeled the concept of problems that
heterogeneous groups share and those that are unique to a
particular subgroup. His primary claims were (1) the more
subgroups, the lower will be the expected value of p and (2)
the more distinct the subgroups are, the lower will be the
expected value of p.
Most of the arguments of Woolrych and Cockton
(2001) were either addressed in previous literature or do not
stand up against the empirical findings reported in previous
literature. It is true that estimates of p can vary widely from
study to study. This characteristic of usability testing can be
addressed by estimating p for a study after running two
subjects and adjusting the estimate as the study proceeds
(Lewis, 2001). There are problems with the estimation of p
from the study to which you want to apply it, but recent
research (discussed below) provides a way to overcome
these problems. Of course, it is possible to select different
subsets of participants who experienced problems in a way
that leads to an overestimate of p (or an underestimate of p,
or any value of p that the person selecting the data wishes).
Test administrators should follow accepted practice and
select evaluators who represent the range of knowledge and
skills found in the population of end users. There is no
compelling evidence that a probability density function
would lead to an advantage over a single value for p,
although there might be value in computing confidence
intervals for single values of p.
Caulton’s (2001) refinement of the model is
consistent with the observation that different user groups
expose different types of usability problems (Nielsen, 1993).
It is good practice to include participants from significant
user groups in each test; three or four per group for two
groups and three participants for more than two groups. If
there is a concern that different user groups will uncover
different sets of usability problems then the data for each
group can be analyzed separately, and a separate p
computed for each user group. However, Caulton’s claim
that problem discovery estimates are always inflated when
averaged across heterogeneous groups and problems with
different values of p is inconsistent with the empirical data
presented in Lewis (1994). Lewis demonstrated that p is
robust, showing that the mean value of p worked very well
for modeling problem discovery in a set of problems that
had widely varying values of p.
4 IMPROVING SMALL-SAMPLE
ESTIMATION OF p
Lewis (2001), responding to an observation by Hertzum and
Jacobsen (2001) that small-sample estimates of p are almost
always inflated, investigated a variety of methods for
adjusting these small-sample estimates to enable accurate
assessment of sample size requirements and true proportions
of discovered problems. Using data from a series of Monte
Carlo studies applied against four published sets of problem
discovery databases, he found that a technique based on
combining information from a normalization procedure and
a discounting method borrowed from statistical language
modeling produced very accurate adjustments for small-
sample estimates of p. The Good-Turing (GT) discounting
procedure reduced, but did not completely eliminate, the
overestimate of problem discovery rates produced by small-
sample p estimates. The GT adjustment, shown in Equation
3, was:
(3)
where p
est
is the initial estimate computed from the raw data
of a usability study, E(N
1
) was the number of usability
Determining Usability Test Sample Size 3087
problems detected by only one user, and N was that total
number of unique usability problems detected by all users.
By contrast, the normalization procedure (Norm)
slightly underestimated problem discovery rates. The
equation was:
(4)
where p
est
is the initial estimate computed from the raw data
of a usability study and n was the number of test
participants. He concluded that the overestimation of p
from small-sample usability studies is a real problem with
potentially troubling consequences for usability
practitioners, but that it is possible to apply these procedures
(normalization and Good-Turing discounting) to
compensate for the overestimation bias. Applying each
procedure to the initial estimate of p, then averaging the
results, produces a highly accurate estimate of the problem
discovery rate. Equation 5 shows the formula for an
adjusted p estimate based on averaging Good-Turing and
normalization adjustments.
(5)
“Practitioners can obtain accurate sample size estimates for
problem-discovery goals ranging from 70% to 95% by
making an initial estimate of the required sample size after
running two participants, then adjusting the estimate after
obtaining data from another two (total of four) participants”
(Lewis, 2001, p.474).
The results of a return-on-investment (ROI) model
for usability studies (Lewis, 1994) indicated that the
magnitude of p affected the point at which the percentage of
problems discovered maximized ROI. For values of p
ranging from .10 to .5, the appropriate problem discovery
goal ranged from .86 to .98, with lower values of p
associated with lower problem discovery goals.
5 AN APPLICATION OF THE
ADJUSTMENT PROCEDURES
In the example shown in Table 1, a usability test with eight
participants has led to the discovery of four unique usability
problems. The problem discovery rates (p) for individual
participants ranged from 0.0 to .75. The problem discovery
rates for specific problems ranged from .125 to .875. The
average problem discovery rate (averaged either across
problems or participants), p
est
, was .375. Note that
Problems 2 and 4 were detected by only one participant
(Participants 2 and 7, respectively). Applying the Good-
Turing estimating procedure from Equation 3 gives
TABLE 1
Data from a Hypothetical Usability Test with Eight
Subjects, p
est
= .375
Problem Number
Subject 1 2 3 4 Count
p
1 1 0 1 0 2 0.500
2 1 0 1 1 3 0.750
3 1 0 0 0 1 0.250
4 0 0 0 0 0 0.000
5 1 0 1 0 2 0.500
6 1 0 0 0 1 0.250
7 1 1 0 0 2 0.500
8 1 0 0 0 1 0.250
Count 7 1 3 1
P 0.875 0.125 0.375 0.125 0.375
Applying normalization as shown in Equation 4 gives
The adjusted problem discovery rate is obtained by
averaging the two estimates as shown in Equation 5 gives
With this adjusted value of p and the known sample size, it
is possible to estimate the sample size adequacy of this
study using the cumulative binomial probability formula: 1
(1 .25)
8
= .90. If the problem discovery goal for this
study had been 90%, then the sample size was adequate. If
the discovery goal had been lower, the sample size would be
excessive, and if the discovery goal had been higher, the
sample size would be inadequate. The discovery of only
four problems (one problem for every two participants)
suggests that the discovery of additional problems would be
difficult. If four problems constitute 90% of the problems
available for discovery given the specifics of this usability
study, then 100% of the problems available for discovery
should be about 4/.9, or 4.44. In non-numerical terms, there
probably aren’t a lot of additional problems to extract from
this problem discovery space.
As an example of sample size estimation, suppose
you had data from the first four participants and wanted to
estimate the number of participants you’d need to run to
achieve 90% problem discovery. After running the fourth
participant, there were three discovered problems (because
Problem 2 did not occur until Participant 7), as shown in
Table 2. One of those problems (Problem 4) occurred only
once.
3088 Determining Usability Test Sample Size
TABLE 2
Data from a Hypothetical Usability Test; First Four
Subjects, p
est
= .500
Problem Number
Subject 1 3 4 Count p
1 1 1 0 2 0.667
2 1 1 1 3 1.000
3 1 0 0 1 0.333
4 0 0 0 0 0.000
Count 3 2 1
P 0.750 0.500 0.250 0.500
Applying the Good-Turing estimating procedure from
Equation 3 gives
Applying normalization as shown in Equation 4 gives
The average of the two estimates is
Given p = .28, the estimated proportion of discovered
problems would be 1 (1 .28)
4
, or .73. Doing the same
computation with n = 7 gives .90, indicating that the
appropriate sample size for the study would be 7. Note that
in the matrix for this hypothetical study, running the eighth
participant did not reveal any new problems.
6 CONCLUSIONS
The cumulative binomial probability formula (given
appropriate adjustment of p when estimated from small
samples) provides a quick and robust means of estimating
problem discovery rates (p). This estimate can be used to
estimate usability test sample size requirements (for studies
that are underway) and to evaluate usability test sample size
adequacy (for studies that have already been conducted).
Further research is needed to answer remaining questions
about when usability testing is reliable, valid, and useful.
REFERENCES
CAULTON, D.A., 2001, Relaxing the homogeneity assumption in
usability testing. Behaviour & Information Technology, 20,
1-7.
HERTZUM , M. and JACOBSEN, N.E., 2001, The evaluator
effect: a chilling fact about usability evaluation methods.
International Journal of Human-Computer Interaction, 13,
421-443.
KESSNER, M., WOOD, J., DILLON, R.F. and WEST, R.L., 2001,
On the reliability of usability testing. In Jacko, J. and Sears,
A., (eds), Conference on Human Factors in Computing
Systems: CHI 2001 Extended Abstracts (Seattle, WA: ACM
Press), pp. 97-98.
LEWIS, J.R., 1994, Sample sizes for usability studies: Additional
considerations. Human Factors, 36, 368-378.
LEWIS, J.R., 1996, Reaping the benefits of modern usability
evaluation: The Simon story. In Salvendy, G. and Ozok, A.,
(eds), Advances in Applied Ergonomics: Proceedings of the
1st International Conference on Applied Ergonomics ICAE
'96 (Istanbul, Turkey: USA Publishing), pp. 752-757.
LEWIS, J.R., 2001, Evaluation of procedures for adjusting
problem-discovery rates estimated from small samples.
International Journal of Human-Computer Interaction, 13,
445-479.
MARSHALL, C., BRENDAN, M. and PRAIL, A., 1990, Usability
of product X lessons from a real product. Behaviour &
Information Technology, 9, 243-253.
MOLICH, R., BEVAN, N., CURSON, I., BUTLER, S.,
KINDLUND, E., MILLER, D. and KIRAKOWSKI, J., 1998,
Comparative evaluation of usability tests. In Proceedings of
the Usability Professionals Association Conference
(Washington, DC: UPA), pp. 83-84.
NIELSEN, J., 1993, Usability engineering (San Diego, CA:
Academic Press).
NIELSEN, J., 1994, Heuristic evaluation. In Nielsen, J. and Mack,
R.L. (eds), Usability Inspection Methods (New York: John
Wiley), pp. 25-61.
NIELSEN, J. and LANDAUER, T.K., 1993, A mathematical
model of the finding of usability problems. In Proceedings of
ACM INTERCHI’93 Conference (Amsterdam, Netherlands:
ACM Press), pp. 206-213.
SWETS, J.A., DAWES, R.M. and MONAHAN, J., 2000, Better
decisions through science. Scientific American, 283(4), 82-
87.
VIRZI, R.A., 1992, Refining the test phase of usability evaluation:
How many subjects is enough? Human Factors, 34, 457-468.
WOOLRYCH, A. and COCKTON, G., 2001, Why and when five
test users aren’t enough. In Vanderdonckt, J., Blandford, A.
and Derycke A. (eds.) Proceedings of IHM-HCI 2001
Conference, Vol. 2 (Toulouse, France: Cépadèus Éditions),
pp. 105-108.
... For usability testing, we used six participants for each target group. Turner, Lewis, and Nielsen (2006) demonstrated that 80% of usability issues could be identified with a sample of five users. More specifically, after the fifth user session, many problems only recur, and fewer and fewer new problems appear. ...
Conference Paper
Full-text available
COVID-19 has increased suffering in various sectors of life and affected people's daily lives worldwide. It has significantly impacted the health, economic and social fields. The Czech Republic is one of the countries hit hard by the epidemic, which led to its closure several times. This paper empirically examines the impact of COVID-19 on stock prices in the Czech Republic with the help of the Autoregressive-Distributed Lag (ARDL) Bounds Test. The daily closing prices of the stock index, P.X., from 22/03/2020 to 21/02/2022 were used for the Analysis. The results reveal that the Czech stock market was negatively affected during the pandemic; this effect was short-term.
... We recruited a minimum of 3 and a maximum of 5 participants in each age and gender group, with the exception of males aged 40-50 years, where we were unable to recruit any participant. For the purpose of assessing usability of the video game, studies have demonstrated that most usability problems are detected with three to five users [50]. As a result of the small sample of participants, we were limited in our ability to stratify the findings by age, gender, and other variables of interest. ...
... Given that there is a large variety of tinnitus patients, in terms of age, gender and type of tinnitus, creating a representative sample would have meant a very large number of subjects to be contacted [47]. On the other hand, it is almost ascertained that small samples in an iterative design process are more effective -in terms of cost-benefit analysis -to discover the most important usability issues [48,49]. So, focusing on usability, and given the difficulty to enrol a representative sample of tinnitus patients, this choice was a necessary compromise. ...
Article
Full-text available
Tinnitus is an annoying ringing in the ears, in varying shades and intensities. Tinnitus can affect a person’s overall health and social well-being (e.g., sleep problems, trouble concentrating, anxiety, depression and inability to work). The diagnostic procedure of tinnitus usually consists of three steps: an audiological examination, psychoacoustic measurement, and a disability evaluation. All steps are performed by physicians, who use specialised hardware/software and administer questionnaires. This paper presents a system, to be used by patients, for the diagnosis and self-management of tinnitus. The system is made up of an app and a device. The app is responsible for executing – through the device – a part of the required audiological and psychoacoustic examinations, as well as administering questionnaires that evaluate disability. The paper reviews the quality of the automated audiometric reporting and the user experience provided by the app. Descriptive and inferential statistics were used to support the findings. The results show that automated reporting is comparable with that of physicians and that user experience was improved by re-designing and re-developing the acufenometry of the app. As for the user experience, two experts in Human-Computer Interaction evaluated the first version of the app: their agreement was good (Cohen’s K = 0.639) and the average rating of the app was 1.43/2. Also patients evaluated the app in its initial version: the satisfactory tasks (audiometry and questionnaires) were rated as 4.31/5 and 4.65/5. The unsatisfactory task (acufenometry) was improved and the average rating increased from 2.86/5 to 3.96/5 ( p = 0.0005). Finally, the general usability of the app was increased from the initial value of 73.6/100 to 85.4/100 ( p = 0.0003). The strengths of the project are twofold. Firstly, the automated reporting feature, which – to the best of our knowledge – is the first attempt in this area. Secondly, the overall app usability, which was evaluated and improved during its development. In summary, the conclusion drawn from the conducted project is that the system works as expected, and despite some weaknesses, also the replication of the device would not be expensive, and it can be used in different scenarios.
... Another usability researcher stated that a group size of ten respondents is sufficient where it reveals a minimum of 82% of the problem [64]. A group of seven respondents is optimal for studies; even where the study is quite complex will reveal 95% of the problem [65]. A discussion session was held, and the response was recorded. ...
Article
Full-text available
The worldwide expansion of internet technologies and the World Wide Web (WWW) has witnessed a booming rise in popularity and adoption of Web Applications (WA). The current technological advancement has allowed web applications to become more innovative and practical in managing born-digital content. This requires developers to continue to expand their assessment repertoire to provide valuable and actionable feature coverage. This study demonstrates User Experience Assessment (UXA) as part of the Re-CRUD console framework formative assessment. Re-CRUD console framework is a code automation tool for web application development containing integrated records management features that help the information professional manage the digital content effectively. The assessment's primary goal was to get detailed feedback from information professionals on the Re-CRUD feature coverage to make Re-CRUD more pleasant for developers and content friendly. We conducted contextual discussions using the think-aloud protocol and usability testing with experts in WA development and information professionals. The findings revealed a positive review of Re-CRUD features coverage and code generation procedure but a less favourable review of authentication policy and audit trail. The feedback is used to improvise Re-CRUD feature coverage and increase code automation productivity.
... We recruited a minimum of 3 and a maximum of 5 participants in each age and gender group, with the exception of males aged 40-50 years, where we were unable to recruit any participant. For the purpose of assessing usability of the video game, studies have demonstrated that most usability problems are detected with three to five users [50]. As a result of the small sample of participants, we were limited in our ability to stratify the findings by age, gender, and other variables of interest. ...
Article
Full-text available
Background: The emergence of novel coronavirus (COVID-19) has introduced additional pressures on an already fragile mental healthcare system due to significant rise in depression, anxiety and stress among Canadians. While Cognitive Behavioural Therapy (CBT) is known to be an efficacious treatment to reduce such mental health issues, few people have access to it in an engaging and sustainable manner. To address this gap, a collaboration between The Centre for Addiction and Mental Health (CAMH) and the National Research Council of Canada (NRC) developed CBT based self-led, online, clinician-tested modules in the form of a video game, Legend of Evelys, and evaluated its usability in the attenuation of COVID-19 related increase in stress. Objective: We discuss the conceptualization and design of new self-care modules in form of a video game, its implementation in a technological infrastructure, and inclusivity and privacy considerations that informed the development of the video game. A usability study of the modules was conducted to assess the video game's usability, user engagement and user perceptions of the video game. Methods: The development of the video game involved establishment of a technology infrastructure for secure implementation of the software for the video game modules and a clinician-led assessment of the clinical utility of these modules through two "whiteboard" sessions. The usability study was informed by a mix-method sequential explanatory design to evaluate the intervention of the mobile application through two distinct phases including a quantitative data collection analysis using in-application analytics data and two surveys followed by a qualitative data collection by semi-structured interviews. Results: A total of 32 participants trialed the app for two weeks. They used the video game an average of six times and rated the game as good based on the Systems Usability Score. In terms of stress reduction, the study demonstrated a significant difference in the participants' Perceived Stress Score at baseline (mean 22.14, SD 6.187) compared to the two week follow-up (mean 18.04, SD 6.083); t(27)=3.628, P=.001. Qualitative interviews helped participants identify numerous functionality issues and provided specific recommendations, most of which were successfully integrated into the video game for future release. Conclusions: Through this collaboration, we have established that it is possible to incorporate CBT exercises into a video game and have these exercises adopted, to address stress. While video games are a promising strategy to help people with their stress and anxiety, there is a further need to examine the real-world effectiveness of the Legend of Evelys in reducing anxiety. Clinicaltrial:
... First, the sample size in qualitative design research is often small compared with traditional academic qualitative research. This typical design research practice is based on the finding [30,31] that speaking with a representative sample of 5 to 10 people at a time is sufficient to uncover common challenges, understand underlying causes, and inform decisions. In design research and digital development, as outlined by the Government Digital Service [32], the limitation of small sample size is usually overcome by conducting iterative research and increasing the number of users testing and feeding back on the service as it progresses through the initial phases (discovery and alpha) to later phases (beta and live). ...
Article
Full-text available
BACKGROUND: Digital health interventions (DHIs) have the potential to improve public health by combining effective interventions and population reach. However, what biomedical researchers and digital developers consider an effective intervention differs, thereby creating an ongoing challenge to integrating their respective approaches when evaluating DHIs. OBJECTIVE: This study aims to report on the Public Health England (PHE) initiative set out to operationalize an evaluation framework that combines biomedical and digital approaches and demonstrates the impact, cost-effectiveness, and benefit of DHIs on public health. METHODS: We comprised a multidisciplinary project team including service designers, academics, and public health professionals and used user-centered design methods, such as qualitative research, engagement with end users and stakeholders, and iterative learning. The iterative approach enabled the team to sequentially define the problem, understand user needs, identify opportunity areas, develop concepts, test prototypes, and plan service implementation. Stakeholders, senior leaders from PHE, and a working group critiqued the outputs. RESULTS: We identified 26 themes and 82 user needs from semistructured interviews (N=15), expressed as 46 Jobs To Be Done, which were then validated across the journey of evaluation design for a DHI. We identified seven essential concepts for evaluating DHIs: evaluation thinking, evaluation canvas, contract assistant, testing toolkit, development history, data hub, and publish health outcomes. Of these, three concepts were prioritized for further testing and development, and subsequently refined into the proposed PHE Evaluation Service for public health DHIs. Testing with PHE’s Couch-to-5K app digital team confirmed the viability, desirability, and feasibility of both the evaluation approach and the Evaluation Service. CONCLUSIONS: An iterative, user-centered design approach enabled PHE to combine the strengths of academic and biomedical disciplines with the expertise of nonacademic and digital developers for evaluating DHIs. Design-led methodologies can add value to public health settings. The subsequent service, now known as Evaluating Digital Health Products, is currently in use by health bodies in the United Kingdom and is available to others for tackling the problem of evaluating DHIs pragmatically and responsively.
Article
This paper presents a systematic literature review characterizing the methodological properties of usability studies conducted on educational and learning technologies in the past 20 years. PRISMA guidelines were followed to identify, select, and review relevant research and report results. Our rigorous review focused on (1) categories of educational and learning technologies that have been the focus of usability evaluation, (2) specific usability evaluation methods used, (3) outcome measures, and (4) research limitations. Findings revealed a diverse range of usability evaluation methods employed for different types of educational/learning technologies and the contexts in which those methods were used, with the majority of usability studies being performed on e-learning technologies within higher education contexts. Specific methods, instrumentation, and types of usability research found to be dominant in reviewed studies were further analyzed and classified, with findings suggesting inquiry methods using questionnaires were most prevalent. Prevalent outcome measures were also synthesized, with findings suggesting that the majority of usability research focuses on issues of technological usability, with very few studies considering pedagogical and socio-cultural aspects of usability. A number of limitations were found, including conceptual and procedural flaws, fundamental misunderstanding of usability evaluation methods, and inappropriate application of usability methods, suggesting potentially problematic and unreliable results. These findings are discussed in-depth, and implications for future research are provided.
Chapter
The advent of digital economy, the expansion of Internet user base, the surge in the number of smartphone users, and the increasing interconnectedness of man, machines and organizations have fuelled a phenomenal growth in the market penetration of e-commerce worldwide. The disruption of physical shopping by COVID-19 outbreak and the consequent change in consumer behaviour, leaning more towards online shopping, has also significantly pushed up that growth trajectory. The proliferation of e-commerce-apps is a complementary component of this phenomenon, and the competition for sustainability and growth of these apps and the associated companies is ever-increasing. The above developments have necessitated the according of heightened attention to User Experience, i.e., UX, and cognitive response, manifested in consumers’ need, preference, attitude, behaviour and comfort, in designing new e-commerce apps and continually improving the existing ones. For meeting this need, a plethora of tools and techniques are available for informing the design process from the perspectives of end-users for enhancing user-centric design efficacy and productivity of the apps. In this paper, a methodology of combining six UX research techniques has been deployed and a user persona has been created in a novel framework by selecting an existing e-commerce App and engaging a consumer newly opting for online shopping. Pain points have been explored and scopes of improving the App identified. This methodology of bettering usability and user experience would be useful for the design and testing of not only e-commerce apps but also of apps developed for numerous other fields of applications.
Conference Paper
Full-text available
Abstract. The respiratory health of Malaysian Traffic Police has been compromised by working in heavy traffic and congested junctions with bad air for long hours. A wireless outdoor individual exposure device is vital to track their exposure, however, the efficacy of the system remains uncertain. While existing techniques exist to examine the efficacy of such system, there is a lack of methodology for engaging multiple assessment methods to evaluate the degree of user experience. This paper aims to propose a methodological framework tool for a quantitative evaluation of the wireless outdoor individual exposure indicator system prototype. A systematic search was conducted in major electronic databases (MEDLINE, Web of Science, Google), grey literature sources and all relevant data in the field. A three-stage framework consisting of simulation real-time monitoring, in-field testing, and usability testing is assembled. The three stage framework proposed serves as a generic approach for evaluating the prototype with the purpose of tracking individual outdoor exposure. The method is capable of describing the complete evaluation process, from the accuracy and performance of the sensor to the extent of the end-user experience. Using the three-stage approach, future researchers may be able to create a monitoring system that is relevant to their needs.
Article
Full-text available
Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Conference Paper
Full-text available
Simon (TM-Bellsouth Corp.) is a commercially available personal communicator (PC) combining features of a PDA (personal digital assistant) with a full suite of communications features. This paper describes the involvement of human factors engineering in the development of Simon, and summarizes the various approaches to usability evaluation employed during its development. Simon has received a considerable amount of praise from the industry and won several industry awards, with recognition both for its innovative engineering and its usability.
Article
Full-text available
There are 2 excellent reasons to compute usability problem-discovery rates. First, an es- timate of the problem-discovery rate is a key component for projecting the required sample size for a usability study. Second, practitioners can use this estimate to calculate the proportion of discovered problems for a given sample size. Unfortunately, small- sample estimates of the problem-discovery rate suffer from a serious overestimation bias. This bias can lead to serious underestimation of required sample sizes and serious overestimation of the proportion of discovered problems. This article contains descrip- tions and evaluations of a number of methods for adjusting small-sample estimates of the problem-discovery rate to compensate for this bias. A series of Monte Carlo simula- tions provided evidence that the average of a normalization procedure and Good-Tu- ring (Jelinek, 1997; Manning & Schutze, 1999) discounting produces highly accurate es- timates of usability problem-discovery rates from small sample sizes.
Conference Paper
Full-text available
SUMMARY Nielsen's claim that "Five Users are Enough" (5) is based on a statistical formula (2) that makes unwarranted assumptions about individual differences in problem discovery, combined with optimistic setting of values for a key variable. We present the initial Landauer-Nielsen formula and recent evidence that it can fail spectacularly to calculate the required number of test users for a realistic web-based test. We explain these recent results by examining the assumptions behind the formula. We then re-examine some of our own data, and find that, while the Landauer-Nielsen formula does hold, this is only the case for simple problem counts. An analysis of problem frequency and severity indicates that highly misleading results could have resulted when the number of required users is almost doubled. Lastly, we identify structure and components of a more realistic approach to estimating test user requirements.
Article
Full-text available
Computer professionals have a need for robust, easy-to-use usability evaluation methods (UEMs) to help them systematically improve the usability of computer artifacts. However, cognitive walkthrough (CW), heuristic evaluation (HE), and thinkingaloud study (TA)—3 of the most widely used UEMs—suffer from a substantial evaluator effect in that multiple evaluators evaluating the same interface with the same UEM detect markedly different sets of problems. A review of 11 studies of these 3 UEMs reveals that the evaluator effect exists for both novice and experienced evaluators, for both cosmetic and severe problems, for both problem detection and severity assessment, and for evaluations of both simple and complex systems. The average agreement between any 2 evaluators who have evaluated the same system using the same UEM ranges from 5% to 65%, and no 1 of the 3 UEMs is consistently better than the others. Although evaluator effects of this magnitude may not be surprising for a UEM as informal as HE, it is certainly notable that a substantial evaluator effect persists for evaluators who apply the strict procedure of CW or observe users thinking out loud. Hence, it is highly questionable to use a TA with 1 evaluator as an authoritative statement about what problems an interface contains. Generally, the application of the UEMs is characterized by (a) vague goal analyses leading to variability in the task scenarios, (b) vague evaluation procedures leading to anchoring, or (c) vague problem criteria leading to anything being accepted as a usability problem, or all of these. The simplest way of coping with the evaluator effect, which cannot be completely eliminated, is to involve multiple evaluators in usability evaluations.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Article
Using the example of a real product, this paper shows how various usability assessments, conducted by different human factors engineers, in several phases of the product's development life-cycle, identified similar potential usability problems. Circumstances dictated that no remedial action was taken, so it was possible to track these potential usability defects to customer sites, where it was found that most of the important problems did indeed occur. Thus, it can be demonstrated that human factors advice was valid and reliable. In simpler terms, early usability evaluation by human factors engineers can save hours of wasted development effort and customer frustration, and can help to ensure that a more usable product is produced.
Article
Much attention has been paid to the question of how many subjects are needed in usability research. Virzi (1992) modelled the accumulation of usability problems with increasing numbers of subjects and claimed that five subjects are sufficient to find most problems. The current paper argues that this answer is based on an important assumption, namely that all types of users have the same probability of encountering all usability problems. If this homogeneity assumption is violated, then more subjects are needed. A modified version of Virzi's model demonstrates that the number of subjects required increases with the number of heterogeneous groups. The model also shows that the more distinctive the groups, the more subjects will be required. This paper will argue that the simple answer 'five' cannot be applied in all circumstances. It most readily applies when the probability that a user will encounter a problem is both high and similar for all users. It also only applies to simple usability tests that seek to detect the presence, but not the statistical prevalence, of usability problems.