ArticlePDF Available

Sample Sizes for Usability Studies: Additional Considerations

Authors:
  • MeasuringU

Abstract and Figures

Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Content may be subject to copyright.
HUMAN FACTORS, 1994,36(2),368-378
Sample Sizes for Usability Studies:
Additional Considerations
JAMES R. LEWIS,\
International Business Machines, Inc., Boca Raton, Florida
Recently, Virzi (1992) presented data that support three claims regarding sample
sizes for usability studies: (1) observing four or five participants will allow a us-
ability practitioner to discover 80% of a product's usability problems, (2) observ-
ing additional participants will reveal fewer and fewer new usability problems,
and (3) more severe usability problems are easier to detect with the first few
participants. Results from an independent usability study clearly support the sec-
ond claim, partially support the first, but fail to support the third. Problem dis-
covery shows diminishing retums as a function of sample size. Observing four to
five participants will uncover about 80% of a product's usability problems as long
as the average likelihood of problem detection ranges between 0.32 and 0.42, as in
Virzi. If the average likelihood of problem detection is lower, then a practitioner
will need to observe more than five participants to discover 80% of the problems.
Using behavioral categories for problem severity (or impact), these data showed no
correlation between problem severity (impact) and rate of discovery. The data
provided evidence that the binomial probability formula may provide a good
model for predicting problem discovery curves, given an estimate of the average
likelihood of problem detection. Finally, data from economic simulations that
estimated return on investment (ROI) under a variety of settings showed that only
the average likelihood of problem detection strongly influenced the range of sam-
ple sizes for maximum RaJ.
INTRODUCTION
The goal of many usability studies is to
identify design problems and recommend
product changes (to either the current prod-
uct or future products) based on the design
problems (Gould, 1988; Grice and Ridgway,
1989; Karat, Campbell, and Fiegel, 1992;
Whitefield and Sutcliffe, 1992; Wright and
Monk, 1991). During a usability study, an ob·
I
Requests for reprints should be sent to James R. Lewis.
IBM Corp., P.O. Box 1328, Boca Raton, FL 33429-1328.
server watches representative participants
perform representative tasks to understand
when and how they have problems using a
product. The problems provide clues about
how to redesign the product to either elimi-
nate the problem or provide easy recovery
from it (Lewis and Norman, 1986; Norman,
1983).
Human factors engineers who conduct in-
dustrial usability evaluations need to under-
stand their sample size requirements. If they
collect a larger sample than necessary, they
© 1994, Human Factors and Ergonomics Society. All rights reserved.
SAMPLE SIZES FOR USABILITY STUDIES
might increase product cost and development
time. If they collect too small a sample, they
might fail to detect problems that, uncor-
rected, would reduce the usability of the
product. Discussing usability testing, Keeler
and Denning (1991) showed a common nega-
tive attitude toward small-sample usability
studies when they stated, "actual [usability]
test procedures cut comers in a manner that
would be unacceptable to true empirical in-
vestigations. Test groups are small (between
6 and 20 subjects per test)" (p. 290). Yet, in
any setting, not just an industrial one, the ap-
propriate sample size accomplishes the goals
of the study as efficiently as possible (Krae-
mer and Thiemann, 1987).
Virzi (1992) investigated sample size re-
quirements for usability evaluations. He re-
ported three experiments in which he mea-
sured the rate at which trained usability
experts identified problems as a function of
the number of naive participants they had ob-
served. For each experiment, he ran a Monte
Carlo simulation to permute participant or-
ders 500 times and measured the cumulative
percentage of problems discovered for each
sample size. In the second experiment, the
observers provided ratings of problem sever-
ity (using a seven-point scale). In addition to
having the observers provide problem sever-
ity ratings (this time using a three-point
scale), an independent set of usability experts
provided estimates of problem severity based
on brief, one-paragraph descriptions of the
problems discovered in the third experiment.
TABLE 1
June
1994--369
This helped to control the effect of knowledge
of problem frequency on the estimation of
problem severity. The correlation between
problem frequency and test observers' judg-
ment of severity in the second experiment
was 0.463
(p
<
0.01). In the third experiment,
agreement among the test observers and the
independent set of judges was significant,
W(16)
=
0.471,p
<
0.001, for the rank order of
17 problems in terms of how disruptive they
were likely to be to the usability of the sys-
tem. (Virzi did not report the magnitude of
the correlation between problem frequency
and either the test observers' or independent
judges' estimates of problem severity for the
third experiment.) Table 1 shows some of the
key features of the three experiments.
Based on these experiments, Virzi (1992)
made three claims regarding sample size for
usability studies: (1) Observing four or five
participants will allow a practitioner to dis-
cover 80% of a product's usability problems,
(2) observing additional participants will re-
veal fewer and fewer new usability problems,
and (3) more severe usability problems are
easier to detect with the first few partici-
pants. These important findings are in need
of replication. One purpose of this paper is to
report the results of an independent usability
study that clearly support the second claim,
partially support the first, and fail to support
the third. Another purpose is to develop a
mathematical model of problem discovery
based on the binomial probability formula
and examine its extension into economic
Key Features of Virzi's (1992) Three Experiments
Sample Number Number of Average Likelihood of
Experiment Size of Tasks Problems Problem Detection
1 12 3 13 0.32
220 21 40 0.36
3 20 7 17 0.42
370-June 1994
simulations that estimate return on invest-
ment (ROI) for a usability study as a function
of several independent variables.
THE OFFICE APPLICATIONS
USABILITY STUDY
Lewis, Henry, and Mack (1990) conducted a
series of usability studies to develop usability
benchmark values for integrated office sys-
tems. The following method and results are
from one of these studies (the only one for
which we kept a careful record of which par-
ticipants experienced which problems). A set
of 11 scenarios served as stimuli in the eval-
uation.
Method
Participants.
Fifteen employees of a tempo-
rary help agency participated in the study.
All participants had at least three months' ex-
perience with a computer system but had no
programming training or experience. Five
participants were clerks or secretaries with no
experience in the use of a mouse device, five
were business professionals with no mouse
experience, and five were business profes-
sionals who did report mouse experience.
Apparatus.
The office system had a word
processor, a mail application, a calendar ap-
plication, and a spreadsheet on an operating
system that allowed a certain amount of in-
tegration among the applications.
Procedure.
Aparticipant began with a brief
tour of the lab, read a description of the pur-
pose of the study, and completed a back-
ground questionnaire. After a short tutorial
on the operating system, the participant be-
gan working on the scenarios. It usually took
a participant about 6 h to complete the sce-
narios. Observers watched participants, one
at a time, by closed-circuit television. In ad-
dition to several performance measures, ob-
servers carefully recorded the problems that
HUMAN FACTORS
participants experienced during the study.
They classified the problems in decreasing
level of impact according to four behavioral
definitions:
1. Scenario failure.
The problem caused the par-
ticipant to fail to complete a scenario by either
requiring assistance to recover from the prob-
lem or producing an incorrect output (except-
ing minor typographical errors).
2. Considerable recovery effort.
The participant ei-
ther worked on recovery from the problem for
more than a minute or experienced the prob-
lem multiple times within a scenario.
3. Minor recovery effort.
The participant experi-
enced the problem only once within a scenario
and required less than a minute to recover.
4./nefficiency.
The participant worked toward
the scenario's goal but deviated from the most
efficient path.
Results
Participants experienced 145 different
problems during the usability study. The av-
erage likelihood of problem detection was
0.16. Figure 1 shows the results of applying a
Monte Carlo procedure to calculate the mean
of 500 permutations of participant orders, re-
vealing the general form of the cumulative
problem discovery curve. Figure 1also shows
the predicted cumulative problem discovery
curve using the formula 1 - (1 -
p)"
(Virzi,
1990,1992; Wright and Monk, 1991),where
p
is the probability of detecting a given prob-
lem and
n
is the sample size. The predicted
curve shows an excellent fit to the Monte
100
I I I 1 I I
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sample Size
Figure 1.
Predicted problem discovery as a function
of sample size.
SAMPLE SIZES FOR USABILITY STUDIES June
1994-371
Discussion
Figure 2.
Problem discovery rate as a function of
problem impact.
These results are completely consistent
with the earlier finding that additional par-
ticipants discover fewer and fewer problems
(Virzi, 1992). If the average likelihood of
problem detection had been in the range of
0.32 to 0.42, then five participants would
have been enough to uncover
80%
of the prob-
lems. However, because the average likeli-
hood of problem detection was considerably
lower in this study than in the three Virzi
studies, usability studies such as this would
need 10 participants to discover 80% of the
problems. This shows that it is important for
usability evaluators to have an idea about the
average likelihood of problem detection for
their types of products and usability studies
before they estimate sample size require-
ments. If a product has poor usabili ty (has a
high average likelihood of problem detec-
tion), it is easy to improve the product (or at
least discover a large percentage of its prob-
lems) with a small sample. However, if a
product has good usability (has a low average
likelihood of problem detection), it will re-
quire a larger sample to discover the remain-
ing problems.
These results showed no significant rela-
tionship between problem frequency and
problem impact. This outcome failed to sup-
port the claim that observers would find se-
vere usability problems faster than they
would less severe problems (Virzi, 1992).
Virzi used the term
problem severity,
and my
colleagues and I (Lewis et aI., 1990)described
the dimension as
problem impact.
Our concep-
tion of problem severity was that it is the
combination of the effects of problem impact
and problem frequency. Because the impact
for a problem in this usability study was as-
signment to behaviorally defined categories,
this impact classification should be indepen-
dent of problem frequency.
In his third experiment, Virzi attempted to
control the confounding caused by having the
observers, who have problem frequency
knowledge, also rate severity (using a three-
point scale). The observers and an indepen-
dent group of usability experts ranked 17 of
the problems along the dimension of disrup-
tiveness to system usability, and the results
indicated significant agreement,
W(16)
=
0.471, P
<
0.001. However, this procedure
(providing one-paragraph problem descrip-
tions to the independent group of experts)
might not have successfully removed the
,
--
IfT1lacSl
l
----0--- lmpacl2
-.- knpact
31
-----()--1rT1Jad4
__ J
1 2 3 4 5 6 7 B 9 10 11 12 13 14 15
Sample Size
100
Carlo curve, Kolmogorov-Smimov
J'
3
=
0.73,
P
=
0.66 (Hollander and Wolfe, 1973). For
this study, observing five participants would
uncover only
55%
of the problems. To un-
cover
80%
of the problems would require 10
participants.
Different participants might experience the
same problem but might not experience the
same impact. For subsequent data analyses,
the impact rating for each problem was the
modal impact level across the participants
who experienced the problem. (If the distri-
bution was bimodal. then the problem re-
ceived the more severe classification.) Figure
2 shows the results of applying the same
Monte Carlo procedure to problems for each
of the four impact levels. The curves overlap
considerably, and the Pearson correlation be-
tween problem frequency and impact level
was not significant,
r(143)
=
0.06,
p
=
0.48.
1
00
f '"
J
7.
c
i
60
i
50
11 40
.
'30
3
20
~ 10
372-June 1994
influence of problem frequency from severity
estimation.
It is unfortunate that Virzi did not report
the magnitude of the correlation between
problem frequency and the severity judg-
ments of the usability experts who did not
have any knowledge of problem frequency.
Given these conflicting results and the logical
independence of problem impact (or severity)
and frequency, human factors engineers and
others who conduct usability evaluations
should take the conservative approach of as-
suming no relationship between problem im-
pact and frequency until future research re-
solves the different outcomes.
PROBLEM DISCOVERY CURVES AND
THE BINOMIAL PROBABILITY FORMULA
Several researchers have suggested that the
formula 1 - (1 -
p)n
predicts the rate of
problem discovery in usability studies (Virzi,
1990, 1992; Wright and Monk, 1991). How-
ever, none of these researchers has offered an
explanation of the basis for that formula. In
an earlier paper (Lewis, 1982), I proposed
that the binomial probability theorem could
provide a statistical model for determining
the likelihood of detecting a problem of prob-
ability
p,
r
times, in a study with
n
partici-
pants.
The binomial probability formula is
P(r)
=
(~)pr(1 - p)n-r
(Bradley, 1976), where
P(r)
is
the likelihood that an event will occur
r
times, given a sample size of
n
and the prob-
ability
p
that the event will occur in the pop-
ulation at large. The conditions under which
the binomial probability formula applies are
random sampling, independent observations,
two mutually exclusive and exhaustive cate-
gories of events, and sample observations
that do not deplete the source. Problem dis-
covery usability studies usually meet these
conditions. Usability practitioners should at-
tempt to sample participants randomly. (Al-
though circumstances rarely allow true
HUMAN FACTORS
random sampling in usability studies,
experimenters do not usually exert any influ-
ence on precisely who participates in the
study, resulting in a quasi-random sampling.)
Observations among participants are inde-
pendent, because the problems experienced
by
one participant cannot have an effect on
those experienced by another participant.
(Note that this model does not require inde-
pendence among the different types of prob-
lems that occur.) The two mutually exclusive
and exhaustive problem detection categories
are (1) the participant encountered the prob-
lem and (2) the participant did not experience
the problem.
Finally, the sampled observations in a us-
ability study do not deplete the source. The
probability that a given sample size will pro-
duce at least one instance of problem detec-
tion is 1 minus the probability of no detec-
tions, or 1 -
P(O).
When
r
=
0,
P(O)
=
(o)p°(1
- p)n-O,
which reduces to
P(O)
=
(1 _
p)n.
Thus the cumulative binomial probability for
the likelihood that a problem of probability
p
will occur at least once is 1 - (1 -
p)n.
Problem Discovery Curves for Specific
Problems of Varying Probabilities
Asshown in Figure 1,the formula 1 - (1 -
p)n
provides a good fit to Monte Carlo estima-
tions where
p
is the average likelihood of
problem detection for a set of problems. An-
other approach is to select specific problems
ofvarying probabilities of detection and com-
pare Monte Carlo problem discovery curves
with curves predicted with the cumulative bi-
nomial probability formula. Figure 3 shows
the problem discovery likelihoods (each
based on 500 Monte Carlo participant-order
permutations) for five specific problems, with
problem probabilities ranging from 0.14 to
0.74. The figure also shows the predicted cu-
mulative problem discovery curves.
As seen in Figure 3, 1 - (1 -
p)n
provides
an excellent fit to the results of Monte Carlo
SAMPLE SIZES FOR USABILITY STUDIES
June
1994-373
Figure 3.
Predicted problem discovery rates as a
function of individual problem likelihood.
permutations. For low-probability problems
(0.14, 0.20, and 0.40) the Kolmogorov-
Smirnov
J'
3
was 0.54
(p
=
0.93), and for high-
probability problems (0.57 and 0.74)
J'
3
was
0.18 (p
=
1.00). Figure 3 shows that the bino-
mial and Monte Carlo curves deviated more
when problem probability was low. This is
probably because the curves based on the
Monte Carlo simulations must end at a dis-
covery likelihood of 1.0, but the binomial
curves do not have this constraint. A sample
size of 15 was probably not sufficient to dem-
onstrate perfectly accurate problem discov-
ery curves for low-probability problems us-
ing Monte Carlo permutations.
Note also that the binomial curves for high-
probability problems (0.57 and 0.74) matched
the Monte Carlo curve very closely, but low-
probability problems (0.14, 0.20, and 0.40)
underpredicted the Monte Carlo curve. This
lends support to Virzi's (1992) suggestion that
the tendency of the formula 1 - (1 -
p)n
to
overpredict problem discovery for sets of
problems is a Jensen's Inequality artifact.
Jensen's Inequality is a general inequality
satisfied by a convex function:
----0--
Monle
Caf10 Ilewl9 et
a1,1990)
- BIfIOffil3I(VI'ZI, 1990)
-0-- MonteCarb(VIrZI.
1990)
_6-
BIIlOI1lIaI (LewIS
e1
aI ,
'990)
1 23456789101112131.1S1&1118W~
one (Parker, 1984). Because any
a
i
can equal
lin,
the formula applies to the arithmetic
mean. Applied to the data in this paper, the
function of a mean (such as the average of a
series of Monte Carlo trials) will be less than
or equal to the mean of a function (such as
p
averaged over a set of problems, then placed
into the binomial probability formula).
Detecting Problems at Least Twice
If the binomial probabili ty formula is a rea-
sonable model for problem discovery in us-
ability studies, then it should also predict the
likelihood that a problem will occur at least
twice. (In practice, some usability practitio-
ners use this criterion to avoid reporting
problems that might be idiosyncratic to a sin-
gle participant.) The cumulative binomial
probability for P (at least two detections) is 1
- [P(O) +
P(1)].BecauseP(1)
=
np(1 - p)n-I,
P (at least two detections)
=
1-[(1 _
p)n +
np(1 - p)n-I].
Figure 4 shows the Monte Carlo (based on
500 participant-order permutations) and bi-
nomial problem discovery curves for the like-
lihood that a problem will occur at least
twice (Lewis et aI., 1990). A Kolmogorov-
Smirnov goodness-of-fit test revealed that the
binomial probability formula did not provide
an adequate fit to the Monte Carlo data,
1'3
=
1.27,
P
=
0.08. However, the average
likelihood of detecting a problem at least
twice was quite low in this study (0.03), so it
; -.-Mort:ecartDI7411
I
---0--- 8InomlaI1 7<4)
-.- Marte Cat10 I 57) :
--0-- BnorTIIaI(57)
-.-- Morts Carlo 140)
-0---- BIIlOfTlIaI(
40)
.-.- MorleCarlol20)
----0----
8111(1(1''11011(20)
I
-)(-MonleCarlo(14l!
-:t"-SII'lOn'lI3II.14)
i
1 2 :] 4 5 6 7
a
9 10 11 12 13 14 15 ~ ---I
Sample SIze
00
OB
where
Xi
is any number in the region where
f
is convex, and
ai
is nonnegative and sums to
Sample Size
Figure 4.
Predicted problem discovery rates for de-
tecting problems at least twice.
374-June 1994
was possible that a sample size of 15 might
not be adequate to model this problem dis-
covery situation. With data reported by Virzi
(1990), Figure 4 also shows the Monte Carlo
(based on 500 participant-order permuta-
tions) and binomial problem discovery
curves for a usability study in which the av-
erage likelihood of detecting a problem at
least twice was 0.12 and there were 20 par-
ticipants. In that situation, the Kolmogorov-
Smirnov goodness-of-fit test provided strong
support for prediction based on the binomial
probability formula,J'3
=
0.47,p
=
0.98.
Discussion
These data provide support for the hypoth-
esis that the cumulative binomial probability
formula is a reasonable model for problem
discovery in usability studies. To help human
factors engineers select an appropriate sam-
ple size for a usability study, Table 2 shows
the expected proportion of detected problems
TABLE 2
HUMAN FACTORS
(at least once) for various problem detection
probabilities through a sample size of 20. Ta-
ble
3
shows the minimum sample size re-
quired to detect problems of varying proba-
bilities at least once (or, as shown in
parentheses, at least twice). When the prob-
lem probability is the average across a set of
problems, then the cumulative likelihood
that the problem will occur is also the ex-
pected proportion of discovered problems.
For example, if a practitioner planned to
discover problems from a set with an average
probability of detection of 0.25, was willing
to treat a single detection of a problem seri-
ously, and planned to discover 90% of the
problems, the study would require 8 partici-
pants. If the practitioner planned to see a
problem at least twice before taking it
seriously, the sample size requirement would
be 14. If a practitioner planned to discover
problems at least once with probabilities as
low as 0.01 and with a cumulative likelihood
Expected Proportion of Detected Problems (at Least Once) for Various Problem Detection Probabilities
and Sample Sizes
Sample
Size p =0.01 P=0.05 P
=
0.10 P
=
0.15 P
=
0.25 P
=
0.50 P
=
0.90
10.01 0.05 0.10 0.15 0.25 0.50 0.90
2 0.02 0.10 0.19 0.28 0.44 0.75 0.99
30.03 0.14 0.27 0.39 0.58 0.88 1.00
4 0.04 0.19 0.34 0.48 0.68 0.94 1.00
50.05 0.23 0.41 0.56 0.76 0.97 1.00
6 0.06 0.26 0.47 0.62 0.82 0.98 1.00
7 0.07 0.30 0.52 0.68 0.87 0.99 1.00
8 0.08 0.34 0.57 0.73 0.90 1.00 1.00
90.09 0.37 0.61 0.77 0.92 1.00 1.00
10 0.10 0.40 0.65 0.80 0.94 1.00 1.00
11 0.10 0.43 0.69 0.83 0.96 1.00 1.00
12 0.11 0.46 0.72 0.86 0.97 1.00 1.00
13 0.12 0.49 0.75 0.88 0.98 1.00 1.00
14 0.13 0.51 0.77 0.90 0.98 1.00 1.00
15 0.14 0.54 0.79 0.91 0.99 1.00 1.00
16 0.15 0.56 0.81 0.93 0.99 1.00 1.00
17 0.16 0.58 0.83 0.94 0.99 1.00 1.00
18 0.17 0.60 0.85 0.95 0.99 1.00 1.00
19 0.17 0.62 0.86 0.95 1.00 1.00 1.00
20 0.18 0.64 0.88 0.96 1.00 1.00 1.00
SAMPLE SIZES FOR USABILITY STUDIES
TABLE 3
June
1994-375
Sample Size Requirements as a Function of Problem Detection Probability and the Cumulative Likeli-
hood of Detecting the Problem at Least Once (Twice)
Problem Cumulative Likelihood of Detecting the Problem at Least Once (Twice)
Detection
Probability 0.50 0.75 0.85 0.90 0.95 0.99
0.01 68 (166) 136 (266) 186 (332) 225 (382) 289 (462) 418 (615)
0.05 14 (33) 27 (53) 37 (66) 44 (76) 57 (91) 82 (121)
0.10 7 (17) 13 (26) 18 (33) 22 (37) 28 (45) 40 (60)
0.15 5 (11) 9 (17) 12 (22) 14 (25) 18 (29) 26 (39)
0.25 3(7) 5 (10) 7 (13) 8 (14) 11 (17) 15 (22)
0.50 1 (3) 2 (5) 3(6) 4(7) 5(8) 7 (10)
0.90 1 (2) 1 (2) 1(3) 1(3) 2 (3) 2(4)
Note. These are the minimum sample sizes that result alter rounding cumulative likelihoods to two decimal places. Strictly speaking.
therefore, the cumulative probability for the 0.50 column is 0.495, and that for the 0.75 column is 0.745, and so on. If a practitioner requires
greater precision, the method described in the paper will allow the calculation of a revised sample size. which will always be equal to or
greater than the sample sizes in this table. The discrepancy will increase as problem probability decreases. cumulative probability increases.
and the number of times a problem must be detected increases.
of discovery of 0.99, the study would require
418 participants (an unrealistic requirement
in most settings, implying unrealistic study
goals).
A RETURN-ON-INVESTMENT MODEL
FOR USABILITY STUDIES
The preceding analyses show that, given an
estimate of the average likelihood of problem
detection, it is possible to generate problem
discovery curves with the cumulative bino-
mial probability distribution. These curves
provide a basis for selecting an appropriate
sample size for usability studies. However, a
more complete analysis should address the
costs associated with running additional par-
ticipants, fixing problems, and failing to dis-
cover problems. Such an analysis should al-
low usability practitioners to specify the
relationship between sample size and return
on investment.
Method
Six variables were manipulated in ROI
simulations to determine those variables that
exert influence on (1) the sample size at the
maximum ROI, (2) the magnitude of the max-
imum RO!, and (3) the percentage of prob-
lems discovered at the maximum ROI. (Table
4 lists the variables and their values,) The
equation for the simulations was ROI
=
Sav-
ings/Costs, where Savings is the cost of the
discovered problems had they remained un-
discovered, minus the cost of fixing the dis-
covered problems, and Costs is the sum of the
daily cost to run a study, plus the costs asso-
ciated with problems that remain undiscov-
ered. Thus a better ROI will have a higher
numerical value.
The simulations included cumulative bino-
mial problem discovery curves for sample
sizes from 1 to 20 for three average likeli-
hoods of problem discovery (0.10, 0.25, and
0.50). For each sample size and average like-
lihood of problem discovery, a BASIC pro-
gram provided the expected number of dis-
covered and undiscovered problems. The
program then crossed the discovered prob-
lem cost of $100 with undiscovered problem
costs of $200, $500, and $1000 (low set), and
the discovered problem cost of $1000 with
undiscovered problem costs of $2000, $5000,
and $10,000 (high set) to calculate RO!s.
For the simulations, the sample size vari-
able of 20 participants covered a reasonable
range and should result in the discovery of a
376-June 1994
TABLE 4
Main Effects for the
ROI
Simulations
HUMAN FACTORS
Independent Variables
Variable
Average likelihood of problem discovery
Number of problems available for discovery
Daily cost to run study
Cost to fix a discovered problem
Cost of an undiscovered problem (low set)
Cost of an undisovered problem (high set)
Dependent Variables
Percentage of Problems
Sample Size at Magnitude of Discovered at
Value Maximum ROI Maximum ROI Maximum ROI
0.10 19.0 3.1 86
0.25 14.6 22.7 97
0.50 7.7 52.9 99
Range: 11.3 49.8 13
30 11.5 7.0 91
150 14.4 26.0 95
300 15.4 45.6 95
Range: 3.9 38.6 4
500 14.3 33.4 94
1000 13.2 19.0 93
Range: 1.1 14.4 1
100 11.9 7.0 92
1000 15.6 45.4 96
Range: 3.7 38.4 4
200 10.2 1.9 89
500 12.0 6.4 93
1000 13.5 12.6 94
Range: 3.3 10.7 5
2000 14.7 12.3 95
5000 15.7 41.7 96
10000 16.4 82.3 96
Range: 1.7 70.0 1
large proportion of the problems that are
available for discovery in many usability
studies (Virzi, 1992). The values for the num-
ber of problems available for discovery are
consistent with those reported in the litera-
ture (Lewis et ai., 1990; Virzi, 1990, 1992), as
are the values for the average likelihood of
problem discovery. Assuming one participant
per day, the values for the daily cost to run a
study are consistent with current laboratory,
observer, and participant costs. The ratio of
the costs to fix a discovered problem to the
costs of an undiscovered problem are congru-
ent with software engineering indexes re-
ported by Boehm (1981).
Results
Table 4 shows the results of the main ef-
fects of the independent variables in the sim-
ulations on the dependent variables of (1)the
sample size at which the maximum ROI oc-
curred, (2) the magnitude of the maximum
RO!, and (3) the percentage of problems dis-
covered at the maximum ROI. The table
shows the average value of each dependent
variable for each level of all the independent
variables, and the range of the average values
for each independent variable. Across all the
variables, the average percentage of discov-
ered problems at the maximum ROI was 94%.
Discussion
Allof the independent variables influenced
the sample size at the maximum ROI, but the
variable with the broadest influence (as indi-
cated by the range) was the average likeli-
hood of problem discovery
(P).
It also had the
strongest influence on the percentage of prob-
lems discovered at the maximum ROI. There-
fore, it is very important for usability practi-
tioners to estimate the magnitude of this
variable for their studies because it largely
determines the appropriate sample size. If
SAMPLE SIZES FOR USABILITY STUDIES
the expected value of
p
is small (for example,
0.10), practitioners should plan to discover
about 86% of the problems. If the expected
value of
p
is larger (for example, 0.25 or 0.50),
practitioners should plan to discover about
98% of the problems. If the value of
p
is be-
tween 0.10 and 0.25, practitioners should in-
terpolate in Table 4 to determine an appro-
priate goal for the percentage of problems to
discover.
Contrary to expectation, the cost of an un-
discovered problem had a minor effect on
sample size at maximum ROI, but, like all the
other independent variables, it had a strong
effect on the magnitude of the maximum
ROI. Usability practitioners should be aware
of these costs and their effect on ROI, but
these costs have relatively little effect on the
appropriate sample size for a usability study.
The definitions of the cost variables for the
ROI simulations are purposely vague. Each
practitioner needs to consider the potential
elements of cost for a specific work setting.
For example, the cost of an undiscovered
problem in one setting might consist primar-
ily of the cost to send personnel to user loca-
tions to repair the problem. In another set-
ting the primary cost of an undiscovered
problem might be the loss of future sales re-
sulting from customer dissatisfaction.
GENERAL DISCUSSION
The law of diminishing returns, based on
the cumulative binomial probability for-
mula, applies to problem discovery usability
studies. To use this formula to determine an
appropriate sample size, practitioners must
form an idea about the expected value of
p
(the average likelihood of problem detection)
for the study and the percentage of problems
that the study should uncover. Practitioners
can use the data in Table 4 or their own ROI
formulas to estimate an appropriate goal for
the percentage of problems to discover and
can examine data from their own or pub-
June
1994-377
lished usability studies to estimate
p.
(The
data from this office applications study have
shown that
p
can be as low as 0.16.) With
these two estimates, Table 3 (or, more gener-
ally, the cumulative binomial probability dis-
tribution) can provide the appropriate sam-
ple size for the usability study.
Practitioners who wait to see a problem at
least twice before giving it serious consider-
ation can see in Table 3 the sample size im-
plications of this strategy. Certainly, all other
things being equal, it is more important to
correct a problem that occurs frequently than
one that occurs infrequently. However, it is
unrealistic to assume that the frequency of
detection of a problem is the only criterion to
consider in the analysis of usability prob-
lems. The best strategy is to consider problem
frequency and impact simultaneously to de-
termine which problems are most important
to correct rather than establishing a cutoff
rule such as "fix every problem that appears
two or more times."
The results of the present office applica-
tions usability study raise a serious question
about the relationship between problem fre-
quency and impact (or severity). In this
study, problem discovery rates were the same
regardless of the problem impact rating.
Clearly, the conservative approach for prac-
titioners is to assume independence of fre-
quency and impact until future research re-
solves the discrepancy in findings between
this office applications study and the studies
reported by Virzi (1992).
It is important for practitioners to consider
the risks as well as the gains when they use
small samples in usability studies. Although
the diminishing returns for inclusion of addi-
tional participants strongly suggest that the
most efficient approach is to run a small sam-
ple (especially if the expected
p
is high, if the
study will be iterative, and if undiscovered
problems will not have dangerous or expen-
sive outcomes), human factors engineers and
378-June 1994
other usability practitioners must not be-
come complacent regarding the risk of fail-
ing to detect low-frequency but important
problems.
The goal of this paper was to address con-
siderations for the selection of a sample size
of participants for problem discovery usabil-
ity studies. However, this is only one element
among several that usability practitioners
must consider. Another important topic is the
selection and construction of the tasks and
scenarios that participants encounter in a
study. Certainly what an evaluator asks par-
ticipants to do influences the likelihood of
problem discovery. If the likelihood of discov-
ery of a specific problem on a single perfor-
mance of a task is low, the likelihood of dis-
covery will increase if participants have
multiple opportunities to perform the task (or
variations on the task). Repeating tasks also
allows an evaluator to determine if particular
problems that occur early in a participant's
experience with a system diminish or persist
with practice.
Conversely, repeating tasks increases study
time. The decision about whether to have
multiple trials depends on the purpose of the
study. Concerns about what tasks to ask par-
ticipants to do is similar to the problem of
assessing content validity in psychometrics
(Nunnally, 1978). This topic (adequate task
coverage in usability studies) deserves more
detailed treatment.
ACKNOWLEDGMENTS
I thank my colleagues in the IBM Human Factors group
and the
Human Factors
reviewers for their helpful com-
ments concerning this work. In particular, I thank Robert
J. Wherry, Jr., who, in reviewing the first draft of this
paper, wrote his own BASIC programs to recreate my ta-
bles. In doing so, he uncovered an error in my ROI simu-
lation program and prevented me from publishing inaccu-
rate data. This was truly reviewing beyond the call of duty,
and I greatly appreciate it.
HUMAN FACTORS
REFERENCES
Boehm, B. W. (1981).
Software engineering economics.
En-
glewood Cliffs, NJ: Prentice-Hall.
Bradley, J. V. (1976).
Probability; decision; statistics.
Engle-
wood Cliffs, NJ: Prentice-Hall.
Gould, J. D. (1988). How to design usable systems. In M.
Helander (Ed.),
Handbook of human-computer interac-
tion
(pp. 757-789). New York: North-Holland.
Grice, R. A., and Ridgway, L. S. (1989). A discussion of
modes and motives for usability evaluation.
IEEE
Transactions on Professional Communications,
32, 230-
237.
Hollander, M., and Wolfe, D. A. (1973).
Nonparametric sta-
tistical methods.
New York: Wiley.
Karat, C. M., Campbell, R., and Fiegel, T. (1992). Compar-
ison of empirical testing and walkthrough methods in
user interface evaluation. In
Human factors in comput-
ing systems: CHI
'92
conference proceedings
(pp.
397-
404). New York: Association for Computing Machinery.
Keeler, M. A., and Denning, S. M. (1991). The challenge of
interface design for communication theory: From in-
teraction metaphor to contexts of discovery.
Interact-
ing with Computers,
3, 283-301.
Kraemer, H. C., and Thiemann, S. (1987).
How many sub-
jects? Statistical power analysis in research.
Newbury
Park, CA: Sage.
Lewis, J. R. (1982). Testing small system customer set-up.
In
Proceedings of the Human Factors Society 26th An-
nual Meeting
(pp. 718-720). Santa Monica, CA: Human
Factors and Ergonomics Society.
Lewis, J. R., Henry, S. C., and Mack, R. L. (1990). Inte-
grated office software benchmarks: A case study. In
Human-Computer Interaction-INTERACT '90
(pp.
337-343). London: Elsevier.
Lewis, C., and Norman, D. A. (1986). Designing for error. In
D. A. Norman and S. W. Draper (Eds.),
User-centered
system design: New perspectives on human-computer in-
teraction
(pp. 411-432). Hillsdale, NJ: Erlbaum.
Norman, D. A. (1983). Design rules based on analyses of
human error.
Communications of the ACM,
4, 254-258.
Nunnally, J. C. (1978).
Psychometric theory.
New York:
McGraw-Hili.
Parker, S. P. (1984).
Dictionary of scientific and technical
terms.
New York: McGraw-Hili.
Virzi, R. A. (1990). Streamlining the design process: Run-
ning fewer subjects. In
Proceedings of the Human Fac-
tors Society 34th Annual Meeting
(pp. 291-294). Santa
Monica, CA: Human Factors and Ergonomics Society.
Virzi, R. A. (1992). Refining the test phase of usability eval-
uation: How many subjects is enough?
Human Factors,
34, 457-468.
Whitefield, A., and Sutcliffe, A. (1992). Case study in hu-
man factors evaluation.
Information and Software
Technology,
34,443-451.
Wright, P. C., and Monk, A. F. (1991). A cost-effective eval-
uation method for use by designers.
International Jour-
nal of Man-Machine Studies,
35, 891-912.
Date received: November
2,1992
Date accepted: October
7, 1993
... For this study, we recruited four undergraduate students majoring in media informatics and two graduate students specializing in computer graphics. Our choice for the number of study participants is based on recommendations for determining the sample size for usability tests [86], [87]. The recruits, four males and two females, were between 21 and 36 years old (µ = 25.83, ...
... Number of Participants in Usability Study: Our decision for n = 6 follows recommendations aimed at optimizing sample size for usability tests to maximize return on investment (ROI) [86], [87]. We set our problem discovery goal at a value of 0.9, which we considered sufficient for a prototype at this stage. ...
Article
Full-text available
Visual Parameter Space Analysis (VPSA) enables domain scientists to explore input-output relationships of computational models. Existing VPSA applications often feature multi-view visualizations designed by visualization experts for a specific scenario, making it hard for domain scientists to adapt them to their problems without professional help. We present RSVP, the Rapid Suggestive Visualization Prototyping system encoding VPSA knowledge to enable domain scientists to prototype custom visualization dashboards tailored to their specific needs. The system implements a task-oriented, multi-view visualization recommendation strategy over a visualization design space optimized for VPSA to guide users in meeting their analytical demands. We derived the VPSA knowledge implemented in the system by conducting an extensive meta design study over the body of work on VPSA. We show how this process can be used to perform a data and task abstraction, extract a common visualization design space, and derive a task-oriented VisRec strategy. User studies indicate that the system is user-friendly and can uncover novel insights.
... For this study, we recruited four undergraduate students majoring in media informatics and two graduate students specializing in computer graphics. Our choice for the number of study participants is based on recommendations for determining the sample size for usability tests [86], [87]. The recruits, four males and two females, were between 21 and 36 years old (µ = 25.83, ...
... Number of Participants in Usability Study: Our decision for n = 6 follows recommendations aimed at optimizing sample size for usability tests to maximize return on investment (ROI) [86], [87]. We set our problem discovery goal at a value of 0.9, which we considered sufficient for a prototype at this stage. ...
Preprint
Full-text available
Visual Parameter Space Analysis (VPSA) enables domain scientists to explore input-output relationships of computational models. Existing VPSA applications often feature multi-view visualizations designed by visualization experts for a specific scenario, making it hard for domain scientists to adapt them to their problems without professional help. We present RSVP, the Rapid Suggestive Visualization Prototyping system encoding VPSA knowledge to enable domain scientists to prototype custom visualization dashboards tailored to their specific needs. The system implements a task-oriented, multi-view visualization recommendation strategy over a visualization design space optimized for VPSA to guide users in meeting their analytical demands. We derived the VPSA knowledge implemented in the system by conducting an extensive meta design study over the body of work on VPSA. We show how this process can be used to perform a data and task abstraction, extract a common visualization design space, and derive a task-oriented VisRec strategy. User studies indicate that the system is user-friendly and can uncover novel insights.
... The session was held at a family hub in Blackburn, which was convenient for all participants to access. We aimed to recruit five participants, which has been recommended for being able to identify 80% of a product's usability issues (Lewis 1994;Virzi 1992). Additionally, this number was perceived as being feasible to observe at one time. ...
Article
Full-text available
Introduction A lack of culturally appropriate genetic information prevents the British Pakistani community from engaging with genetic services. The GENE-Ed project focussed on the development of an educational app with and for the Pakistani community. A secondary aim was understanding how to engage the community in research. Methods We used an iterative co-design and co-creation approach including four phases to develop the Gene app. Phase 1 included seven interviews with community members to explore genetics understanding and define the requirements. Phase 2 included reviewing smartphone apps and research on digital patient-facing interventions for genetics understanding. Phase 3 included developing the app and obtaining initial feedback. In Phase 4, feedback was obtained from five community members using the System Usability Scale (SUS), a bespoke survey and observations. Results Four themes were identified in the interviews: current awareness of genetics; consanguinity, religion and cultural influence; presenting genetics information in a new digital resource and dissemination; information-sharing and uptake. The reviews highlighted an absence of culturally sensitive, accessible and evidence-based digital resources. Initial feedback included altering the animations and images within the app and simplifying the text. The mean SUS score was 87, indicating excellent usability. The written information, animations and videos were acceptable to participants, and they tended to trust the information in the app. During feedback, community members responded well to different methods but struggled with written open-ended survey questions. Conclusion The co-design approach was essential to developing an acceptable resource for the British Pakistani community. Future clinical testing is needed.
... To conduct the usability experiment, a group of twenty (20) undergraduate students (i.e., 11 males and 9 females) was selected through recruitment methods such as the university's listserv and flyer distribution. Previous studies emphasize that testing with five participants is enough to detect 80% of system usability issues (Virzi, 1992, Lewis, 1994, Rough, 2018. Similar studies have used less than 20 participants in their usability studies (Lucas and Thabet, 2008;Irizarry et al., 2012). ...
Article
Full-text available
Classification of construction resource states, using sensor data analytics, has implications for improving informed decision-making for safety and productivity. However, training on sensor data analytics in construction education faces challenges owing to the complexity of analytical processes and the large stream of raw data involved. This research presents the development and user evaluation of ActionSens, a block-based end-user programming platform, for training students from construction-related disciplines to classify resources using sensor data analytics. ActionSens was designed for construction students to perform sensor data analytics such as activity recognition in construction. ActionSens was compared to traditional tools (i.e., combining Excel and MATLAB) used for performing sensor data analytics in terms of usability, workload, visual attention, and processing time using the System Usability Scale, NASA Task Load Index, eye-tracking, and qualitative feedback. Twenty students participated, performing data analytics tasks with both approaches. ActionSens exhibited a better user experience compared to conventional platforms, through higher usability scores and lower cognitive workload. This was evident through participants' interaction behavior, showcasing optimized attentional resource allocation across key tasks. The study contributes to knowledge by illustrating how the integration of construction domain information into block-based programming environments can equip students with the necessary skills for sensor data analytics. The development of ActionSens contributes to the Learning-for-Use framework by employing graphical and interactive programming objects to foster procedural knowledge for addressing challenges in sensor data analytics. The formative evaluation provides insights into how students engage with the programming environment and assesses the impact of the environment on their cognitive load.
... We tested the experience with six (6) total users, comprising university students in digital media fields. This sample size is appropriate given the exploratory nature of this work, and the high probability of problem discovery inherent in novel technology (see Lewis, 1994), which is also aligns with previous work suggesting that between 5 and 8 users can identify the majority of usability issues of a system (J. Nielsen & Landauer, 1993). ...
... See Table 1 educational background, disease severity, and other important determinants of usability. Two influential papers published in the early 1990s 49,50 led to the "rule of thumb" that 80% of use-errors can be uncovered by assessing 5-10 participants per user group, depending on the likelihood of problem detection. While minimization of use-errors is important, sample sizes of this magnitude do not allow for sufficient diversity to understand generalizability of study results. ...
Article
Full-text available
We propose the addition of usability validation to the extended V3 framework, now "V3+", and describe a pragmatic approach to ensuring that sensor-based digital health technologies can be used optimally at scale by diverse users. Alongside the original V3 components (verification; analytical validation; clinical validation), usability validation will ensure user-centricity of digital measurement tools, paving the way for more inclusive, reliable, and trustworthy digital measures within clinical research and clinical care.
... First, our study involved a sample of 10 participants. While this number of participants was sufficient to conduct a usability test, from a scientific perspective [26], its potential impact on the broader generalisability of our findings warrants consideration. Secondly, this study was conducted using specific apps and games that were predefined. ...
Preprint
Full-text available
Background: In recent years, the integration of robotic systems into various aspects of daily life has become increasingly common. As these technologies continue to advance, ensuring user-friendly interfaces and seamless interactions becomes more essential. For social robots to genuinely provide lasting value to humans, a favourable user experience (UX) emerges as an essential prerequisite. Aim: This article aimed to evaluate the usability of the MINI robot, highlighting its strengths and areas for improvement based on user feedback and performance. Method: In a controlled lab setting, a mixed-method qualitative study was conducted with ten individuals aged 65 and above diagnosed with mild dementia (MD) and mild cognitive impairment (MCI). Participants engaged in individual MINI robot interaction sessions, completing cognitive tasks as per written instructions. Video and audio recordings documented interactions, while post-session System Usability Scale (SUS) questionnaires quantified usability perception. Ethical guidelines were followed, ensuring informed consent, and the data underwent qualitative and quantitative analyses, contributing insights into the MINI robot's usability for this demographic. Results: The study addresses the ongoing challenges that tasks present, especially for MD individuals, emphasizing the importance of user support. Most tasks require both verbal and physical interactions, indicating that MD individuals face challenges when switching response methods within subtasks. These complexities originate from the selection and use of response methods, including difficulties with voice recognition, tablet touch, and tactile sensors. These challenges persist across tasks, with individuals with MD struggling to comprehend task instructions and provide correct answers and individuals with MCI struggling to use response devices, often due to the limitations of the robot's speech recognition. Technical shortcomings have been identified. The results of the System Usability Scale (SUS) indicate positive perceptions, although there are lower ratings for instructor assistance and pre-use learning. The average SUS score of 68.3 places device usability in the "good" category. Conclusion: Our study examines the usability of the MINI robot, revealing strengths in quick learning, simple system and operation, and integration of features while also highlighting areas for improvement. Careful design and modifications are essential for meaningful engagement with people with dementia. The robot could better benefit people with MD and MCI if clear, detailed instructions and instructor assistance are available.
... Research by Virzi and Nielsen suggested that using only 5 participants could reveal approximately 80% of product usability problems. 38,39 However, Lewis argued that the relationship between participant numbers and the identification rate of system problems follows a curve. 40,41 With 5 participants, only 55% of usability problems could be identified, 10 patients were able to identify 80% of usability problems, and 15 participants were able to identify 100% of usability problems. ...
Article
Full-text available
Objective Ophthalmic ward nursing work is onerous and busy, and many researchers have tried to introduce artificial intelligence (AI) technology to assist nurses in performing nursing tasks. This study aims to use augmented reality (AR) and AI technology to develop an intelligent assistant system for ophthalmic ward nurses and evaluate the usability and acceptability of the system in assisting clinical work for nurses. Methods Based on AR technology, under the framework of deep learning, the system management, functions, and interfaces were completed using acoustic recognition, voice interaction, and image recognition technologies. Finally, an intelligent assistance system with functions such as patient face recognition, automatic information matching, and nursing work management was developed. Ophthalmic day ward nurses were invited to participate in filling out the System Usability Scale (SUS). Using the AR-based intelligent assistance system (AR-IAS) as the experimental group and the existing personal digital assistant (PDA) system as the control group. The experimental results of the three subscales of learnability, efficiency, and satisfaction of the usability scale were compared, and the clinical usability score of the AR-IAS system was calculated. Results This study showed that the AR-IAS and the PDA systems had learnability subscale scores of 22.50/30.00 and 21.00/30.00, respectively; efficiency subscale scores of 29.67/40.00 and 28.67/40.00, respectively; and satisfaction subscale scores of 23.67/30.00 and 23.17/30.00, respectively. The overall usability score of the AR-IAS system was 75.83/100.00. Conclusion Based on the analysis results of the System Usability Scale, the AR-IAS system developed using AR and AI technology has good overall usability and can be accepted by clinical nurses. It is suitable for use in ophthalmic nursing tasks and has clinical promotion and further research value.
Article
Full-text available
Background Physical activity (PA) is recognized as a modifiable lifestyle factor for managing depression. An application(app)-based intervention to promote PA among individuals with depression may be a viable alternative or adjunct to conventional treatments offering increased accessibility. Objective This paper describes the early stages of the development process of MoodMover, a 9-week app-based intervention designed to promote PA for people with depression, including its usability testing. Methods Development of MoodMover followed the initial stages of the Integrate, Design, Assess, and Share (IDEAS) framework. The development process included (1) identifying intervention needs and planning; (2) intervention development; and (3) usability testing and refinement. Usability testing employed a mixed-methods formative approach via virtual semi-structured interviews involving goal-oriented tasks and administration of the mHealth App Usability Questionnaire (MAUQ). Results Drawing on formative research, a multidisciplinary research team developed the intervention, guided by the Multi-Process Action Control framework. Nine participants engaged in the usability testing with the MoodMover prototypes receiving an average MAUQ score of 5.79 (SD = 1.04), indicating good to high usability. Necessary modifications were made based on end-users' feedback. Conclusions The development of MoodMover, the first theoretically informed app-based PA intervention for individuals with depression, may provide another treatment option, which has wide reach. The comprehensive usability testing indicated interest in the app and strong perceptions of usability enabling a user-centered approach to refine the app to better align with end-users' preferences and needs. Testing the feasibility and preliminary efficacy of the refined MoodMover is now recommended.
Preprint
Full-text available
Background: Physical activity (PA) is recognized as a modifiable lifestyle factor for managing depression. An app-based intervention to promote PA among individuals with depression may be a viable alternative or adjunct to conventional treatments offering increased accessibility. This study aimed to describe the early stages of development and usability testing of a 9-week app-based intervention, MoodMover, specifically designed to promote PA for people with depression. Methods: Development of MoodMover followed the initial stages of the Integrate, Design, Assess, and Share (IDEAS) framework. The development process included: (1) identifying intervention needs and planning; (2) intervention development; and (3) usability testing and refinement. Usability testing employed a mixed-methods formative approach via virtual semi-structured interviews involving goal-oriented tasks and administration of the mHealth App Usability Questionnaire (MAUQ). Results: Drawing on formative research, a multidisciplinary research team developed the intervention guided by the Multi-Process Action Control framework. Nine participants engaged in the usability testing with the MoodMover prototypes receiving an average MAUQ score of 5.79 (SD = 1.04), indicating good to high usability. Necessary modifications were made based on end-users’ feedback. Conclusions: The development of MoodMover, the first theoretically informed app-based PA intervention for individuals with depression, may provide another evidence-based treatment option which has wide reach. The comprehensive usability testing indicated interest in the app and strong perceptions of usability enabling a user-centered approach to refine the app to better align with end-users’ preferences and needs. Testing the feasibility and preliminary efficacy of the refined MoodMover is now recommended.
Conference Paper
Full-text available
Customer Set-Up is a proven approach to reducing service costs and providing products at lower prices to customers. To ensure the effectiveness of Customer Set-Up instructions and procedures, these instructions and procedures must be studied before being shipped with their associated product. This paper will address several points to consider when planning a study of a Customer Set-Up system, such as procedure, appropriateness of subjects, number of subjects, the iterative procedure, studies vs. test, and development of test criteria.
Chapter
This paper summarizes the current state ofthe art and recent trends in software engineering economics. It provides an overview of economic analysis techniques and their applicability to software engineering and management. It surveys the field of software cost estimation, including the major estimation techniques available and the state of the art in algorithmic cost models.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Article
A human factors (HF) evaluation, carried out as part of the development of a set of computer-aided software engineering (CASE) tools, is presented and is used as an example of the processes and products of typical HF evaluation practice. The role of HF evaluation as a part of software quality assurance is identified, and typical current practice of HF evaluation is characterized. The details of the particular evaluation are then reported. First, its processes are described; these are determined by relating features of the system under development to the desired focus, actual context, and possible methods of the evaluation. Then the products of the evaluation are described; these products or outcomes are formulated as the user-computer interaction difficulties that were identified, grouped into three types (termed task, presentation, and device difficulties). The characteristics of each type of difficulty are discussed, in terms of their ease of identification, their generality across application domains, the HF knowledge that they draw on, and their relationship to redesign. The conclusion considers the usefulness of the evaluation, the inadequacies of system development practice it implies, and how to incorporate HF evaluation into an improved system development practice.