Technical ReportPDF Available

Sample Sizes for Usability Studies: Additional Considerations

Authors:
  • MeasuringU

Abstract and Figures

Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a usability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing returns as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Content may be subject to copyright.
_____________________
Technical Report
_____________________________________________________________________________
IBM Sample Sizes for Usability Studies:
Additional Considerations
James R. Lewis
October 25, 1992
Technical Report 54.711
Unclassified
Boca Raton, Florida
_____________________________________________________________________________
ii
Sample Sizes for Usability Studies: Additional Considerations
James R. Lewis
Design Center/Human Factors
Boca Raton, FL
TR54.711 (Revised March 3, 1993)
ABSTRACT
Recently, Virzi (1992) presented data that support three claims regarding sample sizes for
usability studies. The claims were (1) observing four or five participants will allow a
usability practitioner to discover 80% of a product's usability problems, (2) observing
additional participants will reveal fewer and fewer new usability problems, and (3) more
severe usability problems are easier to detect with the first few participants. Results from
an independent usability study clearly support the second claim, partially support the
first, but fail to support the third. Problem discovery shows diminishing returns as a
function of sample size. Observing four to five participants will uncover about 80% of a
product's usability problems, as long as the average likelihood of problem detection
ranges between .32 and .42, as in Virzi (1992). If the average likelihood of problem
detection is lower, then a practitioner will need to observe more than five participants to
discover 80% of the problems. Using behavioral categories for problem severity (or
impact), these data showed no correlation between problem severity (impact) and rate of
discovery. The data provided evidence that the binomial probability formula may
provide a good model for predicting problem-discovery curves, given an estimate of the
average likelihood of problem detection. Finally, data from economic simulations that
estimated return-on-investment (ROI) under a variety of settings showed that only the
average likelihood of problem detection strongly influenced the range of sample sizes for
maximum ROI. Other variables, such as the number of problems available for discovery,
costs to run a study, costs of fixing problems and costs of failing to fix problems, had a
much smaller influence on the sample size that maximized ROI.
Copyright IBM Corp. 1993. All rights reserved.
iii
Table of Contents
List of Tables.................................................................................................................iv
Introduction.....................................................................................................................1
The Office Applications Usability Study.........................................................................2
Method........................................................................................................................2
Results.........................................................................................................................3
Discussion ...................................................................................................................5
Problem-Discovery Curves and the Binomial Probability Formula..................................6
Problem-Discovery Curves for Specific Problems of Varying Probabilities..................6
Detecting Problems at Least Twice ..............................................................................8
Discussion ...................................................................................................................9
A Return-on-Investment Model for Usability Studies.................................................... 11
Method......................................................................................................................11
Results.......................................................................................................................13
Discussion .................................................................................................................13
General Discussion........................................................................................................15
References ....................................................................................................................17
iv
List of Tables
1 Key Features of Virzi's (1992) Three Experiments . . . . . . . . . . . . . . . . . . 2
2 Goodness-of-Fit Tests for Binomial and Monte Carlo Problem-Discovery
Curves for Problems of Varying Probabilities of Detection . . . . . . . . . . . 9
3 Expected Proportion of Detected Problems (at Least Once) for Various
Problem Detection Probabilities and Sample Sizes . . . . . . . . . . . . . . . . . 10
4 Sample Size Requirements as a Function of Problem Probability and
the Cumulative Likelihood that the Problem Will Occur at Least Once
(Twice) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Variables and Their Values in Return-on-Investment Simulations . . . . . . . 12
6 Main Effects for the ROI Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
List of Figures
1 Predicted Problem Discovery as a Function of Sample Size . . . . . . . . . . . . 4
2 Problem-Discovery Rate as a Function of Problem Impact . . . . . . . . . . . . 4
3 Predicted Problem-Discovery Rates as a Function of
Individual Problem Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Predicted Problem-Discovery Rates for Detecting
Problems at Least Twice (Lewis, Henry, and Mack, 1990) . . . . . . . . . . . . 9
5 Predicted Problem-Discovery Rates for Detecting
Problems at Least Twice (Virzi, 1990) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vi
1
Introduction
The goal of many usability studies is to identify design problems and recommend
product changes (to either the current product or future products) based on the design
problems (Gould, 1988; Grice and Ridgway, 1989; Karat, Campbell, and Fiegel, 1992;
Whitefield and Sutcliffe, 1992; Wright and Monk, 1991). During a usability study, an
observer watches representative participants perform representative tasks to understand
when and how they have problems using a product. The problems provide clues about
how to redesign the product to either eliminate the problem or provide easy recovery
from it (Lewis and Norman, 1986; Norman, 1983).
Human factors engineers who conduct industrial usability evaluations need to
understand their sample size requirements. If they collect a larger sample than necessary,
they might increase product cost and development time. If they collect too small a
sample, they might fail to detect problems that, uncorrected, would reduce the usability of
the product. Discussing usability testing, Keeler and Denning (1991) showed a common
negative attitude toward small-sample usability studies when they stated, "actual
[usability] test procedures cut corners in a manner that would be unacceptable to true
empirical investigations. Test groups are small (between 6 and 20 subjects per test)" (p.
290). Yet, in any setting, not just an industrial one, the appropriate sample size
accomplishes the goals of the study as efficiently as possible (Kraemer and Thiemann,
1987).
Virzi (1992) investigated sample size requirements for usability evaluations. He
reported three experiments in which he measured the rate at which trained usability
experts identified problems as a function of the number of naive participants they had
observed. For each experiment, he ran a Monte Carlo simulation to permute participant
orders 500 times, and measured the cumulative percentage of problems discovered for
each sample size. In the second experiment, the observers provided ratings of problem
severity (using a seven-point scale). In addition to having the observers provide problem-
severity ratings (this time using a three-point scale), an independent set of usability
experts provided estimates of problem severity based on brief, one-paragraph
descriptions of the problems discovered in the third experiment. This helped control the
effect of knowledge of problem frequency on the estimation of problem severity. The
correlation between problem frequency and test observers' judgment of severity in the
second experiment was .463 (p<.01). In the third experiment, the agreement among the
test observers and the independent set of judges was significant (W(16)=.471, p<.001) for
the rank order of 17 problems in terms of how disruptive they were likely to be to the
usability of the system. (Virzi did not report the magnitude of the correlation between
problem frequency and either the test observers' or independent judges' estimates of
problem severity for the third experiment.) Table 1 shows some of the key features of the
three experiments.
2
________________________________________________________________________
Table 1. Key Features of Virzi's (1992) Three Experiments
________________________________________________________________________
Sample Number Number of Average Likelihood of
Experiment Size of Tasks Problems Problem Detection
________________________________________________________________________
1 12 3 13 .32
2 20 21 40 .36
3 20 7 17 .42
________________________________________________________________________
Based on these experiments, Virzi (1992) made three claims regarding sample
size for usability studies. The claims were (1) observing four or five participants will
allow a practitioner to discover 80% of a product's usability problems, (2) observing
additional participants will reveal fewer and fewer new usability problems, and (3) more
severe usability problems are easier to detect with the first few participants. These
important findings are in need of replication. One purpose of this paper is to report the
results of an independent usability study that clearly support the second claim, partially
support the first, and fail to support the third. Another purpose is to develop a
mathematical model of problem discovery based on the binomial probability formula, and
examine its extension into economic simulations that estimate return-on-investment
(ROI) for a usability study as a function of several independent variables.
The Office Applications Usability Study
Lewis, Henry, and Mack (1990) conducted a series of usability studies to develop
usability benchmark values for integrated office systems. The following method and
results are from one of these studies (the only one for which we kept a careful record of
which participants experienced which problems). A set of 11 scenarios served as stimuli
in the evaluation.
Method
Participants. Fifteen employees of a temporary-help agency participated in the
study. All participants had at least three months' experience with a computer system, but
had no programming training or experience. Five participants were clerks or secretaries
with no experience in the use of a mouse device, five were business professionals with no
mouse experience, and five were business professionals who did report mouse
experience.
3
Apparatus. The office system had a word processor, a mail application, a
calendar application, and a spreadsheet on an operating system that allowed a certain
amount of integration among the applications.
Procedure. A participant began with a brief tour of the lab, read a description of
the purpose of the study, and completed a background questionnaire. After a short
tutorial on the operating system, the participant began working on the scenarios. It
usually took a participant about six hours to complete the scenarios. Observers watched
participants, one at a time, by closed-circuit television. In addition to several
performance measures, we carefully recorded the problems that participants experienced
during the study. We classified the problems in decreasing level of impact according to
four behavioral definitions:
1. Scenario failure. The problem caused the participant to fail to complete a scenario, by either requiring
assistance to recover from the problem or producing an incorrect output (excepting minor
typographical errors).
2. Considerable recovery effort. The participant either worked on recovery from the problem for more
than a minute or experienced the problem multiple times within a scenario.
3. Minor recovery effort. The participant experienced the problem only once within a scenario and
required less than a minute to recover.
4. Inefficiency. The participant worked toward the scenario's goal, but deviated from the most efficient
path.
Results
Participants experienced 145 different problems during the usability study. The
average likelihood of problem detection was .16. Figure 1 shows the results of applying
a Monte Carlo procedure to calculate the mean of 500 permutations of participant orders,
revealing the general form of the cumulative problem-discovery curve. Figure 1 also
shows the predicted cumulative problem-discovery curve using the formula 1-(1-p)n
(Virzi, 1990; Virzi, 1992; Wright and Monk, 1991), where p is the probability of
detecting a given problem, and n is the sample size. The predicted curve shows an
excellent fit to the Monte Carlo curve (Kolmogorov-Smirnov J'3=.73, p=.66) (Hollander
and Wolfe, 1973). For this study, observing five participants would uncover only 55% of
the problems. To uncover 80% of the problems would require 10 participants.
4
________________________________________________________________________
Sample Size
Percentage of
Problems
Discovered
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 1
01
11
21
31
41
5
Binomial
Monte Carlo
______
__________________________________________________________________
Figure 1. Predicted Problem Discovery as a Function of Sample Size
________________________________________________________________________
Different participants might experience the same problem but might not
experience the same impact. For subsequent data analyses, the impact rating for each
problem was the modal impact level across the participants who experienced the
problem. (If the distribution was bimodal, then the problem received the more severe
classification.) Figure 2 shows the results of applying the same Monte Carlo procedure
to problems for each of the four impact levels. The curves overlap considerably, and the
Pearson correlation between problem frequency and impact level was not significant
(r(143)=.06, p=.48).
________________________________________________________________________
Sample Size
Percentage of
Problems
Discovered
0
10
20
30
40
50
60
70
80
90
100
1234567891
01
11
21
31
41
5
Impact 1
Impact 2
Impact 3
Impact 4
________________________________________________________________________
Figure 2. Problem-Discovery Rate as a Function of Problem Impact
________________________________________________________________________
5
Discussion
These results are completely consistent with the earlier finding that additional
participants discover fewer and fewer problems (Virzi, 1992). If the average likelihood
of problem detection had been in the range of .32 to .42, then five participants would
have been enough to uncover 80% of the problems. However, because the average
likelihood of problem detection was considerably lower in this study than in the three
Virzi studies, usability studies such as this would need 10 participants to discover 80% of
the problems. This shows that it is very important for usability evaluators to have an idea
about the average likelihood of problem detection for their types of products and types of
usability studies before they estimate sample size requirements. If a product has poor
usability (has a high average likelihood of problem detection), it is easy to improve the
product (or at least discover a large percentage of its problems) with a small sample.
However, if a product has good usability (has a low average likelihood of problem
detection), it will require a larger sample to discover the remaining problems.
These results showed no significant relationship between problem frequency and
problem impact. This outcome failed to support the claim that observers would find
severe usability problems faster than less severe problems (Virzi, 1992). Virzi used the
term "problem severity," where we (Lewis, Henry, and Mack, 1990) described the
dimension as "problem impact." Our conception of "problem severity" is that it is the
combination of the effects of problem impact and problem frequency. Because the
impact for a problem in this usability study was assignment to behaviorally-defined
categories, this impact classification should be independent of problem frequency. In his
third experiment, Virzi attempted to control the confounding caused by having the
observers, who have problem-frequency knowledge, also rate severity (using a 3-point
scale). The observers and an independent group of usability experts ranked 17 of the
problems along the dimension of disruptiveness to system usability, and the results
indicated significant agreement (W(16)=.471, p<.001). However, this procedure
(providing one-paragraph problem descriptions to the independent group of experts)
might not have successfully removed the influence of problem frequency from severity
estimation. It is unfortunate that Virzi did not report the magnitude of the correlation
between problem frequency and the severity judgments of the usability experts who did
not have any knowledge of problem frequency. Given these conflicting results and the
logical independence of problem impact (or severity) and frequency, human factors
engineers and others who conduct usability evaluations should take the conservative
approach of assuming no relationship between problem impact and frequency until future
research resolves the different outcomes.
6
Problem-Discovery Curves and the Binomial Probability Formula
Several researchers have suggested that the formula 1-(1-p)n predicts the rate of
problem discovery in usability studies (Virzi, 1990; Virzi, 1992; Wright and Monk,
1991). However, none of these researchers has offered an explanation of the basis for
that formula. In an earlier paper (Lewis, 1982), I proposed that the binomial probability
theorem could provide a statistical model for determining the likelihood of detecting a
problem of probability p r times in a study with n participants.
The binomial probability formula is P(r)=(n)pr(1-p)n-r (Bradley, 1976), where
P(r) is the likelihood that an event will occur r times, given a sample size of n and the
probability p that the event will occur in the population-at-large. The conditions under
which the binomial probability formula applies are random sampling, independent
observations, two mutually exclusive and exhaustive categories of events, and sample
observations that do not deplete the source. Problem-discovery usability studies usually
meet these conditions. Usability practitioners should attempt to sample participants
randomly. (Although circumstances rarely allow true random sampling in usability
studies, experimenters do not usually exert any influence on precisely who participates in
the study, resulting in a quasi-random sampling.) Observations among participants are
independent, because the problems experienced by one participant cannot have an effect
on those experienced by another participant. (Note that this model does not require
independence among the different types of problems that occur.) The two mutually
exclusive and exhaustive problem-detection categories are (1) the participant encountered
the problem, and (2) the participant did not experience the problem. Finally, the sampled
observations in a usability study do not deplete the source. The probability that a given
sample size will produce at least one instance of problem detection is 1 minus the
probability of no detections, or 1-P(0). When r=0, P(0)=(n)p0(1-p)n-0, which reduces to
P(0)=(1-p)n. Thus, the cumulative binomial probability for the likelihood that a problem
of probability p will occur at least once is 1-(1-p)n.
Problem-Discovery Curves for Specific Problems of Varying Probabilities
As shown in Figure 1, the formula 1-(1-p)n provided a good fit to the data when p
is the average likelihood of problem detection for a set of problems. Another approach is
to select specific problems of varying probabilities of detection and compare Monte Carlo
problem-discovery curves with curves predicted with the cumulative binomial probability
formula. Figure 3 shows the problem-discovery likelihoods (each based on 500 Monte
Carlo participant-order permutations) for five specific problems, with problem
probabilities ranging from .14 to .74. The figure also shows the predicted cumulative
problem-discovery curves. Table 2 contains the results of Kolmogorov-Smirnov
goodness-of-fit tests for the Monte Carlo and binomial curves for each problem.
7
________________________________________________________________________
Sample Size
Discovery
Likelihood
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1234567891
01
11
21
31
41
5
Binomial (.14)
Monte Carlo (.14)
Binomial (.20)
Monte Carlo (.20)
Binomial (.40)
Monte Carlo (.40)
Binomial (.57)
Monte Carlo (.57)
Binomial (.74)
Monte Carlo (.74)
________________________________________________________________________
Figure 3. Predicted Problem-Discovery Rates as a Function of Individual Problem Likelihood
________________________________________________________________________
________________________________________________________________________
Table 2. Goodness-of-Fit Tests for Binomial and Monte Carlo Problem-Discovery Curves for
Problems of Varying Probabilities of Detection
________________________________________________________________________
Probability of Kolmogorov-
Problem Detection Smirnov J'3 p
________________________________________________________________________
1 .14 0.54 .93
2 .20 0.54 .93
3 .40 0.54 .93
4 .57 0.18 1.00
5 .74 0.18 1.00
________________________________________________________________________
8
As Figure 3 and Table 2 show, 1-(1-p)n provided an excellent fit to the results of
Monte Carlo permutations. Figure 3 shows that the binomial and Monte Carlo curves
deviated more when problem probability was low. This is probably because the curves
based on the Monte Carlo simulations must end at a discovery likelihood of 1.0, but the
binomial curves do not have this constraint. A sample size of 15 was probably not
sufficient to demonstrate accurate problem-discovery curves for low-probability
problems using Monte Carlo permutations. Note also that the binomial curves for
individual problems either matched the Monte Carlo curve very closely (problems 4 and
5), or underpredicted the Monte Carlo curve (problems 1, 2, and 3). This lends support to
Virzi's (1992) suggestion that the tendency of the formula 1-(1-p)n to overpredict
problem discovery for sets of problems is a Jensen's Inequality artifact.
Detecting Problems at Least Twice
If the binomial probability formula is a reasonable model for problem discovery
in usability studies, then it should also predict the likelihood that a problem will occur at
least twice. (In practice, some usability practitioners use this criterion to avoid reporting
problems that might be idiosyncratic to a single participant.) The cumulative binomial
probability for P(at least two detections) is 1-(P(0)+P(1)). Because P(1)=np(1-p)n-1, P(at
least two detections)=1-((1-p)n+np(1-p)n-1). Figure 4 shows the Monte Carlo (based on
500 participant-order permutations) and binomial problem-discovery curves for the
likelihood that a problem will occur at least twice (Lewis, Henry, and Mack, 1990). A
Kolmogorov-Smirnov goodness-of-fit test did not lead to rejection of the hypothesis that
the binomial probability formula provided an adequate fit to the Monte Carlo data
(J'3=1.27, p=.08), but did not provide very strong support for the null hypothesis of good
fit. The average likelihood of detecting a problem at least twice was quite low in this
study (.03), so it was possible that a sample size of 15 might not be adequate to model
this problem-discovery situation. With data provided by Virzi (1990), Figure 5 shows the
Monte Carlo (based on 500 participant-order permutations) and binomial problem-
discovery curves for a usability study in which the average likelihood of detecting a
problem at least twice was .12 and there were 20 participants. In that situation, the
Kolmogorov-Smirnov goodness-of-fit test provided much stronger support for prediction
based on the binomial probability formula (J'3=.47, p=.98).
9
________________________________________________________________________
Sample Size
Percentage of
Problems
Discovered
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 1
01
11
21
31
41
5
Binomial
Monte Carlo
________________________________________________________________________
Figure 4. Predicted Problem-Discovery Rates for Detecting Problems at Least Twice (Lewis,
Henry, and Mack, 1990)
________________________________________________________________________
________________________________________________________________________
Sample Size
Percentage of
Problems
Discovered
0
20
40
60
80
100
1234567891
01
11
21
31
41
51
61
71
81
92
0
Binomial
Monte Carlo
________________________________________________________________________
Figure 5. Predicted Problem-Discovery Rates for Detecting Problems at Least Twice
(Virzi, 1990)
________________________________________________________________________
Discussion
These data provide support to the hypothesis that the cumulative binomial
probability formula is a reasonable model for problem discovery in usability studies. To
help human factors engineers select an appropriate sample size for a usability study,
Table 3 shows the expected proportion of detected problems (at least once) for various
problem detection probabilities through a sample size of 20, and Table 4 shows the
minimum sample size required to detect problems of varying probabilities at least once
(or, as shown in parentheses, at least twice). (See Lewis, 1990, for more extensive tables,
10
similar to Table 3, for sample sizes from 1 to 100 and discovery likelihoods for
discovering problems one, two, or three times.) When the problem probability is the
average across a set of problems, then the cumulative likelihood that the problem will
occur is also the expected proportion of discovered problems. For example, if a
practitioner planned to discover problems from a set with an average probability of
detection of .25, was willing to treat a single detection of a problem seriously, and
planned to discover 90% of the problems, the study would require eight participants. If
the practitioner planned to see a problem at least twice before taking it seriously, the
sample size requirement would be 14. If a practitioner planned to discover problems at
least once with probabilities as low as .01 and with a cumulative likelihood of discovery
of .99, the study would require 418 participants (an unrealistic requirement in most
settings, implying unrealistic study goals).
________________________________________________________________________
Table 3. Expected Proportion of Detected Problems (at Least Once) for Various Problem
Detection Probabilities and Sample Sizes
________________________________________________________________________
Sample
Size p=.01 p=.05 p=.10 p=.15 p=.25 p=.50 p=.90
________________________________________________________________________
1 0.01 0.05 0.10 0.15 0.25 0.50 0.90
2 0.02 0.10 0.19 0.28 0.44 0.75 0.99
3 0.03 0.14 0.27 0.39 0.58 0.88 1.00
4 0.04 0.19 0.34 0.48 0.68 0.94 1.00
5 0.05 0.23 0.41 0.56 0.76 0.97 1.00
6 0.06 0.26 0.47 0.62 0.82 0.98 1.00
7 0.07 0.30 0.52 0.68 0.87 0.99 1.00
8 0.08 0.34 0.57 0.73 0.90 1.00 1.00
9 0.09 0.37 0.61 0.77 0.92 1.00 1.00
10 0.10 0.40 0.65 0.80 0.94 1.00 1.00
11 0.10 0.43 0.69 0.83 0.96 1.00 1.00
12 0.11 0.46 0.72 0.86 0.97 1.00 1.00
13 0.12 0.49 0.75 0.88 0.98 1.00 1.00
14 0.13 0.51 0.77 0.90 0.98 1.00 1.00
15 0.14 0.54 0.79 0.91 0.99 1.00 1.00
16 0.15 0.56 0.81 0.93 0.99 1.00 1.00
17 0.16 0.58 0.83 0.94 0.99 1.00 1.00
18 0.17 0.60 0.85 0.95 0.99 1.00 1.00
19 0.17 0.62 0.86 0.95 1.00 1.00 1.00
20 0.18 0.64 0.88 0.96 1.00 1.00 1.00
________________________________________________________________________
11
________________________________________________________________________
Table 4. Sample Size Requirements as a Function of Problem Probability and the Cumulative
Likelihood that the Problem Will Occur at Least Once (Twice)
________________________________________________________________________
Cumulative Likelihood That the Problem Will Occur at Least Once (Twice)
Problem
Probability .50 .75 .85 .90 .95 .99
________________________________________________________________________
.01 68 (166) 136 (266) 186 (332) 225 (382) 289 (462) 418 (615)
.05 14 ( 33) 27 ( 53) 37 ( 66) 44 ( 76) 57 ( 91) 82 (121)
.10 7 ( 17) 13 ( 26) 18 ( 33) 22 ( 37) 28 ( 45) 40 ( 60)
.15 5 ( 11) 9 ( 17) 12 ( 22) 14 ( 25) 18 ( 29) 26 ( 39)
.25 3 ( 7) 5 ( 10) 7 ( 13) 8 ( 14) 11 ( 17) 15 ( 22)
.50 1 ( 3) 2 ( 5) 3 ( 6) 4 ( 7) 5 ( 8) 7 ( 10)
.90 1 ( 2) 1 ( 2) 1 ( 3) 1 ( 3) 2 ( 3) 2 ( 4)
________________________________________________________________________
Table Note: These are the minimum sample sizes that result after rounding cumulative likelihoods
to two decimal places.
________________________________________________________________________
A Return-on-Investment Model for Usability Studies
The preceding analyses show that, given an estimate of the average likelihood of
problem detection, it is possible to generate problem-discovery curves with the
cumulative binomial probability distribution. These curves provide a basis for selecting
an appropriate sample size for usability studies. However, a more complete analysis
should address the costs associated with running additional participants, fixing problems,
and failing to discover problems. Such an analysis should allow usability practitioners to
specify the relationship between sample size and return-on-investment (ROI).
Method
Table 5 shows the variables and their values in ROI simulations designed to
determine which variables exert the most influence on (1) the sample size at the
maximum ROI, (2) the magnitude of the maximum ROI, and (3) the percentage of
problems discovered at the maximum ROI. The equation for the simulations was
ROI=SAVINGS/COSTS, where SAVINGS is the cost of the discovered problems had
12
they remained undiscovered minus the cost of fixing the discovered problems, and
COSTS is the sum of the daily cost to run a study plus the costs associated with problems
that remain undiscovered. Thus, a better ROI will have a higher numerical value.
________________________________________________________________________
Table 5. Variables and Their Values in Return-on-Investment Simulations
________________________________________________________________________
Variable Values
________________________________________________________________________
Sample Size From 1 to 20
Number of Problems 30, 150, 300
Available for
Discovery
Average Likelihood of .10, .25, .50
Problem Discovery
Daily Cost to Run $500/day, $1000/day
the Study
Cost to Fix a $100, $1000
Discovered Problem
Cost of an Undiscovered $200, $500, $1000
Problem (Low Set)
Cost of an Undiscovered $2000, $5000, $10000
Problem (High Set)
________________________________________________________________________
The simulations included cumulative binomial problem-discovery curves for
sample sizes from 1 to 20 for three average likelihoods of problem discovery (.10, .25,
.50). For each sample size and average likelihood of problem discovery, a BASIC
program provided the expected number of discovered and undiscovered problems. The
program then crossed the discovered-problem cost of $100 with undiscovered-problem
costs of $200, $500, and $1000 (low set), and the discovered-problem cost of $1000 with
undiscovered-problem costs of $2000, $5000, and $10,000 (high set) to calculate ROIs.
For the simulations, the sample-size variable of 20 participants covered a
reasonable range, and should result in the discovery of a large proportion of the problems
that are available for discovery in many usability studies (Virzi, 1992). The values for
the number of problems available for discovery are consistent with those reported in the
literature (Lewis, Henry, and Mack, 1990; Virzi, 1990; Virzi, 1992), as are the values for
the average likelihood of problem discovery. Assuming one participant per day, the
13
values for the daily cost to run a study are consistent with current laboratory, observer,
and participant costs. The ratio of the costs to fix a discovered problem to the costs of an
undiscovered problem are congruent with software-engineering indices reported by
Boehm (1981).
Results
Table 6 shows the results of the main effects of the independent variables in the
simulations on the dependent variables of (1) the sample size at which the maximum ROI
occurred, (2) the magnitude of the maximum ROI, and (3) the percentage of problems
discovered at the maximum ROI. The table shows the average value of each dependent
variable for each level of all the independent variables, and the range of the average
values for each independent variable. Across all the variables, the average percentage of
discovered problems at the maximum ROI was 94%.
Discussion
All of the independent variables influenced the sample size at the maximum ROI,
but the variable with the strongest influence (as indicated by the range) was the average
likelihood of problem discovery (p). It also had the strongest influence on the percentage
of problems discovered at the maximum ROI. Therefore, it is very important for
usability practitioners to estimate the magnitude of this variable for their studies, because
it largely determines the appropriate sample size. If the expected value of p is small (for
example, .10), practitioners should plan to discover about 86% of the problems. If the
expected value of p is larger (for example, .25 or .50), practitioners should plan to
discover about 98% of the problems. If the value of p is between .10 and .25,
practitioners should interpolate in Table 6 to determine an appropriate goal for the
percentage of problems to discover.
Contrary to expectation, the cost of an undiscovered problem had a minor effect
on sample size at maximum ROI, but, like all the other independent variables, it had a
strong effect on the magnitude of the maximum ROI. Usability practitioners should be
aware of these costs and their effect on ROI, but they have relatively little effect on the
appropriate sample size for a usability study.
The definitions of the cost variables for the ROI simulations are purposely vague.
Each practitioner needs to consider the potential elements of cost for a specific work
setting. For example, the cost of an undiscovered problem might, in one setting, consist
primarily of the cost to send personnel to user locations to repair the problem. In another
setting, the primary cost of an undiscovered problem might be the loss of future sales due
to customer dissatisfaction.
14
________________________________________________________________________
Table 6. Main Effects for the ROI Simulations
________________________________________________________________________
Independent Variables Dependent Variables .
Percent Problems
Sample Size at Magnitude of Discovered at
Variable Value Maximum ROI Maximum ROI Maximum ROI
________________________________________________________________________
Average .10 19.0 3.1 86
Likelihood .25 14.6 22.7 97
of Problem .50 7.7 52.9 99
Discovery Range: 11.4 49.8 13
Number of 30 11.5 7.0 91
Problems 150 14.4 26.0 95
Available for 300 15.4 45.6 95
Discovery Range: 3.9 38.6 4
Daily Cost 500 14.3 33.4 94
to Run Study 1000 13.2 19.0 93
Range: 1.1 14.4 1
Cost to Fix 100 11.9 7.0 92
a Discovered 1000 15.6 45.4 96
Problem Range: 3.7 38.4 4
Cost of an 200 10.2 1.9 89
Undiscovered 500 12.0 6.4 93
Problem 1000 13.5 12.6 94
(Low Set) Range: 3.3 10.7 5
Cost of an 2000 14.7 12.3 95
Undiscovered 5000 15.7 41.7 96
Problem 10000 16.4 82.3 96
(High Set) Range: 1.7 70.0 1
________________________________________________________________________
15
General Discussion
The law of diminishing returns, based on the cumulative binomial probability
formula, applies to problem-discovery usability studies. To use this formula to determine
an appropriate sample size, practitioners must form an idea about the expected value of p
(the average likelihood of problem detection) for the study, and the percentage of
problems that the study should uncover. Practitioners can use the data in Table 6 or their
own ROI formulas to estimate an appropriate goal for the percentage of problems to
discover, and can examine data from their own or published usability studies to estimate
p. (The data from this office-applications study have shown that p can be as low as .16.)
With these two estimates, Table 4 (or, more generally, the cumulative binomial
probability distribution) can provide the appropriate sample size for the usability study.
Practitioners who wait to see a problem at least twice before giving it serious
consideration can see in Table 4 the sample-size implications of this strategy. Certainly,
all other things being equal, it is more important to correct a problem that occurs
frequently than one that occurs infrequently. However, it is unrealistic to assume that the
frequency of detection of a problem is the only criterion to consider in the analysis of
usability problems. The best strategy is to consider problem frequency and impact
simultaneously to determine which problems are most important to correct rather than
establishing a cut-off rule such as "fix every problem that appears two or more times."
The results of the present office-applications usability study raise a serious
question about the relationship between problem frequency and impact (or severity). In
this study, problem-discovery rates were the same regardless of the problem-impact
rating. Clearly, the conservative approach for practitioners is to assume independence of
frequency and impact.
It is important for practitioners to consider the risks as well as the gains when they
use small samples in usability studies. Although the diminishing returns for inclusion of
additional participants strongly suggest that the most efficient approach is to run a small
sample (especially if the expected p is high, the study will be iterative, and undiscovered
problems will not have dangerous or expensive outcomes), human factors engineers and
other usability practitioners must not become complacent regarding the risk of failing to
detect low-frequency but important problems.
The goal of this paper was to address considerations for the selection of a sample
size of participants for problem-discovery usability studies. However, this is only one
element among several that usability practitioners must consider. Another important
topic is the selection and construction of the tasks and scenarios that participants do in a
study. Certainly, what an evaluator asks participants to do influences the likelihood of
problem discovery. If the likelihood of discovery of a specific problem on a single
performance of a task is low, the likelihood of discovery will increase if participants have
multiple opportunities to perform the task (or variations on the task). Repeating tasks
also allows an evaluator to determine if particular problems that occur early diminish or
persist with practice, but repeating tasks increases study time. The decision about
16
whether to have multiple trials depends on the purpose of the study. Concerns about
what tasks to ask participants to do is similar to those in assessing content validity in
psychometrics (Nunnally, 1978). This topic (adequate task coverage in usability studies)
deserves detailed treatment.
17
References
Boehm, B. W. (1981). Software engineering economics. Englewood Cliffs, NJ:
Prentice-Hall.
Bradley, J. V. (1976). Probability; decision; statistics. Englewood Cliffs, NJ: Prentice-
Hall.
Gould, J. D. (1988). How to design usable systems. In M. Helander (Ed.), Handbook of
Human-Computer Interaction (pp. 757-789). New York, NY: North-Holland.
Grice, R. A., and Ridgway, L. S. (1989). A discussion of modes and motives for
usability evaluation. IEEE Transactions on Professional Communications, 32, 230-237.
Hollander, M., and Wolfe, D. A. (1973). Nonparametric statistical methods. New York,
NY: Wiley.
Karat, C. M., Campbell, R., and Fiegel, T. (1992). Comparison of empirical testing and
walkthrough methods in user interface evaluation. In Human Factors in Computing
Systems: CHI ' 92 Conference Proceedings (pp. 397-404). New York, NY: Association
for Computing Machinery.
Keeler, M. A., and Denning, S. M. (1991). The challenge of interface design for
communication theory: From interaction metaphor to contexts of discovery. Interacting
with Computers, 3, 283-301.
Kraemer, H. C., and Thiemann, S. (1987). How many subjects? Statistical power
analysis in research. Newbury Park, CA: Sage Publications.
Lewis, C., and Norman, D. A. (1986). Designing for error. In D. A. Norman and S. W.
Draper (Eds.), User Centered System Design: New Perspectives on Human-Computer
Interaction (pp. 411-432). Hillsdale, NJ: Lawrence Erlbaum.
Lewis, J. R. (1982). Testing small system customer set-up. In Proceedings of the
Human Factors Society 26th Annual Meeting (pp. 718-720). Santa Monica, CA: Human
Factors Society.
Lewis, J. R. (1990). Sample sizes for observational usability studies: Tables based on
the binomial probability formula (Tech. Report 54.571). Boca Raton, FL: International
Business Machines Corp.
Lewis, J. R., Henry, S. C., and Mack, R. L. (1990). Integrated office software
benchmarks: A case study. In Human-Computer Interaction -- INTERACT '90 (pp. 337-
343). London, UK: Elsevier Science Publishers.
18
Norman, D. A. (1983). Design rules based on analyses of human error.
Communications of the ACM, 4, 254-258.
Nunnally, J. C. (1978). Psychometric theory. New York, NY: McGraw-Hill.
Virzi, R. A. (1990). Streamlining the design process: Running fewer subjects. In
Proceedings of the Human Factors Society 34th Annual Meeting (pp. 291-294). Santa
Monica, CA: Human Factors Society.
Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects
is enough? Human Factors, 34, 457-468.
Whitefield, A., and Sutcliffe, A. (1992). Case study in human factors evaluation.
Information and Software Technology, 34, 443-451.
Wright, P. C., and Monk, A. F. (1991). A cost-effective evaluation method for use by
designers. International Journal of Man-Machine Studies, 35, 891-912.
... According to a previous study [34], physiological monitoring is a relatively emerging technology in usability and UX studies, thus an ideal estimation for the sample size in such studies is not readily available. However, several studies [87][88][89] suggest that the commonly known "magic number 5" may suffice to uncover approximately 80% of usability issues in technology acceptance. Nevertheless, employing small sample sizes may result in significant variability in test outcomes that cannot be adequately mitigated [90]. ...
... • Sample size. In [87][88][89] claimed that 5 participants are sufficient to identify 80% of usability problems. However, small sample sizes can lead to high variability in test results that cannot be fully adjusted [90]. ...
Article
Full-text available
In the new industrial contexts, the workers’ well-being is the central pillar. Therefore, research on methods and technics to improve the workers’ user experience in a human–robot collaborative environment is necessary. While the effects of kinematic variables, such as speed and acceleration, on human safety have been extensively studied, their impact on human perception has not been fully explored. This study investigates the effects of the robot’s speed and acceleration on humans. An experimental research approach was used, where 20 participants (10 women and 10 men) performed an assembly task collaborating with a robot. An experiment was defined with two procedures, and the participants were evenly distributed: in the first experiment, the participants started by performing the task at a slow robot speed and then performed the same task at a faster speed. In the second experiment, the other half followed the opposite procedure. Key Performance Indicators (KPIs), physiological values (via EEG and EDA), and perceptual values (using the standardised UEQ-S questionnaire) were collected. The results showed that the robot’s speed and acceleration impact the task’s completion time and participants’ emotional responses. Our results lead to a new concept, “HRI speed bell”, which indicates that it is necessary to investigate the optimal speed and acceleration to ensure good trust and perceived safety. Furthermore, the task sequence also influences participants’ expectations and performance. Finally, the results are examined according to gender perspective.
... Finally, a physical workshop (120′) was held with 16 of the stakeholders from the LIBERTY project to present the results and exchange opinions regarding the practical use of circularity criteria and indicators for the sustainable development and LCA of EV batteries. Regarding the size of the interviewed population, Lewis (1994) and Aguayo (2024) highlighted that 5 participants are enough to identify 80 % of the usability problems for a product, tool or process. Nevertheless, small sample sizes may increase the variability and limit the applicability of the results (Cazañas, and S., and Parra, E., 2017), which could be seen a limitation of this paper. ...
Article
Full-text available
The implementation of circular economy (CE) criteria and indicators in the design stage of electric vehicle (EV) batteries could optimise their life cycle resource efficiency and environmental performance. However, the viability of using circularity criteria and indicators to develop more environmentally sustainable EV batteries remains unclear due to the lack of scientific and industrial case studies. The goal of this paper is to show the perceptions from relevant stakeholders about the suitability of the implementation of circularity criteria and indicators for EV batteries design and life cycle management (LCM). A total of 24 industrial and academic stakeholders were engaged in individual meetings to assess the importance and applicability of 30 circularity design criteria and 15 product-level circularity indicators, collected through a review of academic papers, policy regulations, and industry reports. According to the consulted stakeholders, i) “focus on quality of performance”, ii) “favour cleaner production”, and iii) “use digitalisation and internet-of-things solutions” were identified as the most suitable criteria for implementation. Regarding circularity indicators, End of Life Indices and the Product Circularity Indicator were considered the most relevant for use due to their coverage of multiple life cycle stages and circularity strategies. However, the results suggest a discrepancy in stakeholders' views regarding the best circular design criteria and the most suitable circularity indicators. Consequently, there is yet a lack of adequate indicators for sustainable EV battery design and LCM incorporating the required circular design criteria. Accordingly, future research should focus on defining and aligning specific circularity criteria and indicators for EV batteries to support and monitor sustainable innovation.
... Although the sample size in our study was deemed sufficient for usability testing [73][74][75][76] and composed of diverse physically inactive participants across different genders, a wide range of age groups, and varying severities of depressive symptoms, ongoing modifications to MoodMover may warrant a larger sample size. However, we ensured that any modifications were reviewed by at least three subsequent participants. ...
Article
Full-text available
Background Physical activity (PA) is recognized as a modifiable lifestyle factor for managing depression. An application(app)-based intervention to promote PA among individuals with depression may be a viable alternative or adjunct to conventional treatments offering increased accessibility. Objective This paper describes the early stages of the development process of MoodMover, a 9-week app-based intervention designed to promote PA for people with depression, including its usability testing. Methods Development of MoodMover followed the initial stages of the Integrate, Design, Assess, and Share (IDEAS) framework. The development process included (1) identifying intervention needs and planning; (2) intervention development; and (3) usability testing and refinement. Usability testing employed a mixed-methods formative approach via virtual semi-structured interviews involving goal-oriented tasks and administration of the mHealth App Usability Questionnaire (MAUQ). Results Drawing on formative research, a multidisciplinary research team developed the intervention, guided by the Multi-Process Action Control framework. Nine participants engaged in the usability testing with the MoodMover prototypes receiving an average MAUQ score of 5.79 (SD = 1.04), indicating good to high usability. Necessary modifications were made based on end-users' feedback. Conclusions The development of MoodMover, the first theoretically informed app-based PA intervention for individuals with depression, may provide another treatment option, which has wide reach. The comprehensive usability testing indicated interest in the app and strong perceptions of usability enabling a user-centered approach to refine the app to better align with end-users' preferences and needs. Testing the feasibility and preliminary efficacy of the refined MoodMover is now recommended.
... Among the essential planning steps, after having established the test objectives, it is determined how the product will be tested and how the user groups will be established. On the other hand, there has been a debate about how many participants are needed in a reliable usability test to identify usability issues [26][27] and [28] consider that a majority or about 80 % (given a 30 probability of detection) of usability issues will be observed with the first five participants [29] . In fact, a study can be conducted with 5 users and get excellent results as long as the users are all from the same subgroup. ...
Article
Full-text available
Today, visualization of 3D medical images is an essential tool for medical education. Web-based 3D tools for the teaching-learning process have turned out to be an efficient alternative to conventional systems. In this work, we aim the modeling process and 3D web-based interactive visualization of the human brain using 3D web technologies and an improvement of the Methodology for the Development of Virtual Reality Educational Environments (MEDEERV, for its acronym in Spanish). 20 undergraduate medicine, dentistry, gerontology, and computer science students performed a brain model usability test (9 women; 11 men, mean age = 22.1 years, SD = 0.70). To this end, we used a post-test questionnaire with Likert scale answers whose alpha of Cronbach was 0.93. The proof of concept of the brain model that we have developed in this work provides evidence of the viability of the system to be used as a web tool for basic neuroanatomy learning. The main contribution of this work focuses on the implementation of MEDEERV to model the 3D human brain, plus the usability testing for reengineering feedback. This approach to modeling, visualizing, and evaluating could be used in other areas of human anatomical teaching. Although the experimental results show a good user experience, functionality and usability, it is necessary to generate a new version and carry out a study with a larger and more specific population with knowledge of brain anatomy.
... We are poised to reach data saturation within the given sample size for the single-group pilot trial. [22][23][24] During the interviews, we will ask participants to share whether they suggested protocol change(s) that might make home testing simpler or clearer. These queries may elucidate reasons for non-participation and improvements that could be implemented for the roll-out of the randomised trial. ...
Article
Full-text available
Introduction Lupus nephritis (LN) is a frequent complication of SLE, occurring in up to 60% of adult patients and ultimately progressing from acute inflammation to chronicity with fibrosis and end-stage kidney failure in 10%–30% of patients. Racial/ethnic minority patients with lupus have worse long-term outcomes, including progression to end-stage renal disease and overall mortality. A major challenge in the management of patients with SLE is delayed identification of early kidney disease, which ultimately leads to a greater burden on both patients and the health system. Methods and analysis Using a mixed methods approach, this study will develop, adapt and evaluate a home urine sampling protocol with a text-messaging reminder and data capture system for patients at elevated risk of de novo LN or relapse. First, a feasibility pilot using a single-group trial design (n=18) will be implemented, with a feasibility assessment and qualitative, debriefing interviews with patients to further refine the intervention. The second phase is a comparative effectiveness trial of the intervention (n=160) with the primary outcome of biopsy eligibility, that is, the participant has a clinical indication for a kidney biopsy (urine protein–creatinine ratio≥0.5), whether or not the patient actually undergoes the biopsy procedure. The randomised trial includes an economic evaluation of the adapted home urinalysis protocol. Discussion and dissemination It is unknown whether weekly home-based urine sampling can identify proteinuria sooner than standard care; if found sooner, kidney problems could be diagnosed earlier, hopefully leading to earlier care for less-involved disease and subsequent reduced morbidity. The data collected in this trial will inform future feasibility and effectiveness of text-messaging-based home urine sampling interventions. Trial registration number The randomised trial will be registered with ClincialTrials.gov prior to enrolment start.
Article
Full-text available
Introduction A lack of culturally appropriate genetic information prevents the British Pakistani community from engaging with genetic services. The GENE-Ed project focussed on the development of an educational app with and for the Pakistani community. A secondary aim was understanding how to engage the community in research. Methods We used an iterative co-design and co-creation approach including four phases to develop the Gene app. Phase 1 included seven interviews with community members to explore genetics understanding and define the requirements. Phase 2 included reviewing smartphone apps and research on digital patient-facing interventions for genetics understanding. Phase 3 included developing the app and obtaining initial feedback. In Phase 4, feedback was obtained from five community members using the System Usability Scale (SUS), a bespoke survey and observations. Results Four themes were identified in the interviews: current awareness of genetics; consanguinity, religion and cultural influence; presenting genetics information in a new digital resource and dissemination; information-sharing and uptake. The reviews highlighted an absence of culturally sensitive, accessible and evidence-based digital resources. Initial feedback included altering the animations and images within the app and simplifying the text. The mean SUS score was 87, indicating excellent usability. The written information, animations and videos were acceptable to participants, and they tended to trust the information in the app. During feedback, community members responded well to different methods but struggled with written open-ended survey questions. Conclusion The co-design approach was essential to developing an acceptable resource for the British Pakistani community. Future clinical testing is needed.
Article
Full-text available
Classification of construction resource states, using sensor data analytics, has implications for improving informed decision-making for safety and productivity. However, training on sensor data analytics in construction education faces challenges owing to the complexity of analytical processes and the large stream of raw data involved. This research presents the development and user evaluation of ActionSens, a block-based end-user programming platform, for training students from construction-related disciplines to classify resources using sensor data analytics. ActionSens was designed for construction students to perform sensor data analytics such as activity recognition in construction. ActionSens was compared to traditional tools (i.e., combining Excel and MATLAB) used for performing sensor data analytics in terms of usability, workload, visual attention, and processing time using the System Usability Scale, NASA Task Load Index, eye-tracking, and qualitative feedback. Twenty students participated, performing data analytics tasks with both approaches. ActionSens exhibited a better user experience compared to conventional platforms, through higher usability scores and lower cognitive workload. This was evident through participants' interaction behavior, showcasing optimized attentional resource allocation across key tasks. The study contributes to knowledge by illustrating how the integration of construction domain information into block-based programming environments can equip students with the necessary skills for sensor data analytics. The development of ActionSens contributes to the Learning-for-Use framework by employing graphical and interactive programming objects to foster procedural knowledge for addressing challenges in sensor data analytics. The formative evaluation provides insights into how students engage with the programming environment and assesses the impact of the environment on their cognitive load.
Chapter
Self-awareness of human anatomy varies widely among the general public, leading to challenges in comprehending the intricate functions and spatial relationships of internal organs. To bridge this gap, the integration of active learning methodologies into science, technology, engineering, and mathematics (STEM) education has become imperative. These methodologies empower students to actively engage in their learning process through interactive discussions and hands-on activities, fostering a deeper understanding of complex anatomical concepts. In anatomical education, creative learning interventions, such as body painting and crafting, have proven effective in enhancing active learning skills including motor skills, observational abilities, and visuospatial aptitude. Furthermore, the accessibility of emerging technologies such as augmented reality (AR) has presented promising opportunities to revolutionize science curricula, offering innovative and engaging methods to facilitate a more comprehensive understanding of human anatomy. This chapter discusses the variability in human anatomy awareness and associated challenges. Emphasizing the significance of active learning, the chapter underscores its role in aiding students to grasp complex scientific concepts. This research focuses on a creative learning approach to teaching the complexities of the brain, lungs, and heart, utilizing a combination of innovative techniques, including anatomical baking, photogrammetry, 3D modeling, and AR application development. Preliminary findings underscore the usability of the AR application and its efficacy in fostering increased motivation for learning while recognizing the need for further comprehensive user testing. The approach aims to enhance public understanding of these vital organs, leveraging the popularity of baking programs to ensure widespread accessibility and engagement.
Article
Full-text available
We propose the addition of usability validation to the extended V3 framework, now "V3+", and describe a pragmatic approach to ensuring that sensor-based digital health technologies can be used optimally at scale by diverse users. Alongside the original V3 components (verification; analytical validation; clinical validation), usability validation will ensure user-centricity of digital measurement tools, paving the way for more inclusive, reliable, and trustworthy digital measures within clinical research and clinical care.
Conference Paper
Full-text available
Customer Set-Up is a proven approach to reducing service costs and providing products at lower prices to customers. To ensure the effectiveness of Customer Set-Up instructions and procedures, these instructions and procedures must be studied before being shipped with their associated product. This paper will address several points to consider when planning a study of a Customer Set-Up system, such as procedure, appropriateness of subjects, number of subjects, the iterative procedure, studies vs. test, and development of test criteria.
Article
A summary is presented of the current state of the art and recent trends in software engineering economics. It provides an overview of economic analysis techniques and their applicability to software engineering and management. It surveys the field of software cost estimation, including the major estimation techniques available, the state of the art in algorithmic cost models, and the outstanding research issues in software cost estimation.
Conference Paper
This chapter describes the way in which usable systems can be designed. Usability is a combination of many factors, each of which is often developed independently. User-interface code is becoming an increasingly large percentage of the total system code. Standards are beginning to emerge for user interface design. Establishing standards for software aspects of the user interface is probably premature. There are lots of guidelines for a good system design. However, these are not enough for the design of good systems. One should at the beginning and throughout development focus on prospective users and their work. It is often heard that people buy computer systems for the functions in them. One is unlikely to figure out what the functions should be without talking with users. One should continuously measure each aspect of usability, and then iterate in a hill-climbing way toward a better system. All aspects of usability should begin evolving from the beginning and should be under one focus.
Article
Recent attention has been focused on making user interface design less costly and more easily incorporated into the product development life cycle. This paper reports an experiment conducted to determine the minimum number of subjects required for a usability test. It replicates work done by Jakob Nielsen and extends it by incorporating problem importance into the curves relating the number of subjects used in an evaluation to the number of usability problems revealed. The basic findings are that (1) with between 4 and 5 subjects 80% the usability problems are detected and (2) that additional subjects are less and less likely to reveal new information. Moreover, the correlation between expert judgements of problem importance and likelihood of discovery is significant, suggesting that the most disruptive usability problems are found with the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle the practitioner is employing, and the goals of the usability test.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.