ArticlePDF Available

The Bootstrap Discovery Behaviour (BDB): A new outlook on usability evaluation

Authors:

Abstract

The value of λ is one of the main issues debated in international usability studies. The debate is centred on the deficiencies of the mathematical return on investment model (ROI model) of Nielsen and Landauer (1993). The ROI model is discussed in order to identify the base of another model that, respecting Nielsen and Landauer's one, tries to consider a large number of variables for the estimation of the number of evaluators needed for an interface. Using the bootstrap model (Efron 1979), we can take into account: (a) the interface properties, as the properties at zero condition of evaluation and (b) the probability that the population discovery behaviour is represented by all the possible discovery behaviours of a sample. Our alternative model, named Bootstrap Discovery Behaviour (BDB), provides an alternative estimation of the number of experts and users needed for a usability evaluation. Two experimental groups of users and experts are involved in the evaluation of a website (http://www.serviziocivile.it). Applying the BDB model to the problems identified by the two groups, we found that 13 experts and 20 users are needed to identify 80% of usability problems, instead of 6 experts and 7 users required according to the estimation of the discovery likelihood provided by the ROI model. The consequence of the difference between the results of those models is that in following the BDB the costs of usability evaluation increase, although this is justified considering that the results obtained have the best probability of representing the entire population of experts and users.
RESEARCH REPORT
The Bootstrap Discovery Behaviour (BDB):
a new outlook on usability evaluation
Simone Borsci Alessandro Londei
Stefano Federici
Received: 5 July 2010 / Accepted: 12 October 2010 / Published online: 3 November 2010
Marta Olivetti Belardinelli and Springer-Verlag 2010
Abstract The value of kis one of the main issues
debated in international usability studies. The debate is
centred on the deficiencies of the mathematical return on
investment model (ROI model) of Nielsen and Landauer
(1993). The ROI model is discussed in order to identify
the base of another model that, respecting Nielsen and
Landauer’s one, tries to consider a large number of
variables for the estimation of the number of evaluators
needed for an interface. Using the bootstrap model (Efron
1979), we can take into account: (a) the interface prop-
erties, as the properties at zero condition of evaluation
and (b) the probability that the population discovery
behaviour is represented by all the possible discovery
behaviours of a sample. Our alternative model, named
Bootstrap Discovery Behaviour (BDB), provides an
alternative estimation of the number of experts and users
needed for a usability evaluation. Two experimental
groups of users and experts are involved in the evaluation
of a website (http://www.serviziocivile.it). Applying the
BDB model to the problems identified by the two groups,
we found that 13 experts and 20 users are needed to
identify 80% of usability problems, instead of 6 experts
and 7 users required according to the estimation of the
discovery likelihood provided by the ROI model. The
consequence of the difference between the results of
those models is that in following the BDB the costs of
usability evaluation increase, although this is justified
considering that the results obtained have the best prob-
ability of representing the entire population of experts
and users.
Keywords Asymptotic test Bootstrap Effectiveness
Return of investment User experience evaluation
Introduction
Nielsen and Landauer (1993) show that, generally, the least
number of evaluators (experts or users) required for
usability evaluation techniques ranges from three to five.
The mathematical model of those authors was run for the
problems identified by the experts or users in order to
evaluate whether the technique is efficient or cost effective:
‘observing additional participants reveals fewer and fewer
new usability problems’’ (Turner et al. 2006, p.3084); thus,
adding more than four or five users (participants) does not
provide an advantage in estimating rates of discovery of
new problems in terms of costs, benefits, efficiency and
effectiveness. This model, known as return on investment
(ROI), is an asymptotic test able to estimate the number of
evaluators needed with the following formula:
FoundðiÞ¼N½1ð1kÞið1Þ
Electronic supplementary material The online version of this
article (doi:10.1007/s10339-010-0376-6) contains supplementary
material, which is available to authorized users.
S. Borsci (&)A. Londei S. Federici
ECoNA—Interuniversity Centre for Research on Cognitive
Processing in Natural and Artificial Systems,
Sapienza University of Rome,
Rome, Italy
e-mail: simone.borsci@gmail.com
S. Federici
Department of Human and Education Sciences,
University of Perugia,
Perugia, Italy
123
Cogn Process (2011) 12:23–31
DOI 10.1007/s10339-010-0376-6
In (1), Nis the total number of problems in the interface,
k
1
is defined by Nielsen and Landauer (1993, p.208) as
‘the probability of finding the average usability problem
when running a single average subject test’’ (i.e. individual
detection rate), and iis the number of users. As some
international studies (Lewis 1994; Nielsen 2000; Nielsen
and Mack 1994; Virzi 1990;1992; Wright and Monk 1991)
have shown, a sample size of five participants is sufficient
to find approximately 80% of the usability problems in a
system when the individual detection rate (k) is at least
0.30. By using this mathematical model, the range of
evaluators required for a usability test can be found, and
therefore the increase in the number of problems found by
adding users to the evaluation can be calculated. For
instance, by applying formula (1), practitioners can
estimate whether five users are sufficient for an efficient
assessment or, otherwise, how many users (n) are needed in
order to increase the percentage of usability problems, as
follows:
FoundðsÞ¼½1ð10:3Þs¼0:83
In this example of a potential application of the formula,
provided by Nielsen (2000), the problem detection rate
obtained with five users is 0.83 (i.e. 83% of usability
problems will be detected). However, we must emphasise
that many studies (Lewis 1994; Turner et al. 2006; Virzi
1990;1992) show kranging from 0.16 to 0.42 (see Lewis
2006). Afterwards, the increase in the problem detection
rate can be estimated by adding more users to this sample
of five, as reported in Fig. 1.The analysis of that hypo-
thetical sample shows that almost 100% of usability
problems can be found with 15 users, considering the fact
that with just 5 users the likelihood of problem discovery is
equal to 83%, but in order to discover less than 20% more
usability problems, not yet identified, at least 10 more users
need to be added to the evaluation.
The deficiencies of the kvalue estimation
As Nielsen et al. (1993) underline when discussing their
model, the discoverability rate (k) for any given usability
test depends on at least seven main factors:
The properties of the system and its interface;
The stage of the usability lifecycle;
Type and quality of the methodology used to conduct
the test;
Specific tasks selected;
Match between the test and the context of real-world
usage;
Representativeness of the test participants;
Skill of the evaluator.
These factors have an effect on the evaluation of the
interaction between system and user that, in our opinion,
the ROI model is not able to estimate. Indeed, the ROI
model assumes that
All the evaluators have the same probability of finding
all problems. As Caulton (2001) states, the ROI model is
based on the idea that all types of subjects have the same
probability of encountering all potential usability problems,
without considering their different evaluation skills. At the
same time, the ROI model does not take into consideration
the effects of the evaluation methodologies being used, the
representativeness of the sample of participants, or, finally,
the similarity between the test and its context in the real
world. In particular, as Woolrych and Cockton (2001) have
claimed, the ROI model fails to integrate all of the indi-
vidual differences in problem discoverability; in this sense,
the probability of the participants encountering all of the
problems remains a relevant issue that needs clarification.
Recently, Schmettow (2008), while discussing the
assumption of a homogeneous detection probability,
claimed that it is intuitively unrealistic. This author pro-
poses a beta-binomial distribution, in which the value of
‘‘ p’ is estimated using a process which is able to take
heterogeneity into account. However, Schmettow (2008)
demonstrates the problems of the assumption of heteroge-
neity in the ROI model, without proposing a real solution.
Fig. 1 The asymptotic behaviour of discovery likelihood in relation
to our hypothetical sample with k=0.30
1
Actually, only Nielsen et al. (1993) used k, instead of p(Lewis
1994;2006; Virzi 1992; Wright and Monk 1991; Schmettow 2008)in
the formula 1, partly because they derived their formula from the
‘Poisson process’’ (see Nielsen and Landauer 1993). Many authors
(Lewis 1994;2006; Virzi 1992; Wright and Monk 1991; Schmettow
2008) use the formula (1) written as: P=1–(1–p)
n
, where ‘P’’ i s
the total number of problems in the interface, ‘p’ the probability of
finding the average usability problem when running a single average
subject test and ‘n’’ is the number of participants.
24 Cogn Process (2011) 12:23–31
123
The kvalue estimation does not take into account the
differences between the systems evaluated. This means that
the effect on the evaluation results caused by the properties
of the system, the interface lifecycle stage and the meth-
odologies selected for the evaluation are not considered by
the model. In fact, the ROI model starts with a ‘‘one
evaluator’’ condition and not at zero condition. This means
that the characteristics of the system are considered only as
the differences between problems found by the first eval-
uators. Nielsen (1994) pointed out that the first evaluator
(a user or an expert) generally finds 30% of the problems,
because these problems are generally the most evident. The
subsequent evaluators usually find a smaller percentage of
new problems, simply because the most evident ones have
already been detected by the first evaluator. The number of
evident problems is determined empirically, and it varies
because it is dependent on the evaluator’s skills, which, as
we have already stated, is a factor that this model does not
consider. The value of 30% was derived through Monte
Carlo (MC) resampling of multiple evaluators and could
also be estimated using the full matrix of problems as
discovered by independent evaluators (see Lewis 2001).
A serious limitation of Nielsen’s (and both Landauer and
Virzi’s) work is that they happened to be working with
products for which the value of pacross evaluators/users
was about 0.3, but as Lewis (1994) showed, it is possible
for the composite value of pto be much lower than 0.3. For
Lewis (1994), the value was 0.16, and for Spool and
Schroeder (2001; see also Lewis 2006) it was 0.029. In
order to assess the completeness of a problem-discovery
usability study, the practitioner(s) running the study must
have some idea of the value of p, which differs from study
to study as a function of the properties of the system, the
interface lifecycle stagethe methodologies selected for
the evaluation, and the skill of the evaluators/users, ergo it
is not necessarily 0.3 (30%).
The international debate on the estimation of the value
of kalso shows that the ROI model suffers from an
overestimation of k, or as Woolrych and Cockton (2001)
claim, at least from an optimistic estimation of the dis-
covery rate. As Schmettow (2008, p.94) underlines, Lewis
(2001) in order to resolve this problem of overestimation
‘compared several correction terms in application to real
data sets. The final suggestion was an equally weighted
combination of a simplified Good-Turing (GT) adjustment
and a normalization procedure (NORM) proposed by
Hertzum and Jacobsen (2003)’’. This adjustment is able to
deflate the overestimated value of k, estimated using a
small sample, but without solving all of the problems, as
previously discussed, that Nielsen and Landauer’s model
generates.
Taking into account the deficiencies of the ROI model,
in a usability evaluation practitioners must consider that the
estimation of kcould have a variable range of values and as
a consequence, that this model cannot guarantee the reli-
ability of the evaluation results obtained by the first five
participants.
This analysis allows us to provide an alternative model
to the ROI one based on the probabilistic behaviour in the
evaluation. As its first feature, our alternative model should
be able to take into account the probabilistic individual
differences in problem identification. The second feature of
our model is that it must consider the evaluated interfaces
as an object per se. The interfaces are considered different
not in terms of the number of problems found by the first
evaluator (evaluation condition), but different as objects
(zero condition) estimating the probabilistic number of
evident problems that all the evaluators can detect by
testing the interface. The third feature of the model is that
in order to calculate the number of evaluators needed for
the evaluation, it must consider the representativeness of
the sample (as regards the population of all the possible
evaluation behaviours of the participants). Our model is
based on the statistical inference methods known as
bootstrapping.
A new look: the Bootstrap Discovery Behaviour
Bootstrapping is a general approach to statistical inference
based on building a sampling distribution for a statistic by
resampling from the data at hand. The term ‘‘bootstrap-
ping’’, defined by Efron (1979), is an allusion to the
expression ‘‘pulling oneself up by one’s bootstraps’’—in
this case, using the sample data as a population from which
repeated samples are drawn (for a general introduction to
bootstrapping methods see Fox (2002). The present boot-
strapping approach moves from the assumption that dis-
covering new problems should be the main goal of both
users’ and experts’ evaluations as well as expressed in
Formula (1) by Nielsen and Landauer (1993).
Given a generic problem x, the probability that a subject
will find xis p(x). If two subjects (experts or users) navi-
gate the same interface, the probability that at least one of
them will detect the problem xis
pðx1Vx2Þð2Þ
In (2), where x
1
and x
2
represent the problem x detected by
subjects 1 and 2, OR is the logic operator. According to De
Morgan’s law (Goodstein 1963), (2) is equivalent to:
p½:ð:x1K:x2Þ ð3Þ
Equation (3) expresses the probability of ‘‘the degree to
which it is false that none of the subjects find anything’
(the logic operator for negation). So (3) can be rewritten as:
pð:xÞ¼1pðxÞ
Cogn Process (2011) 12:23–31 25
123
Since the probabilities of different subjects’ finding a
specific problem are mutually independent, Equation (3)
can be written as:
p½:ð:x1K:x2Þ ¼ 1½1pðx1Þ  ½1pðx2Þ ð4Þ
Following Caulton’s homogeneity assumption (2001) that
all subjects have the same probability (p) of finding the
problem x, then (4) can also be expressed as:
pðx1Vx2Þ¼1½1p2ð5Þ
Of course, we can extend this case to a generic number of
evaluators L:
pðx1Vx2V...xLÞ¼1½1pLð6Þ
Equation 6expresses the probability that in a sample
composed of Levaluators, at least one of them will identify
the problem x.
According to Nielsen and Landauer (1993), given
Nproblems in an interface, the probability of any problem
being detected by any evaluator can be considered constant
(p(x)=p). Then, the mean number of problems detected
by Levaluators is
FðLÞ¼N½1ð1pÞLð7Þ
Leading to the same model presented by Nielsen (Equa-
tion 1), in (7), in order to estimate p(x) we adopted the
bootstrap model, avoiding estimation merely based on the
addition of detected problems. This kind of estimation
could in fact be invalidated by the small size of the ana-
lysed samples or by the differences in the subjects’ prob-
abilities of problem detections. Our idea is that the
bootstrap model should be able to grant a more reliable
estimation of the probability of identifying a problem.
Experiment
In order to test the Bootstrap Discovery Behaviour:
Two experimental groups are asked to evaluate a target
interface: 25 experts by means of a cognitive walk-
through (CW) technique and 20 users by using the
thinking aloud (TA) technique.
Using the ‘‘Fit’’ function in Matlab software (http://
www.mathworks.com), we applied a bootstrap with
5000 samplings. The results of each subsample were
obtained by a random order of evaluators (experts and
users) with repetition.
In order to identify the best fit of the data within a 95%
confidence interval, the result of each bootstrap sam-
pling allowed us to estimate three parameters: (i) the
probable number of problems found (p): this value was
obtained as the normalized mean number of problems
found by each subgroup of subjects; (ii) the maximum
number of problems that all possible samples could
identify (a), known as the maximum limit; and (iii) the
value of the known term q.
Participants
Experts group
This group was comprised of 25 experts (10 males, 15
females, mean age =26.6) with different levels of exper-
tise: 10 experts had more than 3 years of experience and 15
had less than 1 year of experience. All the experts evalu-
ated the target website with a CW technique.
Users group
Twenty students from Sapienza University of Rome (5
males, 15 females, mean age =21.3) were involved in the
TA analysis of the target website.
Evaluation techniques
Cognitive walkthrough
This starts with a task analysis that allows (a) the sequence
of steps a user should take in order to accomplish a task to
be specified and (b) the system responses to the actions to
be observed. Once the task analysis is over, the expert
simulates the actions of the potential user and identifies the
problems the user is supposed to find. As Rieman, Franzke,
and Redmiles (1995) claim, this technique is based on three
elements: ‘‘(1) a general description of who the users will
be and what relevant knowledge they possess, (2) a specific
description of one or more representative tasks to be per-
formed with the system, and (3) a list of the correct actions
required to complete each of these tasks with the interface
being evaluated’’ (p. 387).
The experts perform the walkthrough by asking them-
selves a set of questions for each subtask (Lewis and
Rieman 1993; Polson et al. 1992; Wharton et al. 1994):
The user sets a goal to be accomplished with the system
(for example, checking the spelling of this document).
The user searches the interface for currently available
actions (menu items, buttons, command-line inputs,
etc.).
The user selects the action that seems likely to lead to
progress towards the goal.
The user performs the selected action and evaluates the
system’s feedback for evidence that progress is being
made towards the current goal.
26 Cogn Process (2011) 12:23–31
123
Thinking aloud
Known as verbal protocol analysis, this had a large appli-
cation in the study of consumer and judgement-making
processes (Bellman, Park 1980; Bettman 1979; Biehal and
Chakravarti 1982a;b;1986;1989; Green 1995; Kuusela
et al. 1998). In describing this user-based evaluation process,
Kuusela and Pallab (2000) state: ‘‘The premise of this pro-
cedure is that the way subjects search for information,
evaluate alternatives, and choose the best option can be
registered through their verbalization and later be analysed
to discover their decision processes and patterns. Protocol
data can provide useful information about cue stimuli,
product associations, and the terminology used by con-
sumers’’ (p. 388). The TA can be performed according to
two main different experimental procedures: the first pro-
cedure, and the most popular, is the concurrent verbal pro-
tocol, with which data are collected during the decision task;
the second procedure is the retrospective verbal protocol,
with which data are collected when the decision task is over.
Our experimental work has used the concurrent TA because
it is one of the most frequently applied techniques of verbal
reporting used in HCI studies. Indeed, in the concurrent TA,
users express their problems, strategies, stress and impres-
sions without the influence of a ‘‘rethinking’’ perception, as
happens in retrospective analysis (Borsci and Federici 2009;
Federici and Borsci 2010; Federici et al. 2010a;b).
Each test was performed at the laboratory of cognitive
psychology of the Sapienza University of Rome with a
specific setting represented in Fig. 2.
Apparatus
Each participant uses an Intel-Pentium 4 computer with a
RAM of 4 GB, a Gforce 8800 (video), and a Creative
Sound Blaster X-Fi (audio). The monitor is a SyncMas-
ter900p, 19’’, and the speakers are two Creative GigaWorks
T20 Series II. Each test is video recorded with a Sony 3
megapixels and each user screen movement is recorded by
the CamStudio 20 screen recorder. A 2800 Sony monitor is
used as a control of user action by the expert. Each user
used Internet Explorer 8 as a browser.
Target websites
http://www.serviziocivile.it was chosen as the target website.
It was selected from those websites of the Italian Public
Administration considered accessible by the CNIPA evalua-
tion (http://www.pubbliaccesso.gov.it/logo/elenco.php). We
chose serviziocivile.it for two main reasons: (1) from a
structural point of view, it offers a high quantity of informa-
tion collected in a large numberof pages; (2) from the point of
view of the analysis we had to carry out, the fact that the
website’s target users are peoplebetween 18 and 28 years old
eased our enrolment possibilities in order to form theusability
evaluation samples involved in user-based evaluations.
The expert-based and user-based analyses were carried
out on four scenarios. These scenarios were created and
approved by three external evaluators with more than
3 years of experience in the field. These evaluators did not
participate in the experimental sessions.
Procedure
Experts group
In a meeting with all experts, the evaluation coordinator (the
second author of this paper) presented the procedure, goals
and scenarios provided by three external experts with more
than 5 years of experience in accessibility and usability
evaluation. Then, all experts were invited to evaluate the
system through a CW and to provide independent evaluations.
Users group
After 20 min of free navigation as training, users started
the TA evaluation following four scenarios (see Appendix).
The evaluation coordinator reported all problems identified
in the TA session, and checked and integrated the report
using the video of verbalization and mouse action recorded
by CamStudio.
Measurements and tools
We compared the number of evaluators needed in order to
achieve identification of 80% of problems, applying both
the ROI and our bootstrap model. The analysis was carried
out by SPSS 16 and Matlab software.
Alternative model of estimation
In order to identify the number of experts and users needed
to detect more than 80% of problems, we must obtain the
best fit with our results. Our model must also provide an
Monitor with user actions
User
Facilitator
Evaluation coordinator
Fig. 2 The users’ experimental settings for the TA analysis
Cogn Process (2011) 12:23–31 27
123
estimation of those parameters able to represent the prop-
erties of the interface and the representativeness of the
sample. The bootstrap analysis was used in order to obtain
the following parameters:
All the possible discovery behaviours of participants.
Considering our 5,000 possible bootstrap samples (with
repetition), at each bootstrap step a subsample composed
of collected data (i.e. the identified problems) presented
in a random order was selected. The maximum value of
collected problems represents our maximum limit value
(indicated below in (8) as a). This value indicates the
representativeness of our sample.
A rule in order to select the representative data.As
representative data for the subsamples, we used the
normalized mean of the number of problems found by
each subsample (indicated below in (8) as p). As already
mentioned, pis the estimated probability of the detection of
a generic problem by an evaluator in the chosen population.
The model expressed below in 8 represents the best fit of
the data obtained by the bootstrapped subsamples of expert
and user groups:
FðLÞ¼Ntað1pÞLþq
 ð8Þ
In (8), Nt represents the total number of problems in the
interface and the qvariable expresses the hypothetical
condition L=0 (an analysis without evaluators). In other
words, since Fdoes not vanish when L=0, F(0) represents
the amount of evident problems that can be effortlessly
detected by any subject, and qthe possibility of detecting a
certain number of problems that have already been identified
(or are evident to identify) and were not fixed by the designer:
Fð0Þ¼Nt½að1pÞqð9Þ
The value q represents the properties of the interface
from the evaluation perspective. This is at least the ‘‘zero
condition’’ of the interface properties.
Results obtained by applying the ROI model
Experts identified 46 problems with a value of kequal to
0.26. The number of experts needed to find 80% of prob-
lems equalled 6 (Fig. 3).
Users identified 39 problems with a value of kequal to
0.22. The number of users needed to find 80% of problems
equalled 7 (Fig. 4).
Results obtained by applying the BDB
Experts group
Applying our model to the data obtained by the experts
with the classic CW, we obtained the probable discovery
likelihood expressed in Fig. 5.
2019181716151413121110987654321
Number of experts
100%
80%
60%
40%
20%
Discovery likelihood
Fig. 3 The discovery likelihood of the experts group
2019181716151413121110987654321
Number of users
100%
80%
60%
40%
20%
Discovery likelihood
Fig. 4 The discovery likelihood of the users group
Fig. 5 Discovery likelihood of experts with CW1 estimated by a
5,000 bootstrap sampling
28 Cogn Process (2011) 12:23–31
123
The values of the parameters needed for calculating the
model (a,p,q) are reported in Table 1below:
Our results show that 13 experts are needed for the
evaluation in order to identify more than 80% of problems
(the number of known problems qis 3.77).
User group
The data for the user group were processed as for the
experts’ one. The probable discovery likelihood of the user
sample is reported in Fig. 6.
The parameters obtained are reported in Table 2:
Applying the results to Equation (8), the result shows
that 20 users are needed for the evaluation in order to
identify more than 80% of problems (the number of known
problems qis 6).
The convergent validity of the BDB model
In order to verify the significance of the results obtained
through the BDB model,
2
a convergent validity test was
carried out by comparing the results with those obtained
using the MC method (Lewis 2001;2006).
The results (Tables 3and 4) show that there is barely an
overlap between the kvalues obtained using the two
techniques. These results confirm the validity of the BDB
model with respect to the MC method.
The aim of this work is not to provide a new method for
estimating the value of k, since the BDB model does not
discuss the discovery rate obtained using practitioners’
data. The BDB model should be applied when the kvalue
has already been estimated using a test with three or five
users, and for calculating how many users are required in
order to detect more than 80% of the errors in a target
interface. By doing so, the BDB model enlarges the per-
spective of analysis by adding two new parameters which
are not considered in the classic estimation model: this
model considers all of the possible discovery behaviours of
participants (a) and encompasses a rule for the selection of
representative data (q). These parameters take into account
the variability of the different interfaces (q) and the dif-
ferent behaviours of the samples used in a usability study
(a). However, our model does not supersede the kvalue
estimation obtained using the classic ROI model or by GT
adjustment. In fact, by using the BDB model, practitioners
might receive confirmation that the number of users/
experts involved in their sample test is already sufficient
for a reliable evaluation.
Discussion
The results given by the BDB are very different from those
obtained by the ROI model. For the ROI model, the esti-
mation of the costs of the usability evaluation required a
sample composed of 6 experts and one of 7 users, while
applying the BDB the estimation shows that a sample of 13
experts and one of 20 users are needed in order to identify
more than 80% of problems. As a consequence, in fol-
lowing the BDB model, there is an increase in the usability
evaluation costs with respect to the data provided by the
Table 1 The values of the parameters a,pand qcalculated by the
bootstrap of experts data
Parameters Values Confidence interval
a0.9623 (0.9583–0.9664)
p0.1414 (0.1387–0.1440)
q0.8356 (0.7706–0.9006)
Fig. 6 Discovery likelihood of users with TA estimated by the
bootstrap analysis
Table 2 The values of the parameters a,pand qcalculated by the
bootstrap of users data
Parameters Values Confidence interval
a0.8691 (0.8440–0.8942)
p0.1235 (0.1116–0.1355)
q2.3910 (2.0980–2.6850)
2
In the review phase of this work, a reviewer claimed that ‘‘The
authors should do a Monte Carlo resampling exercise to assess the
extent to which randomly selected sets of 6 experts (for the CW data)
and 7 users (for the TA data) find or fail to find at least 80% of the
problems discovered by the full samples,’’ since, according to the
reviewer’s opinion, ‘‘The authors simply state the different sample
size estimates and appear to assume that the BDB are correct without
further, evaluation or any tests of significance’’. In accordance with
the reviewer’s suggestions, we have added this section.
Cogn Process (2011) 12:23–31 29
123
ROI model. However, the increase in costs enlarged the
evaluator perspective by providing a more reliable set of
results. Actually, the BDB approach allows the behaviour
of the whole population (parameter a), the representative-
ness of the sample data (i.e. the problems found expressed
by the parameter p) and the different properties of the
interface (parameter q) to be taken into account.
Conclusion
The BDB, while respecting the assumption of the ROI
model, opens a new perspective on the discovery likelihood
and on the costs of usability evaluation. Indeed, the pos-
sibility of considering both the properties of the interface
and the representativeness of data grants the practitioner a
representative evaluation of the interface. A practitioner
can run a test by applying the BDB model after the first five
experts and users i.e. the ROI model in order to estimate
the parameters a,p, and qand the number of evaluators
needed for an evaluation that considers the specific prop-
erties of the interface and the representativeness of the
sample. In this sense, in the evaluation a practitioner can
take into account both the BDB model and the ROI one.
Our perspective offers a new model for the usability
evaluation, guaranteeing the representativeness of the data
and overcoming the deficiencies of the ROI model. In this
sense, the increase in costs is justified by the possibility of
obtaining representativeness of the entire potential popu-
lation with a small sample.
Appendix: User scenarios
1. A friend of yours is enrolled on a 1-year activity in
social service. You are interested in finding more
information about social service activities and acquir-
ing information in order to apply for a one-year job.
Go to the website http://www.serviziocivile.it/, find
that information and download the documents for the
job application.
2. A friend of yours, who lives in Rome, has some
internet connection problems, so he or she telephones
you for assistance. In fact, he or she is interested in
social service work, but he or she does not know where
the office is and when it is open in order to present his
curriculum vitae. Go to the website http://www.
serviziocivile.it/ in order to find that information for
him or her.
3. You are interested in social service activities, so you
go to the website http://www.serviziocivile.it/ in order
to see whether this website offers a newsletter service,
even though you are not enrolled on the social service
activities. If the newsletter service requires you log in,
sign up to the newsletter.
4. A friend of yours is working on a 1-year social service
project in the Republic of the Philippines. You are
interested in applying for a job on this project. Go to
the website http://www.serviziocivile.it/ in order to
find information about the project and whether is
possible to obtain a job.
References
Bellman JR, Park CW (1980) Effects of prior knowledge and
experience and phase of the choice process on consumer decision
processes: a protocol analysis. J Consum Res 7(3):234–248
Bettman JR (1979) An information processing theory of consumer
choice. Addison-Wesley, Cambridge
Biehal G, Chakravarti D (1982a) Experiences with the Bettman-park
verbal-protocol coding scheme. J Consum Res 8(4):442–448
Biehal G, Chakravarti D (1982b) Information-presentation format and
learning goals as determinants of consumers’ memory retrieval
and choice processes. J Consum Res 8(4):431–441
Biehal G, Chakravarti D (1986) Consumers’ use of memory and
external information in choice: Macro and micro perspectives.
J Consum Res 12(4):382–405
Biehal G, Chakravarti D (1989) The effects of concurrent verbaliza-
tion on choice processing. J Mark Res 26(1):84–96
Borsci S, Federici S (2009) The partial concurrent thinking aloud: a
new usability evaluation technique for blind users. In: Emiliani
PL, Burzagli L, Como A, Gabbanini F, Salminen A-L (eds)
Assistive technology from adapted equipment to inclusive
environments—AAATE 2009, vol 25. Assistive technology
research series. IOS Press, Florence, pp 421–425. doi:10.3233/
978-1-60750-042-1-421
Caulton D (2001) Relaxing the homogeneity assumption in usability
testing. Behav Inf Technol 20(1):1–7. doi:10.1080/01449290010
020648
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann
Statist 7(1):1–26. doi:10.1214/aos/1176344552
Table 3 A comparison of the value of kas obtained using BDB model with the ones obtained using MC resampling with 3, 6, 10 and 20 users
BDB kvalue with 20 users MC kvalue with 20 users MC kvalue with 10 users MC kvalue with 6 users MC kvalue with 3 users
0.123 0.119 0.243 0.266 0.364
Table 4 Comparing the value of kas obtained by BDB model with
the ones obtained using MC resampling with 3, 6 and 13 experts
BDB kvalue
with 13 experts
MC kvalue
with 13 experts
MC kvalue
with 6 experts
MC kvalue
with 3 experts
0.141 0.165 0.235 0.313
30 Cogn Process (2011) 12:23–31
123
Federici S, Borsci S (2010) Usability evaluation: models, methods, and
applications. International encyclopedia of rehabilitation. Center
for International rehabilitation research information and exchange
(CIRRIE), Buffalo. http://cirrie.buffalo.edu/encyclopedia/article.
php?id=277&language=en. Accessed 20 Sept 2010
Federici S, Borsci S, Mele ML (2010a) Usability evaluation with screen
reader users: a video presentation of the pcta’s experimental
setting and rules. Cogn Process 11(3):285–288. doi:10.1007/
s10339-010-0365-9
Federici S, Borsci S, Stamerra G (2010b) Web usability evaluation
with screen reader users: implementation of the partial concur-
rent thinking aloud technique. Cogn Process 11(3):263–272. doi:
10.1007/s10339-009-0347-y
Fox J (2002) An r and s-plus companion to applied regression. SAGE,
California
Goodstein RL (1963) Boolean algebra. Pergamon Press, Oxford
Green A (1995) Verbal protocol analysis. Psychologist 8(3):126–129
Hertzum M, Jacobsen NE (2003) The evaluator effect: a chilling fact
about usability evaluation methods. Int J Hum Comput Interact
15(4):183–204. doi:10.1207/S15327590IJHC1501_14
Kuusela H, Pallab P (2000) A comparison of concurrent and
retrospective verbal protocol analysis. Am J Psychol 113(3):
387–404
Kuusela H, Spence MT, Kanto AJ (1998) Expertise effects on
prechoice decision processes and final outcomes: A protocol
analysis. Eur J Mark 32(5/6):559
Lewis JR (1994) Sample sizes for usability studies: additional
considerations. Hum Factors 36(2):368–378
Lewis JR (2001) Evaluation of procedures for adjusting problem-
discovery rates estimated from small samples. Int J Hum Comput
Interact 13(4):445–479
Lewis JR (2006) Sample sizes for usability tests: mostly math, not
magic. Interactions 13(6):29–33. doi:10.1145/1167948.1167973
Lewis C, Rieman J (1993) Task-centered user interface design: a prac-
tical introduction. http://users.cs.dal.ca/*jamie/TCUID/tcuid.pdf.
Accessed 20 Jun 2010
Nielsen J (2000) Why you only need to test with 5 users. www.useit.
com/alertbox/20000319.html. Accessed 20 Jun 2010
Nielsen J, Landauer TK A mathematical model of the finding of
usability problems. In: Proceedings of the INTERACT ‘93 and
CHI ‘93 Conference on human factors in computing systems,
Amsterdam, 24–29 Apr 1993. ACM, New York, NY, USA,
pp 206–213
Nielsen J, Mack RL (eds) (1994) Usability inspection methods.Wiley,
New York
Polson PG, Lewis C, Rieman J, Wharton C (1992) Cognitive
walkthroughs: a method for theory-based evaluation of user
interfaces. Int J Man Mach Stud 36(5):741–773. doi:10.1016/
0020-7373(92)90039-N
Rieman J, Franzke M, Redmiles D Usability evaluation with the
cognitive walkthrough. In: Conference companion on human
factors in computing systems, Denver, Colorado, United States,
1995. ACM, 223735, pp 387–388. doi:10.1145/223355.223735
Schmettow M Heterogeneity in the usability evaluation process. In:
Proceedings of the 22nd British HCI group annual conference on
people and computers: culture, creativity, interaction—Volume
1, Liverpool, United Kingdom, 2008. British Computer Society,
1531527, pp 89–98
Spool J, Schroeder W Testing web sites: Five users is nowhere near
enough. In: CHI ‘01 extended abstracts on human factors in
computing systems, Seattle, Washington, 2001. ACM, 634236,
pp 285–286. doi:10.1145/634067.634236
Turner CW, Lewis JR, Nielsen J (2006) Determining usability test
sample size, vol 2. International encyclopedia of ergonomics and
human factors, Second edn. CRC Press, Boca Raton
Virzi RA (1990) Streamlining the design process: running fewer
subjects. Human factors and ergonomics society annual meeting
proceedings 34:291–294
Virzi RA (1992) Refining the test phase of usability evaluation: how
many subjects is enough? Hum Factors 34(4):457–468
Wharton C, Rieman J, Lewis C, Polson PG (1994) The cognitive
walkthrough method: a practitioner’s guide. In: Nielsen J, Mack RL
(eds) Usability inspection methods. Wiley, New York, pp 105–140
Woolrych A, Cockton G Why and when five test users aren’t enough.
In: Vanderdonckt J, Blandford A, Derycke A (eds) Proceedings
of IHM-HCI 2001 conference, Toulouse, FR, 10–14 Sept 2001.
Ce
´padc
ˇus E
´ditions, pp 105–108
Wright PC, Monk AF (1991) A cost-effective evaluation method for
use by designers. Int J Man Mach Stud 35(6):891–912. doi:
10.1016/s0020-7373(05)80167-1
Cogn Process (2011) 12:23–31 31
123

Supplementary resource (1)

... This rule, known as the five-user assumption, proposes a one-size-fits-all solution in which five users are considered enough for reliable usability testing. The five-user assumption has, however, been strongly criticized in the literature, notably because the (Return on Investment (ROI) based) estimation model behind it was too optimistic [30][31][32][33][34][35][36][37]. In fact the p-value in the ROI model was estimated as the problems identified by each users against the total number identified by the cohort. ...
... Federici [30,31], is another re-sampling method that adopts a bootstrapping approach [42,43]. ...
... The estimation of the final number of users for an evaluation sample can be calculated by inserting the p-value into the following well-known error distribution formula [22,23,25,29,31]: ...
Article
Full-text available
Before releasing a product, manufacturers have to follow a regulatory framework and meet standards, producing reliable evidence that the device presents low levels of risk in use. There is, though, a gap between the needs of the manufacturers to conduct usability testing while managing their costs, and the requirements of authorities for representative evaluation data. A key issue here is the number of users that should complete this evaluation to provide confidence in a product's safety. This paper reviews the US FDA's indication that a sample composed of 15 participants per major group (or a minimum of 25 users) should be enough to identify 90-97% of the usability problems and argues that a more nuanced approach to determining sample size (which would also fit well with the FDA's own concerns) would be beneficial. The paper will show that there is no a priori cohort size that can guarantee a reliable assessment, a point stressed by the FDA in the appendices to its guidance, but that manufacturers can terminate the assessment when appropriate by using a specific approach - illustrated in this paper through a case study - called the 'Grounded Procedure'.
... While the reduction of assessment costs remains a key outcome associated with p-value estimation, the debate has become increasingly focused on the reliability of the data gathered when using a small sample of users [Borsci et al. 2012; Nielsen 2012; Sauro and Lewis 2012; Schmettow 2012]. The approach taken to p-value estimation is also relevant to both issues of cost and reliability, with different approaches having focused on such factors as the order of the subjects in the evaluation [Nielsen and Landauer 1993], the nature of the errors and problems identified by the sample [Turner, Lewis and Nielsen 2006], and the properties of the interface [Borsci et al. 2011]. From the 1960s to the 1980s two main barriers prevented developers from adopting a systematic approach to evaluation in the design cycle. ...
... Lewis Lewis 2000] applied this in conjunction with the Good-Turing procedure and showed that it delivers a more conservative and more reliable value of p than the classic ROI model. Third, the Bootstrap Discovery Behavior model, proposed by Borsci, Londei, and Federici [2011; see also Borsci, Federici, Mele, Polimeno, & Londei, 2012] is another re-sampling method that builds on the Good-Turing and Monte Carlo methods. It adopts a bootstrapping approach [Efron 1979; Fox 2002] and modifies the ROI equation (1) as follows: ...
Article
Full-text available
The debate concerning how many participants represents a sufficient number for interaction testing is well-established and long-running, with prominent contributions arguing that five users provide a good benchmark when seeking to discover interaction problems. We argue that adoption of five users in this context is often done with little understanding of the basis for, or implications of, the decision. We present an analysis of relevant research to clarify the meaning of the five-user assumption and to examine the way in which the original research that suggested it has been applied. This includes its blind adoption and application in some studies, and complaints about its inadequacies in others. We argue that the five-user assumption is often misunderstood, not only in the field of Human-Computer Interaction, but also in fields such as medical device design, or in business and information applications. The analysis that we present allows us to define a systematic approach for monitoring the sample discovery likelihood, in formative and summative evaluations, and for gathering information in order to make critical decisions during the interaction testing, while respecting the aim of the evaluation and allotted budget. This approach -- which we call the Grounded Procedure -- is introduced and its value argued.
... From a statistical perspective, the current estimation procedure is based on a model of how the usability problems are detected; this is considered to be a binomial process. The literature suggests that the total number of usability problems can be estimated from the discovery matrix's problem margin (the sum of the columns) (Kanis 2011;Lewis 2001;Hertzum and Jacobsen 2003;Schmettow 2012;Borsci, Londei, and Federici 2011). However, this estimation is complicated by (i) the small sample size usually encountered in usability testing of medical P a g e 118 | 207 devices (Faulkner 2003) and (ii) as-yet unobserved problems that truncate the margin and bias estimates (Lewis 2000;Sauro and Lewis 2016;Thomas and Gart 1971). ...
Thesis
Introduction. Une information nouvelle possède une valeur si elle permet de réduire le risque de prendre une mauvaise décision. En utilisant une caractérisation bayésienne de l’incertitude, les méthodes fondées sur la valeur de l’information (VoI) estiment cette valeur au regard des conséquences associées à la mauvaise décision. Du fait de ses spécificités, le dispositif médical constitue un domaine de choix pour ces méthodes. L’objectif de cette thèse était, à partir d’exemples sélectionnés, d’illustrer dans quelle mesure les méthodes VoI pouvaient être utiles lors des décisions prises dans la vie du dispositif médical.Méthodes. Deux cadres d’utilisation décrits dans la littérature ont été utilisés : la priorisation des efforts de recherche et l’optimisation des designs d’études. Ils ont été appliqués à deux temps distincts de l’évaluation du dispositif médical : la détermination des études post-inscription (EPI) demandées à l’occasion de la demande de remboursement et la détermination de la taille d’échantillon lors de l’évaluation précoce de l’utilisabilité réalisée par l’industriel en vue de l’obtention du marquage CE. Ces deux applications ont nécessité de préciser le contexte décisionnel (ensemble des choix, fonction-objectif des parties prenantes, etc.) de développer le modèle d’aide à la décision correspondant à la perspective décisionnelle adoptée et enfin, de caractériser l’incertitude. Les exemples concernaient respectivement les demandes d’EPI pour les endoprothèses aortiques et l’évaluation de l’utilisabilité d’un dispositif innovant d’auto-injection d’adrénaline. Ces exemples ont permis d’identifier les conditions de mise en oeuvre des analyses VoI en termes de données requises, de délai, et de complexité.Résultats. L’analyse menée sur l’exemple des endoprothèses aortiques a requis une re-paramétrisation de l’ensemble du modèle d’aide à la décision existant pour : incorporer les données françaises, intégrer l’incertitude relative à l’ensemble des paramètres en tenant compte des corrélations existantes, et prendre en compte l’hétérogénéité des effets selon les sous-groupes de patients. Sur l’ensemble de la population, l’EVPI élevée quel que soit le critère de jugement justifiait d’envisager l’acquisition de données supplémentaires. Les calculs de l’EVPPI confirmaient l’intérêt d’une d’EPI sur les paramètres d’efficacité, particulièrement sur le long terme. La prise en compte de l’hétérogénéité, originale tant d’un point de vue méthodologique qu’applicatif, a montré qu’il aurait été pertinent de restreindre ces EPI aux patients jeunes et en bon état général. Ce qui n’était pas envisagé dans les EPI demandées aux industriels.Dans le cadre des études d’utilisabilité, le développement de novo d’un modèle statistique de la découverte des erreurs d’usage a été rendu nécessaire au regard des insuffisances des modèles existants. En effet, notre approche reposant sur la modélisation de la matrice complète de découverte des erreurs dominait les modèles existants en termes de biais, de consistance et de probabilité de couverture de l’intervalle de confiance, notamment pour des petits échantillons. Cette approche originale a permis d’intégrer la fonction-objectif du décideur dans la détermination de la taille optimale des échantillons, dans une logique de gestion, plutôt que d’évitement du risque. Les tailles d’échantillon estimées étaient plus importantes que dans la littérature (environ 100 participants). L’implémentation de notre méthode est permise par la mise à disposition de l’outil de calcul en libre accès. Plusieurs enseignements émergent de ces applications [...]
... Accordingly, to fully model the perceived experience of a user, practitioners should include a set of repeated objective and subjective measures in their evaluation protocols to enable satisfaction and benefit analysis as a "subjective sum of the interactive experience" [4]. Several standardized tools have been developed to measure satisfaction, realization of benefit and perceived usability of user with and without disabilities [5][6][7][8][9][10][11]. It is also well known that if the UX of a product is assessed at the end of the design process, product changes are much more expensive than if the same evaluation were conducted throughout the development process (i.e., according to a usercentered design, UCD) [5,12]. ...
Chapter
Full-text available
To fully model the perceived experience of a user, practitioners should include a set of repeated objective and subjective measures in their evaluation protocols to enable satisfaction and benefit analysis as a “subjective sum of the interactive experience.” It is also well known that if the UX of a product is assessed at the end of the design process, product changes are much more expensive than if the same evaluation were conducted throughout the development process. In this study, we aim to present how these concepts of UX and UCD inform the process of selecting and assigning assistive technologies (ATs) for people with disabilities (PWD) according to the Matching Person and Technology (MPT) model and assessments. To make technology the solution to the PWD’s needs, the MPT was developed as an international measure evidence-based tool to assess the best match between person and technology, where the user remains the main actor in all the selection, adaptation, and assignment process (user-driven model). The MPT model and tools assume that the characteristics of the person, environment, and technology should be considered as interacting when selecting the most appropriate AT for a particular person’s use. It has demonstrated good qualitative and quantitative psychometric properties for measuring UX, realization of benefit and satisfaction and, therefore, it is a useful resource to help prevent the needs and preferences of the users from being met and can reduce early technology abandonment and the consequent waste of money and energy.
... However, even for those applications, national and international regulations (ie, CE marking in Europe or FDA regulations) and harmonized standards (ie, EN 62366 advised for CE marking in Europe) strengthened the requirements for premarket certifications but did not standardize a threshold for usability or technical performance. We must recognize that recommending the minimum required sample size (eg, 15 users identified by user profile numbers) makes an improvement to the summative usability assessment method [9]. It is also necessary to assess user profiles accurately. ...
Article
Full-text available
This viewpoint argues that the clinical effects of mobile health (mHealth) interventions depends on the acceptance and adoption of these interventions and their mediators, such as usability of the mHealth software, software performance and features, training and motivation of patients and health care professionals to participate in the experience, or characteristics of the intervention (eg, personalized feedback).
... From a statistical perspective, the current estimation procedure is based on a model of how the usability problems are detected; this is considered to be a binomial process. The literature suggests that the total number of usability problems can be estimated from the discovery matrix's problem margin (the sum of the columns) [7][8][9][10][11]. However, this estimation is complicated by (i) the small sample size usually encountered in usability testing of medical devices [12] and (ii) as-yet unobserved problems that truncate the margin and bias estimates [13][14][15]. ...
Article
Full-text available
Background: Usability testing of medical devices are mandatory for market access. The testings' goal is to identify usability problems that could cause harm to the user or limit the device's effectiveness. In practice, human factor engineers study participants under actual conditions of use and list the problems encountered. This results in a binary discovery matrix in which each row corresponds to a participant, and each column corresponds to a usability problem. One of the main challenges in usability testing is estimating the total number of problems, in order to assess the completeness of the discovery process. Today's margin-based methods fit the column sums to a binomial model of problem detection. However, the discovery matrix actually observed is truncated because of undiscovered problems, which corresponds to fitting the marginal sums without the zeros. Margin-based methods fail to overcome the bias related to truncation of the matrix. The objective of the present study was to develop and test a matrix-based method for estimating the total number of usability problems. Methods: The matrix-based model was based on the full discovery matrix (including unobserved columns) and not solely on a summary of the data (e.g. the margins). This model also circumvents a drawback of margin-based methods by simultaneously estimating the model's parameters and the total number of problems. Furthermore, the matrix-based method takes account of a heterogeneous probability of detection, which reflects a real-life setting. As suggested in the usability literature, we assumed that the probability of detection had a logit-normal distribution. Results: We assessed the matrix-based method's performance in a range of settings reflecting real-life usability testing and with heterogeneous probabilities of problem detection. In our simulations, the matrix-based method improved the estimation of the number of problems (in terms of bias, consistency, and coverage probability) in a wide range of settings. We also applied our method to five real datasets from usability testing. Conclusions: Estimation models (and particularly matrix-based models) are of value in estimating and monitoring the detection process during usability testing. Matrix-based models have a solid mathematical grounding and, with a view to facilitating the decision-making process for both regulators and device manufacturers, should be incorporated into current standards.
... The extent to which this should affect the use of the binomial formula in modeling problem discovery is an ongoing topic of research (Briand, El Emam, Freimut, & Laitenberger, 2000; Kanis, 2011; Lewis, 2001; Schmettow, 2008 Schmettow, , 2009). Some researchers have investigated alternative means of discovery modeling that do not make the assumption of the homogeneity of p, including the beta-binomial (Schmettow, 2008), logit-normal binomial (Schmettow, 2009Schmettow, , 2012), bootstrapping (Borsci, Londei, & Federici, 2011), and capture–recapture models (Briand et al., 2000). These more complex alternative models may turn out to be advantageous over the simple binomial model, but they may also have disadvantages, particularly in the sample sizes required to accurately estimate their parameters. ...
Article
Full-text available
The philosopher of science J. W. Grove (1989) once wrote, “There is, of course, nothing strange or scandalous about divisions of opinion among scientists. This is a condition for scientific progress” (p. 133). Over the past 30 years, usability, both as a practice and as an emerging science, has had its share of controversies. It has inherited some from its early roots in experimental psychology, measurement, and statistics. Others have emerged as the field of usability has matured and extended into user-centered design and user experience. In many ways, a field of inquiry is shaped by its controversies. This article reviews some of the persistent controversies in the field of usability, starting with their history, then assessing their current status from the perspective of a pragmatic practitioner. Put another way: Over the past three decades, what are some of the key lessons we have learned, and what remains to be learned? Some of the key lessons learned are:• When discussing usability, it is important to distinguish between the goals and practices of summative and formative usability.• There is compelling rational and empirical support for the practice of iterative formative usability testing—it appears to be effective in improving both objective and perceived usability.• When conducting usability studies, practitioners should use one of the currently available standardized usability questionnaires.• Because “magic number” rules of thumb for sample size requirements for usability tests are optimal only under very specific conditions, practitioners should use the tools that are available to guide sample size estimation rather than relying on “magic numbers.”
Article
Full-text available
Background: Patients with neuromuscular knee-instability assisted with orthotic devices experience problems including pain, falls, mobility issues and limited engagement in daily activities. Objectives: The aim of this study was to analyse current real-life burden, needs and orthotic device outcomes in patients in need for advanced orthotic knee-ankle-foot-orthoses (KAFOs). Methodology: An observer-based semi-structured telephone interview with orthotic care experts in Germany was applied. Interviews were transcribed and content-analysed. Quantitative questions were analysed descriptively. Findings: Clinical experts from eight centres which delivered an average of 49.9 KAFOs per year and 13.3 microprocessor-stance-and-swing-phase-controlled-knee-ankle-foot orthoses (MP-SSCOs) since product availability participated. Reported underlying conditions comprised incomplete paraplegia (18%), peripheral nerve lesions (20%), poliomyelitis (41%), post-traumatic lesions (8%) and other disorders (13%). The leading observed patient burdens were "restriction of mobility" (n=6), followed by "emotional strain" (n=5) and "impaired gait pattern" (n=4). Corresponding results for potential patient benefits were seen in "improved quality-of-life" (n=8) as well as "improved gait pattern" (n=8) followed by "high reliability of the orthosis" (n=7). In total, experts reported falls occurring in 71.5% of patients at a combined annual frequency of 7.0 fall events per year when using KAFOs or stance control orthoses (SCOs). In contrast, falls were observed in only 7.2 % of MPSSCO users. Conclusion: Advanced orthotic technology might contribute to better quality of life of patients, improved gait pattern and perceived reliability of orthosis. In terms of safety a substantial decrease in frequency of falls was observed when comparing KAFO and MP-SSCO users.
Conference Paper
The selection of participants for usability assessment, together with the minimum number of subjects required to obtain a set of reliable data, is a hot topic in Human Computer Interaction (HCI). Albeit, prominent contributions through the application of different p estimation models argued that five users provide a good benchmark when seeking to discover interaction problems a lot of studies have complained this five-user assumption. The sample size topic is today a central issue for the assessment of critical-systems, such as medical devices, because lacks in usability and, moreover, in the safety in use of these kind of products may seriously damage the final users. We argue that rely on one-size-fits-all solutions, such as the five-user assumption (for websites) or the mandated size of 15 users for major group (for medical device) lead manufactures to release unsafe product. Nevertheless, albeit there are no magic numbers for determining "a priori" the cohort size, by using a specific procedure it is possible to monitoring the sample discovery likelihood after the first five users in order to obtain reliable information about the gathered data and determine whether the problems discovered by the sample have a certain level of representativeness (i.e., reliability). We call this approach "Grounded Procedure" (GP).The goal of this study is to present the GP assumptions and steps, by exemplifying its application in the assessment of a home medical device.
Article
Full-text available
"Really, how many users do you need to test? Three answers, all different." ---User Experience, Vol. 4, Issue 4, 2005
Article
Article
Subjects made an initial choice using external product information. Some concurrently verbalized this choice, whereas others did not. Next, they received more information on new brands and a new attribute for all brands. Both verbalizers and nonverbalizers then made a second choice using some of the first choice information incidentally acquired in memory. All subjects verbalized this second choice. The effects of the first choice verbalization manipulation were examined by analyzing the second choice protocols along with the choice outcome and task perception measures. In comparison with verbalizers, nonverbalizers did more problem framing and brand processing during earlier phases of the second choice. However, choice outcomes did not differ. The findings suggest that the verbalization manipulation may have altered the first choice, creating memory differences that affected some subsequent tasks. Retrieval measures corroborate this conclusion. The concurrent verbal protocol method is evaluated on the basis of these findings.
Article
Recent attention has been focused on making user interface design less costly and more easily incorporated into the product development life cycle. This paper reports an experiment conducted to determine the minimum number of subjects required for a usability test. It replicates work done by Jakob Nielsen and extends it by incorporating problem importance into the curves relating the number of subjects used in an evaluation to the number of usability problems revealed. The basic findings are that (1) with between 4 and 5 subjects 80% the usability problems are detected and (2) that additional subjects are less and less likely to reveal new information. Moreover, the correlation between expert judgements of problem importance and likelihood of discovery is significant, suggesting that the most disruptive usability problems are found with the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle the practitioner is employing, and the goals of the usability test.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Article
Structural equation models (SEMs), also called simultaneous equation models, are multivariate (i.e., multi- equation) regression models. Unlike the more traditional multivariate linear model, however, the response variable in one regression equation in an SEM may appear as a predictor in another equation; indeed, variables in an SEM may in‡uence one-another reciprocally, either directly or through other variables as intermediaries. These structural equations are meant to represent causal relationships among the variables in the model. A cynical view of SEMs is that their popularity in the social sciences re‡ects the legitimacy that the models appear to lend to causal interpretation of observational data, when in fact such interpretation is no less problematic than for other kinds of regression models applied to observational data. A more charitable interpretation is that SEMs are close to the kind of informal thinking about causal relationships that is common in social-science theorizing, and that, therefore, these models facilitate translating such theories into data analysis. In economics, in contrast, structural-equation models may stem from formal theory. To my knowledge, the only facility in S for …tting structural equation models is my sem library, which at present is available for R but not for S-PLUS. The sem library includes functions for estimating structural equations in observed-variables models by two-stage least squares, and for …tting general structural equation models with multinormal errors and latent variables by full-information maximum likelihood. These methods are covered (along with the associated terminology) in the subsequent sections of the appendix. As I write this appendix, the sem library is in a preliminary form, and the capabilities that it provides are modest compared with specialized structural equation software. Structural equation modeling is a large subject. Relatively brief introductions may be found in Fox (1984: Ch. 4) and in Duncan (1975); Bollen (1989) is a standard book-length treatment, now slightly dated; and most general econometric texts (e.g., Greene, 1993: Ch. 20; Judge et al., 1985: Part 5) take up at least observed-variables structural equation models.