Conference PaperPDF Available

Heterogeneity in the Usability Evaluation Process

Authors:
Heterogeneity in the Usability Evaluation Process
Martin Schmettow
University of Passau
Information Systems II
Innstr. 43
94032 Passau, Germany
schmettow@web.de
ABSTRACT
Current prediction models for usability evaluations are based
on stochastic distributions derived from series of Bernoulli pro-
cesses. The underlying assumption of these models is a homoge-
neous detection probability despite of it being intuitively unreal-
istic. This paper contributes a simple statistical test for existence
of heterogeneity in the process. The compound beta-binomial
model is proposed to incorporate sources of heterogeneity and
compared to the binomial model. Analysis of several data sets
from the literature illustrates the methods and reveals that het-
erogeneity occurs in most situations. Finally, it is demonstrated
how heterogeneity biases the prediction of evaluation processes.
Open research questions are discussed and preliminary advice for
practitioners for controlling their processes is given.
Categories and Subject Descriptors
H.5.2 [User Interfaces (e.g. HCI)]: Evaluation/methodology
General Terms
Measurement, Human Factors
Keywords
UsabilityEvaluation,FiveUsersDebate,Evaluation,Stochastic
Models, Overdispersion, Process Prediction
1. INTRODUCTION
Usability evaluation for finding usability defects is a vital activity
in the development of complex interactive products. According to
a well known law of Software Engineering the costs of fixing a
defect is an overlinear function of how early a defect was intro-
duced and how late it was uncovered [1]. Thus, when usability
is critical for business success a rational choice was to set a very
ambitious goal, say 99% of defects detected. This topic was dis-
cussed by Jared Spool (CEO of User Interface Engineering) in
his keynote at the HCI 2007 conference. He argued that with e-
commerce and peer-to-peer business models usability is a crucial
quality which may cost or save a company millions of dollars a
day (the examples were Amazon and Ebay).
For usability consulting companies (or departments) this poses
considerable problems on planning and managing usability
evaluation studies. The first problem is to accurately calculate
projects. At the time of negotiation with the customer there is lit-
tle known on how many defects are in the system, how easy they
turn out to be detected and how many test persons are needed.
Instead, the contracting manager has to rely on her expectations
based on past experiences. This is problematic, as reflected by
the “Five users is not enough” debate, where it turned out that
evaluation studies vary a lot in the required sample size. The
risks at this stage are to underestimate the effort and run into a
project deficit. On the other hand, when the calculation is too
gratious, the customer may choose the tightly calculated offer of
a competitor (who may then struggle with project deficits).
Assumed that an evaluation study started with a reasonable bud-
get, it is still essential to track the progress, meaning the rate of
usability defects found at each stage of the study. If Spool’s pre-
vision is taken seriously, customers of usability consulting com-
panies may in future even claim a guarantee for a certain rate of
detected defects. This implies that both – customer and usability
company – have means to accurately estimate what proportion of
defects were truly revealed by the study.
Since the early nineties, this is the motivation for studying mod-
els describing and predicting the evaluation process. The prin-
cipal problem is that the usability evaluation process is stochas-
tic. That is, there is always random variation in how defects are
detected. Consequently, estimators like the detection rate are al-
ways uncertain. Fortunately, the preciseness of estimators usually
increases with larger sample sizes. For planning and controlling
the evaluation process this has two implications: First, the more
projects and their outcomes are known to the contracting man-
ager the better will be her guess of required effort. Second, at the
beginning of the evaluation process the estimators are very unre-
liable. This changes when the study proceeds and more data sets
become available. The closer the study approaches the defined
goal, the more reliable will be the estimator, which allows for an
informed decision of whether to test another couple of partici-
pants or not.
But, in order to finally have reliable estimators it is essential
to apply an appropriate statistical model for the process. If the
wrong model is chosen, the random variation will still decrease,
but finally the estimator will either under- or overestimate the real
situation. Both is harmful.
In this paper I will at first give a brief overview on the past and re-
cent approaches to predict the evaluation process. Subsequently,
the approaches and their underlying arguments will undergo a
more thorough statistical discussion. In particular, this will show
that a basic assumptions of these models is violated: the probabil-
ity of detecting defects is not homogeneous, but heterogeneous.
It differs across inspectors (or test participants) and defects. For
revealing the heterogeneity in the process, a statistical test is de-
veloped which is simple enough to be applied by practitioners
without a comprehensive statistical software at hand.
Furtheron, the matter of compound distributions is introduced,
which is a more accurate description of what is going on in the
89
© The Author 2008.
Published by the British Computer Society
process. The compound beta-binomial model allows for estimat-
ing the factors introducing the extra variance. Finally, I will inves-
tigate to what extent incorporating heterogeneity improves mon-
itoring and predicting the evaluation process.
All models and procedures introduced here are applied to five real
data sets previously published. Two of these are from think-aloud
testing studies, three are inspection-based evaluations. Please
note, that from the mostly mathematical perspective expressed
throughout the paper the evaluation method is not essential.
Therefore, the term session will always refer to both, individual
test participants and expert evaluators (inspectors). Accordingly,
the term process size denotes the number of independant sessions
conducted at a certain stage in the evaluation process.
2. RELATED WORK
One of the first authors examining the stochastic nature of the
usability evaluation process was Virzi [24]. The process of find-
ing usability defects in a series of think-aloud testing sessions is
proposed to follow the cumulative function of the geometric dis-
tribution (CGF, also known as the curve of dimishing returns).
The CGF plots the relative outcome (i.e. percentage of defects
detected so far) of the process as a function of the process size s
(i.e. number of independent testing sessions or inspections) and
the detection probability p.
P(identified at least once|nInspectors,p) = 1(1p)n(1)
Virzi ran a series of usability tests and used a Monte Carlo pro-
cedure to estimate the amount of defects found with each process
size s. Noteworthy, Virzi was already aware of heterogeneity in
the process: He estimated different curves for defects grouped by
a severity ranking . But finally, the average detection probability
was estimated to be p=.35 and it was concluded that under these
circumstances five sessions suffice to find approximately 80% of
the defects.
Nielsen and Landauer continued the topic. They claimed that un-
der the assumption of independence of single evaluation sessions
the margin sum of defects found per session follows a Poisson
distribution [15]. Again, these authors are aware of heterogene-
ity in the evaluation process. In addition to the detectability of
defects they mention the heterogeneity of test persons or inspec-
tors and the possibility that usability problems co-occur, which
violates the assumption of stochastic independance.
This study also revealed that the basic probability varies a lot
between studies [15]. Consequently, one cannot rely upon an a
priori value for pfor controlling an evaluation study. A way out
is to estimate the basic probability from the early part of the
study in order to extrapolate the rest of the process towards the
given process goal. A number of procedures have been tried to
estimate pfor a particular study. Nielsen and Landauer used a
least-square fitting procedure for estimating the basic probability.
Later, Lewis introduced another approach in simply estimating p
from the response matrix of detection events happened so far. As
he had to admit later, this procedure is inherently flawed and leads
to an overestimation of p[10].
A main goal of this contribution is to show, that the problem of
unprecise process prediction is not due to the procedures for es-
timating p, but a fundamental false assumption with the CGF
model. The CGF model assumes that there is a homogeneous
probability for any defect and session throughout the process.
Whereas Virzi and also Nielsen and Landauer argue on differ-
ences between sessions and defects regarding detection probabil-
ity, Lewis seems to mostly ignore this topic .
In fact, a few other authors have treated the problem of hetero-
geneity in a variety of ways. Caulton extended the CGF model for
heterogeneity of groups regarding defect detectability in usabil-
ity testing [5]. It was assumed that there exist distinct subgroups
of test persons that are sensitive to certain subsets of usability
defects. This model may be applicable for situations with highly
diverse user groups. A stricter formulation of this model was to
assume that users or evaluators differ in their sensitivity or skills
to detect defects. This can be expressed with the following mod-
ification to the CGF model, where pidenotes the sensitivity or
skill of an individual person:
P(identified at least once|p1. .. pn) = 1
n
i=1
(1pi)(2)
Woolrych and Cockton [26] re-examined data of a previous study
and considered the CGF model as risky, because it does not
account for heterogeneous detection probability of defects. They
suggest to replace pwith a kind of density distribution.
Recently, Schmettow and Vietze suggested the Rasch model for
measuring the impact factors of defect detectability and evaluator
skills [22]. Under certain assumption this model provides a den-
sity distribution for pwith evaluator skills siand defect difficulty-
to-detect djas parameters.
P(j identified by i|si,dj) = esidj
1+esidj(3)
They suggested, that these individual measures may be used for
prediction of process outcome . This appears promising, given
that the model turns out to fit the data from real processes.
Faulkner observed a large variance of outcome with small sample
size studies [7]. This was not directly assigned to heterogeneinity
in the process, but it stresses an important point for cases where
a guaranteed defect detection rate is at stake: For applications
where usability is mission-critical it does not suffice to size the
sample according to the point estimates of the CGF. Instead,
confidence intervals have to be taken into account. Expressed as
a statement of guarantee this appears like “the study will reveal
99% of existing defects with a probability of 95%”. Determining
the confidence interval for the outcome estimators again requires
an appropriate stochastic process model.
3. DATA SETS
The following statistical approaches to deal with heterogeneity
will be showcased with an analysis on five published data sets.
Three of the data sets were previously used by Lewis to assess
the adjustment terms for ˆpfor small sample sizes [10]: The
two data sets MA NT EL and SAVI NG S stem from a publication
assessing the performance of the Heuristic Evaluation [16]; the
set MAC ER R is the result of a usability testing study [10]. The
data set WC 01 [26] is from a usability testing study that was
conducted to verify the results of a previous assessment of the
Heuristic Evaluation – thus, it is a special case in that most
usability defects were already known (a so-called falsification
testing study). Finally the data set UP I07 is from a small scale
experiment where a novel inspection method (called Usability
Pattern Inspection) was compared to the Heuristic Evaluation
[20]. The results from both experimental conditions have been
merged for the analysis here. This data set can be obtained from
the accompanying website [19] or on request. Note, that in this
experiment both groups of participants performed virtually equal,
so session heterogeneity is not the first guess. For an overview
on the process sizes, number of defects and average detection
probability in the data sets see the table 1.
4. TESTING FOR HETEROGENEITY
Several authors have expressed their doubts that a homogeneous
p, as assumed by the CGF model, is appropriate. However, no
Heterogeneity in the Usability Evaluation Process
90
Table 1. Usability evaluation data sets
Data Set Type p sessions defects Ref
MACE RR User Test .16 15 145 [10]
MANTEL HE .36 76 30 [16]
SAVINGS HE .26 34 44 [16]
UPI 07 HE, UPI .30 10 35 [20]
WC01 User Test .43 12 16 [26]
studies have proven heterogeneity with proper inferential statis-
tics so far. In the following, a simple statistical test for hetero-
geneity is developed and applied. It bases on certain properties of
the binomial distribution.
4.1 Binomial Sums and Mixtures
Assumed there is a homogeneous pfor all detection events the
basic process can be regarded as a series of equal Bernoulli
processes. Under this assumption the CGF model applies for
predicting the outcome after a number of sessions. Another way
to look at the data is to plot a data matrix with two dimensions
– sessions and defects – and denote every successful detection
as 1 and a missed defect as 0. Under homogeneous pthe margin
sums of the data matrix shall follow a binomial distribution. For
the small example in figure 1 this applies to the number of times
a defect is found SDB(4,0.3)and the number of found defects
in each session SIB(5,0.3). The binomial distribution has a
mean of µ=np and a variance of σ2=np(1p). Thus we can
expect µ=1.2 and σ2=0.84 for SD. Indeed, observed values
for SDcome close to this with ˆ
µ=1.4 and ˆ
σ2=0.8.
Nielsen & Landauer followed this idea, but approximated the
margin sums as a Poisson distribution. This is common practice,
but problematic in this case: The Poisson distribution is only
applicable as a limiting case of the binomial distribution with
a very large series of independent Bernoulli experiments with
a very low probability (n,p0). A rule of the thumb
states that n>30 and p< .05 as a precondition to approximate
a binomial distribution (e.g. [23]). This is hardly fulfilled for the
revised data sets where the probability of detecting a defect varies
between .12 and .58 and the number of usability defects between
9 and 145. For the general case, the binomial distribution is the
appropriate model of how often a defect is detected and how
many defects each session reveals.
When discussing the Poisson model, Nielsen and Landauer ar-
gue that this model still holds under heterogeneity because of the
Poisson distribution’s additivity property, where the sum of two
Poisson distributed random variables is again Poisson distributed.
This is a weak argument because it does not allow for the conclu-
sion, that the CGF prediction stays unaffected as well. And, in
fact, it is completely different for the binomial distribution. In-
stead, if there is heterogeneity in defects, SDbecomes a mixed
binomial distribution, whereas SIis a sum of binomial variables.
As an example consider the case depicted in figure 2 with two
sets of n1=n2=20 defects with detection probabilities p1=
.2 and p2=.3. These n=n1+n2=40 defects are evaluated
S1S2S3S4SD
D11001 2
D20100 1
D30110 2
D40110 2
D50000 0
SI1321
Figure 1. Example data matrix from a usability evaluation study
with five defects, four sessions and homogeneous probability
p=.3. A successful defect detection event is denoted as 1.
m1(p1=0.2)
s sessions
n1 defects
SD2~B(s,p2)
m2(p2=0.3)
s sessions
n2 defects
SD1~B(s,p1)
SI1~B(n1,p1)
SI2~B(n2,p2)
Overdispersed Mixture SD
Underdispersed Sum SI
s sessions
n1+n2 defects
Figure 2. Sums and Mixtures of two subsets of defects with
differing detectability
in s=30 sessions. This results in two dichotomous response
matrices m1[s×n1]and m2[s×n2], where the margin sums can
be computed. For m1the number of defects found per session
is binomial distributed with SI
1B(n1,p1), analogously for m2.
Orthogonally, the number of times a defect detected in m1is
SD
1B(s,p1)and analogously in m2. If we now combine the two
matrices to m[s×n]with n=n1+n2and two new margins sums
SIand SD, neither of these margin sums is binomial distributed.
Whereas the average margin sum is still np, with p=p1/2+p2/2,
the variances from the the binomial distribution.
Instead, SI(the number of defects found in each session) is now
the sum of two binomial variables and thus underdistributed,
which means that Var(SI)<np(1p). Vice versa, SD(the num-
ber of times a defect is detected) is a mixed binomial variable and
thus overdistributed with Var(SD)>np(1p). The same phe-
nomena appear in the general case with more than two mixtures
or sums. A general proof for sums of unequal binomial random
variables (SIin the example) being underdispersed towards the
binomial distribution is given by Marshall and Olkin [13] (cited
after [18]). The overdispersion of mixed binomial distributions
(SDin the example) is proven by Whitt ([25], cited after [18]).
4.2 A Simple Test on Overdispersion
Nielsen and Landauer restrict their model to the homogeneous
case and assume “that all evaluations [.. . ] find exactly the mean
number of problems”. Even in the homogenous case this is an
oversimplification because the binomial process already imposes
a variance of np(1p). The introduced properties of sums and
mixtures of different binomial variables are the key to determine
whether there is additional variance that does not stem from
a homogeneous Bernoulli series. Establishing a statistical test
specific for overdispersion towards the binomial distribution is
straightforward by means of stochastic Monte-Carlo simulation:
Testing the sessions for homogeneity from the matrix m[s×n]is
done in the following steps:
1. Compute the observed variance Varobs =var(SI)
Heterogeneity in the Usability Evaluation Process
91
0 20 40 60 80
Data Set: MacErr
Times Observed (Defect)
Frequency
12345678 1011
Observed
Expected
Figure 3. Overdispersion of defect margin sums in the MAC ERR
data set
2. Estimate pfrom the matrix mas the relative amount of detec-
tion events
3. Generate sbinomial random numbers X1. ..XsB(n,p)
4. Compute the variance V=Var(x1...xs)
5. Repeat 3 and 4 a number of times (e.g. r=1000), resulting in
vsim ={V1...Vr}.
6. Count the elements of vsim that are equal or larger than Varobs
and compute the relative frequency α=|Vvsim,VVarobs|
r
The relative frequency αcan be interpreted in the usual way un-
der the null hypotheses: The observed random variable varobs
has the variance to be expected under the binomial distribution.
Note, that in the case of both being heterogeneous – defects and
sessions – there must appear a slight compensation of overdis-
persion by the underdispersive effect of the orthogonal binomial
sums. But, as Rivest showed, the amount of underdispersion usu-
ally is much weaker than overdispersion [18]. In any case, there
is no risk of overconfidence.
Figure 3 illustrates how this test works. It shows the observed
frequency of how often a defect is detected versus the frequency
expected by the binomial model (from the MAC ERR data set). It
appears that there are too many defects only detected once and
too many detected more than six times. This results in greater
variance which is revealed by the overdispersion test.
4.3 Data Analysis
The Monte-Carlo test on overdispersion with respect to the bino-
mial distribution was implemented as a program in R [17]. The
program takes the dichotomous response matrix (like depicted
in figure 1) of a usability evaluation as input and outputs the
observed variance in the matrix Varobs, the theoretical variance
Vartheo , the mean of the variances generated with the Monte-
Carlo experiment and the relative frequency α. If the absolute fre-
quency of VVarobs is zero, a maximum value for αis printed,
based on the number of Monte-Carlo runs.
The five data sets introduced above have undergone the overdis-
persion test with r=10000 Monte-Carlo runs in both directions:
defects found per session and number of times a defect was
found. A significant α(α.05) can be interpreted that there
exists heterogeneity in the skills of inspectors, respectively the
sensibility of participants in a usability testing study.
The results are shown in table 2. In four cases the observed
variance was larger than both, theoretical and simulated vari-
ance. In three studies this appears as highly significant. In one
case (UP I07) there still is a tendency. But, note that the ob-
served variance is still twice as large as the theoretical and this
was the smallest sample. Only in the WC01 study the observed
variance is below the theoretical value. As explained above this
may be an effect of underdispersion due to sums of binomial
random variables. For verification an underdispersion test was
conducted, by simply changing the last step in the procedure to
Table 2. Test on heterogeneity of session sensitivity
Data Set
Var
obs
Var
theo
Var
sim
P(Var
obs
Var
sim
)
M
AC
E
RR
97.26
19.76
19.79
<0.0001
∗∗∗
M
AN TEL
13.25
7.05
7.04
<0.001
∗∗∗
S
AVING S
18.45
8.71
8.73
<0.001
∗∗∗
U
PI
07
13.39
7.35
7.40
.06
+
WC01
3.17
3.93
3.90
.64
Table 3. Test on heterogeneity of defect detectability
Data Set
Var
obs
Var
theo
Var
sim
P(Var
obs
Var
sim
)
MACER R
4.93
2.04
2.04
<0.0001
∗∗∗
MAN TEL
546.15
17.86
17.86
<0.0001
∗∗∗
SAVING S
51.77
6.73
6.68
<0.0001
∗∗∗
UPI 07
3.80
2.10
2.11
0.0015
∗∗∗
WC01
19.63
2.94
2.96
<0.0001
∗∗∗
α=|Vvsim,VVarobs|
r. With a result of α=.37 underdispersion
could not be verified for the data set.
Table 3 shows the results of the overdispersion tests regarding
defect heterogeneity. In all five studies the observed variance of
margin sums is highly significant beyond the expected variance
under the binomial model.
4.4 Discussion
The overdispersion test is based on a simple procedure, is easy to
conduct and interpret, but still a powerful tool to reveal hetero-
geneity – at least in mid to large size evaluation studies.
When conducting usability evaluations, one definitely has to ac-
count for heterogeneous detectability of defects. This appeared in
all five data sets, regardless of the evaluation method employed
or other contextual factors.
In most cases one also has to expect session heterogeneity. An
exception is the WC01 study. This is interesting for two reasons:
First, this was a falsification study, where the aim is to verify pre-
viously proposed defects. This most likely results in a very sys-
tematic design of testing tasks, which may in some way equalize
the participants sensitivity. Second, the original authors argued
that the high variance in process outcome (with subsets of three
participants) is due to variance in users [26]. They were wrong as
the overdispersion test shows.
5. DETERMINING HETEROGENEITY
In the previous section the impact of heterogeneity on the margin
sums of the response matrix were introduced with a discrete mix-
ture of binomial variables. The overdispersion test can be used to
easily detect heterogeneity. But, a further aim is to determine the
amount of heterogeneity in the sample. For example, one may
want to analyse, whether a certain method has the wanted effect
of equalizing the process – by making some intractable defects
easier or help novices to catch up with experts. Another purpose
in evaluation method research is to compare different studies.
Clearly, the results from two evaluation studies (or conditions in a
strictly experimental study) can at best be interpreted when there
are approximately equal conditions – determining the variance of
impact factors is one such criterion.
5.1 Fitting the beta-binomial distribution
In general, estimating the variance of the heterogeneity factor re-
quires to assume a certain distribution of p. As pis a probability,
a distribution must range in [0,1]. In such cases often the beta
distribution is employed: It has exactly the required range and
comes with two parameters allowing it to take a variety of shapes
with deliberate mean and variance.
Heterogeneity in the Usability Evaluation Process
92
If we assume pBet a(a,b)and undertake a series of nBernoulli
experiments with p, than the sum of results is beta-binomial
distributed SIBetaBin(a,b,n)with mean and variance
µ=na
a+b(4)
σ2=nab(n+a+b)
(a+b)2(1+a+b)(5)
Estimating the parameters aand bfrom the margin sum is prob-
ably not an everyday statistical technique, but a few programs
exist with an implementation. In the following the VGAM pack-
age [27] for the statistical computing environment R [17] serves
for estimating the parameters. This package allows estimating the
parameters for a large variety of distributions with the maximum
likelihood (ML) method. The ML method identifies the set of
parameters x1...xnfor a certain distribution that are most likely
given the observed data D. For some types of distributions there
exists a symbolical solution (e.g. µand σ2of the normal distri-
bution), but in most cases numerical algorithms have to be used
to find the maximum of the likelihood function L(x1...xk|D).
There are two restrictions of the beta-binomial model: First, it
relies on a single margin sum vector and thus is only capable
of capturing one heterogeneity factor at the time. Second, the
beta-binomial model is not appropriate in the case of underdis-
persion. As was explained in section 4.1, slight underdispersion
may appear from the sum of binomial random variables, but this
is usually overcompensated by the overdispersive mixture. Con-
sequently, if both factors are mixtures a slight underestimation of
the factor’s variance may appear. There are critical cases, where
underdispersion may happen: First, one factor is purely binomial
distributed and the other is a mixture. Underdispersion may also
arise, when there is stochastic dependence between defects or
sessions. This may happen in usability testing studies, if a severe
defect causes a test person to give up early and in effect “shields”
other defects later in the task flow. Stochastic dependencies be-
tween sessions may appear, if a previously identified defect is
not recorded on later occurances; obviously this can be avoided
by holding the sessions strictly independent. Fortunately, this is
common practice in industrial and research applications.
In presence of underdispersion it is not even possible to estimate
the parameters of the underlying beta distribution as this would
require the variance of the prior beta distribution to be negative,
which cannot happen with any real-valued distribution. Accord-
ingly, the ML estimation will stop at unreasonable high values
for aand band throw an error. It is therefore recommended to
first check for over- and underdispersion with the statistical test
introduced above.
5.2 Model selection criteria
The previous section 4.2 introduced a test that employs a simple
frequentist approach to test for a margin sum having the variance
expected under the binomial model. But, it is a stronger claim that
the beta-binomial model is a better approximation of the evalua-
tion process than the binomial model. In order to decide between
competing models regarding their match to the data, an appro-
priate selection criterion is needed. Because model selection is
a statistical approach rarely found in the HCI literature [4], this
will briefly being introduced in the following.
Two straightforward selection criteria are the residual deviance
(e.g. residual sum of squares, RSS) after applying the model or
the value of the maximized likelihood function. The better of two
models would have a smaller residual deviance and the larger
likelihood. However this may ignore an important directive for
scientific reasoning – Occam’s Razor, which demands that the
more parsimonous of two theories must be preferred if it has
the same explanation power. In statistical reasoning parsimony
Table 4. Estimated beta-binomial parameters for session margin
sums
Data Set a b Var
MACER R 5.25 26.57 .004
MAN TEL 12.57 20.73 .006
SAVINGS 11.76 31.20 .005
UPI 07 15.61 36.42 .004
WC01 nc nc nc
applies as the numbers of parameters of two competing models.
Obviously, the more parameters a model has, the more versatile
it is in taking the shape of the observed data. Consequently, a
proper criterion for model selection balances goodness-of-fit and
parsimony of models.
Several so called information criteria put a penalty on the num-
ber of parameters and thus allow for proper model selection re-
specting the directive of parsimony. One of the widest known is
the Akaike Information Criterion (AIC), which has a deep foun-
dation in mathematical information theory [3]. It is amazingly
simple to apply after an ML estimation of a model’s parameters
(or the special case of the least squares method with normally
distributed errors). Usually, one adds a further correction term
for the sample sizes to the AIC. The corrected AICccomputes as
follows with the maximized likelihood function Lmax,kmodel
parameters and nobservations:
AICc=2logLmax (x1. . .xk) + 2k+2k(k+1)
nk1(6)
As the AICcgrows with the number of parameters the model with
the lowest value fits the data best with respect to parsimony. The
AIC does not assume that one of the model candidates is the true
model (which is assumed by the Bayes Information Criterion,
see [3] for a thorough discussion). This can in most cases be
regarded as a realistic and favorable approach. In the case of
modelling evaluation process data there is considerable interest
in finding a better alternative to the binomial model, but choosing
the beta distribution is more or less a pragmatic than a well
founded decision. Another issue with interpreting the AIC is that
values of logLmax are usually very large, making the difference
between two AICs appear quite small. This may tempt the naive
conclusion, that the models do not differ much in explanation
power. In fact, this is not a problem as the relative difference does
not affect the selection of the best model [3].
5.3 Data Analysis
It was shown so far that in virtually all situations heterogeneity
due to defect detectability arises. In most cases one has also to
account for session heterogeneity. Consequently, it is to expect
that in most cases the beta-binomial model will fit the data better
than the binomial model. Once the two parameters aand bhave
been estimated, the variance of the underlying beta distribution
follows equation 5. This may serve as a measure of how large the
heterogeneity is.
Beta-binomial parameters are being estimated via the ML method
(provided by the VGAM package for R [27]). This procedure
also yields the maximized likelihood function, which enables
calculating the AICc. Consequently, a likewise estimation of the
binomial parameter pin order to select the better model with the
smaller AICccomplements the parameter estimation. Usually, the
estimated pis exactly the probability of detection events as in
table 1 and is not reported again.
As the WC01 has an observed variance lower than the binomial
for sessions, this data set is not analysed, because this would re-
sult in unreasonable values (virtually b) which may even
produce program errors. As table 4 shows, the estimation proce-
Heterogeneity in the Usability Evaluation Process
93
Table 5. Model selection for session margin sums
Data Set binomial beta-binomial
logL AICclogL AICc
WC01 -131.32 265.04 nc nc
MACER R -966.17 1934.65 -948.20 1901.41
MAN TEL -1511.40 3024.85 -1502.13 3008.42
SAVINGS -875.60 1753.33 -870.07 1744.52
UPI 07 -213.80 430.11 -213.04 431.80
Table 6. Estimated beta-binomial parameters for defect margin
sums
Data Set a b Var
MACER R 2.37 12.00 .009
MAN TEL .79 1.21 .080
SAVINGS 1.18 3.15 .037
UPI 07 3.62 8.41 .016
WC01 .62 .65 .110
Table 7. Model selection for defect margin sums
Data Set binomial beta-binomial
logL AICclogL AICc
MACER R -966.17 1934.37 -937.42 1878.93
MAN TEL -1511.40 3024.93 -1083.99 2172.43
SAVINGS -875.60 1753.30 -770.11 1544.52
UPI 07 -213.80 429.73 -210.11 424.59
WC01 -131.32 264.92 -95.31 195.55
dure yields reasonable values for aand band the variance in the
remaining four data sets ranges from .004 to .006.
For model selection the AICcis computed like introduced above.
Table 5 shows that, as expected, the beta-binomial model is
usually to be preferred to the binomial model. Only for the UP I07
data set and, most likely, WC01 the binomial model fits better,
which is consistent to the results from the overdispersion test.
Table 6 shows the parameters of the beta-binomial model on
defects. In all cases the estimation produced reasonable values
without errors. The variance ranges from .009 to .110.
As can be obtained the AICcvalues in table 7, for all data sets
one has to prefer the beta-binomial model to the binomial model.
5.4 Discussion
The results from estimating the beta-binomial parameters clearly
confirm the previous results from the overdispersion test. The val-
ues for beta-binomial variance suggest that session heterogeneity
is quite comparable between the studies (except the WC01 data
set, of course), whereas there appears a larger range for defects.
In most cases, the beta-binomial model is preferred due to a
smaller value of AICc. In two cases, however, the binomial model
fits better: session heterogeneity in the data sets WC 01 and
UPI 07 are better explained with a binomial model. Again, this
is expected as no significant overdispersion could previously be
revealed. But, note that for the UPI07 the two logLvalues are
virtually equal for both cases. Thus, it is no harm to apply the
beta-binomial model right from the start as it will always fit
equally good or better than the binomial model. It is, though,
required to check for the rare case of underdispersion first, as
this will give useless results, if any.
6. PROCESS PREDICTION UNDER
HETEROGENEITY
The main argument of Nielsen and Landauer is to estimate the
process outcome given the process size nand probability pwith
the CGF introduced by Virzi. Their aim is to predict the process
from the data of the process itself and they suggest a curve fitting
procedure using the number of previously undetected defects in
each step. Based on the additive property of Poisson distributed
variables it was believed that the CGF approach is robust to
heterogeneity in the process. As the next section will show, this
is not the case.
6.1 Biases in estimators for p
The curve fitting procedure of estimating pfrom previous process
data has later been replicated by Lewis [10] and found to be
less efficient than his approach of directly estimating pfrom the
number of defects detected so far ndand the average number of
detection events per session ed.
ˆp=ed/nd(7)
This idea originated in an earlier publication [9], but later Lewis
undertook a major revision [10]. With small samples (in the early
phase of a usability evaluation process) the naive estimator of p
as the average proportion of defects found by each person is bi-
ased towards overestimation. This is caused by still undetected
defects not been taken into account. In his revision Lewis com-
pared several correction terms in application to real data sets. The
final suggestion was an equally weighted combination of a sim-
plified Good-Turing (GT) adjustment and a normalization proce-
dure (NO RM) proposed by Hertzum and Jacobsen [8] (see equa-
tion (8)). The GT adjusts for still unseen defects by taking into
account the number of defects detected exactly once n1. NOR M
accounts for a possible overestimation due to the small sample
size in the beginning of the process. The so adjusted estimator
for pwrites as
ˆpGT Norm =(ˆp1
n)(11
n)
2+ˆp
2(1+n1/nd)(8)
It was shown with several real data sets that ˆpGT Nor m yields the
best estimation of pfor small process sizes. For the moment, this
seems to be the choice for practitioners to control their processes.
However, from a more theoretical point of view it does not fully
convince. The particular combination of both adjustment terms
arose empirically, but is not theoretically justified by a consistent
model of the process, in other words: it is a more or less deliber-
ate choice. This is problematic, because the properties of the es-
timator ˆpGT Norm are mostly unknown. Thus, it is not clear, how
robust the estimator is to violations (esp. heterogeneity) and how
it behaves with more extreme parameters (e.g. very high num-
ber of defects or very low p). In fact, Lewis already observed an
increasing underestimation for larger process sizes (n>6). In-
dustrial evaluation studies are regularly conducted with several
tens of sessions, so this is clearly an annoying effect.
6.2 Fitting the cumulative beta-geometric function
A principal problem of all estimators for pis, that they do not
take the heterogeneity of the process into account. In general, it
seems to be quietly assumed, that the CGF is only sensitive to
the expected value of p, but is not biased by variance between
defects. This is not the case; instead the function of process size
on process outcome is different if defect heterogeneity is present:
The process outcome approaches the asymptote at 1 more slowly.
This effect is depicted in figure 4 for three mixed distributions
with the same average pbut different discrete distributions un-
derlying p.
Eventually, this is why Lewis’ GT adjusted estimator failed to
predict the process properly . If defect heterogeneity is indeed
the source of bias in the CGF, then the cumulative beta-geometric
function (CBGF) should fit the data better than the CGF. The pro-
cedure to test this is straightforward from what was introduced
so far: pfor the CGF and aand bfor the CBGF are estimated
Heterogeneity in the Usability Evaluation Process
94
Table 8. Comparison of cumulative beta-geometric vs. cumula-
tive geometric model fit: Residual Sum of Squares
Data Set RSSCGF RSSCBGF
MacErr .023 .081
Mantel .416 .077
Savings .330 .003
Upi07 .010 .013
WC01 .173 .063
from the margin sum SD. The empirical curve of process out-
come is generated by a Monte-Carlo procedure. Both theoreti-
cal functions – CGF and CBGF – are then plotted against the
empirical function and the deviance computes as residual sums
of squares (RSS). Estimating the beta-geometric parameters of
Monte-Carlo sampled data with the VGAM package usually pro-
duced program errors instead of usable values. Therefore, the fol-
lowing analysis will not include a model selection step by AIC,
but compares the RSS alone, which is not perfect with respect
to model parsimony. In any case, process prediction is largely a
practical problem, where model fit (i.e. minimizing the deviance)
counts more than scientific truth.
6.3 Data Analysis
The previous analysis dealt with showing the existence and de-
termining the size of heterogeneity from the response matrix. It
is still an open question how heterogeneity in the data sets im-
pacts the process outcome, respectively how it differs from the
outcome predicted by the CGF. It has already been demonstrated
that with defect heterogeneity the curve flattens more rapidly with
larger process sizes. To show this effect on the data sets, three
functions will be drawn and compared:
1. The CGF with pestimated from the relative frequency of
detection in the response matrix. Note, that this is a post-hoc
analysis on the whole response matrix, so there is no need for
a small sample adjustment like GT-N ORM here.
2. The cumulative beta-geometric function with aand besti-
mated with the beta-binomial procedure (see table 6)
3. The average “true” process outcome estimated with Monte-
Carlo sampling
Table 8 shows the CBGF to fit better for the data sets MA NT EL,
SAVINGS and WC01. In the remaining two studies (MAC ERR ,
UPI07) the CGF has the lower deviation from the observed data.
Next, the graphs for the two process models are plotted and
visually compared. Figure 5 shows those three models where the
CBGF fitted better. A first observation is that the CGF predicts
a steeper progress for small process sizes. It generally appears
closer to a rectangular shape. In all three cases the CBGF curve
is much closer to the observed process outcome for the first third
0.0 0.8
No heterogeneity
Prob
0.0 0.8
Medium heterogeneity
Prob
0.0 0.4 0.8
0.0 0.8
High heterogeneity
P
Prob
0246810
0.0 0.2 0.4 0.6 0.8 1.0
No
Medium
High
Figure 4. Impact of defect heterogeneity on process outcome
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60
0.4 0.6 0.8 1.0
Mantel
Process Size
Process Outcome
CGF( 0.38 )
CBGF( 0.79 , 1.21 )
Real Data (MC runs= 100 )
● ● ●●●
0 5 10 15 20 25 30 35
0.2 0.4 0.6 0.8 1.0
Savings
Process Size
Process Outcome
CGF( 0.27 )
CBGF( 1.18 , 3.15 )
Real Data (MC runs= 100 )
2 4 6 8 10 12
0.4 0.5 0.6 0.7 0.8 0.9 1.0
WC01
Process Size
Process Outcome
CGF( 0.43 )
CBGF( 0.62 , 0.65 )
Real Data (MC runs= 100 )
Figure 5. Graphical comparison of process outcome models for
studies with RSSgeom >RSSbetageom
of process size, i.e. the steep part. In the MA NT EL and WC01
studies the “asymptotical” last part of the process is closer to
the CGF curve. Solely for the SAVI NGS study a close-to-perfect
match of the beta-geometric model is given.
Figure 6 shows the same comparison for the two studies where
the CGF provided a slightly better fit. As expected, all three
curves are quite close to each other. Especially, there is much
less difference in the “steep” part of the process. Interestingly,
both models understimate the process outcome for the last few
sessions of the studies, with a slightly better performance of the
CGF.
6.4 Discussion
The comparison of process outcome models gives a mixed pic-
ture. In three cases the CBGF fits the data better than the geo-
Heterogeneity in the Usability Evaluation Process
95
2 4 6 8 10 12 14
0.2 0.4 0.6 0.8 1.0
MacErr
Process Size
Process Outcome
CGF( 0.16 )
CBGF( 2.37 , 12 )
Real Data (MC runs= 100 )
2 4 6 8 10
0.2 0.4 0.6 0.8 1.0
UPI07
Process Size
Process Outcome
CGF( 0.3 )
CBGF( 3.62 , 8.41 )
Real Data (MC runs= 100 )
Figure 6. Graphical comparison of process outcome models for
studies with RSSgeom <RSSbetageom
metric model. Two further observations can be made: First, for
MACER R and UP I07 the CGF fits better, but the differences in
RSS are comparably small. Second, these both studies showed
the smallest variance in defects detectability (cf. table 6). This
is confirmed by the graphical comparison: When variance of de-
tectability is low, the process predictions are quite similar. When
variance gets larger the CBGF is in favor, but mainly for the early
process.
A final clarification of these mixed findings is out of reach at the
moment, but some speculations can be given: The deviance in
the “asymptotic” part of the process may be due to a limitation
of the Monte-Carlo procedure. In fact, there is no longer fine-
grained sampling, when the maximum process size is reached.
For example, there are only two samples left for s1. This
makes the smallest possible steps larger than at the beginning
of the process. By the way, this is a similar problem, which made
Lewis introduce the NORM term for small process sizes. Also, the
models did not include the variance of sessions. It may be, that
this causes a further bias on the process outcome model, which
impacts with larger process sizes.
Be aware, that the post-hoc approach presented here is not appli-
cable for controlling an industrial process. But, in hope that fu-
ture research comes up with online prediction models accounting
for heterogeneity, some practical advice can already be given: In
general, if finding defects is mission-critical it is safe to account
for defect heterogeneity. In the worst case this leads to a slight
underestimation of current process outcome, but there is no harm
from an optimistic bias. In all less ambitious studies, where the
process goal is around 80% or smaller, it is highly recommended
to apply a heterogeneous prediction model as this performs at
least equal and better in most cases.
A special caveat on using prediction models is revealed by a fur-
ther look at the data set UP I07. The authors reported, that the
experiment was held under tight time constraints with a limited
set of user tasks [20]. They have subsequently conducted a usabil-
ity testing study in order to validate the defects via falsification
testing. In this study they found a larger number of defects not
detected by the ten inspection sessions. But, the incomplete data
set neatly complies to the CGF prediction model and asymptoti-
cally reaches the upper bound. This is well in line with the find-
ings of Lindgaard and Chattratichart who re-analysed the CUE-
4 data sets [14] and found that the effectiveness of a usability
study was largely affected by task desgn and task coverage [12].
It follows, that any prediction model will not uncover the fact
that some defects have not been found due to flawed study de-
sign. In other words: Usability practitioners are not freed from
their liability to carefully compile relevant user tasks and assign
sufficient ressources to their studies.
7. GENERAL DISCUSSION
Several researchers have already guessed it and the results re-
ported here confirm it: Heterogeneity regularly appears in both
factors – sessions and defects – of usability evaluation studies.
Defects differ in their detectability in all five studies analysed
here. In most cases one also has to expect sessions to differ in the
detection capability, regardless of usability tests or expert evalu-
ations are being conducted.
What has largely been overlooked in research is that heterogene-
ity has considerable impact on the prediction of process outcome.
Defect heterogeneity causes a lower process outcome than to be
expected under the CGF model. For industrial usability studies
the risk arises to stop the process to early. As an example, have
a look at prediction of the SAVI NG S data set, where a total of 44
defects was observed (see middle graph in figure 5). Assumed,
that it was decided to stop the process when 80% of defects are
detected. The CGF predicts this being the case with the “magic”
number of five evaluators. But, only 67% are detected at this
stage, which means that only 30 instead of 35 defects are re-
vealed. In fact, the double process size of 10 is required to reach
the 80% goal. The bias of the CGF becomes largest with a process
size of 9. Here, the CGF predicts 94%, whereas the observed rate
is only 78% with an absolut difference of around 7 defects. Ob-
viously, underestimations in this magnitude are very harmful for
large, mission-critical studies. In comparison the CBGF accounts
for defect heterogeneity and predicts this data set very well with
a maximum deviation of only one defect.
Another issue rarely being adressed in the current literature is the
variance of outcome at different points in the process. This was
not the major topic here, but, it is known from an earlier simula-
tion experiment that session heterogeneity causes additional vari-
ance of outcome [22]. Consequently, when an evaluation study
has a strict goal an extra safety margin is required.
Determining the size of this safety margin is beyond the scope
of this work. But, the methodological approach introduced here
is the way to go: Statistical model selection provides powerful
means to determine the underlying stochastic process (or at least
the best known approximation). An appropriate statistical model
may serve both: point estimators and variance for process out-
come. For simple models (which are unlikely to be good approxi-
mations) confidence intervals can easily be obtained from general
textbooks on statistics. For more complex models Monte-Carlo
simulations provide a flexible alternative.
The beta-geometric model suggested here is an advance regard-
ing better point estimators. But, in order to also determine the
confidence interval for the estimators a model is required that
includes both impact factors – session and defect heterogeneity
– at the same time. This is the case with the Rasch model (see
section 2). But this approach requires large sample sizes for pa-
Heterogeneity in the Usability Evaluation Process
96
rameter estimation and is thus not applicable for extrapolating the
process from early sessions.
Another good starting point are the so-called capture-recapture
models, which have their origins in biostatistics (e.g. [6]).
Capture-recapture models address a problem analogous to the
defect detection process: The number of animals living in an area
has to be estimated without counting them all. There already ex-
ists a variety of statistical models, some of them allow for time
effects (comparable to session heterogeneity) and animal hetero-
geneity (comparable to defect heterogeneity). In fact, these have
successfully been applied to software inspection processes in
Software Engineering (e.g. [2]). Work on evaluating and adapt-
ing capture-recapture models for usability evaluation processes
is currently in progress.
Still, a way to go is to use the beta-geometric model for the plan-
ning of studies (i.e. at time of project negotiation). A usability
service company may have tens or hundreds of data sets from
past projects. These can be fed into the beta-binomial estima-
tion procedure to derive the heterogeneity measures. Stored into
a database these can act as a priori estimators for to-be-planned
studies. Probably, the value will increase further when data sets
are classified according to some relevant factors. Common statis-
tical techniques like ANOVA or factor analysis may reveal these
factors which, and at the same time may give further insights into
the origins of heterogeneity.
8. CONCLUSION
Heterogeneity in usability evaluations is a fact. This puts forth
several open research questions:
How large is heterogeneity in both factors under various con-
ditions?
What causes heterogeneity?
Is it a continuous variable or are there distinct classes of
usability inspectors (test users) and defects?
What kind of model can replace the CGF in order to extrapo-
late the process?
Heterogeneity puts harmful bias on the well known CGF predic-
tor. As a result, usability practitioners take the risk of stopping
the process far too early. An approved solution for reliable pro-
cess prediction is currently work-in-progres. Until then, advice to
practitioners running usability studies can be given as follows:
Run the test on binomial overdispersion on your past data sets
to convince yourself of heterogeneity.
Pay special attention to defect heterogeneity (which is likely
to occur), as this leads to harmful overestimation of outcome.
For small studies or at the beginning of a larger study (n<6)
apply the CGF model with the GT-N OR M estimator sug-
gested by Lewis [11].
When process size increases (n>10), estimate the beta-
binomial parameters and feed them to the CBGF. Repeat this
when new data points arrive and control your process towards
the targeted goal.
Make sure to always have a generous safety margin when us-
ability is mission-critical. Expect up to 17% underestimation
of outcome with the CGF. Have a look at [21, 7] to get an idea
of the random variation at different process sizes.
Keep in mind that predicted outcome is only reliable for well-
designed studies. Pay attention to complete coverage and
appropriate tasks.
Also, practicioners are encouraged to store their project data
and begin to use it for planning of studies. Possibly, there will
once appear a mature approach of experienced-based sample
size prediction which makes use of past data sets and statistical
models to give a best guess on required process size.
Some programs for analysing evaluation process data are avail-
able online [19] or on request. The author is willing to further
assist and cooperate.
9. ACKNOWLEDGEMENTS
Work on this paper was made possible by a stipend of the Pas-
sau Graduate School of Business and Economics and generous
support of Chair Prof. Dr. Franz Lehner, Passau University.
Thanks to all authors who have published their complete data sets
[10, 16, 26]. Find another one at [19].
10. REFERENCES
[1] B. W. Boehm and V. R. Basili. Software defect reduction top 10
list. IEEE Computer, 34(1):135–137, 2001.
[2] L. C. Briand, K. El Emam, and O. Freimut, B. G.and Laitenberger.
A comprehensive evaluation of capture-recapture models for
estimating software defect content. IEEE Transactions on Software
Engineering, 26(6):518–540, 2000.
[3] K. P. Burnham and D. R. Anderson. Multimodel inference.
understanding AIC and BIC in model selection. Sociological
Methods & Research, 33(2):261–304, 2004.
[4] P. Cairns. HCI...not as it should be: Inferential statistics in HCI
research. In L. J. Ball, M. A. Sasse, C. Sas, T. C. Ormerod,
A. Dix, P. Bagnall, and T. McEwan, editors, Proceedings of the
HCI 2007, volume 1 of People and Computers, pages 195–201.
British Computing Society, 2007.
[5] D. A. Caulton. Relaxing the homogeneity assumption in usability
testing. Behaviour & Information Technology, 20(1):1–7, 2001.
[6] A. Chao. An overview of closed capture-recapture models. Journal
of Agricultural, Biological, and Environmental Statistics, 6(2):158–
175, 2001.
[7] L. Faulkner. Beyond the five-user assumption: Benefits of increased
sample sizes in usability testing. Behavior Research Methods,
Instruments & Computers, 35(3):379–383, 2003.
[8] M. Hertzum and N. E. Jacobsen. The evaluator effect: A chilling
fact about usability evaluation methods. International Journal of
Human-Computer Interaction, 13(4):421–443, 2001.
[9] J. R. Lewis. Sample sizes for usability studies: Additional
considerations. Human Factors, 36:368–378, 1994.
[10] J. R. Lewis. Evaluation of procedures for adjusting problem-
discovery rates estimated from small samples. International
Journal of Human-Computer Interaction, 13(4):445–479, 2001.
[11] J. R. Lewis. Sample sizes for usability tests: Mostly math, not
magic. Interactions, 13(6):29–33, 2006.
[12] G. Lindgaard and J. Chattratichart. Usability testing: What have we
overlooked? In CHI ’07: Proceedings of the SIGCHI conference
on Human factors in computing systems, pages 1415–1424, New
York, NY, USA, 2007. ACM Press.
[13] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization
and Its Applications. Academic Press, New York, 1979.
[14] R. Molich and R. Jeffries. Comparative expert review. In
Proceedings of the CHI 2003, Extended Abstracts, pages 1060
– 1061. ACM Press, 2003.
[15] J. Nielsen and T. K. Landauer. A mathematical model of the
finding of usability problems. In CHI ’93: Proceedings of the
SIGCHI conference on Human factors in computing systems, pages
206–213, New York, NY, USA, 1993. ACM Press.
[16] J. Nielsen and R. Molich. Heuristic evaluation of user interfaces.
In Proceedings of the CHI 1990, 1990.
[17] R Development Core Team. R: A Language and Environment for
Statistical Computing. R Foundation for Statistical Computing,
Vienna, Austria, 2006. ISBN 3-900051-07-0.
Heterogeneity in the Usability Evaluation Process
97
[18] L.-P. Rivest. Why a time effect often has a limited impact on
capture-recapture estimates in closed populations. The Canadian
Journal of Statistics, 35(4), 2007. in print.
[19] M. Schmettow. Heterogenity in the usability evaluation process
- accompaning website. Website, January 2008. http:
//schmettow.info/Heterogeneity.
[20] M. Schmettow and S. Niebuhr. A pattern-based usability inspection
method: First empirical performance measures and future issues.
In D. Ramduny-Ellis and D. Rachovides, editors, Proceedings of
the HCI 2007, volume 2 of People and Computers, pages 99–102.
BCS, September 2007.
[21] M. Schmettow and W. Vietze. Introducing item response theory for
measuring usability inspection processes. submitted to CHI2008,
September 2007.
[22] M. Schmettow and W. Vietze. Introducing item response theory
for measuring usability inspection processes. In CHI 2008
Proceedings, pages 893–902. ACM SIGCHI, April 2008.
[23] S. S. Shapiro and A. J. Gross. Statistical Modeling Techniques.
Marcel Decker, New York, 1981.
[24] R. A. Virzi. Refining the test phase of usability evaluation: How
many subjects is enough? Human Factors, 34(4):457–468, 1992.
[25] W. Whitt. Uniform conditional variability ordering of probability
distributions. Journal of Applied Probability, 22:619–633, 1985.
[26] A. Woolrych and G. Cockton. Why and when five test users aren’t
enough. In J. Vanderdonckt, A. Blandford, and A. Derycke, editors,
Proceedings of IHM-HCI 2001 Conference, volume 2, pages 105–
108. Cepadeus, Toulouse, France, 2001.
[27] T. W. Yee. VGAM: Vector Generalized Linear and Additive Models,
2007. R package version 0.7-5.
Heterogeneity in the Usability Evaluation Process
98
... However, this assumption is unrealistic and does not hold true in real-life usability testing. Schmettow showed that overdispersion was frequent in the problem margin sums, reflecting heterogeneity in the probability of detection [23]. Furthermore, erroneously ignoring the presence of heterogeneity by using a single, average value of p leads to overestimation of the completeness of the discovery process (Jensen's inequality) [24]. ...
... Four did not involve a medical device: the EDU3D dataset encompassed 119 problems discovered by 20 participants during the evaluation of virtual environments [34], the MACERR dataset encompassed 145 problems discovered by 15 participants during a scenario-driven usability testing of an integrated office system [35], the MANTEL dataset encompassed 30 problems submitted by 76 expert participants evaluating the specifications of a computer program, and the SAVINGS dataset encompassed 48 usability problems discovered by 34 participants on voice response systems MANTEL and SAVINGS comes from the same experiment on heuristic evaluations [36]. These four studies were included because they have been used in important publications in this field [8] and they enabled us to address heterogeneity in the probability of discovery, in particular [23]. The fifth usability testing involved a medical device: INFPUMP encompassed 107 usability problems discovered by 34 participants (intensive care unit nurses and anesthesiologist) evaluating a prototype medical infusion pump [25]. ...
... In these five datasets, the number of participants ranged from 15 to 76. Previous studies of these datasets [8,19,23,25] demonstrated that the probability of problem detection was heterogeneous. As suggested by the results of the simulation study, the methods not taking account of heterogeneity considered that the discovery process was complete or very close to being complete for all datasets (except MACERR: see below). ...
Article
Full-text available
Background: Usability testing of medical devices are mandatory for market access. The testings' goal is to identify usability problems that could cause harm to the user or limit the device's effectiveness. In practice, human factor engineers study participants under actual conditions of use and list the problems encountered. This results in a binary discovery matrix in which each row corresponds to a participant, and each column corresponds to a usability problem. One of the main challenges in usability testing is estimating the total number of problems, in order to assess the completeness of the discovery process. Today's margin-based methods fit the column sums to a binomial model of problem detection. However, the discovery matrix actually observed is truncated because of undiscovered problems, which corresponds to fitting the marginal sums without the zeros. Margin-based methods fail to overcome the bias related to truncation of the matrix. The objective of the present study was to develop and test a matrix-based method for estimating the total number of usability problems. Methods: The matrix-based model was based on the full discovery matrix (including unobserved columns) and not solely on a summary of the data (e.g. the margins). This model also circumvents a drawback of margin-based methods by simultaneously estimating the model's parameters and the total number of problems. Furthermore, the matrix-based method takes account of a heterogeneous probability of detection, which reflects a real-life setting. As suggested in the usability literature, we assumed that the probability of detection had a logit-normal distribution. Results: We assessed the matrix-based method's performance in a range of settings reflecting real-life usability testing and with heterogeneous probabilities of problem detection. In our simulations, the matrix-based method improved the estimation of the number of problems (in terms of bias, consistency, and coverage probability) in a wide range of settings. We also applied our method to five real datasets from usability testing. Conclusions: Estimation models (and particularly matrix-based models) are of value in estimating and monitoring the detection process during usability testing. Matrix-based models have a solid mathematical grounding and, with a view to facilitating the decision-making process for both regulators and device manufacturers, should be incorporated into current standards.
Article
Objectives For medical devices, a usability assessment is mandatory for market access; the objective is to detect potentially harmful use errors that stem from the device’s design. The manufacturer assesses the final version of the device and determines the risk-benefit ratio for remaining errors. Nevertheless, the decision rule currently used to determine the sample size for this testing has statistical limitations and the lack of a clear decision-making perspective. Methods As an alternative, we developed a value-of-information analysis from the medical device manufacturer’s perspective. The consequences of use errors not detected during usability testing and the errors’ probability of occurrence were embedded in a loss function. The value of further testing was assessed as a reduction in the expected loss for the manufacturer. The optimal sample size was determined using the expected net benefit of sampling (ENBS) (the difference between the value provided by new participants and the cost of their inclusion). Results The value-of-information approach was applied to a real usability test of a needle-free adrenaline autoinjector. The initial estimate (performed on the first n = 20 participants) gave an optimal sample size of 100 participants and an ENBS of €255 453. This estimation was updated iteratively as new participants were included. After the inclusion of 90 participants, the ENBS was null for any sample size; hence, the cost of adding more participants outweighed the expected value of information, and therefore, the study could be stopped. Conclusions On the basis of these results, our method seems to be highly suitable for sample size estimation in the usability testing of medical devices before market access.
Chapter
Full-text available
The chapter focuses on the Bootstrap statistical technique for assigning measures of accuracy to sample estimates, here adopted for the first time to obtain an effective and efficient interaction evaluation. After introducing and discussing the classic debate on p value (i.e., the discovery detection rate) about estimation problems, the authors present the most used model for the estimation of the number of participants needed for an evaluation test, namely the Return On Investment model (ROI). Since the ROI model endorses a monodimensional and economical perspective in which an evaluation process, composed of only an expert technique, is sufficient to identify all the interaction problems-without distinguishing real problems (i.e., identified both experts and users) and false problems (i.e., identified only by experts)- they propose the new Bootstrap Discovery Behaviour (BDB) estimation model. Findings highlight the BDB as a functional technique favouring practitioners to optimize the number of participants needed for an interaction evaluation. Finally, three experiments show the application of the BDB model to create experimental sample sizes to test user experience of people with and without disabilities.
Article
Full-text available
Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Article
Full-text available
The Usability Pattern Inspection (UPI) is a new usability inspection method designed for the added downstream utility of producing concrete design recommendations. This paper provides first empirical evidence that UPI measures up to the established inspection method Heuristic Evaluation (HE) regarding defect identification. It is shown that there is also some potential for synergy between UPI and HE. The further research plan of measuring UPI is presented.
Article
Full-text available
This article reviews various models for both discrete-time and continuous-time closed capture-recapture experiments. The traditional discrete-time models assume that the samples are independent. Dependence may be caused by local dependence (list dependence) within each animal or by heterogeneity among animals. Three different approaches that can incorporate dependence into models are reviewed, i.e., ecological models, log-linear models, and the sample-coverage approach. The statistical tools involved in population size estimation in these three approaches cover a wide range of methodologies. There has been relatively little published research for the continuous-time counterparts. The counting process approach, which is the framework for most existing estimation procedures for continuous-time models, is reviewed. The connection of continuous-time models to recurrent event analysis in the context of failure time inferences is discussed. The applications of capture-recapture models to other disciplines are briefly presented. Remarks about the limitations of the models are made and some future research directions are also suggested.
Article
Full-text available
Vector smoothing is used to extend the class of generalized additive models in a very natural way to include a class of multivariate regression models. The resulting models are called ‘vector generalized additive models’. The class of models for which the methodology gives generalized additive extensions includes the multiple logistic regression model for nominal responses, the continuation ratio model and the proportional and nonproportional odds models for ordinal responses, and the bivariate probit and bivariate logistic models for correlated binary responses. They may also be applied to generalized estimating equations.
Conference Paper
Full-text available
Usability evaluation methods have a long history of research. Latest contributions significantly raised the validity of method evaluation studies. But there is still a measurement model lacking that incorporates the relevant factors for inspection performance and accounts for the probabilistic nature of the process. This paper transfers a modern probabilistic approach from psychometric research, known as the Item Response Theory, to the domain of measuring usability evaluation processes. The basic concepts, assumptions and several advanced procedures are introduced and related to the domain of usability inspection. The practical use of the approach is exemplified in three scenarios from research and practice. These are also made available as simulation programs.
Article
Variability orderings indicate that one probability distribution is more spread out or dispersed than another. Here variability orderings are considered that are preserved under conditioning on a common subset. One density f on the real line is said to be less than or equal to another, g, in uniform conditional variability order (UCVO) if the ratio f ( x ) /g ( x ) is unimodal with the model yielding a supremum, but f and g are not stochastically ordered. Since the unimodality is preserved under scalar multiplication, the associated conditional densities are ordered either by UCVO or by ordinary stochastic order. If f and g have equal means, then UCVO implies the standard variability ordering determined by the expectation of all convex functions. The UCVO property often can be easily checked by seeing if f ( x )/ g ( x ) is log-concave. This is illustrated in a comparison of open and closed queueing network models.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Article
The author is concerned with log-linear estimators of the size N of a population in a capture-recapture experiment featuring heterogeneity in the individual capture probabilities and a time effect. He also considers models where the first capture influences the probability of subsequent captures. He derives several results from a new inequality associated with a dispersive ordering for discrete random variables. He shows that in a log-linear model with inter-individual heterogeneity, the estimator N̂ is an increasing function of the heterogeneity parameter. He also shows that the inclusion of a time effect in the capture probabilities decreases N̂ in models without heterogeneity. He further argues that a model featuring heterogeneity can accommodate a time effect through a small change in the heterogeneity parameter. He demonstrates these results using an inequality for the estimators of the heterogeneity parameters and illustrates them in a Monte Carlo experiment.