ChapterPDF Available

Determining Usability Test Sample Size

Authors:
  • Triangle Business Architecture
  • MeasuringU

Abstract and Figures

The cumulative binomial probability formula (given appropriate adjustment of p when estimated from small samples) provides a quick and robust means of estimating problem discovery rates (p). This estimate can be used to estimate usability test sample size requirements (for studies that are underway) and to evaluate usability test sample size adequacy (for studies that have already been conducted). Further research is needed to answer remaining questions about when usability testing is reliable, valid, and useful.
Content may be subject to copyright.
International Encyclopedia of Ergonomics and Human Factors, 2006, Second Edition, Volume 3
Edited by Waldemar Karwowski, Boca Raton, FL: CRC Press
3084
Determining Usability Test Sample Size
Carl W. Turner*, James R. Lewis, and Jakob Nielsen
*State Farm Insurance Cos., Bloomington, IL 61791, USA
IBM Corp., Boca Raton, FL 33487 USA
Nielsen Norman Group, Fremont, CA 94539, USA
1 INTRODUCTION
Virzi (1992), Nielsen and Landauer (1993), and Lewis
(1994) have published influential articles on the topic of
sample size in usability testing. In these articles, the authors
presented a mathematical model of problem discovery rates
in usability testing. Using the problem discovery rate
model, they showed that it was possible to determine the
sample size needed to uncover a given proportion of
problems in an interface during one test. The authors
presented empirical evidence for the models and made
several important claims:
Most usability problems are detected with the first
three to five subjects.
Running additional subjects during the same test is
unlikely to reveal new information.
Return on investment (ROI) in usability testing is
maximized when testing with small groups using
an iterative test-and-design methodology.
Nielsen and Landauer (1993) extended Virzi’s (1992)
original findings and reported case studies that supported
their claims for needing only small samples for usability
tests. They and Lewis (1994) identified important
assumptions about the use of the formula for estimating
problem discovery rates. The problem discovery rate model
was recently re-examined by Lewis (2001).
2 THE ORIGINAL FORMULAE
Virzi (1992) published empirical data supporting the use of
the cumulative binomial probability formula to estimate
problem discovery rates. He reported three experiments in
which he measured the rate at which usability experts and
trained student assistants identified problems as a function
of the number of naive participants they observed. Problem
discovery rates were computed for each participant by
dividing the number of problems uncovered during an
individual test session by the total number of unique
problems found during testing. The average likelihood of
problem detection was computed by averaging all
participants’ individual problem discovery rates.
Virzi (1992) used Monte Carlo simulations to
permute participant orders 500 times to obtain the average
problem discovery curves for his data. Across three sets of
data, the average likelihoods of problem detection (p in the
formula above) were 0.32, 0.36, and 0.42. He also had the
observers (Experiment 2) and an independent group of
usability experts (Experiment 3) provide ratings of problem
severity for each problem. Based on the outcomes of these
experiments, Virzi made three claims regarding sample size
for usability studies: (1) Observing four or five participants
allows practitioners to discover 80% of a product’s usability
problems, (2) observing additional participants reveals
fewer and fewer new usability problems, and (3) observers
detect the more severe usability problems with the first few
participants. Based on these data, he claimed that running
tests using small samples in an iterative test-and-design
fashion would identify most usability problems and save
both time and money.
Proportion of unique problems
found = 1 (1 p)n (1)
where p is the mean problem discovery rate computed
across subjects (or across problems) and n is the number of
subjects. Seeking to quantify the patterns of problem
detection observed in several fairly large-sample studies of
problem discovery (using either heuristic evaluation or user
testing) Nielsen and Landauer (1993) derived the same
formula from a Poisson process model (constant probability
path independent). They found that it provided a good fit to
their problem-discovery data, and provided a basis for
predicting the number of problems existing in an interface
and performing cost-benefit analyses to determine
appropriate sample sizes. Across 11 studies (five user tests
and six heuristic evaluations), they found the average value
of p to be .33 (ranging from .16 to .60, with associated
estimates of p ranging from .12 to .58). Nielsen and
Landauer used lambda rather than p, but the two concepts
are essentially equivalent. In the literature, ? (lambda), L,
and p are commonly used to represent the average
likelihood of problem discovery. Throughout this article,
we will use p.
Number of unique problems
found = N(1 (1 p)n)) (2)
Determining Usability Test Sample Size 3085
where p is the problem discovery rate, N is the total number
of problems in the interface, and n is the number of subjects.
The problem discovery rate was approximately .3
when averaged across a large number of independent tests,
but the rate for any given usability test will vary depending
on several factors (Nielsen & Landauer, 1993). These
factors include:
Properties of the system and interface, including
the size of the application.
Stage in the usability lifecycle the product is tested
in, whether early in the design phase or after
several iterations of test and re-design.
Type and quality of the methodology used to
conduct the test.
Specific tasks selected.
Match between the test and the context of real
world usage.
Representativeness of the test participant.
Skill of the evaluator.
Research following these lines of investigation led to other,
related claims. Nielsen (1994) applied the formula in
Equation 2 to a study of problem discovery rate for heuristic
evaluations. Eleven usability specialists evaluated a
complex prototype system for telephone company
employees. The evaluators obtained training on the system
and the goals of the evaluation. They then independently
documented usability problems in the user interface based
on published usability heuristics. The average value of p
across 11 evaluators was .29, similar to the rates found
during talk-aloud user testing (Nielsen & Landauer, 1993;
Virzi, 1992).
Lewis (1994) replicated the techniques applied by
Virzi (1992) to data from a usability study of a suite of
office software products. The problem discovery rate for
this study was .16. The results of this investigation clearly
supported Virzi’s second claim (additional participants
reveal fewer and fewer problems), partially supported the
first (observing four or five participants reveals about 80%
of a product’s usability problems as long as the value of p
for a study is in the approximate range of .30 to .40), and
failed to support the third (there was no correlation between
problem severity and likelihood of discovery). Lewis noted
that it is most reasonable to use small-sample problem
discovery studies “if the expected p is high, if the study will
be iterative, and if undiscovered problems will not have
dangerous or expensive outcomes” (1994, p. 377).
3 RECENT CHALLENGES
Recent challenges to the estimation of problem discovery
rates appear to take two general forms. The first questions
the reliability of problem discovery procedures (user testing,
heuristic evaluation, cognitive walkthrough, etc.). If
problem discovery is completely unreliable, then how can
anyone model it? Furthermore, how can one account for the
apparent success of iterative problem-discovery procedures
in increasing the usability of the products against which they
are applied?
The second questions the validity of modeling the
probability of problem discovery with a single value for p.
Other issues such as the fact that claiming high
proportions of problem discovery with few participants
requires a fairly high value of p, that different task sets lead
to different opportunities to discover problems, and the
importance of iteration are addressed at length in earlier
papers (Lewis, 1994; Nielsen, 1993).
3.1 Is Usability Problem Discovery Reliable?
Molich et al. (1998) conducted a study in which four
different usability labs evaluated a calendar system and
prepared reports of the usability problems they discovered.
An independent team of usability professionals compared
the reports produced by the four labs. The number of
unique problems identified by each lab ranged from four to
98. Only one usability problem was reported by all four
labs. The teams that conducted the studies noted difficulties
in conducting the evaluations that included a lack of testing
goals, no access to the product development team, a lack of
user profile information, and no design goals for the
product.
Kessner et al. (2001) have also reported data that
question the reliability of usability testing. They had six
professional usability teams test an early prototype of a
dialog box. The total number of usability problems was
determined to be 36. None of the problems were identified
by every team, and only two were reported by five teams.
Twenty of the problems were reported by at least two teams.
After comparing their results with those of Molich et al.
(1999), Kessner et al. suggested that more specific and
focused requests by a client should lead to more overlap in
problem discovery.
Hertzum and Jacobsen (2001) have termed the lack
of inter-rater reliability among test observers an ‘evaluator
effect’ that “multiple evaluators evaluating the same
interface with the same usability evaluation method detect
markedly different sets of problems” (p. 421). Across a
review of 11 studies, they found the average agreement
between any two evaluators of the same system ranged from
5% to 65%, with no usability evaluation method (cognitive
walkthroughs, heuristic evaluations, or think-aloud user
studies) consistently more effective than another. Their
review, and the studies of Molich et al. (1999) and Kessner
et al. (2001) point out the importance of setting clear test
objectives, running repeatable test procedures, and adopting
clear definitions of usability problems. Given that multiple
evaluators increase the likelihood of problem detection
(Nielsen, 1994), they suggested that one way to reduce the
evaluator effect is to involve multiple evaluators in usability
tests.
3086 Determining Usability Test Sample Size
The results of these studies are in stark contrast to
earlier studies in which usability problem discovery was
reported to be reliable (Lewis, 1996; Marshall, Brendon, &
Prail, 1990). The widespread use of usability problem
discovery methods indicates that practitioners believe they
are reliable. Despite this widespread belief, an important
area of future research will be to reconcile the studies that
have challenged the reliability of problem discovery with
the apparent reality of usability improvement achieved
through iterative application of usability problem discovery
methods. For example, there might be value in exploring
the application of signal detection theory (Swets, Dawes, &
Monahan, 2000) to the detection of usability problems.
3.2 Issues in the Estimation of p
Woolrych and Cockton (2001) challenged the
assumption that a simple estimate of p is sufficient for the
purpose of estimating the sample size required for the
discovery of a specified percentage of usability problems in
an interface. Specifically, they criticized the formula for
failing to take into account individual differences in
problem discoverability and also claimed that the typical
values used for p (around .30) are overly optimistic. They
also pointed out that the circularity in estimating the key
parameter of p from the study for which you want to
estimate the sample size reduces its utility as a planning
tool. Following close examination of data from a previous
study of heuristic evaluation, they found combinations of
five participants which, if they had been the only five
participants studied, would have dramatically changed the
resulting problems lists, both for frequency and severity.
They recommended the development of a formula that
replaces a single value for p with a probability density
function.
Caulton (2001) claimed that the simple estimate of
p only applies given a strict homogeneity assumption that
all types of users have the same probability of encountering
all usability problems. To address this, Caulton added to the
standard cumulative binomial probability formula a
parameter for the number of heterogeneous groups. He also
introduced and modeled the concept of problems that
heterogeneous groups share and those that are unique to a
particular subgroup. His primary claims were (1) the more
subgroups, the lower will be the expected value of p and (2)
the more distinct the subgroups are, the lower will be the
expected value of p.
Most of the arguments of Woolrych and Cockton
(2001) were either addressed in previous literature or do not
stand up against the empirical findings reported in previous
literature. It is true that estimates of p can vary widely from
study to study. This characteristic of usability testing can be
addressed by estimating p for a study after running two
subjects and adjusting the estimate as the study proceeds
(Lewis, 2001). There are problems with the estimation of p
from the study to which you want to apply it, but recent
research (discussed below) provides a way to overcome
these problems. Of course, it is possible to select different
subsets of participants who experienced problems in a way
that leads to an overestimate of p (or an underestimate of p,
or any value of p that the person selecting the data wishes).
Test administrators should follow accepted practice and
select evaluators who represent the range of knowledge and
skills found in the population of end users . There is no
compelling evidence that a probability density function
would lead to an advantage over a single value for p,
although there might be value in computing confidence
intervals for single values of p.
Caulton’s (2001) refinement of the model is
consistent with the observation that different user groups
expose different types of usability problems (Nielsen, 1993).
It is good practice to include participants from significant
user groups in each test; three or four per group for two
groups and three participants for more than two groups. If
there is a concern that different user groups will uncover
different sets of usability problems then the data for each
group can be analyzed separately, and a separate p
computed for each user group. However, Caulton’s claim
that problem discovery estimates are always inflated when
averaged across heterogeneous groups and problems with
different values of p is inconsistent with the empirical data
presented in Lewis (1994). Lewis demonstrated that p is
robust, showing that the mean value of p worked very well
for modeling problem discovery in a set of problems that
had widely varying values of p.
4 IMPROVING SMALL-SAMPLE
ESTIMATION OF p
Lewis (2001), responding to an observation by Hertzum and
Jacobsen (2001) that small-sample estimates of p are almost
always inflated, investigated a variety of methods for
adjusting these small-sample estimates to enable accurate
assessment of sample size requirements and true proportions
of discovered problems. Using data from a series of Monte
Carlo studies applied against four published sets of problem
discovery databases, he found that a technique based on
combining information from a normalization procedure and
a discounting method borrowed from statistical language
modeling produced very accurate adjustments for small-
sample estimates of p. The Good-Turing (GT) discounting
procedure reduced, but did not completely eliminate, the
overestimate of problem discovery rates produced by small-
sample p estimates. The GT adjustment, shown in Equation
3, was:
(3)
where pest is the initial estimate computed from the raw data
of a usability study, E(N1) was the number of usability
Determining Usability Test Sample Size 3087
problems detected by only one user, and N was that total
number of unique usability problems detected by all users.
By contrast, the normalization procedure (Norm)
slightly underestimated problem discovery rates. The
equation was: (4)
where pest is the initial estimate computed from the raw data
of a usability study and n was the number of test
participants. He concluded that the overestimation of p
from small-sample usability studies is a real problem with
potentially troubling consequences for usability
practitioners, but that it is possible to apply these procedures
(normalization and Good-Turing discounting) to
compensate for the overestimation bias. Applying each
procedure to the initial estimate of p, then averaging the
results, produces a highly accurate estimate of the problem
discovery rate. Equation 5 shows the formula for an
adjusted p estimate based on averaging Good-Turing and
normalization adjustments.
(5)
“Practitioners can obtain accurate sample size estimates for
problem-discovery goals ranging from 70% to 95% by
making an initial estimate of the required sample size after
running two participants, then adjusting the estimate after
obtaining data from another two (total of four) participants”
(Lewis, 2001, p.474).
The results of a return-on-investment (ROI) model
for usability studies (Lewis, 1994) indicated that the
magnitude of p affected the point at which the percentage of
problems discovered maximized ROI. For values of p
ranging from .10 to .5, the appropriate problem discovery
goal ranged from .86 to .98, with lower values of p
associated with lower problem discovery goals.
5 AN APPLICATION OF THE
ADJUSTMENT PROCEDURES
In the example shown in Table 1, a usability test with eight
participants has led to the discovery of four unique usability
problems. The problem discovery rates (p) for individual
participants ranged from 0.0 to .75. The problem discovery
rates for specific problems ranged from .125 to .875. The
average problem discovery rate (averaged either across
problems or participants), pest , was .375. Note that
Problems 2 and 4 were detected by only one participant
(Participants 2 and 7, respectively). Applying the Good-
Turing estimating procedure from Equation 3 gives
TABLE 1
Data from a Hypothetical Usability Test with Eight
Subjects, pest = .375
Problem Number
Subject 1 2 3 4 Count
p
1 1 0 1 0 2 0.500
2 1 0 1 1 3 0.750
3 1 0 0 0 1 0.250
4 0 0 0 0 0 0.000
5 1 0 1 0 2 0.500
6 1 0 0 0 1 0.250
7 1 1 0 0 2 0.500
8 1 0 0 0 1 0.250
Count 7 1 3 1
P 0.875 0.125 0.375 0.125 0.375
Applying normalization as shown in Equation 4 gives
The adjusted problem discovery rate is obtained by
averaging the two estimates as shown in Equation 5 gives
With this adjusted value of p and the known sample size, it
is possible to estimate the sample size adequacy of this
study using the cumulative binomial probability formula: 1
(1 .25)8 = .90. If the problem discovery goal for this
study had been 90%, then the sample size was adequate. If
the discovery goal had been lower, the sample size would be
excessive, and if the discovery goal had been higher, the
sample size would be inadequate. The discovery of only
four problems (one problem for every two participants)
suggests that the discovery of additional problems would be
difficult. If four problems constitute 90% of the problems
available for discovery given the specifics of this usability
study, then 100% of the problems available for discovery
should be about 4/.9, or 4.44. In non-numerical terms, there
probably aren’t a lot of additional problems to extract from
this problem discovery space.
As an example of sample size estimation, suppose
you had data from the first four participants and wanted to
estimate the number of participants you’d need to run to
achieve 90% problem discovery. After running the fourth
participant, there were three discovered problems (because
Problem 2 did not occur until Participant 7), as shown in
Table 2. One of those problems (Problem 4) occurred only
once.
3088 Determining Usability Test Sample Size
TABLE 2
Data from a Hypothetical Usability Test; First Four
Subjects, pest = .500
Problem Number
Subject 1 3 4 Count p
1 1 1 0 2 0.667
2 1 1 1 3 1.000
3 1 0 0 1 0.333
4 0 0 0 0 0.000
Count 3 2 1
P 0.750 0.500 0.250 0.500
Applying the Good-Turing estimating procedure from
Equation 3 gives
Applying normalization as shown in Equation 4 gives
The average of the two estimates is
Given p = .28, the estimated proportion of discovered
problems would be 1 (1 .28)4, or .73. Doing the same
computation with n = 7 gives .90, indicating that the
appropriate sample size for the study would be 7. Note that
in the matrix for this hypothetical study, running the eighth
participant did not reveal any new problems.
6 CONCLUSIONS
The cumulative binomial probability formula (given
appropriate adjustment of p when estimated from small
samples) provides a quick and robust means of estimating
problem discovery rates (p). This estimate can be used to
estimate usability test sample size requirements (for studies
that are underway) and to evaluate usability test sample size
adequacy (for studies that have already been conducted).
Further research is needed to answer remaining questions
about when usability testing is reliable, valid, and useful.
REFERENCES
CAULTON, D.A., 2001, Relaxing the homogeneity assumption in
usability testing. Behaviour & Information Technology, 20,
1-7.
HERTZUM , M. and JACOBSEN, N.E., 2001, The evaluator
effect: a chilling fact about usability evaluation methods.
International Journal of Human-Computer Interaction, 13,
421-443.
KESSNER, M., WOOD, J., DILLON, R.F. and WEST, R.L., 2001,
On the reliability of usability testing. In Jacko, J. and Sears,
A., (eds), Conference on Human Factors in Computing
Systems: CHI 2001 Extended Abstracts (Seattle, WA: ACM
Press), pp. 97-98.
LEWIS, J.R., 1994, Sample sizes for usability studies: Additional
considerations. Human Factors, 36, 368-378.
LEWIS, J.R., 1996, Reaping the benefits of modern usability
evaluation: The Simon story. In Salvendy, G. and Ozok, A.,
(eds), Advances in Applied Ergonomics: Proceedings of the
1st International Conference on Applied Ergonomics ICAE
'96 (Istanbul, Turkey: USA Publishing), pp. 752-757.
LEWIS, J.R., 2001, Evaluation of procedures for adjusting
problem-discovery rates estimated from small samples.
International Journal of Human-Computer Interaction, 13,
445-479.
MARSHALL, C., BRENDAN, M. and PRAIL, A., 1990, Usability
of product X lessons from a real product. Behaviour &
Information Technology, 9, 243-253.
MOLICH, R., BEVAN, N., CURSON, I., BUTLER, S.,
KINDLUND, E., MILLER, D. and KIRAKOWSKI, J., 1998,
Comparative evaluation of usability tests. In Proceedings of
the Usability Professionals Association Conference
(Washington, DC: UPA), pp. 83-84.
NIELSEN, J., 1993, Usability engineering (San Diego, CA:
Academic Press).
NIELSEN, J., 1994, Heuristic evaluation. In Nielsen, J. and Mack,
R.L. (eds), Usability Inspection Methods (New York: John
Wiley), pp. 25-61.
NIELSEN, J. and LANDAUER, T.K., 1993, A mathematical
model of the finding of usability problems. In Proceedings of
ACM INTERCHI’93 Conference (Amsterdam, Netherlands:
ACM Press), pp. 206-213.
SWETS, J.A., DAWES, R.M. and MONAHAN, J., 2000, Better
decisions through science. Scientific American, 283(4), 82-
87.
VIRZI, R.A., 1992, Refining the test phase of usability evaluation:
How many subjects is enough? Human Factors, 34, 457-468.
WOOLRYCH, A. and COCKTON, G., 2001, Why and when five
test users aren’t enough. In Vanderdonckt, J., Blandford, A.
and Derycke A. (eds.) Proceedings of IHM-HCI 2001
Conference, Vol. 2 (Toulouse, France: Cépadèus Éditions),
pp. 105-108.
... In this respect, one must ensure, based on experimental evidence, that the number of tests participants will allow a complete evaluation of the interface being evaluated and that no superfluous users will be recruited. This point has been studied since the nineties and has not find a final answer yet [23,24]. ...
Preprint
The aim of this paper is to review some work conducted in the field of user testing that aims at specifying or clarifying the test procedures and at defining and developing tools to help conduct user tests. The topics that have been selected were considered relevant for evaluating applications in the field of medical and health care informatics. These topics are: the number of participants that should take part in a user test, the test procedure, remote usability evaluation, usability testing tools, and evaluating mobile applications.
... Six participants participated in this study. Nielsen, Turner, and Lewis stated that most issues will be discovered by a small group of users [27]. These people are developing their qualitative research and have some results from applying data collection instruments. ...
Article
Dealing manually with large volumes of textual information leads to problems in qualitative research, such as the time spent reading and processing the data and the possibility of losing critical details. In education sciences, investigations are commonly based on the qualitative paradigm, requiring substantial amounts of data to be processed textually. In this situation, an alternative for processing text to detect patterns using neural networks in the computational linguistics scenario is proposed. In this regard, qualitative research using a quasi-experimental method is performed. The quasi-experiment is done within the Qualitative Information Analysis course of the master's degree in Higher Education Teaching in 2022 with Class XIV; in such a course, six students evaluated the computational solution using material related to their working master’s thesis. The results from the application test are satisfactory due to the positive user acceptance, providing a plausible alternative to commercial computational solutions on the market.
... However, it is worth noting that existing literature suggests that even a sample size as small as five individuals can effectively uncover usability issues with a mobile application, and nearly 100% of usability problems can be identified with just 14 participants. 88,89 Recruitment was not extended to other centres due to the pandemic/post-pandemic context and for methodological reasons, as quantitative data collection had to be carried out within the same population to minimize possible biases related to the mixed design. 42 Nevertheless, usability should ideally be measured with a larger number of participants from different care settings. ...
Article
Full-text available
Chronic wounds are a growing concern due to aging populations, sedentary lifestyles and increasing rates of obesity and chronic diseases. The impact of such wounds is felt worldwide, posing a considerable clinical, environmental and socioeconomic challenge and impacting the quality of life. The increasing complexity of care requires a holistic approach, along with extensive knowledge and skills. The challenge experienced by health‐care professionals is particularly significant for newly graduate nurses, who face a gap between theory and practice. Digital tools, such as mobile applications, can support wound care by facilitating more precise assessments, early treatment, complication prevention and better outcomes. They also aid in clinical decision‐making and improve healthcare delivery in remote areas. Several mobile applications have emerged to enhance wound care. However, there are no applications dedicated to newly graduate nurses. The aim of this study was to co‐create and evaluate an algorithm for the development of a wound care mobile application supporting clinical decisions for new graduate nurses. The development of this mobile application is envisioned to improve knowledge application and facilitate evidence‐based practice. This study is part of a multiphase project that adopted a pragmatic epistemological approach, using the ‘Knowledge‐to‐Action’ conceptual model and Duchscher's Stages of Transition Theory. Following a scoping review, an expert consensus, and stakeholder meetings, this study was pursued through a sequential exploratory mixed methods design carried out in two phases. In the initial phase, 21 participants engaged in semi‐structured focus groups to explore their needs regarding clinical decision support in wound care, explore their perceptions of the future mobile application's content and identify and categorize essential components. Through descriptive analysis, five overarching themes emerged, serving as guiding principles for conceptual data model development and refinement. These findings confirmed the significance of integrating a comprehensive glossary complemented by photos, ensuring compatibility between the mobile application and existing documentation systems, and providing quick access to information to avoid burdening work routines. Subsequently, the algorithm was created from the qualitative data collected. The second phase involved presenting an online SurveyMonkey® questionnaire to 34 participants who were not part of the initial phase to quantitatively measure the usability of this algorithm among future users. This phase revealed very positive feedback regarding the usability [score of 6.33 (±0.19) on a scale of 1–7], which reinforces its quality. The technology maturation process can now continue with the development of a prototype and subsequent validation in a laboratory setting.
... To evaluate the prototype, we conducted a usability session with eight participants (average age: 29.6 ± 9.1, min: 22, max: 50). Even if the sample size was limited, it is important to remember that a small sample size, typically around five participants, is often sufficient to uncover the majority of usability issues, according to the human-computer interaction literature based on the works of Nielsen [42]. By conducting our initial testing with eight participants, we aimed to achieve remarkable efficiency in identifying and addressing the vast majority of existing problems within the user interface. ...
Article
Full-text available
In the context of smart campuses, effective emergency management is crucial for ensuring the safety and well-being of students, staff, and visitors. This paper presents a comprehensive support tool designed to enhance emergency management on smart campuses, integrating a low-cost people-counting system based on cameras and Raspberry Pi devices. It introduces a newly designed architecture and user interfaces that enhance the functionality and user experience of a smart campus disaster management system. Finally, a usability evaluation has been carried out to validate the brand-new user interfaces devoted to emergency management.
... Six young volunteers without reported mobility problems in the upper extremity participated in the experiments. Previous studies have suggested that most usability problems can be detected with five subjects [49,50]. Table 1 shows the socio-demographic data of the participants. ...
Article
Full-text available
A critical element of neurological function is eye–hand coordination: the ability of our vision system to coordinate the information received through the eyes to control, guide, and direct the hands to accomplish a task. Recent evidence shows that this ability can be disturbed by strokes or other neurological disorders, with critical consequences for motor behaviour. This paper presents a system based on serious games and multimodal devices aimed at improving the assessment of eye–hand coordination. The system implements gameplay that involves drawing specific patterns (labyrinths) to capture hand trajectories. The user can draw the path using multimodal devices such as a mouse, a stylus with a tablet, or robotic devices. Multimodal input devices can allow for the evaluation of complex coordinated movements of the upper limb that involve the synergistic motion of arm joints, depending on the device. A preliminary test of technological validation with healthy volunteers was conducted in the laboratory. The Dynamic Time Warping (DTW) index was used to compare hand trajectories without considering time-series lag. The results suggest that this multimodal framework allows for measuring differences between fine and gross motor skills. Moreover, the results support the viability of this system for developing a high-resolution metric for measuring eye–hand coordination in neurorehabilitation.
... In the context of these promising findings, there are some limitations that should be discussed. First, the sample size of teachers, particularly for feasibility testing, was small although it is in line with recommendations for the number of participants needed for obtaining feedback and initial feasibility testing of technology (Turner et al., 2006). Second, there were only two months of the school year left for feasibility testing after scheduling initial GBG Tech trainings with teachers in the Spring, so there was limited time available to get a good estimate of sustained fidelity. ...
Article
Full-text available
The Good Behavior Game (GBG), a universal classroom management intervention, has shown clear benefits in promoting the behavioral, social-emotional, and academic development of students. However, the quality with which this intervention is delivered tends to diminish over time, which decreases the likelihood of these positive outcomes. By leveraging the benefits of technology, we built a sophisticated online platform to support teachers’ fidelity of the GBG in collaboration with expert consultants and education partners. This paper details initial steps to develop and refine GBG Technology (GBG Tech). Three teacher consultants and two experts in technology-enhanced and classroom management interventions provided ongoing feedback as GBG Tech was initially developed through a rapid prototyping approach by a team of high-tech engineers. Twenty-four teachers participated in focus groups to inform subsequent revisions of the technology, and seven teachers tested the feasibility of GBG Tech in their classrooms for 6 weeks. As anticipated, teachers found GBG Tech to be acceptable, understandable, and feasible to use. Moreover, teachers reached fidelity quickly (M = 2.43 weeks), sustained fidelity for 6 weeks, and delivered the GBG at the recommended dosage. The results of this study informed a full version of GBG Tech that is ready for large-scale testing and a set of design principles intended to guide the development of other technology-delivered interventions aimed at sustaining fidelity in authentic classroom settings.
Article
Full-text available
The classification of police reports according to the typification of the criminal act described in them is not an easy task. The reports are written in natural language and often present missing, imprecise, or even inconsistent information, or lack sufficient details to make a clear decision. Focusing on property crimes, the aim of this work is to assist judges in this classification process by automatically extracting information from police reports and producing a list of possible classifications of crimes accompanied by a degree of confidence in each of them. The work follows the design science research methodology, developing a tool as an artifact. The proposal uses information extraction techniques to obtain the data from the reports, guided by an ontology developed for the Spanish legal system on property crimes. Probabilistic inference mechanisms are used to select the set of articles of the law that could apply to a given case, even when the evidence does not allow an unambiguous identification. The proposal has been empirically validated in a real environment with judges and prosecutors. The results show that the proposal is feasible and usable, and could be effective in assisting judges to classify property crime reports.
Article
Few patient engagement tools incorporate the complex patient experiences, contexts, and workflows that limit depression treatment implementation. Describe a user-centered design (UCD) process for operationalizing a preference-driven patient activation tool. Informed by UCD and behavior change/implementation science principles, we designed a preference-driven patient activation prototype for engaging patients in depression treatment. We conducted three usability cycles using different recruitment/implementation approaches: near live/live testing in primary care waiting rooms (V1–2) and lab-based think aloud testing (V3) oversampling older, low-literacy, and Spanish-speaking patients in the community and via EHR algorithms. We elicited clinician and “heuristic” expert input. We administered the system usability scale (SUS) all three cycles and pre-post V3, the patient activation measure, decisional conflict scale, and depression treatment barriers. We employed descriptive statistics and thematically analyzed observer notes and transcripts for usability constructs. Overall, 43 patients, 3 clinicians, and 5 heuristic (a usability engineering method for identifying usability problems) experts participated. Among patients, 41.9% were ≥ 65 years old, 79.1% female, 23.3% Black, 62.8% Hispanic, and 55.8% Spanish-speaking and 46.5% had ≤ high school education. We described V1–3 usability (67.2, 77.3, 81.8), treatment seeking (92.3%, 87.5%, 92.9%), likelihood/comfort discussing with clinician (76.9%, 87.5%, 100.0%), and pre vs. post decisional conflict (23.7 vs. 15.2), treatment awareness (71.4% vs. 92.9%), interest in antidepressants (7.1% vs. 14.3%), and patient activation (66.8 vs. 70.9), with fewer barriers pertaining to cost/insurance, access/coordination, and self-efficacy/stigma/treatment efficacy. Key themes included digital literacy, understandability, high acceptability for aesthetics, high usefulness of patient/clinician videos, and workflow limitations. We adapted manual entry/visibility/content; added patient activation and a personalized algorithm; and proposed flexible, care manager delivery leveraging clinic screening protocols. We provide an example of leveraging UCD to design/adapt a real-world, patient experience and workflow-aligned patient activation tool in diverse populations.
Article
Full-text available
Recently, Virzi (1992) presented data that support three claims regarding sample sizes for usability studies: (1) observing four or five participants will allow a us-ability practitioner to discover 80% of a product's usability problems, (2) observing additional participants will reveal fewer and fewer new usability problems, and (3) more severe usability problems are easier to detect with the first few participants. Results from an independent usability study clearly support the second claim, partially support the first, but fail to support the third. Problem discovery shows diminishing retums as a function of sample size. Observing four to five participants will uncover about 80% of a product's usability problems as long as the average likelihood of problem detection ranges between 0.32 and 0.42, as in Virzi. If the average likelihood of problem detection is lower, then a practitioner will need to observe more than five participants to discover 80% of the problems. Using behavioral categories for problem severity (or impact), these data showed no correlation between problem severity (impact) and rate of discovery. The data provided evidence that the binomial probability formula may provide a good model for predicting problem discovery curves, given an estimate of the average likelihood of problem detection. Finally, data from economic simulations that estimated return on investment (ROI) under a variety of settings showed that only the average likelihood of problem detection strongly influenced the range of sample sizes for maximum ROI.
Conference Paper
Full-text available
Simon (TM-Bellsouth Corp.) is a commercially available personal communicator (PC) combining features of a PDA (personal digital assistant) with a full suite of communications features. This paper describes the involvement of human factors engineering in the development of Simon, and summarizes the various approaches to usability evaluation employed during its development. Simon has received a considerable amount of praise from the industry and won several industry awards, with recognition both for its innovative engineering and its usability.
Article
Full-text available
There are 2 excellent reasons to compute usability problem-discovery rates. First, an estimate of the problem-discovery rate is a key component for projecting the required sample size for a usability study. Second, practitioners can use this estimate to calculate the proportion of discovered problems for a given sample size. Unfortunately, small-sample estimates of the problem-discovery rate suffer from a serious overestimation bias. This bias can lead to serious underestimation of required sample sizes and serious overestimation of the proportion of discovered problems. This article contains descriptions and evaluations of a number of methods for adjusting small-sample estimates of the problem-discovery rate to compensate for this bias. A series of Monte Carlo simulations provided evidence that the average of a normalization procedure and Good-Turing (Jelinek, 1997; Manning & Schutze, 1999) discounting produces highly accurate estimates of usability problem-discovery rates from small sample sizes.
Conference Paper
Full-text available
SUMMARY Nielsen's claim that "Five Users are Enough" (5) is based on a statistical formula (2) that makes unwarranted assumptions about individual differences in problem discovery, combined with optimistic setting of values for a key variable. We present the initial Landauer-Nielsen formula and recent evidence that it can fail spectacularly to calculate the required number of test users for a realistic web-based test. We explain these recent results by examining the assumptions behind the formula. We then re-examine some of our own data, and find that, while the Landauer-Nielsen formula does hold, this is only the case for simple problem counts. An analysis of problem frequency and severity indicates that highly misleading results could have resulted when the number of required users is almost doubled. Lastly, we identify structure and components of a more realistic approach to estimating test user requirements.
Article
Full-text available
Computer professionals have a need for robust, easy-to-use usability evaluation methods (UEMs) to help them systematically improve the usability of computer artifacts. However, cognitive walkthrough (CW), heuristic evaluation (HE), and thinkingaloud study (TA)—3 of the most widely used UEMs—suffer from a substantial evaluator effect in that multiple evaluators evaluating the same interface with the same UEM detect markedly different sets of problems. A review of 11 studies of these 3 UEMs reveals that the evaluator effect exists for both novice and experienced evaluators, for both cosmetic and severe problems, for both problem detection and severity assessment, and for evaluations of both simple and complex systems. The average agreement between any 2 evaluators who have evaluated the same system using the same UEM ranges from 5% to 65%, and no 1 of the 3 UEMs is consistently better than the others. Although evaluator effects of this magnitude may not be surprising for a UEM as informal as HE, it is certainly notable that a substantial evaluator effect persists for evaluators who apply the strict procedure of CW or observe users thinking out loud. Hence, it is highly questionable to use a TA with 1 evaluator as an authoritative statement about what problems an interface contains. Generally, the application of the UEMs is characterized by (a) vague goal analyses leading to variability in the task scenarios, (b) vague evaluation procedures leading to anchoring, or (c) vague problem criteria leading to anything being accepted as a usability problem, or all of these. The simplest way of coping with the evaluator effect, which cannot be completely eliminated, is to involve multiple evaluators in usability evaluations.
Article
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Article
Using the example of a real product, this paper shows how various usability assessments, conducted by different human factors engineers, in several phases of the product's development life-cycle, identified similar potential usability problems. Circumstances dictated that no remedial action was taken, so it was possible to track these potential usability defects to customer sites, where it was found that most of the important problems did indeed occur. Thus, it can be demonstrated that human factors advice was valid and reliable. In simpler terms, early usability evaluation by human factors engineers can save hours of wasted development effort and customer frustration, and can help to ensure that a more usable product is produced.
Article
Much attention has been paid to the question of how many subjects are needed in usability research. Virzi (1992) modelled the accumulation of usability problems with increasing numbers of subjects and claimed that five subjects are sufficient to find most problems. The current paper argues that this answer is based on an important assumption, namely that all types of users have the same probability of encountering all usability problems. If this homogeneity assumption is violated, then more subjects are needed. A modified version of Virzi's model demonstrates that the number of subjects required increases with the number of heterogeneous groups. The model also shows that the more distinctive the groups, the more subjects will be required. This paper will argue that the simple answer 'five' cannot be applied in all circumstances. It most readily applies when the probability that a user will encounter a problem is both high and similar for all users. It also only applies to simple usability tests that seek to detect the presence, but not the statistical prevalence, of usability problems.