Conference PaperPDF Available

Testing web sites: five users is nowhere near enough



We observed the same task executed by 49 users on four production web sites. We tracked the rates of discovery of new usability problems on each site and, using that data, estimated the total number of usability problems on each site and the number of tests we would need to discover every problem. Our findings differ sharply from rules-of-thumb derived from earlier work by Virzi[1] and Nielsen[2,3] commonly viewed as "industry standards." We found that the four sites we studied would need considerably more than five users to find 85
Testing Web Sites: Five Users Is Nowhere Near Enough
Jared Spool
User Interface Engineering
242 Neck Rd.
Bradford, MA 01835 USA
+1 978 374 8300
Will Schroeder
User Interface Engineering
242 Neck Rd.
Bradford, MA 01835 USA
+1 978 374 8300
We observed the same task executed by 49 users on four
production web sites. We tracked the rates of discovery of
new usability problems on each site and, using that data,
estimated the total number of usability problems on each
site and the number of tests we would need to discover
every problem. Our findings differ sharply from rules-of-
thumb derived from earlier work by Virzi[1] and
Nielsen[2,3] commonly viewed as “industry standards.”
We found that the four sites we studied would need
considerably more than five users to find 85% of the
usability testing, number of users, usability engineering,
web usability
Previous work on this topic [1,2,3] addressed the “How
many tests are needed to find x% of the usability
problems?” question with a priori models ignoring the
effect of specific products, investigators, and techniques.
And yet specific site or product features, individual’s
testing techniques, complexity of usability tasks, and type
or level of problem looked for must affect the number of
problems that evaluators will find.
We feel the analytical tools used in previous work [1,2] are
generally valid and we’ve extended them in this paper.
However, we challenge the “rule of thumb” conclusion
drawn from them [3] using the test data we’ve presented
here. We believe a new approach that’s based on rate of
discovery of problems as testing progresses is required.
To test our theory, we conducted usability tests of four
sites. Three of the sites primarily sell music CDs and movie
videos and DVDs. The fourth site sells electronic gadgets.
The users for the study all had a history of purchasing these
types of products on-line. Each user made a list of specific
products they wanted to purchase in each of the product
We conducted 49 tests, each with the same purchasing
mission: “Describe an item that you want and buy it on this
site.” The only differences between each test were the
objects each user was attempting to purchase, which were
taken from the shopping list each user brought to the test.
We designed the test to accentuate problems that arise in
the purchase process. To this end, we gave each user
enough money to purchase the items within a
predetermined budget for the site they were testing. Using
that money, we instructed them to buy anything they
wanted on the site or nothing. Users kept what they bought
and any money they did not spend as well. From this we
identified 378 problems that prevented people from
completing their purchase.
Although each usability test in this case was just a single
task (“purchase a product”), we evaluated the complete
purchasing process on each site which was equivalent to
the usability tests providing the data in prior work[1,2].
Table 1 shows the problems we discovered in each test in
the sequence for each site and how many were new (had
not been observed previously in the test series).
Estimates of the number of tests needed to find a fraction of
all usability problems present in [2] are based on an
assumed value of L—the expected proportion of usability
problems found testing any single user. The probability of
finding a new obstacle in test i is:
pi = L(i-1) (1)
pi is not only the expected fraction of obstacles in the ith
test which will be new, but also the fraction of new
obstacles not yet found. If we can estimate L then we can
also estimate the test number i for which a percentage of
obstacles will remain.
The average of probabilities of finding a new obstacle in
tests 1 through i is: (ai) = newi/alli. Smoother estimates of
L based on a are still too noisy to use, so a cumulative
average of the values is taken
L = (iai – (i –1)ai-1)1/(i-1) L =
From (1) the test number where x% of obstacles remain to
be found is:
Tx% = Log(x/100)/Log(L) + 1
The rule-of-thumb given in [3] would estimate T10% as 6.2.
For the first site, after we’d conducted 6 tests, the
calculated value was 27, far above the rule-of-thumb
estimate. Figure 1 gives estimates of T10% made after each
completed test for sites I-IV. Note how all of the values are
above 6.2, in some cases as high as 33.
2 6 10 14 18
Test Number
2 6 10 14 18
Test Number
Estimate of L
Figure 1: T10% made after each completed test
Figure 2: L Derived from table 1
Figure 2 shows L derived from Table 1’s data. None of the
11 products in [2] showed L near the values we saw in our
Nielsen [3] proclaims that 5 users is enough to catch 85%
of the problems on practically any web site. But our data
shows differently. On sites I and II, taking L = 0.1 we had
found approximately 35% of the problems after the first
five users.
According to [1], serious usability problems are found
“early in the testing.” In our studies, serious usability
problems that prevented intended purchases were found on
Site I first in tests 13 and 15. Is halfway through “early?”
The design of the task obviously bears on when (and
whether) serious problems surface. If all serious problems
are to be found, the task must take users over every
possible path. The magnitude of L and the number of
problems found in each test (new + old) together measure
the effectiveness of a test. A good test finds a lot of
problems with each user, and a rapidly decreasing fraction
of new problems.
Of the tests used to develop the “average” L =.31 in [2],
three were voice response systems (averaging .34), two
mainframe applications (averaging .33), one videotex (.51)
and five were PC applications (averaging .28.) The lowest
L (“Office system (integrated spreadsheet, etc.)”) was .16.
In testing four sites, we found no L higher than .16.
It is not hard to come up with reasons why e-commerce
web sites might not have the same L values as voice-
response systems, or even as a circa-1992 Office Suite.
Today’s web sites have millions of pages of complex data
where users have to make many personal choices.
The implications of these findings show that fundamental
changes in our thinking about usability engineering will
need to occur. What happens to a development process
when the number of tests can’t be predicted until testing
has begun?
Analysis of this new data indicates that the formula given
in [3] can still be usefully applied, but more work needs to
be done in determining an appropriate value of L to use for
web site testing. It looks very much like five users are not
1. Virzi, Robert A. Refining the Test Phase of Usability
Evaluation: How Many Subjects Is Enough?. Human
Factors 34,4 (1992), 457-468.
2. Nielsen, Jakob and Landauer, Thomas K. A
Mathematical Model of the Finding of Usability
Problems. INTERCHI ’93 206-213.
3. Nielsen, Jakob Why You Only Need To Test With Five
Users, Jakob Nielsen’s Alertbox, March 19, 2000,
Site 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
All I
14 6 8 7 12 9 12 7 11 14 7 8 6 10 7 4 8 14
New I 14 5 7 5 7 5 11 6 9 10 5 5 2 8 3 1 2 9
All II
15 5 2 7 6 12 5 6 1 13 6 6 1 7 3 6 2 7
New II
15 4 1 5 2 9 5 2 0 8 2 4 0 2 2 3 1 1
6 7 4 11 7 10 5
6 7 1 9 5 6 2
All IV
6 7 17 11 6 7
New IV 6 5 13 6 5 2
Table 1. Obstacles Found By Test
... Although the study only included fve plasterers, the feedback was informative enough to strengthen the hypothesis that such a system appeals to plasterers and allows for an easy introduction to robotic fabrication processes on the construction site. Although Spool and Schroeder [64] recommend more than fve participants, according to Virzi [67], fve users are enough to detect 80 percent of usability problems. Our results, therefore, suggest that such a system is of high interest to skilled workers for implementing robotic processes in their daily tasks. ...
Conference Paper
Full-text available
This paper presents Interactive Robotic Plastering (IRoP), a system enabling designers and skilled workers to engage intuitively with an in-situ robotic plastering process. The research combines three elements: interactive design tools, an augmented reality interface, and a robotic spraying system. Plastering is a complex process relying on tacit knowledge and craftsmanship, making it difficult to simulate and automate. However, our system utilizes a controller-based interaction system to enable diverse users to interactively create articulated plasterwork in-situ. A customizable computational toolset converts human intentions into robotic motions while respecting robotic and material constraints. To accomplish this, we developed both an interactive computational model to translate the data from a motion-tracking system into robotic trajectories using design and editing tools as well as an audio-visual guidance system for in-situ projection. We then conducted two user-studies of designers and skilled workers who used IRoP to design and fabricate a full-scale demonstrator.
... On the other hand, other findings suggested that five evaluators are not sufficient to discover all usability issues (i.e. [55,58,59]). ...
Full-text available
This study was conducted to compare CHE between Human-Computer Interaction (HCI) experts and novices in evaluating the Smartphone app for the cultural heritage site. It uses the Smartphone Mobile Application heuRisTics (SMART), focusing on smartphone applications and traditional Nielsen heuristics, focusing on a wider range of interactive systems. Six experts and six novices used the severity rating scale to categorise the severity of the usability issues. These issues were mapped to both heuristics. The study found that experts and novice evaluators identified 19 and 14 usability issues, respectively, with ten as the same usability issues. However, these same usability issues have been rated differently. Although the t-test indicates no significant differences between experts and novices in their ratings for usability issues, these results nevertheless indicate the need for both evaluators in CHE to provide a more comprehensive perspective on the severity of the usability issues. Furthermore, the mapping of the usability issues for Nielsen and SMART heuristics concluded that more issues with the smartphone app could be addressed through smartphone-specific heuristics than general heuristics, indicating a better tool for heuristic evaluation of the smartphone app. This study also provides new insight into the required number of evaluators needed for CHE.
... Design neglects a multitude of requirements when it makes complex target groups manageable with the help of downscaling. When Nielsen in 2000 postulated that "Five users are enough", his statement was critically and intensively discussed within the HCD community (see Faulkner 2003, Woolrych & Cockton 2001, Spool & Schroeder 2001. Even if seven, ten, fifteen or twenty users have to be tested in practice, downscaling is still crucial in order to make design processes manageable: The prototypical users (personas) serve as a template for the selection of suitable subjects. ...
Conference Paper
Full-text available
Design has power – to improve lives, to empower people and to break down barriers. Successful design requires (besides many other tasks) a comprehensive analysis and deep understanding of the target audience. However, current design approaches, for instance established in the field of Human Centered Design, lead to multiple biases: Design neglects a multitude of needs when it uses downscaling to make complex target groups manageable. Downscaling must therefore disproportionately consider special needs within the design process – and upscaling must be able to compensate these biases again. The approach presented in this paper delivers three benefits: Conflicts between general and specific requirements are resolved, efficiency and equity are given equal consideration, and synergies become possible even at the resource level. By systematically analyzing and linking the issues of downscaling and upscaling in the context of design processes, the paper provides guardrails; these guardrails guide the design process and support a better focus to the general and specific needs of the target group.
... In HCI, Nielsen controversially claimed that only 5 participants are needed for a qualitative usability study [56] and that 20 were sufficient for more quantitative studies as this "typically offers a reasonably tight confidence interval" [55]. The first claim in particular has been disputed by others in the same field [63,71]. A recent systematic review of user studies at the CHI 2014 conference found sample sizes ranging from 1 to 916,000 with a mean sample size for in-person laboratory studies of 20 (SD=12) [12]. ...
Full-text available
Two major barriers to conducting user studies are the costs involved in recruiting participants and researcher time in performing studies. Typical solutions are to study convenience samples or design studies that can be deployed on crowdsourcing platforms. Both solutions have benefits but also drawbacks. Even in cases where these approaches make sense, it is still reasonable to ask whether we are using our resources – participants’ and our time – efficiently and whether we can do better. Typically user studies compare randomly-assigned experimental conditions, such that a uniform number of opportunities are assigned to each condition. This sampling approach, as has been demonstrated in clinical trials, is sub-optimal. The goal of many Information Retrieval (IR) user studies is to determine which strategy (e.g., behaviour or system) performs the best. In such a setup, it is not wise to waste participant and researcher time and money on conditions that are obviously inferior. In this work we explore whether Best Arm Identification (BAI) algorithms provide a natural solution to this problem. BAI methods are a class of Multi-armed Bandits (MABs) where the only goal is to output a recommended arm and the algorithms are evaluated by the average payoff of the recommended arm. Using three datasets associated with previously published IR-related user studies and a series of simulations, we test the extent to which the cost required to run user studies can be reduced by employing BAI methods. Our results suggest that some BAI instances (racing algorithms) are promising devices to reduce the cost of user studies. One of the racing algorithms studied, Hoeffding, holds particular promise. This algorithm offered consistent savings across both the real and simulated data sets and only extremely rarely returned a result inconsistent with the result of the full trial. We believe the results can have an important impact on the way research is performed in this field. The results show that the conditions assigned to participants could be dynamically changed, automatically, to make efficient use of participant and experimenter time.
... The assessment's goals is to detect usability and interface issues at an earlier stage of development. Thus, the number of users are enough to spot 85% of the user interface (UI) problems [27], since the variance for individual users is low. The users were selected among those whose responsibility is to follow the pollution levels in the city and that were involved in previous settlement of fixed air pollution collection spots. ...
... 39 Some experts suggest only 5 are needed, 40,41 while others suggest more. 42 Given this variation, we chose to limit our test to 12 participants based on practical limitations and experience from past projects. Participant and patient cases were randomized using a 3-way randomized complete block design, which ensures that the same number of each intervention (EHR vs Asthma Timeline Application) is applied to subjects and scenarios and, therefore, removes the subject and scenario effects. ...
Objective Develop and evaluate an interactive information visualization embedded within the electronic health record (EHR) by following human-centered design (HCD) processes and leveraging modern health information exchange standards. Materials and Methods We applied an HCD process to develop a Fast Healthcare Interoperability Resources (FHIR) application that displays a patient’s asthma history to clinicians in a pediatric emergency department. We performed a preimplementation comparative system evaluation to measure time on task, number of screens, information retrieval accuracy, cognitive load, user satisfaction, and perceived utility and usefulness. Application usage and system functionality were assessed using application logs and a postimplementation survey of end users. Results Usability testing of the Asthma Timeline Application demonstrated a statistically significant reduction in time on task (P < .001), number of screens (P < .001), and cognitive load (P < .001) for clinicians when compared to base EHR functionality. Postimplementation evaluation demonstrated reliable functionality and high user satisfaction. Discussion Following HCD processes to develop an application in the context of clinical operations/quality improvement is feasible. Our work also highlights the potential benefits and challenges associated with using internationally recognized data exchange standards as currently implemented. Conclusion Compared to standard EHR functionality, our visualization increased clinician efficiency when reviewing the charts of pediatric asthma patients. Application development efforts in an operational context should leverage existing health information exchange standards, such as FHIR, and evidence-based mixed methods approaches.
As part of the OPTIMOS 2.0 funding project of the German Federal Ministry for Economic Affairs and Energy (BMWi), a partner develops the TicketIssuance app for secure hardware-based storage of high-priced tickets. The app has implemented a previously unknown technology using the Secure Element and the NFC interface. It is therefore imperative to investigate the usability of the app for a successful market launch. For this purpose, user tests of a prototype of the app were conducted using the think-aloud method. This study analyses the results of five tasks. Test subjects rate the expected and perceived difficulty level for each task. That forms the basis for identifying improvement strategies. The test subjects’ performance, the frequency of errors and problems encountered, and the need for moderator’s support form the basis for prioritizing usability items within the tasks. The developed structure to determine the test tasks’ prioritization and usability items, layout, navigation, handling, wording, system, and data economy offers improvements to increase usability. Furthermore, the study investigates the determination of a suitable sample size for usability testing.
Full-text available
More than 1 in 5 Canadians are immigrants. While industries and public services in the country interact with users of diverse backgrounds, it is unclear how the cultural and linguistic backgrounds of users influence their perceived usability of services. We conducted a usability test of the website of Immigration, Refugee and Citizenship Canada (IRCC) to explore cultural—Individualism/Collectivism, Power Distance, and Uncertainty Avoidance—and linguistic variables with English-speaking international students from China, India, and Nigeria and French-speaking international students. We found that second-language participants were more comfortable with the website’s language, and Chinese and Nigerian participants criticized the interface more than Indian participants. Our work suggests that researchers should recruit and understand participants from multiple cultural and linguistic backgrounds to ensure the reliability and reproducibility of usability test results and protocols.
Attention has been given to making user interface design and testing less costly so that it might be more easily incorporated into the product development life cycle. Three experiments are reported in this paper that relate the proportion of usability problems identified in an evaluation to the number of subjects participating in that study. The basic findings are that (a) 80% of the usability problems are detected with four or five subjects, (b) additional subjects are less and less likely to reveal new information, and (c) the most severe usability problems are likely to have been detected in the first few subjects. Ramifications for the practice of human factors are discussed as they relate to the type of usability test cycle employed and the goals of the usability test.
Conference Paper
For 11 studies, we find that the detection of usability problems as a function of number of users tested or heuristic evaluators employed is well modeled as a Poisson process. The model can be used to plan the amount of evaluation required to achieve desired levels of thoroughness or benefits. Results of early tests can provide estimates of the number of problems left to be found and the number of additional evaluations needed to find a given fraction. With quantitative evaluation costs and detection values, the model can estimate the numbers of evaluations at which optimal cost/benefit ratios are obtained and at which marginal utility vanishes. For a ” medium” example, we estimate that 16 evaluations would be worth their cost, with maximum benefit/cost ratio at four.
You Only Need To Test With Five Users, Jakob Nielsen's Alertbox
  • Jakob Nielsen
  • Why
Nielsen, Jakob Why You Only Need To Test With Five Users, Jakob Nielsen's Alertbox, March 19, 2000,