Content uploaded by Martin Schrepp
Author content
All content in this area was uploaded by Martin Schrepp on May 03, 2023
Content may be subject to copyright.
An analysis of standard deviations for UEQ scales
Research Report
Martin Schrepp
May 2023
2
Contents
Introduction ............................................................................................................................................. 3
Collection of data .................................................................................................................................... 4
Results ..................................................................................................................................................... 4
Applications ............................................................................................................................................. 7
Plan the required sample size ............................................................................................................. 7
Interpret how much users agree or disagree ...................................................................................... 9
Can these results be generalized to UEQ-S and UEQ+ scales? .......................................................... 11
Summary ............................................................................................................................................... 12
References ............................................................................................................................................. 12
3
Introduction
The User Experience Questionnaire (short UEQ) is an established standardized UX questionnaire that
contains 26 items grouped into the six scales Attractiveness, Efficiency, Perspicuity, Dependability,
Stimulation, and Novelty. A detailed description of these scales is provided in Laugwitz, Schrepp &
Held (2006, 2008) or on the UEQ web site www.ueq-online.org.
The UEQ items are semantic differentials with a 7-point answer scale. The positive term is in half of
the items placed in the right and in the other half in the left position. The scale Attractiveness
contains 6 items, all other scales 4 items.
For example, the scale Perspicuity contains the following items:
not understandable
understandable
easy to learn
difficult to learn
complicated
easy
clear
confusing
Items with the positive term in the right position (first of the items shown above) are scaled from -3
to +3 from left to right. Items with the positive term in the left position (second of the items shown
above) are scaled from -3 to +3 from right to left (opposite direction).
The scale score per participant is the average of all item scores in the corresponding scale. The scale
mean is the average over all items in a scale and all participants.
What is typically interpreted in studies is the scale mean of the six UEQ scales. Scale means are
compared to the results of other product or a previous version of the same product to detect
differences in UX quality. There is a benchmark available (Schrepp; Hinderks & Thomaschewski,
2017) that allows to interpret how good or bad a measured scale mean is compared to a huge set of
products.
An additional information, that is often not interpreted or mentioned in research papers using the
UEQ, is the standard deviation of a scale mean. This statistic provides insights on how much
participants agree or disagree concerning an UX quality aspect represented by a UEQ scale. The
underlying problem is that it is difficult to decide when such a standard deviation should be
considered as low or high.
The standard deviation describes how much the impression of single participants deviates in average
from the scale mean. Thus, a low standard deviation can be interpreted in the sense that the
participants of a study show a high level of agreement concerning the UX quality measured by a
scale. On the other hand, a high standard deviation points to a low level of agreement concerning
this UX aspects, thus participants differ massively concerning their satisfaction with the UX aspect
measured by that scale.
For designers interested to improve a product this can be an important information. A low level of
agreement concerning an UX aspect will most likely result from different groups of users that have
quite different requirements concerning a product. Thus, a detailed user research might be necessary
to avoid that changes to the existing product will positively influence the UX impression of one group
of users, but negatively influence the UX impression of another group.
We investigate in this report the standard deviations in a sample of 123 studies using the UEQ. The
results should help the UX practitioner on the one hand to judge if the standard deviations measured
in a study point to a high or low level of agreement concerning that UX quality. In addition, the
knowledge of typical ranges for UEQ standard deviations can help to estimate the required sample
4
size for a planned study in advance. An Excel based tool to support this planning process was
developed based on the results described in this report and available for download on
https://www.ueq-online.org/.
We first describe the process for the selection of the UEQ studies analyzed in this paper. Then we
summarize our insights concerning the observed standard deviations for the scales. Finally, we show
how these insights can be used to interpret standard deviations and to plan sample sizes for further
studies.
Collection of data
The data were collected from two sources. First, there were 51 studies from my own research or
research results that were shared with me in the past. For these studies the raw data were available.
Second, a search in google scholar for papers that cited the English publication that described the
construction of the UEQ (Laugwitz, Schrepp & Held, 2008) was performed. To get a manageable
result list, the search was restricted to papers that were published between 2019 and March 2023
(point in time the search was executed). The result list contained 1197 results (data extracted from
Google Scholar extracted in March 2023). Only papers that fulfilled the following criteria were
considered:
• An empirical study with the UEQ and a sufficient sample size was reported and described in
sufficient detail
• Paper was written in English or German (the language of the application of the UEQ could be
different)
• The mean and standard deviation per scale and sample size of the study was reported in an
understandable way
• The used language version of the UEQ could be inferred. Here we must note that the
language of the UEQ questionnaire was in nearly all cases not explicitly mentioned in the
paper. The language was inferred from the description of the investigated target group. If the
study was, for example, done with patients of an Italian hospital we implicitly assumed that
the Italian language version was used.
72 studies matched these criteria. For those studies there are, of course, no raw data available.
The information from all 123 studies were consolidated in a file that contained per scale the
information about scale mean, standard deviation, sample size, and language of the UEQ version
used in the study. The UEQ contains 6 scales, thus we had 738 data points.
Results
We concentrate on the analysis of the standard deviations of the UEQ scales. The mean for the
standard deviation of a UEQ scale was 0.92 (standard dev. 0.24). Thus, we have a 95% confidence
interval of [0.90,0.94] for the standard deviation of an UEQ scale in the sample of selected studies.
The following figure shows the distribution of the standard deviations of the UEQ scales.
5
Figure 1: Distribution of the observed standard deviations for the scales of the UEQ.
As described above, the data comes from two sources. Data collected in my research or shared with
me (subset A) and data from literature research (subset B). There are some differences between
these two subsets of studies. For example, the average sample size is 104 for subset A and 82 for
subset B. In addition, the investigated products differ. Subset A contains mainly studies concerning
classical software products or web sites. In subset B there are many applications to apps for patient
care or rehabilitation training and in the areas of virtual or augmented reality.
Thus, we can expect some differences between these two subsets. This is confirmed by an ANOVA.
The main standard deviation of scales was 0.98 in subset A and 0.88 in subset B and this difference is
statistically significant (df=737, F=35.93, p=0).
How do the 6 scales differ? Table 1 shows the mean standard deviation per scale.
Scale
Av. Std. Deviation
Variance
Attractiveness
0.92
0.06
Perspicuity
0.92
0.07
Efficiency
0.91
0.05
Dependability
0.88
0.05
Stimulation
0.96
0.06
Novelty
0.94
0.05
Table 1: Average standard deviations per UEQ scale.
There are some smaller differences between the six scales. However, an ANOVA showed no
significant impact of the UEQ scale on the standard deviation (df=732, F=1.496, p=0.189).
Another factor that can have an impact on the standard deviation is the language version of the UEQ.
Of course, translations of UX items can cause small changes of the semantic meaning of an item. It is
impossible to avoid and hard to detect such subtle changes. But they may have an impact on the
standard deviation of a scale.
0
20
40
60
80
100
120
140
160
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Distribution of standard deviations
6
For the following analysis we considered only languages with more than 50 datapoints or since we
have 6 scales per study more than 8 studies. Only the language versions for German, English,
Indonesian, Italian, and Spanish fulfilled this criterion (102 studies and 612 measured standard
deviations).
Language
Av. Std. Deviation
Variance
German
0.92
0.04
English
0.88
0.06
Indonesian
0.91
0.08
Italian
0.84
0.04
Spanish
0.80
0.05
Table 2: Average standard deviation per language version.
An ANOVA shows that the language has a statistically significant impact on the value of the standard
deviation (df=607, F=5.58, p=0). Interestingly the standard deviation of the Italian and Spanish
translations is lower than the standard deviations observed for the German original version. As
always in such cases it is unclear if this is caused by changes of the meaning of items caused by the
translation or by cultural differences in the way surveys are answered (for example, Santoso &
Schrepp, 2019 or Schrepp & Santoso, 2019).
There is an interaction between scale mean and standard deviation. If the mean is very high, then
most participants give high ratings. Thus, the fluctuation of the ratings will be lower than in cases of a
medium overall scale mean (the same is of course true in the case of a low scale value). Thus, we
have here a ceiling or bottom effect. This effect could also be observed in our data.
The following figure shows the dependency between scale mean and the standard deviation of the
scale. As expected, the more (see polynomial trendline) extreme the scale mean is (in the direction of
an extreme positive or extreme negative result) the lower are the standard deviations.
Figure 2: Scale means and standard deviations from our sample of studies.
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
-1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00
Dependency between scale mean and standard deviation
7
Applications
How can we use these results in practice? In fact, they are valuable with respect to two different
problems. First, they make it easier to plan UEQ surveys. Second, they allow to judge how much
participants agree or disagree concerning their impression on the UX quality aspect measured by the
scale.
Plan the required sample size
One of the most difficult questions during the planning phase of a study based on a UX survey is How
many responses do I need to get interpretable results? There is no simple rule that can be followed to
answer this question. Clearly, the required number of responses depends on how accurately you
need to measure the UEQ scale values. The statistical concept that describes what “accurate
measurement” means is the confidence interval.
Assume, for example, that we measure the UX of a web site with the UEQ and collect 50 responses.
For the scale Efficiency we computed a mean scale score of 1.13. But of course, different respondents
have different opinions on the efficiency of that web site and if we would collect a second sample of
50 responses, we cannot expect to measure the same score. Sampling effects will always have an
impact.
To control the impact of such sampling effects, we can calculate the 95% confidence interval for the
scale score (if you use the data analysis tool that can be downloaded from www.ueq-online.org these
values will be calculated automatically):
where n is the number of participants, the mean scale score, the standard deviation of the scale
score and 1.96 is the z-value corresponding to the 95% confidence level.
How can we interpret the 95% confidence interval? Assume we would be able to repeat our survey
infinitely often. We would measure the same product and draw for each repetition the same number
of participants from the same population. Then the measured scale score will be in 95% of these
repetitions inside the 95% confidence interval. Thus, the width of the confidence interval is an
indicator of the accuracy of your measurement. The smaller it is, the more accurate is the
measurement.
If you want to use a different confidence level, then simply exchange 1.96 by the corresponding z-
value for this level (for example, use 1.64 for a 90% confidence level). Most statistical textbooks
contain tables of z-values for typical confidence levels, but nearly all UX research papers use 95% or
90% confidence levels.
As we can see from the formula above the width of the confidence interval for a given confidence
level is influenced by the sample size and the standard deviation. The larger the sample size is, the
more accurate is the measurement and the smaller is the confidence interval. The more the
participants in the study agree concerning the measured UX aspect, the smaller the standard
deviation and the lower the impact of the random sampling of participants on the result.
If we know the standard deviation, then the required sample size required to reach a certain width w
of the confidence interval can be calculated by:
8
The concept of a confidence interval clarifies what accuracy of measurement means. Thus, to
determine how accurate your measurement must be you need to specify the length of the
confidence interval. But what are good values in practice? We describe in the following a simple
heuristic that helps to answer this question.
Assume your goal is to establish a permanent monitoring of the UX quality of a product with the
UEQ. Thus, you plan to apply the UEQ after each new release of your product.
The items of the UEQ are semantic differentials with a 7-point response scale:
inefficient efficient
The answers are scored from -3 (most negative) to +3 (most positive). If the impression of a user
concerning an item would improve compared to a previous measurement point, then he or she
would score at least one point higher compared to the previous result (for example the rating
changes from a -1 to a 0 or from a +1 to a +2).
Would you consider it as an indicator for an improved satisfaction with the product if 3% of the users
express a better impression concerning the efficiency items for the new version? Most likely not. But
maybe 20% of users with a better impression would be a notable improvement?
If this is the case, then you can directly infer a good choice for the width of the confidence interval. If
20% of users express a higher satisfaction that means that 20% of the participants of your survey will
score at least 1 point higher, therefore you can expect an increase in the scale score of at least 0.2.
Since the scale mean is in the center of the confidence interval and you want to be able to detect
such a change you can go for a width of 0.4.
Figure 3: Width of the confidence interval in relation to expected differences.
Let us generalize the example above. To determine a reasonable width of the confidence interval for
your planned study you can ask yourself what you would define as a notable change in ratings. What
is the percentage of improved responses to the items of a scale that you would interpret as a sign of
improved user experience concerning this scale? If x% is the answer to that question, then you
should go for a width of the confidence interval of 2 * x/100.
Thus, if you follow such a basic heuristic, it is quite easy to define a reasonable width of the expected
confidence interval for your research project.
If you consider the formula above the required number of participants can be estimated from the
desired width of the confidence interval and the standard deviation. Of course, you only know the
exact value of the standard deviation after you complete your data collection. But for planning your
research you can often estimate it.
9
If you have, for example, already done a survey with the same evaluated product and a similar way
to recruit users (a similar target group with respect to demographics and usage behavior), then it is
unlikely that the standard deviation of your new research differs much from this value.
As an example, assume you had done a previous similar study and the highest standard deviation of
one of the six UEQ scales was 0.95. The width of the 95% confidence interval for your planned study
should be 0.4. Then you can estimate the required sample size n as:
Thus, planning with approximately 90 participants would be a reasonable solution.
In addition, the data reported above can be used at least for an educated guess of the required
sample size. They describe quite well the typical ranges of standard deviations for UEQ scales. To
support this, we grouped the standard deviations into three equally sized groups.
The observed standard deviations were sorted in increasing order accordingly to their value. Then
the set of scores was split into three equally sized subsets:
• High agreement: The first 33.33% of values. This represents the cases in which the standard
deviation was low and therefore the agreement amongst participants concerning the UX
quality represented by the scale was high.
• Medium agreement: The second 33.33% of values. This represents the cases with a medium
standard deviation and thus a medium level of agreement concerning the UX quality
represented by the scale.
• Low agreement: The third 33.33% of values. This represents the cases in which the standard
deviation was high and therefore the agreement amongst participants concerning the UX
quality represented by the scale was low.
This can also be used for an estimation of the required sample size. If you have a rough idea
concerning the level of agreement you expect for your study, then the means in the three groups
described above can be used as an estimation (High: 0.66, Medium: 0.93, Low: 1.17).
Both methods (use standard deviation from a similar previous study and use the three classes for an
educated guess) are implemented in the Excel tool for the estimation of the required sample size
that can be downloaded from the UEQ homepage.
Interpret how much users agree or disagree
The standard deviations collected in our study can be used as a benchmark that allows to interpret if
a measured standard deviation represents a low, medium of high level of agreement of the
participants concerning the UX aspect represented by the scale.
We described above the split of the results into the three groups representing high, medium, and low
agreement. We can determine the borders of these three groups and define the interpretation of the
standard deviation of a score as:
• High level of agreement: Measured standard deviation of scale < 0.83
• Medium level of agreement: Measured standard deviation of scale between 0.83 and 1.01
• Low level of agreement: Measured standard deviation of scale > 1.01
Of course, this is based on the implicit assumption that the UEQ studies collected for our sample are
somewhat representative. Most of these studies come from published papers, so it can be assumed
10
that these studies meet at least the typical quality criteria. On the other hand, it is impossible to
judge if the investigated products in these studies are somehow representative for industrial
research, since such results are typically not published or shared.
Thus, it is advisable (as always) to apply the recommendations in this report with care and to try to
collect some own data (for example by pre-studies) that fit well to the product domain of your
products.
We finish this section with three concrete examples of scales with a low, medium, and high level of
agreement. They should help to get a direct impression what these agreement levels mean in
practice.
Figure 4: Example of a scale with low agreement. Bars represent the distribution of the scale scores
of the participants.
Figure 4 shows the results for a scale with a low agreement between participants. We see that the
ratings are distributed widely and that there is a peak around 0 (relatively low impression) and
another one around 2.5 (very good impression).
Figure 5: Example of a scale with medium agreement.
Figure 5 shows the results for a scale with a medium level of agreement between participants. Nearly
all ratings are concentrated in the positive area (0.5 to 3). But inside this area they are quite
distributed.
0
0.05
0.1
0.15
0.2
0.25
0.3
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Stimulation scale: Standard deviation 1.31
0
0.05
0.1
0.15
0.2
0.25
0.3
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Perspicuity Scale: Standard Deviation 0.99
11
Figure 6: Example of a scale with high agreement.
Figure 6 shows the results for a scale with a high level of agreement between participants. There is a
clear peak between 1 and 2. More than 50% of the participants scores lie in this small corridor. More
than 30% of the remaining scores are close to this area and higher deviations are extremely rare.
Can these results be generalized to UEQ-S and UEQ+ scales?
There is a short form of the UEQ (called UEQ-S) containing just 8 items (Schrepp, Hinderks &
Thomaschewski, 2017) and a modular extension (called UEQ+) that contains additional scales
(Schrepp & Thomaschewski, 2019). Both versions differ a bit in the scale format.
The UEQ-S contains the two scales Pragmatic Quality and Hedonic Quality (with 4 items per scale).
The items for Pragmatic Quality are selected from the UEQ scales Efficiency, Perspicuity and
Dependability and the items for Hedonic Quality are selected from the UEQ scales Stimulation and
Novelty. Thus, these scales cover semantically a wider range than a single UEQ scale. In addition, the
item format is a bit different, since the positive term of the semantic differential is always placed in
the right position.
Currently the number of available UEQ-S studies is of course much lower than the corresponding
number for the UEQ. But first results point in the direction that the average standard deviation of the
UEQ-S scales is higher than the values we observed for the UEQ scales (first results indicate that it is
higher than 1.1, however the data basis is too small to draw stable conclusions). Theoretically, this
could be expected, since the items from the two UEQ-S scales cover a much bigger semantical space
than the items in a single UEQ scale. However, due to the limited number of available data it is
currently not possible to make a clear statement concerning this point.
The UEQ+ contains currently 20 scales concerning different UX aspects (see Schrepp &
Thomaschewski, 2019 or the descriptions on the UEQ+ home page ueqplus.ueq-research.org). Each
scale consists of 4 items in the form of a semantic differential. But these items are grouped per scale
and a short sentence is used to put them into a common context. In addition, as in the UEQ-S the
positive term of an item is always in the right position.
There are already a couple of studies available that use the UEQ+. But due to the modular character
of the UEQ+ these studies differ in the used scales (the researcher can pick those scales out of the
catalogue of the currently 20 available scales that make most sense for the investigated product).
Thus, again there are not enough data to make clear conclusions about the standard deviations of
the UEQ+ scales. It is especially difficult to judge if these standard deviations are similar for the
different UEQ+ scales or if they differ heavily per scale. Some UEQ+ scales cover aspects that are only
relevant for very special types of products, for example household appliances or voice assistants,
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Dependability scale: Standard Deviation 0.74
12
thus it is relatively likely that differences between such scales are higher than for the six UEQ scales
that all cover general UX aspects that can be applied to many different product types.
The arguments above show that it is currently not a good idea to apply the recommendations in this
report also to the UEQ-S and UEQ+. Pease use it only for the original UEQ.
Summary
We presented an analysis of standard deviations for UEQ scales from a sample of 123 studies. The
results should provide some guidance for planning the sample size of studies and for the
interpretation of measured standard deviations.
Of course, the recommendations in this study are based on the implicit assumption that the UEQ
studies collected for our sample are somewhat representative. Most of these studies come from
published papers, so it can be assumed that they meet at least the typical quality criteria. On the
other hand, it is impossible to judge if the investigated products in these studies are somehow
representative for industrial research since such results are typically not published or shared.
Thus, it is advisable (as always) to apply the recommendations in this report with care and to try to
collect some own data (for example by pre-studies) that fit well to the product domain of your
products.
The results in this report focus on the UEQ. A similar approach for the SUS is already described by
Lewis & Sauro (2022, 2023).
References
Laugwitz, B.; Schrepp, M. & Held, T. (2006). Konstruktion eines Fragebogens zur Messung der User
Experience von Softwareprodukten. In: A.M. Heinecke & H. Paul (Eds.): Mensch & Computer 2006 –
Mensch und Computer im Strukturwandel. Oldenbourg Verlag, pp. 125 – 134. DOI:
10.1524/9783486841749.125.
Laugwitz, B., Schrepp, M. & Held, T. (2008). Construction and evaluation of a user experience
questionnaire. In: Holzinger, A. (Ed.): USAB 2008, LNCS 5298, pp. 63-76. DOI: 10.1007/978-3-540-
89350-9_6.
Santoso, H.B. & Schrepp, M. (2019). The Impact of Culture and Product on the Subjective Importance
of User Experience Aspects. Helion, Vol. 5, https://doi.org/10.1016/j.heliyon.2019.e02434
Lewis, J. & Sauro, J. (2022). Sample Sizes for a SUS Score. Available online:
https://measuringu.com/sample-sizes-for-sus-ci/ (last accessed 2.5.2023).
Lewis, J. & Sauro, J. (2023). The Variability and Reliability of Standardized UX Scales. Available online:
https://measuringu.com/reliability-and-variability-of-standardized-ux-scales/ (last accessed
2.5.2023).
Schrepp, M.; Hinderks, A. & Thomaschewski, J. (2017). Construction of a benchmark for the User
Experience Questionnaire (UEQ). International Journal of Interactive Multimedia and Artificial
Intelligence, Vol. 4, No. 4, pp. 40-44.
Schrepp, Martin; Hinderks, Andreas; Thomaschewski, Jörg (2017): Design and Evaluation of a Short
Version of the User Experience Questionnaire (UEQ-S). IJIMAI 4 (6), pp. 103–108. DOI:
10.9781/ijimai.2017.09.001.
13
Schrepp, M. & Santoso, H. B. (2019). Impact of Culture on the choice of relevant UX Scales. Mensch
und Computer 2019 - Workshopband. Bonn: Gesellschaft für Informatik e.V. DOI:
10.18420/muc2019-ws-624
Schrepp, M. & Thomaschewski, J. (2019). Design and Validation of a Framework for the Creation of
User Experience Questionnaires. International Journal of Interactive Multimedia and Artificial
Intelligence. DOI:10.9781/ijimai.2019.06.006.